The 20 newsgroups dataset is a collection of approximately 20,000
newsgroup documents, partitioned (nearly) evenly across 20 different
newsgroups. The 20 newsgroups collection has become a popular data set for
experiments in text applications of machine learning techniques, such as
text classification and text clustering. We will use the Mahout CBayes
classifier to create a model that would classify a new document into one of
the 20 newsgroups.
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
381 0 0 0 0 9 1 0 0 0 1 0 0 2 0 1 0 0 3 0 |398 a=rec.motorcycles
1 284 0 0 0 0 1 0 6 3 11 0 66 3 0 6 0 4 9 0 |395 b=comp.windows.x
2 0 339 2 0 3 5 1 0 0 0 0 1 1 12 1 7 0 2 0 |376 c=talk.politics.mideast
4 0 1 327 0 2 2 0 0 2 1 1 0 5 1 4 12 0 2 0 |364 d=talk.politics.guns
7 0 4 32 27 7 7 2 0 12 0 0 6 0 100 9 7 31 0 0 |251 e=talk.religion.misc
10 0 0 0 0 359 2 2 0 0 3 0 1 6 0 1 0 0 11 0 |396 f=rec.autos
0 0 0 0 0 1 383 9 1 0 0 0 0 0 0 0 0 3 0 0 |397 g=rec.sport.baseball
1 0 0 0 0 0 9 382 0 0 0 0 1 1 1 0 2 0 2 0 |399 h=rec.sport.hockey
2 0 0 0 0 4 3 0 330 4 4 0 5 12 0 0 2 0 12 7 |385 i=comp.sys.mac.hardware
0 3 0 0 0 0 1 0 0 368 0 0 10 4 1 3 2 0 2 0 |394 j=sci.space
0 0 0 0 0 3 1 0 27 2 291 0 11 25 0 0 1 0 13 18|392 k=comp.sys.ibm.pc.hardware
8 0 1 109 0 6 11 4 1 18 0 98 1 3 11 10 27 1 1 0 |310 l=talk.politics.misc
0 11 0 0 0 3 6 0 10 6 11 0 299 13 0 2 13 0 7 8 |389 m=comp.graphics
6 0 1 0 0 4 2 0 5 2 12 0 8 321 0 4 14 0 8 6 |393 n=sci.electronics
2 0 0 0 0 0 4 1 0 3 1 0 3 1 372 6 0 2 1 2 |398 o=soc.religion.christian
4 0 0 1 0 2 3 3 0 4 2 0 7 12 6 342 1 0 9 0 |396 p=sci.med
0 1 0 1 0 1 4 0 3 0 1 0 8 4 0 2 369 0 1 1 |396 q=sci.crypt
10 0 4 10 1 5 6 2 2 6 2 0 2 1 86 15 14 152 0 1 |319 r=alt.atheism
4 0 0 0 0 9 1 1 8 1 12 0 3 0 2 0 0 0 341 2 |390 s=misc.forsale
8 5 0 0 0 1 6 0 8 5 50 0 40 2 1 0 9 0 3 256|394 t=comp.os.ms-windows.misc
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.8808
Accuracy 90.8596%
Reliability 86.3632%
Reliability (standard deviation) 0.2131