Wikipedia XML parser and Naive Bayes Classifier Example¶
Mahout has an example script  which will download a recent XML dump of the (entire if desired) English Wikipedia database. After running the classification script, you can use the document classification script from the Mahout spark-shell to vectorize and classify text from outside of the training and testing corpus using a modle built on the Wikipedia dataset.
You can run this script to build and test a Naive Bayes classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom).
Tou run the example simply execute the
By defult the script is set to run on a medium sized Wikipedia XML dump. To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80 of classify-wikipedia.sh . However this is not recommended unless you have the resources to do so. Be sure to clean your work directory when changing datasets- option (3).
The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for creating a 20 Newsgroups Classifier . The only difference being that instead of running
$mahout seqdirectory on the unzipped 20 Newsgroups file, you'll run
$mahout seqwiki on the unzipped Wikipedia xml dump.
$ mahout seqwiki
The above command launches
WikipediaToSequenceFile.java which accepts a text file of categories  and starts an MR job to parse the each document in the XML file. This process will seek to extract documents with a wikipedia category tag which (exactly, if the
-exactMatchOnly option is set) matches a line in the category file. If no match is found and the
-all option is set, the document will be dumped into an "unknown" category. The documents will then be written out as a
<Text,Text> sequence file of the form (K:/category/document_title , V: document).
There are 3 different example category files available to in the /examples/src/test/resources directory: country.txt, country10.txt and country2.txt. You can edit these categories to extract a different corpus from the Wikipedia dataset.
The CLI options for
seqwiki are as follows:
--input (-i) input pathname String --output (-o) the output pathname String --categories (-c) the file containing the Wikipedia categories --exactMatchOnly (-e) if set, then the Wikipedia category must match exactly instead of simply containing the category string --all (-all) if set select all categories --removeLabels (-rl) if set, remove [[Category:labels]] from document text after extracting label.
seqwiki, the script runs
testnb as in the step by step 20newsgroups example. When all of the jobs have finished, a confusion matrix will be displayed.