Spark Naive Bayes

Intro

Mahout currently has two flavors of Naive Bayes. The first is standard Multinomial Naive Bayes. The second is an implementation of Transformed Weight-normalized Complement Naive Bayes as introduced by Rennie et al. [1]. We refer to the former as Bayes and the latter as CBayes.

Where Bayes has long been a standard in text classification, CBayes is an extension of Bayes that performs particularly well on datasets with skewed classes and has been shown to be competitive with algorithms of higher complexity such as Support Vector Machines.

Implementations

The mahout math-scala library has an implemetation of both Bayes and CBayes which is further optimized in the spark module. Currently the Spark optimized version provides CLI drivers for training and testing. Mahout Spark-Naive-Bayes models can also be trained, tested and saved to the filesystem from the Mahout Spark Shell.

Preprocessing and Algorithm

As described in [1] Mahout Naive Bayes is broken down into the following steps (assignments are over all possible index values):

As we can see, the main difference between Bayes and CBayes is the weight calculation step. Where Bayes weighs terms more heavily based on the likelihood that they belong to class \(c\), CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class.

Running from the command line

Mahout provides CLI drivers for all above steps. Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An example script is given for the full process from data acquisition through classification of the classic 20 Newsgroups corpus.

Command line options

Examples

  1. 20 Newsgroups classification
  2. Document classification with Naive Bayes in the Mahout shell

References