Spark Naive Bayes¶
Intro¶
Mahout currently has two flavors of Naive Bayes. The first is standard Multinomial Naive Bayes. The second is an implementation of Transformed Weightnormalized Complement Naive Bayes as introduced by Rennie et al. [1]. We refer to the former as Bayes and the latter as CBayes.
Where Bayes has long been a standard in text classification, CBayes is an extension of Bayes that performs particularly well on datasets with skewed classes and has been shown to be competitive with algorithms of higher complexity such as Support Vector Machines.
Implementations¶
The mahout mathscala
library has an implemetation of both Bayes and CBayes which is further optimized in the spark
module. Currently the Spark optimized version provides CLI drivers for training and testing. Mahout SparkNaiveBayes models can also be trained, tested and saved to the filesystem from the Mahout Spark Shell.
Preprocessing and Algorithm¶
As described in [1] Mahout Naive Bayes is broken down into the following steps (assignments are over all possible index values):
 Let
\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)
be a set of documents;\(d_{ij}\)
is the count of word\(i\)
in document\(j\)
.  Let
\(\vec{y}=(y_1,...,y_n)\)
be their labels.  Let
\(\alpha_i\)
be a smoothing parameter for all words in the vocabulary; let\(\alpha=\sum_i{\alpha_i}\)
.  Preprocessing(via seq2Sparse) TFIDF transformation and L2 length normalization of
\(\vec{d}\)
\(d_{ij} = \sqrt{d_{ij}}\)
\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)
\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)
 Training: Bayes
\((\vec{d},\vec{y})\)
calculate term weights\(w_{ci}\)
as:\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)
\(w_{ci}=\log{\hat\theta_{ci}}\)
 Training: CBayes
\((\vec{d},\vec{y})\)
calculate term weights\(w_{ci}\)
as:\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)
\(w_{ci}=\log{\hat\theta_{ci}}\)
\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)
 Label Assignment/Testing:
 Let
\(\vec{t}= (t_1,...,t_n)\)
be a test document; let\(t_i\)
be the count of the word\(t\)
.  Label the document according to
\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)
 Let
As we can see, the main difference between Bayes and CBayes is the weight calculation step. Where Bayes weighs terms more heavily based on the likelihood that they belong to class \(c\)
, CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class.
Running from the command line¶
Mahout provides CLI drivers for all above steps. Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An example script is given for the full process from data acquisition through classification of the classic 20 Newsgroups corpus.

Preprocessing: For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the mahout seq2sparse command performs the TFIDF transformations (wt tfidf option) and L2 length normalization (n 2 option) as follows:
$ mahout seq2sparse i ${PATH_TO_SEQUENCE_FILES} o ${PATH_TO_TFIDF_VECTORS} nv n 2 wt tfidf

Training: The model is then trained using
mahout sparktrainnb
. The default is to train a Bayes model. The c option is given to train a CBayes model:$ mahout sparktrainnb i ${PATH_TO_TFIDF_VECTORS} o ${PATH_TO_MODEL} ow c

Label Assignment/Testing: Classification and testing on a holdout set can then be performed via
mahout sparktestnb
. Again, the c option indicates that the model is CBayes:$ mahout sparktestnb i ${PATH_TO_TFIDF_TEST_VECTORS} m ${PATH_TO_MODEL} c
Command line options¶
 Preprocessing: note: still reliant on MapReduce seq2sparse
Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other transformations can be performed by mahout seq2sparse
and used as input to Bayes/CBayes. For a full list of mahout seq2Sparse
options see the Creating vectors from text page.
$ mahout seq2sparse output (o) output The directory pathname for output. input (i) input Path to job input directory. weight (wt) weight The kind of weight to use. Currently TF or TFIDF. Default: TFIDF norm (n) norm The norm to use, expressed as either a float or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize overwrite (ow) If set, overwrite the output directory sequentialAccessVector (seq) (Optional) Whether output vectors should be SequentialAccessVectors. If set true else false namedVector (nv) (Optional) Whether output vectors should be NamedVectors. If set true else false

Training:
$ mahout sparktrainnb input (i) input Path to job input directory. output (o) output The directory pathname for output. trainComplementary (c) Train complementary? Default is false. master (ma) Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" help (h) Print out help

Testing:
$ mahout sparktestnb input (i) input Path to job input directory. model (m) model The path to the model built during training. testComplementary (c) Test complementary? Default is false. master (ma) Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" help (h) Print out help
Examples¶
References¶
[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning (ICML2003).