Naive Bayes
Naive Bayes is an algorithm that can be used to classify objects into
usually binary categories. It is one of the most common learning algorithms
in spam filters. Despite its simplicity and rather naive assumptions it has
proven to work surprisingly well in practice.
Before applying the algorithm, the objects to be classified need to be
represented by numerical features. In the case of e-mail spam each feature
might indicate whether some specific word is present or absent in the mail
to classify. The algorithm comes in two phases: Learning and application.
During learning, a set of feature vectors is given to the algorithm, each
vector labeled with the class the object it represents, belongs to. From
that it is deduced which combination of features appears with high
probability in spam messages. Given this information, during application
one can easily compute the probability of a new message being either spam
or not.
The algorithm does make several assumptions, that are not true for most
datasets, but make computations easier. The worst probably being, that all
features of an objects are considered independent. In practice, that means,
given the phrase “Statue of Liberty” was already found in a text, does not
influence the probability of seeing the phrase “New York” as well.
Strategy for a parallel Naive Bayes
See https://issues.apache.org/jira/browse/MAHOUT-9
.
Examples
20Newsgroups
- Example code showing how to train and use the Naive Bayes classifier
using the 20 Newsgroups data available at [http://people.csail.mit.edu/jrennie/20Newsgroups/]