This quick start page describes how to run the kMeans clustering algorithm on a Hadoop cluster.
Mahout’s k-Means clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run k-Means on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.
./bin/mahout kmeans <OPTIONS>
In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job
Run the Job:
./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
Run the Job:
--input (-i) input Path to job input directory. Must be a SequenceFile of VectorWritable --clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first --output (-o) output The directory pathname for output. --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. Default is SquaredEuclidean --convergenceDelta (-cd) convergenceDelta The convergence delta value. Default is 0.5 --maxIter (-x) maxIter The maximum number of iterations. --maxRed (-r) maxRed The number of reduce tasks. Defaults to 2 --k (-k) k The k in k-Means. If specified, then a random selection of k Vectors will be chosen as the Centroid and written to the clusters input path. --overwrite (-ow) If present, overwrite the output directory before running job --help (-h) Print out help --clustering (-cl) If present, run clustering after the iterations have taken place