kMeans commandline introduction

This quick start page describes how to run the kMeans clustering algorithm on a Hadoop cluster.

Steps

Mahout’s k-Means clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run k-Means on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout kmeans <OPTIONS>

In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job

Testing it on one single machine w/o cluster

Running it on the cluster

Command line options

  --input (-i) input			       Path to job input directory.
					       Must be a SequenceFile of
					       VectorWritable
  --clusters (-c) clusters		       The input centroids, as Vectors.
					       Must be a SequenceFile of
					       Writable, Cluster/Canopy. If k
					       is also specified, then a random
					       set of vectors will be selected
					       and written out to this path
					       first
  --output (-o) output			       The directory pathname for
					       output.
  --distanceMeasure (-dm) distanceMeasure      The classname of the
					       DistanceMeasure. Default is
					       SquaredEuclidean
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.
					       Default is 0.5
  --maxIter (-x) maxIter		       The maximum number of
					       iterations.
  --maxRed (-r) maxRed			       The number of reduce tasks.
					       Defaults to 2
  --k (-k) k				       The k in k-Means.  If specified,
					       then a random selection of k
					       Vectors will be chosen as the
					       Centroid and written to the
					       clusters input path.
  --overwrite (-ow)			       If present, overwrite the output
					       directory before running job
  --help (-h)				       Print out help
  --clustering (-cl)			       If present, run clustering after
					       the iterations have taken place