kMeans commandline introduction

This quick start page describes how to run the kMeans clustering algorithm on a Hadoop cluster.

Steps

Mahout's k-Means clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run k-Means on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout kmeans <OPTIONS>

In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job

Testing it on one single machine w/o cluster

  • Put the data: cp testdata
  • Run the Job:

    ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25

Running it on the cluster

  • (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
  • Put the data: $HADOOP_HOME/bin/hadoop fs -put testdata
  • Run the Job:

    export HADOOP_HOME= export HADOOP_CONF_DIR=$HADOOP_HOME/conf ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25

  • Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

Command line options

  --input (-i) input                   Path to job input directory. 
                           Must be a SequenceFile of    
                           VectorWritable           
  --clusters (-c) clusters             The input centroids, as Vectors. 
                           Must be a SequenceFile of    
                           Writable, Cluster/Canopy. If k  
                           is also specified, then a random 
                           set of vectors will be selected  
                           and written out to this path 
                           first                
  --output (-o) output                 The directory pathname for   
                           output.              
  --distanceMeasure (-dm) distanceMeasure      The classname of the     
                           DistanceMeasure. Default is  
                           SquaredEuclidean         
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
                           Default is 0.5           
  --maxIter (-x) maxIter               The maximum number of        
                           iterations.          
  --maxRed (-r) maxRed                 The number of reduce tasks.  
                           Defaults to 2            
  --k (-k) k                       The k in k-Means.  If specified, 
                           then a random selection of k 
                           Vectors will be chosen as the    
                           Centroid and written to the  
                           clusters input path.     
  --overwrite (-ow)                If present, overwrite the output 
                           directory before running job 
  --help (-h)                      Print out help           
  --clustering (-cl)                   If present, run clustering after 
                           the iterations have taken place