Running Fuzzy k-Means Clustering from the Command Line

Mahout's Fuzzy k-Means clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run FuzzyK on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout fkmeans <OPTIONS>
  • In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job

Testing it on one single machine w/o cluster

  • Put the data: cp testdata
  • Run the Job:

    ./bin/mahout fkmeans -i testdata

Running it on the cluster

  • (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
  • Put the data: $HADOOP_HOME/bin/hadoop fs -put testdata
  • Run the Job:

    export HADOOP_HOME= export HADOOP_CONF_DIR=$HADOOP_HOME/conf ./bin/mahout fkmeans -i testdata

  • Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

Command line options

  --input (-i) input                   Path to job input directory. 
                           Must be a SequenceFile of    
                           VectorWritable           
  --clusters (-c) clusters             The input centroids, as Vectors. 
                           Must be a SequenceFile of    
                           Writable, Cluster/Canopy. If k  
                           is also specified, then a random 
                           set of vectors will be selected  
                           and written out to this path 
                           first                
  --output (-o) output                 The directory pathname for   
                           output.              
  --distanceMeasure (-dm) distanceMeasure      The classname of the     
                           DistanceMeasure. Default is  
                           SquaredEuclidean         
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
                           Default is 0.5           
  --maxIter (-x) maxIter               The maximum number of        
                           iterations.          
  --k (-k) k                       The k in k-Means.  If specified, 
                           then a random selection of k 
                           Vectors will be chosen as the
                               Centroid and written to the  
                           clusters input path.     
  --m (-m) m                       coefficient normalization    
                           factor, must be greater than 1   
  --overwrite (-ow)                If present, overwrite the output 
                           directory before running job 
  --help (-h)                      Print out help           
  --numMap (-u) numMap                 The number of map tasks.     
                           Defaults to 10           
  --maxRed (-r) maxRed                 The number of reduce tasks.  
                           Defaults to 2            
  --emitMostLikely (-e) emitMostLikely         True if clustering should emit   
                           the most likely point only,  
                           false for threshold clustering.  
                           Default is true          
  --threshold (-t) threshold               The pdf threshold used for   
                           cluster determination. Default   
                           is 0 
  --clustering (-cl)                   If present, run clustering after 
                           the iterations have taken place