Running Canopy Clustering from the Command Line

Mahout's Canopy clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run Canopy on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout canopy <OPTIONS>
  • In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job

Testing it on one single machine w/o cluster

  • Put the data: cp testdata
  • Run the Job:

    ./bin/mahout canopy -i testdata -o output -dm org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2

Running it on the cluster

  • (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
  • Put the data: $HADOOP_HOME/bin/hadoop fs -put testdata
  • Run the Job:

    export HADOOP_HOME= export HADOOP_CONF_DIR=$HADOOP_HOME/conf ./bin/mahout canopy -i testdata -o output -dm org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2

  • Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

Command line options

  --input (-i) input                 Path to job input directory.Must  
                         be a SequenceFile of       
                         VectorWritable         
  --output (-o) output               The directory pathname for output. 
  --overwrite (-ow)              If present, overwrite the output    
                         directory before running job   
  --distanceMeasure (-dm) distanceMeasure    The classname of the       
                         DistanceMeasure. Default is    
                         SquaredEuclidean           
  --t1 (-t1) t1                  T1 threshold value         
  --t2 (-t2) t2                  T2 threshold value         
  --clustering (-cl)                 If present, run clustering after   
                         the iterations have taken place     
  --help (-h)                    Print out help