Running Canopy Clustering from the Command Line

Mahout’s Canopy clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run Canopy on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout canopy <OPTIONS>

Testing it on one single machine w/o cluster

Running it on the cluster

Command line options

  --input (-i) input			     Path to job input directory.Must  
					     be a SequenceFile of	    
					     VectorWritable		    
  --output (-o) output			     The directory pathname for output. 
  --overwrite (-ow)			     If present, overwrite the output	 
					     directory before running job   
  --distanceMeasure (-dm) distanceMeasure    The classname of the	    
					     DistanceMeasure. Default is    
					     SquaredEuclidean		    
  --t1 (-t1) t1 			     T1 threshold value 	    
  --t2 (-t2) t2 			     T2 threshold value 	    
  --clustering (-cl)			     If present, run clustering after	
					     the iterations have taken place	 
  --help (-h)				     Print out help