Running Fuzzy k-Means Clustering from the Command Line¶
Mahout's Fuzzy k-Means clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run FuzzyK on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.
./bin/mahout fkmeans <OPTIONS>
- In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job
Testing it on one single machine w/o cluster¶
- Put the data: cp
Run the Job:
./bin/mahout fkmeans -i testdata
Running it on the cluster¶
- (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
- Put the data: $HADOOP_HOME/bin/hadoop fs -put
Run the Job:
export HADOOP_CONF_DIR=$HADOOP_HOME/conf ./bin/mahout fkmeans -i testdata
Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.
Command line options¶
--input (-i) input Path to job input directory. Must be a SequenceFile of VectorWritable --clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first --output (-o) output The directory pathname for output. --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. Default is SquaredEuclidean --convergenceDelta (-cd) convergenceDelta The convergence delta value. Default is 0.5 --maxIter (-x) maxIter The maximum number of iterations. --k (-k) k The k in k-Means. If specified, then a random selection of k Vectors will be chosen as the Centroid and written to the clusters input path. --m (-m) m coefficient normalization factor, must be greater than 1 --overwrite (-ow) If present, overwrite the output directory before running job --help (-h) Print out help --numMap (-u) numMap The number of map tasks. Defaults to 10 --maxRed (-r) maxRed The number of reduce tasks. Defaults to 2 --emitMostLikely (-e) emitMostLikely True if clustering should emit the most likely point only, false for threshold clustering. Default is true --threshold (-t) threshold The pdf threshold used for cluster determination. Default is 0 --clustering (-cl) If present, run clustering after the iterations have taken place