public final class StreamingKMeansDriver extends AbstractJob
Modifier and Type | Field and Description |
---|---|
static String |
ESTIMATED_DISTANCE_CUTOFF
The initial estimated distance cutoff between two points for forming new clusters.
|
static String |
ESTIMATED_NUM_MAP_CLUSTERS
The number of cluster that Mappers will use should be \(O(k log n)\) where k is the number of clusters
to get at the end and n is the number of points to cluster.
|
static String |
IGNORE_WEIGHTS
Whether to correct the weights of the centroids after the clustering is done.
|
static float |
INVALID_DISTANCE_CUTOFF |
static String |
MAX_NUM_ITERATIONS
After mapping finishes, we get an intermediate set of vectors that represent approximate
clusterings of the data from each Mapper.
|
static String |
NUM_BALLKMEANS_RUNS
The percentage of points that go into the "training" set when evaluating BallKMeans runs in the reducer.
|
static String |
NUM_PROJECTIONS_OPTION
The number of projections to use when using a projection searcher like ProjectionSearch or
FastProjectionSearch.
|
static String |
RANDOM_INIT
Whether to use k-means++ initialization or random initialization of the seed centroids.
|
static String |
REDUCE_STREAMING_KMEANS
Whether to run another pass of StreamingKMeans on the reducer's points before BallKMeans.
|
static String |
SEARCH_SIZE_OPTION
When using approximate searches (anything that's not BruteSearch),
more than just the seemingly closest element must be considered.
|
static String |
SEARCHER_CLASS_OPTION
The Searcher class when performing nearest neighbor search in StreamingKMeans.
|
static String |
TEST_PROBABILITY
The percentage of points that go into the "test" set when evaluating BallKMeans runs in the reducer.
|
static String |
TRIM_FRACTION
The "ball" aspect of ball k-means means that only the closest points to the centroid will actually be used
for updating.
|
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
Modifier and Type | Method and Description |
---|---|
static void |
configureOptionsForWorkers(org.apache.hadoop.conf.Configuration conf,
int numClusters,
int estimatedNumMapClusters,
float estimatedDistanceCutoff,
int maxNumIterations,
float trimFraction,
boolean randomInit,
boolean ignoreWeights,
float testProbability,
int numBallKMeansRuns,
String measureClass,
String searcherClass,
int searchSize,
int numProjections,
String method,
boolean reduceStreamingKMeans)
Checks the parameters for a StreamingKMeans job and prepares a Configuration with them.
|
static void |
main(String[] args) |
static int |
run(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output)
Iterate over the input vectors to produce clusters and, if requested, use the results of the final iteration to
cluster the input vectors.
|
int |
run(String[] args) |
static int |
runMapReduce(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output) |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
public static final String ESTIMATED_NUM_MAP_CLUSTERS
public static final String ESTIMATED_DISTANCE_CUTOFF
Defaults to 10e-6.
,
Constant Field Valuespublic static final String MAX_NUM_ITERATIONS
public static final String TRIM_FRACTION
public static final String RANDOM_INIT
BallKMeans
,
Constant Field Valuespublic static final String IGNORE_WEIGHTS
public static final String TEST_PROBABILITY
public static final String NUM_BALLKMEANS_RUNS
public static final String SEARCHER_CLASS_OPTION
public static final String NUM_PROJECTIONS_OPTION
public static final String SEARCH_SIZE_OPTION
public static final String REDUCE_STREAMING_KMEANS
public static final float INVALID_DISTANCE_CUTOFF
public static void configureOptionsForWorkers(org.apache.hadoop.conf.Configuration conf, int numClusters, int estimatedNumMapClusters, float estimatedDistanceCutoff, int maxNumIterations, float trimFraction, boolean randomInit, boolean ignoreWeights, float testProbability, int numBallKMeansRuns, String measureClass, String searcherClass, int searchSize, int numProjections, String method, boolean reduceStreamingKMeans) throws ClassNotFoundException
conf
- the Configuration to populatenumClusters
- k, the number of clusters at the endestimatedNumMapClusters
- O(k log n), the number of clusters requested from each mapperestimatedDistanceCutoff
- an estimate of the minimum distance that separates two clusters (can be smaller and
will be increased dynamically)maxNumIterations
- the maximum number of iterations of BallKMeanstrimFraction
- the fraction of the points to be considered in updating a ball k-meansrandomInit
- whether to initialize the ball k-means seeds randomlyignoreWeights
- whether to ignore the invalid final ball k-means weightstestProbability
- the percentage of vectors assigned to the test set for selecting the best final centersnumBallKMeansRuns
- the number of BallKMeans runs in the reducer that determine the centroids to return
(clusters are computed for the training set and the error is computed on the test set)measureClass
- string, name of the distance measure class; theory works for Euclidean-like distancessearcherClass
- string, name of the searcher that will be used for nearest neighbor searchsearchSize
- the number of closest neighbors to look at for selecting the closest one in approximate nearest
neighbor searchesnumProjections
- the number of projected vectors to use for faster searching (only useful for ProjectionSearch
or FastProjectionSearch); @see org.apache.mahout.math.neighborhood.ProjectionSearchClassNotFoundException
public static int run(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output) throws IOException, InterruptedException, ClassNotFoundException, ExecutionException
input
- the directory pathname for input points.output
- the directory pathname for output points.IOException
InterruptedException
ClassNotFoundException
ExecutionException
public static int runMapReduce(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output) throws IOException, ClassNotFoundException, InterruptedException
Copyright © 2008–2017 The Apache Software Foundation. All rights reserved.