StreamingKMeans algorithm

The StreamingKMeans algorithm is a variant of Algorithm 1 from Shindler et al and consists of two steps:

  1. Streaming step
  2. BallKMeans step.

The streaming step is a randomized algorithm that makes one pass through the data and produces as many centroids as it determines is optimal. This step can be viewed as a preparatory dimensionality reduction. If the size of the data stream is n and the expected number of clusters is k, the streaming step will produce roughly k*log(n) clusters that will be passed on to the BallKMeans step which will further reduce the number of clusters down to k. BallKMeans is a randomized Lloyd-type algorithm that has been studied in detail, see Ostrovsky et al.

Streaming step


Overview

The streaming step is a derivative of the streaming portion of Algorithm 1 in Shindler et al. The main difference between the two is that Algorithm 1 of Shindler et al assumes the knowledge of the size of the data stream and uses it to set a key parameter for the algorithm. More precisely, the initial distanceCutoff (defined below), which is denoted by f in Shindler et al, is set to 1/(k(1+log(n)). The distanceCutoff influences the number of clusters that the algorithm will produce. In contrast, Mahout implementation does not require the knowledge of the size of the data stream. Instead, it dynamically re-evaluates the parameters that depend on the size of the data stream at runtime as more and more data is processed. In particular, the parameter numClusters (defined below) changes its value as the data is processed.

###Parameters

###Algorithm

The algorithm processes the data one-by-one and makes only one pass through the data. The first point from the data stream will form the centroid of the first cluster (this designation may change as more points are processed). Suppose there are r clusters at one point and a new point p is being processed. The new point can either be added to one of the existing r clusters or become a new cluster. To decide:

There will be either r or r+1 clusters after processing a new point.

As the number of clusters increases, it will go over the clusterOvershoot * numClusters limit (numClusters represents a recommendation for the number of clusters that the streaming step should aim for and clusterOvershoot is the slack). To decrease the number of clusters the existing clusters are treated as data points and are re-clustered (collapsed). This tends to make the number of clusters go down. If the number of clusters is still too high, distanceCutoff is increased.

BallKMeans step


Overview

The algorithm is a Lloyd-type algorithm that takes a set of weighted vectors and returns k centroids, see Ostrovsky et al for details. The algorithm has two stages:

  1. Seeding
  2. Ball k-means

The seeding stage is an initial guess of where the centroids should be. The initial guess is improved using the ball k-means stage.

Parameters

###Algorithm The algorithm can be instructed to take multiple independent runs (using the numRuns parameter) and the algorithm will select the best solution (i.e., the one with the lowest cost). In practice, one run is sufficient to find a good solution.

Each run operates as follows: a seeding procedure is used to select k centroids, and then ball k-means is run iteratively to refine the solution.

The seeding procedure can be set to either ‘uniformly at random’ or ‘k-means++’ using kMeansPlusPlusInit boolean variable. Seeding with k-means++ involves more computation but offers better results in practice.

Each iteration of ball k-means runs as follows:

  1. Clusters are formed by assigning each datapoint to the nearest centroid
  2. The centers of mass of the trimmed clusters (see trimFraction parameter above) become the new centroids

The data may be partitioned into a test set and a training set (see testProbability). The seeding procedure and ball k-means run on the training set. The cost is computed on the test set.

##Usage of StreamingKMeans

 bin/mahout streamingkmeans  
   -i <input>  
   -o <output> 
   -ow  
   -k <k>  
   -km <estimatedNumMapClusters>  
   -e <estimatedDistanceCutoff>  
   -mi <maxNumIterations>  
   -tf <trimFraction>  
   -ri                  
   -iw  
   -testp <testProbability>  
   -nbkm <numBallKMeansRuns>  
   -dm <distanceMeasure>   
   -sc <searcherClass>  
   -np <numProjections>  
   -s <searchSize>   
   -rskm  
   -xm <method>  
   -h   
   --tempDir <tempDir>   
   --startPhase <startPhase>   
   --endPhase <endPhase>                    

###Details on Job-Specific Options:

##References

  1. M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large Datasets
  2. R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type Methods for the k-means Problem