Clustering synthetic control data


This example will demonstrate clustering of time series data, specifically control charts. Control charts are tools used to determine whether a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated repeatedly at equal time intervals. A simulated dataset is available for use in UCI machine learning repository.

A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and is meant to resemble real world information in an anonymized format. It contains six different classes: Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In this example we will use Mahout to cluster the data into corresponding class buckets.

For the sake of simplicity, we won’t use a cluster in this example, but instead show you the commands to run the clustering examples locally with Hadoop.


We need to do some initial setup before we are able to run the example.

  1. Start out by downloading the dataset to be clustered from the UCI Machine Learning Repository:

  2. Download the latest release of Mahout.

  3. Unpack the release binary and switch to the mahout-distribution-0.x folder

  4. Make sure that the JAVA_HOME environment variable points to your local java installation

  5. Create a folder called testdata in the current directory and copy the dataset into this folder.

Clustering Examples

Depending on the clustering algorithm you want to run, the following commands can be used:

bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job

The clustering output will be produced in the output directory. The output data points are in vector format. In order to read/analyze the output, you can use the clusterdump utility provided by Mahout.