#Breiman Example

#### Introduction

This page describes how to run the Breiman example, which implements the test procedure described in Leo Breiman’s paper. The basic algorithm is as follows :

• repeat I iterations
• in each iteration do
• keep 10% of the dataset apart as a testing set
• build two forests using the training set, one with m = int(log2(M) + 1) (called Random-Input) and one with m = 1 (called Single-Input)
• choose the forest that gave the lowest oob error estimation to compute the test set error
• compute the test set error using the Single Input Forest (test error), this demonstrates that even with m = 1, Decision Forests give comparable results to greater values of m
• compute the mean testset error using every tree of the chosen forest (tree error). This should indicate how well a single Decision Tree performs
• compute the mean test error for all iterations
• compute the mean tree error for all iterations

#### Running the Example

The current implementation is compatible with the UCI repository file format. We’ll show how to run this example on two datasets:

First, we deal with Glass Identification: download the dataset file called glass.data and store it onto your local machine. Next, we must generate the descriptor file glass.info for this dataset with the following command:

bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/glass.data -f /path/to/glass.info -d I 9 N L


Substitute /path/to/ with the folder where you downloaded the dataset, the argument “I 9 N L” indicates the nature of the variables. Here it means 1 ignored (I) attribute, followed by 9 numerical(N) attributes, followed by the label (L).

Finally, we build and evaluate our random forest classifier as follows:

bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/glass.data -ds /path/to/glass.info -i 10 -t 100 which builds 100 trees (-t argument) and repeats the test 10 iterations (-i argument)


The example outputs the following results:

• Selection error: mean test error for the selected forest on all iterations
• Single Input error: mean test error for the single input forest on all iterations
• One Tree error: mean single tree error on all iterations
• Mean Random Input Time: mean build time for random input forests on all iterations
• Mean Single Input Time: mean build time for single input forests on all iterations

We can repeat this for a Sonar usecase: download the dataset file called sonar.all-data and store it onto your local machine. Generate the descriptor file sonar.info for this dataset with the following command:

bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L


The argument “60 N L” means 60 numerical(N) attributes, followed by the label (L). Analogous to the previous case, we run the evaluation as follows:

bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100