# Class Discovery

See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf

CDGA uses a Genetic Algorithm to discover a classification rule for a given dataset. A dataset can be seen as a table:

attribute 1attribute 2...attribute N
row 1value1value2...valueN
row 2value1value2...valueN
...............
row Mvalue1value2...valueN

An attribute can be numerical, for example a “temperature” attribute, or categorical, for example a “color” attribute. For classification purposes, one of the categorical attributes is designated as a label, which means that its value defines the class of the rows. A classification rule can be represented as follows:

attribute 1attribute 2...attribute N
weightw1w2...wN
operatorop1op2...opN
valuevalue1value2...valueN

For a given target class and a weight threshold, the classification rule can be read :

for each row of the dataset
if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1 rule.value1)) &&
(rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2 rule.value2)) &&
...
(rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN rule.valueN)) then
row is part of the target class


Important: The label attribute is not evaluated by the rule.

The threshold parameter allows some conditions of the rule to be skipped if their weight is too small. The operators available depend on the attribute types:

• for a numerical attributes, the available operators are ‘<’ and ‘>=’
• for categorical attributes, the available operators are ‘!=’ and ‘==’

The “threshold” and “target” are user defined parameters, and because the label is always a categorical attribute, the target is the (zero based) index of the class label value in all the possible values of the label. For example, if the label attribute can have the following values (blue, brown, green), then a target of 1 means the “blue” class.

For example, we have the following dataset (the label attribute is “Eyes Color”):

and a classification rule: and the following parameters: threshold = 1 and target = 0 (brown).
AgeEyes ColorHair Color
row 116browndark
row 225greenlight
row 312bluelight
weight01
operator<!=
value20light

This rule can be read as follows:

for each row of the dataset
if (0 < 1 || (0 >= 1 && row.value1 < 20)) &&
(1 < 1 || (1 >= 1 && row.value2 != light)) then
row is part of the "brown Eye Color" class


Please note how the rule skipped the label attribute (Eye Color), and how the first condition is ignored because its weight is < threshold.

# Running the example:

NOTE: Substitute in the appropriate version for the Mahout JOB jar

1. cd /examples
2. ant job