source : [3](3.html)
LearnUnprunedTree(X,Y)
Input: X a matrix of R rows and M columns where X{}{}{~}ij{~} = the value of the j‘th attribute in the i‘th input datapoint. Each column consists of either all real values or all categorical values. Input: Y a vector of R elements, where Y{}{}{~}i{~} = the output class of the i‘th datapoint. The Y{}{}{~}i{~} values are categorical. Output: An Unpruned decision tree
If all records in X have identical values in all their attributes (this includes the case where R<2), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes If all values in Y are the same, return a Leaf Node predicting this value as the output Else select m variables at random out of the M variables For j = 1 .. m If j‘th attribute is categorical IG{}{}{~}j{~} = IG(Y|X{}{}{~}j{~}) (see Information Gain) Else (j‘th attribute is real-valued) IG{}{}{~}j{~} = IG(Y|X{}{}{~}j{~}) (see Information Gain) Let *j* = argmax{~}j~ IG{}{}{~}j{~} (this is the splitting attribute we’ll use) If j* is categorical then For each value v of the j‘th attribute Let X{}{}{^}v{^} = subset of rows of X in which X{}{}{~}ij{~} = v. Let Y{}{}{^}v{^} = corresponding subset of Y Let Child{}{}{^}v{^} = LearnUnprunedTree(X{}{}{^}v{^},Y{}{}{^}v{^}) Return a decision tree node, splitting on j‘th attribute. The number of children equals the number of values of the j‘th attribute, and the v‘th child is Child{}{}{^}v{^} Else j* is real-valued and let t be the best split threshold Let X{}{}{^}LO{^} = subset of rows of X in which X{}{}{~}ij{~} <= t. Let Y{}{}{^}LO{^} = corresponding subset of Y Let Child{}{}{^}LO{^} = LearnUnprunedTree(X{}{}{^}LO{^},Y{}{}{^}LO{^}) Let X{}{}{^}HI{^} = subset of rows of X in which X{}{}{~}ij{~} > t. Let Y{}{}{^}HI{^} = corresponding subset of Y Let Child{}{}{^}HI{^} = LearnUnprunedTree(X{}{}{^}HI{^},Y{}{}{^}HI{^}) Return a decision tree node, splitting on j‘th attribute. It has two children corresponding to whether the j‘th attribute is above or below the given threshold.
Note: There are alternatives to Information Gain for splitting nodes
source : [3](3.html)
suppose X can have one of m values V{~}1~,V{~}2~,…,V{~}m~ P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,…,P(X=V{~}m~)=p{~}m~ H(X)= -sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) H(Y|X=v) = the entropy of Y among only those records in which X has value v H(Y|X) = sum{~}j~ p{~}j~ H(Y|X=v{~}j~) IG(Y|X) = H(Y) - H(Y|X)
suppose X is real valued define IG(Y|X:t) as H(Y) - H(Y|X:t) define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t) define IG*(Y|X) = max{~}t~ IG(Y|X:t)
source : [1](1.html)
Each tree is grown as follows:
source : [2](2.html) Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m) to be randomly selected from the available set of variables. Breinman’s recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.
Classify(node,V) Input: node from the decision tree, if node.attribute = j then the split is done on the j‘th attribute
Input: V a vector of M columns where V{}{}{~}j{~} = the value of the j‘th attribute. Output: label of V
If node is a Leaf then Return the value predicted by node
Else Let j = node.attribute If j is categorical then Let v = V{}{}{~}j{~} Let child{}{}{^}v{^} = child node corresponding to the attribute’s value v Return Classify(child{}{}{^}v{^},V)
Else j is real-valued Let t = node.threshold (split threshold) If Vj < t then Let child{}{}{^}LO{^} = child node corresponding to (<t) Return Classify(child{}{}{^}LO{^},V) Else Let child{}{}{^}HI{^} = child node corresponding to (>=t) Return Classify(child{}{}{^}HI{^},V)
source : [1](1.html)
in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
source : [1](1.html)
[1](1.html) Random Forests - Classification Description http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm [2](2.html) B. Larivi�re & D. Van Den Poel, 2004. “Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques,” Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 04/282, Ghent University, Faculty of Economics and Business Administration. Available online : http://ideas.repec.org/p/rug/rugwps/04-282.html [3](3.html) Decision Trees - Andrew W. Moore[4] http://www.cs.cmu.edu/~awm/tutorials[1](1.html) [4](4.html) Information Gain - Andrew W. Moore http://www.cs.cmu.edu/~awm/tutorials