Machine Learning Methods: Decision trees and forests
This post contains our crib notes on the basics of decision trees and forests. We first discuss the construction of individual trees, and then introduce random and boosted forests. We also discuss efficient implementations of greedy tree construction algorithms,
showing that a single tree can be constructed in
Follow @efavdb
Follow us on twitter for new submission alerts!
Introduction
Decision trees constitute a class of simple functions that are frequently used for carrying out regression and classification. They are constructed by hierarchically splitting a feature space into disjoint regions, where each split divides into two one of the
already existing regions. In most common implementations, the splits are always taken along one of the feature axes, which causes the regions to be rectangular in shape. An example is shown in Fig. 1 below. In this example, a two-dimensional feature space
is first split by a tree on
Once a decision tree is constructed, it can be used for making predictions on unlabeled feature vectors — i.e., points in feature space not included in our training set. This is done by first deciding which of the regions a new feature vector belongs to, and
then returning as its hypothesis label an average over the training example labels within that region: The mean of the region’s training labels is returned for regression problems and the mode for classification problems. For instance, the tree in Fig. 1 would
return an average of the five training examples in
The art and science of tree construction is in deciding how many splits should be taken and where those splits should take place. The goal is to find a tree that provides a reasonable, piece-wise constant approximation to the underlying distribution or function that has generated the training data provided. This can be attempted through choosing a tree that breaks space up into regions such that the examples in any given region have identical — or at least similar — labels. We discuss some common approaches to finding such trees in the next section.
Individual trees have the important benefit of being easy to interpret and visualize, but they are often not as accurate as other common machine learning algorithms. However, individual trees can be used as simple building blocks with which to construct more complex, competitive models. In the third section of this note, we discuss three very popular constructions of this sort: bagging, random forests (a variant on bagging), and boosting. We then discuss the runtime complexity of tree/forest construction and conclude with a summary, exercises, and an appendix containing example python code.
Constructing individual decision trees
REGRESSION
Regression tree construction typically proceeds by attempting to minimize a squared error cost function: Given a training set
where
Unfortunately, actually minimizing (
- Greedy algorithm: The tree is constructed recursively, one branching step at a time. At each step, one takes the split that will most significantly reduce the cost function
J , relative to its current value. In this way, afterk−1 splits, a tree withk regions (leaves) is obtained — Fig. 2 provides an illustration of this process. The algorithm terminates whenever some specified stopping criterion is satisfied, examples of which are given below. - Randomized algorithm: Randomized tree-search protocols can sometimes find global minima inaccessible to the gradient-descent-like greedy algorithm. These randomized protocols also proceed recursively. However, at each step, some randomization
is introduced by hand. For example, one common approach is to select
r candidate splits through random sampling at each branching point. The candidate split that most significantly reducesJ is selected, and the process repeats. The benefit of this approach is that it can sometimes find paths that appear suboptimal in their first few steps, but are ultimately favorable.
CLASSIFICATION
In classification problems, the training labels take on a discrete set of values, often having no numerical significance. This means that a squared-error cost function, like that in (
Each of the summands here are plotted in Fig. 3 for the special case of binary classification (two labels only). Each is unfavorably maximized at the most mixed state, where
Although
BIAS-VARIANCE TRADE-OFF AND STOPPING CONDITIONS
Decision trees that are allowed to split indefinitely will have low bias but will over-fit their training data. Placing different stopping criteria on a tree’s growth can ameliorate this latter effect. Two typical conditions often used for this purpose are given by a) placing an upper bound on the number of levels permitted in the tree, or b) requiring that each region (tree leaf) retains at least some minimum number of training examples. To optimize over such constraints, one can apply cross-validation.
Bagging, random forests, and boosting
Another approach to alleviating the high-variance, over-fitting issue associated with decision trees is to average over many of them. This approach is motivated by the observation that the sum of
BAGGING AND RANDOM FORESTS
Bootstrap aggregation, or “bagging”, provides one common method for constructing ensemble tree models. In this approach, one samples with replacement to obtain
where the latter forms are accurate in the large
One nice thing about bagging methods, in general, is that one can train on the entire set of available labeled training data and still obtain an estimate of the generalization error. Such estimates are obtained by considering the error on each point in the training set, in each case averaging only over those trees that did not train on the point in question. The resulting estimate, called the out-of-bag error, typically provides a slight overestimate to the generalization error. This is because accuracy generally improves with growing ensemble size, and the full ensemble is usually about three times larger than the sub-ensemble used to vote on any particular training example in the out-of-bag error analysis.
Random forests provide a popular variation on the bagging method. The individual decision trees making up a random forest are, again, each fit to an independent, bootstrapped subsample of the training data. However, at each step in their recursive
construction process, these trees are restricted in that they are only allowed to split on
BOOSTING
The final method we’ll discuss is boosting, which again consists of a set of individual trees that collectively determine the ultimate prediction returned by the model. However, in the boosting scenario, one fits each of the trees to the full data
set, rather than to a small sample. Because they are fit to the full data set, these trees are usually restricted to being only two or three levels deep, so as to avoid over-fitting. Further, the individual trees in a boosted forest are constructed sequentially.
For instance, in regression, the process typically works as follows: In the first step, a tree is fit to the full, original training set
Here,
Boosted classification tree ensembles are constructed in a fashion similar to that above. However, in contrast to the regression scenario, the same, original training labels are used to fit each new tree in the ensemble (as opposed to an evolving residual). To bring about a similar, gradual learning process, boosted classification ensembles instead sample from the training set with weights that are sample-dependent and that change over time: When constructing a new tree for the ensemble, one more heavily weights those examples that have been poorly fit in prior iterations. AdaBoost is a popular algorithm for carrying out boosted classification. This and other generalizations are covered in the text Elements of Statistical Learning.
Implementation runtime complexity
Before concluding, we take here a moment to consider the runtime complexity of tree construction. This exercise gives one a sense of how tree algorithms are constructed in practice. We begin by considering the greedy construction of a single classification tree. The extension to regression trees is straightforward.
INDIVIDUAL DECISION TREES
Consider the problem of greedily training a single classification tree on a set of
Focus on an intermediate moment in the construction process where one particular node has just been split, resulting in two new regions,
The left side of Fig. 4 illustrates one method for efficiently carrying out these test cuts: For each feature direction, we proceed sequentially through that direction’s ordered list, considering one cut at a time. In the first cut, we take only one example
in the left sub-region induced, and all others on the right. In the second cut, we have the first two examples in the left sub-region, etc. Proceeding in this way, it turns out that the cost function of each new candidate split considered can always be evaluated
in
The above analysis gives the time needed to search for the optimal split within
In summary, we see that achieving
FORESTS, PARALLELIZATION
If a forest of
Discussion
In this note, we’ve quickly reviewed the basics of tree-based models and their constructions. Looking back over what we have learned, we can now consider some of the reasons why tree-based methods are so popular among practitioners. First — and very importantly — individual trees are often useful for gaining insight into the geometry of datasets in high dimensions. This is because tree structures can be visualized using simple diagrams, like that in Fig. 1. In contrast, most other machine learning algorithm outputs cannot be easily visualized — consider, e.g., support-vector machines, which return hyper-plane decision boundaries. A related point is that tree-based approaches are able to automatically fit non-linear decision boundaries. In contrast, linear algorithms can only fit such boundaries if appropriate non-linear feature combinations are constructed. This requires that one first identify these appropriate feature combinations, which can be a challenging task for feature spaces that cannot be directly visualized. Three additional positive qualities of decision trees are given by a) the fact that they are insensitive to feature scale, which reduces the need for related data preprocessing, b) the fact that they can make use of data missing certain feature values, and c) that they are relatively robust against outliers and noisy-labeling issues.
Although boosted and random forests are not as easily visualized as individual decision trees, these ensemble methods are popular because they are often quite competitive. Boosted forests typically have a slightly lower generalization error than their random forest counterparts. For this reason, they are often used when accuracy is highly-valued — see Fig. 5 for an example learning curve consistent with this rule of thumb: Generalization error rate versus training set size for a hand-written digits learning problem. However, the individual trees in a bagged forest can be constructed in parallel. This benefit — not shared by boosted forests — can favor random forests as a go-to, out-of-box approach for treating large-scale machine learning problems.
Exercises follow that detail some further points of interest relating to decision trees and their construction.
REFERENCES
[1] Elements of Statistical Learning, by Hastie, Tibshirani, Friedman
[2] An Introduction to Statistical Learning, by James, Witten, Hastie, and Tibshirani
[3] Random Forests, by Breiman (Machine Learning, 45, 2001).
[4] Sk-learn documentation on runtime complexity, see section 1.8.4.
Exercises
1) JENSEN’S INEQUALITY AND CLASSIFICATION TREE COST FUNCTIONS
a) Consider a real function
When does equality hold?
b) Consider binary tree classification guided by the minimization of the error rate (
c) How about if (
2) DECISION TREE PREDICTION RUNTIME COMPLEXITY
Suppose one has constructed an approximately balanced decision tree, where each node contains one of the
3) CLASSIFICATION TREE CONSTRUCTION RUNTIME COMPLEXITY
a) Consider a region
If
b) Show that a region’s Gini coefficient (
4)REGRESSION TREE CONSTRUCTION RUNTIME COMPLEXITY.
Consider a region
These results allow for the cost function of a region to be updated in
5) CHEBYCHEV’S INEQUALITY AND RANDOM FOREST CLASSIFIER ACCURACY
Adapted from [3].
a) Let
b) Consider a binary classification problem aimed at fitting a sampled function
This is equal to
c) Show that
d) Writing,
for the
Combining with (
The bound (
Appendix: python/sk-learn implementations
Here, we provide the python/sk-learn code used to construct Fig. 5 in the body of this note: Learning curves on sk-learn’s “digits” dataset for a single tree, a random forest, and a boosted forest.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
from
sklearn.datasets import
load_digits from
sklearn.tree import
DecisionTreeClassifier from
sklearn.ensemble import
RandomForestClassifier from
sklearn.ensemble import
GradientBoostingClassifier import
numpy as np #load data: digits.data and digits.target, #array of features and labels, resp. digits
=
load_digits(n_class = 10 )
n_train
=
[] t1_accuracy
=
[] t2_accuracy
=
[] t3_accuracy
=
[] #below, we average over "trials" num of fits for each sample #size in order to estimate the average generalization error. trials
=
25 clf
=
DecisionTreeClassifier() clf2
=
GradientBoostingClassifier(max_depth = 3 ) clf3
=
RandomForestClassifier() num_test
=
500 #loop over different training set sizes for
num_train in
range ( 2 , len (digits.target) - num_test, 25 ): acc1, acc2, acc3
=
0 , 0 , 0 for
j in
range (trials): perm
=
[ 0 ] while
len ( set (digits.target[perm[:num_train]]))< 2 : perm
=
np.random.permutation( len (digits.data)) clf
=
clf.fit(digits.data[perm[:num_train]], digits.target[perm[:num_train]]) acc1
+ =
clf.score(digits.data[perm[ - num_test:]], digits.target[perm[ - num_test:]]) clf2
=
clf2.fit(digits.data[perm[:num_train]], digits.target[perm[:num_train]]) acc2
+ =
clf2.score(digits.data[perm[ - num_test:]], digits.target[perm[ - num_test:]]) clf3
=
clf3.fit(digits.data[perm[:num_train]], digits.target[perm[:num_train]]) acc3
+ =
clf3.score(digits.data[perm[ - num_test:]], digits.target[perm[ - num_test:]]) n_train.append(num_train) t1_accuracy.append(acc1 / trials) t2_accuracy.append(acc2 / trials) t3_accuracy.append(acc3 / trials) % pylab inline plt.plot(n_train,t1_accuracy, color
=
'red' ) plt.plot(n_train,t2_accuracy, color
=
'green' ) plt.plot(n_train,t3_accuracy, color
=
'blue' ) |