Python for Data Science - Decision Trees With CART
Decision Tree
A decision tree is a decision-support tool that models decisions in order to predict probable outcomes of those decisions.
Decision Trees Have Three Types of Nodes
- Root node
- Decision nodes
- Leaf nodes
Decision Tree Algorithms
A class of supervised machine learning methods that are useful for making predictions from nonlinear data
Decision tree algorithms are an appropriate fit for:
- Continuous input and/or output variables
- Categorical input and/or output variables
Two Types of Decision Trees
- Categorical variable decision tree (or classification tree): when you use a decision tree to predict for a categorical target variable
- Continuous variable decision tree (or regression trees): when you use decision tree to predict for a continuous target variable
Main Benefits of Decision Trees Algorithms
- Decision trees are nonlinear models
- Decision trees are very easy to interpret
- Trees can be easily represented graphically, helping in interpretation
- Decision trees require less data preparation
Assumptions of Decision Trees Algorithms
- Root node = Entire training set
- Predictive features are either categorical, or (if continuous) they're binned prior to model deployment
- Rows in the dataset have a recursive distribution based on the values of attributes
How Decision Trees Algorithms Work
- Deploy recursive binary splitting to stratify or segment the predictor space into a number of simple, nonoverlapping regions.
- Make a prediction for a given observation by using the mean or the mode of the training observations in the region to which it belongs.
- Use the set of splitting rules to summarize the tree.
Recursive Binary Splitting
Recursive binary splitting is the process that's used to segment a predictor space into regions in order to create a binary decision tree.
How Recursive Binary Splitting Works
At every stage we split the region into two, and we do this by the following criteria:
- In regression trees: use the SSE to calculate the loss function, thus identifying the best split
- In classification trees: use the Gini index to calculate the loss function, thus identifying the best split
Characteristics of Recursive Binary Splitting
- Top-down
- Greedy
Disadvantages of Decision Tree Algorithms
- Very non-robust
- Sensitive to training data
- Globally optimum tree not guaranteed
Using Decision Trees for Regression
Regression trees are appropriate when:
- Target is a continuous variable
- Linear relationship between features and target
Output values from terminal nodes represent the mean response:
- Values of new data points will be predicted from that mean
Recursive Splitting in Regression Trees
At every state in a regression tree, the region is split into two according to sum of squares error:
The model begins with the entire data set, S, and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups(S1 and S2), such that the overall sums of squares error are minimized:
Using Decision Trees for Classification
Classification trees are appropriate when:
- Target is a binary, categorical variable
Output values from terminal nodes represent the mode response:
- Values of new data points will be predicted from that mode
Recursive Splitting in Classification Trees
At every state in a classification tree, the region is split into two according to a user-defined metric, for example:
The Gini index(G) is a measure of total variance across the K classes; it measures the probability of misclassification
- G takes on a small value if all of the Pmk's are close to zero or one
- "Measure of node purify": where a small G value indicates that a node contains predominantly observations from a single class
Tree Pruning
Tree pruning is the process that's used to overcome model overfitting by removing sub nodes of a decision tree (that is, replacing a whole subtree by a leaf node).
Why Prune a Decision Tree
Why tree pruning is necessary:
Deep tree = Model overfitting = Bad performance
If expected error rate (in subtree) > Single leaf
Two Popular Tree Pruning Methods
- Hold-out test: Fastest, simplest pruning method
- Cost-complexity pruning