Python for Data Science

Python for Data Science
Python for Data Science - Decision Trees With CART

Decision Tree

A decision tree is a decision-support tool that models decisions in order to predict probable outcomes of those decisions.

Decision Trees Have Three Types of Nodes
- Root node
- Decision nodes
- Leaf nodes
Decision Tree Algorithms

A class of supervised machine learning methods that are useful for making predictions from nonlinear data

Decision tree algorithms are an appropriate fit for:
- Continuous input and/or output variables
- Categorical input and/or output variables
Two Types of Decision Trees
- Categorical variable decision tree (or classification tree): when you use a decision tree to predict for a categorical target variable
- Continuous variable decision tree (or regression trees): when you use decision tree to predict for a continuous target variable
Main Benefits of Decision Trees Algorithms
- Decision trees are nonlinear models
- Decision trees are very easy to interpret
- Trees can be easily represented graphically, helping in interpretation
- Decision trees require less data preparation
Assumptions of Decision Trees Algorithms
- Root node = Entire training set
- Predictive features are either categorical, or (if continuous) they're binned prior to model deployment
- Rows in the dataset have a recursive distribution based on the values of attributes
How Decision Trees Algorithms Work
1. Deploy recursive binary splitting to stratify or segment the predictor space into a number of simple, nonoverlapping regions.
2. Make a prediction for a given observation by using the mean or the mode of the training observations in the region to which it belongs.
3. Use the set of splitting rules to summarize the tree.
Recursive Binary Splitting

Recursive binary splitting is the process that's used to segment a predictor space into regions in order to create a binary decision tree.

How Recursive Binary Splitting Works

At every stage we split the region into two, and we do this by the following criteria:
- In regression trees: use the SSE to calculate the loss function, thus identifying the best split
- In classification trees: use the Gini index to calculate the loss function, thus identifying the best split
Characteristics of Recursive Binary Splitting
- Top-down
- Greedy
Disadvantages of Decision Tree Algorithms
- Very non-robust
- Sensitive to training data
- Globally optimum tree not guaranteed
Using Decision Trees for Regression

Regression trees are appropriate when:
- Target is a continuous variable
- Linear relationship between features and target
Output values from terminal nodes represent the mean response:
- Values of new data points will be predicted from that mean
Recursive Splitting in Regression Trees

At every state in a regression tree, the region is split into two according to sum of squares error:

The model begins with the entire data set, S, and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups(S1 and S2), such that the overall sums of squares error are minimized:

[SSE=sum_{i∈S_1}{(y_i-overline{y}_1)^2}+sum_{i∈S_2}{(y_i-overline{y}_2)^2} ]
Using Decision Trees for Classification

Classification trees are appropriate when:
- Target is a binary, categorical variable
Output values from terminal nodes represent the mode response:
- Values of new data points will be predicted from that mode
Recursive Splitting in Classification Trees

At every state in a classification tree, the region is split into two according to a user-defined metric, for example:

The Gini index(G) is a measure of total variance across the K classes; it measures the probability of misclassification

[G=sum^{K}_{k=1}{widehat{p}_{mk}}(1-widehat{p}_{mk}) ]
- G takes on a small value if all of the Pmk's are close to zero or one
- "Measure of node purify": where a small G value indicates that a node contains predominantly observations from a single class
Tree Pruning

Tree pruning is the process that's used to overcome model overfitting by removing sub nodes of a decision tree (that is, replacing a whole subtree by a leaf node).

Why Prune a Decision Tree

Why tree pruning is necessary:

Deep tree = Model overfitting = Bad performance

If expected error rate (in subtree) > Single leaf

Two Popular Tree Pruning Methods
- Hold-out test: Fastest, simplest pruning method
- Cost-complexity pruning
相信未来 - 该面对的绝不逃避，该执著的永不怨悔，该舍弃的不再留念，该珍惜的好好把握。
相关阅读:
『软件介绍』SQLServer2008 基本操作
 PCA的数学原理
 PCA的数学原理
 Oracle数据处理
 UVa 11995
Unreal Engine 4 C++ 为编辑器中Actor创建自己定义图标
 codecombat之边远地区的森林1-11关及地牢38关代码分享
 初识ecside
how tomcat works读书笔记七日志记录器
 HDU 1754（线段树区间最值）
原文地址：https://www.cnblogs.com/keepmoving1113/p/14347155.html

Python for Data Science

Python for Data Science - Decision Trees With CART