• Python for Data Science


    Python for Data Science - Decision Trees With CART

    Decision Tree

    A decision tree is a decision-support tool that models decisions in order to predict probable outcomes of those decisions.

    Decision Trees Have Three Types of Nodes

    • Root node
    • Decision nodes
    • Leaf nodes

    Decision Tree Algorithms

    A class of supervised machine learning methods that are useful for making predictions from nonlinear data

    Decision tree algorithms are an appropriate fit for:

    • Continuous input and/or output variables
    • Categorical input and/or output variables

    Two Types of Decision Trees

    • Categorical variable decision tree (or classification tree): when you use a decision tree to predict for a categorical target variable
    • Continuous variable decision tree (or regression trees): when you use decision tree to predict for a continuous target variable

    Main Benefits of Decision Trees Algorithms

    • Decision trees are nonlinear models
    • Decision trees are very easy to interpret
    • Trees can be easily represented graphically, helping in interpretation
    • Decision trees require less data preparation

    Assumptions of Decision Trees Algorithms

    • Root node = Entire training set
    • Predictive features are either categorical, or (if continuous) they're binned prior to model deployment
    • Rows in the dataset have a recursive distribution based on the values of attributes

    How Decision Trees Algorithms Work

    1. Deploy recursive binary splitting to stratify or segment the predictor space into a number of simple, nonoverlapping regions.
    2. Make a prediction for a given observation by using the mean or the mode of the training observations in the region to which it belongs.
    3. Use the set of splitting rules to summarize the tree.

    Recursive Binary Splitting

    Recursive binary splitting is the process that's used to segment a predictor space into regions in order to create a binary decision tree.

    How Recursive Binary Splitting Works

    At every stage we split the region into two, and we do this by the following criteria:

    • In regression trees: use the SSE to calculate the loss function, thus identifying the best split
    • In classification trees: use the Gini index to calculate the loss function, thus identifying the best split

    Characteristics of Recursive Binary Splitting

    • Top-down
    • Greedy

    Disadvantages of Decision Tree Algorithms

    • Very non-robust
    • Sensitive to training data
    • Globally optimum tree not guaranteed

    Using Decision Trees for Regression

    Regression trees are appropriate when:

    • Target is a continuous variable
    • Linear relationship between features and target

    Output values from terminal nodes represent the mean response:

    • Values of new data points will be predicted from that mean

    Recursive Splitting in Regression Trees

    At every state in a regression tree, the region is split into two according to sum of squares error:

    The model begins with the entire data set, S, and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups(S1 and S2), such that the overall sums of squares error are minimized:

    [SSE=sum_{i∈S_1}{(y_i-overline{y}_1)^2}+sum_{i∈S_2}{(y_i-overline{y}_2)^2} ]

    Using Decision Trees for Classification

    Classification trees are appropriate when:

    • Target is a binary, categorical variable

    Output values from terminal nodes represent the mode response:

    • Values of new data points will be predicted from that mode

    Recursive Splitting in Classification Trees

    At every state in a classification tree, the region is split into two according to a user-defined metric, for example:

    The Gini index(G) is a measure of total variance across the K classes; it measures the probability of misclassification

    [G=sum^{K}_{k=1}{widehat{p}_{mk}}(1-widehat{p}_{mk}) ]

    • G takes on a small value if all of the Pmk's are close to zero or one
    • "Measure of node purify": where a small G value indicates that a node contains predominantly observations from a single class

    Tree Pruning

    Tree pruning is the process that's used to overcome model overfitting by removing sub nodes of a decision tree (that is, replacing a whole subtree by a leaf node).

    Why Prune a Decision Tree

    Why tree pruning is necessary:

    Deep tree = Model overfitting = Bad performance

    If expected error rate (in subtree) > Single leaf

    Two Popular Tree Pruning Methods

    • Hold-out test: Fastest, simplest pruning method
    • Cost-complexity pruning
    相信未来 - 该面对的绝不逃避,该执著的永不怨悔,该舍弃的不再留念,该珍惜的好好把握。
  • 相关阅读:
    『软件介绍』SQLServer2008 基本操作
    PCA的数学原理
    PCA的数学原理
    Oracle数据处理
    UVa 11995
    Unreal Engine 4 C++ 为编辑器中Actor创建自己定义图标
    codecombat之边远地区的森林1-11关及地牢38关代码分享
    初识ecside
    how tomcat works读书笔记 七 日志记录器
    HDU 1754(线段树区间最值)
  • 原文地址:https://www.cnblogs.com/keepmoving1113/p/14347155.html
Copyright © 2020-2023  润新知