决策树-1 - 润新知

决策树-1
决策树是一种有监督学习算法，决策树可用于分类问题，也可用于回归问题。决策树的优点是可读性强，分类速度快。学习决策树时，通常采用损失函数最小化原则。

scikit-learn中有两类决策树，他们均采用优化的CART决策树算法。

1.回归决策树（DecisionTreeRegressor）

DecisionTreeRegressor实现了回归决策树，用于回归问题。它的原型为：

class sklearn.tree.DecisionTreeRegressor(criterion='mse',splitter='best',max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features=None,random_state=None,max_leaf_nodes=None,presort=False)

参数

criterion:一个字符串，指定切分质量的评价准则。默认为‘mse’，且只支持该字符串，表示均方差

splitter:一个字符串，指定切分原则，可以为如下。
- ‘best’：表示选择最优的切分。
- ‘random’：表示随机切分。
max_features:可以为整数、浮点、字符串或者None，指定寻找best split时考虑的特征数量。
- 如果是整数，则每次切分只考虑max_features个特征。
- 如果为浮点数，则每次切分只考虑max_features*n_features个特征（max_features指定了百分比）。
- 如果是字符串‘auto’或者‘sqrt’，则max_features等于n_features。
- 如果是字符串‘log2’，则max_features等于log2(n_features)。
- 如果是None,则max_features等于n_features。
max_depth：可以为整数或者None，指定树的最大深度。
- 如果为None，则表示树的深度不限。
- 如果max_leaf_nodes参数非None，则忽略此选项。
min_samples_split：为整数，指定每个阶段内部阶段（非叶子结点）包含的最少的样本数。

min_samples_leaf：为整数，指定每个叶子结点包含的最少的样本数。

min_weight_fraction_leaf：为浮点数，叶结点中样本的最小权重系数。

max_leaf_nodes：为整数或者None，指定叶结点的最大数量。
- 如果为None，此时叶结点数量不限；
- 如果非None，则max_depth被忽略。
class_weight：为一个字典、字典的列表、字符串‘balanced’或者None，他指定了分类的权重。权重的形式为：{class_label:weight}。
- 如果为None，则每个分类的权重都为1.
- 字符串‘balanced’表示分类的权重是样本中各分类出现的频率的反比。
random_state：一个整数或者一个RandonState实例，或者None。
- 如果为整数，则他指定了随机数生成器的种子。
- 如果为RandomSrate实例，则指定了随机数生成器。
- 如果为None，则使用默认的随机数生成器。
presort：一个布尔值，指定是否要提前安排排序数据从而加速寻找最优切分的过程。设置为True时，对于大数据集会减慢总体的训练过程；但是对于一个小数据集或者设定最大深度的情况下，则会加速训练过程。

属性有以下5个：

feature_importances_：给出了特征的重要程度。该值越高，则该特征越重要（也称为Gini importance）。

max_features_：max_features的推断值。

n_features_：当执行fit之后，特征的数量。

n_outputs_：当执行fit之后，输出的数量。

tree_：一个Tree对象，即底层的决策树。

方法有以下3种：

fit(X,y[,sample_weight,check_input,....])：训练模型。

predict(X[,check_input])：用模型进行预测，返回预测值。

score(X,y[,sample_weight])：返回预测性能得分。

score不超过1，但是可能为负值（预测效果太差）。

score越大，预测性能越好。

下面是一个示例：
```
# -*- coding:utf-8 -*-

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split #cross_validation
import matplotlib.pyplot as plt

def creat_data(n):
    #给出一个随机产生的数据集
    np.random.seed(0)
    X = 5 * np.random.rand(n,1)
    y = np.sin(X).ravel()
    noise_num = (int)(n/5)
    y[::5] += 3 * (0.5 - np.random.rand(noise_num)) #每隔5个点加一个随机噪声
    return train_test_split(X,y,test_size=0.25,random_state=1)

def test_DecisinTreeRegressor(*data):
    X_train,X_test,y_train,y_test = data
    regr = DecisionTreeRegressor()
    regr.fit(X_train,y_train)
    print("Training score:%f"%(regr.score(X_train,y_train)))
    print("Testing score:%f"%(regr.score(X_test,y_test)))
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    X = np.arange(0.0,5.0,0.01)[:,np.newaxis]
    Y = regr.predict(X)
    ax.scatter(X_train,y_train,label="train sample",c='g')
    ax.scatter(X_test,y_test,label="test sample",c='r')
    ax.plot(X,Y,label='predict_value',linewidth=2,alpha=0.5)
    ax.set_xlabel("data")
    ax.set_ylabel("target")
    ax.set_title("Decosion Tree Regression")
    ax.legend(framealpha=0.5)
    plt.show()

#X_train,X_test,y_train,y_test = creat_data(100)
#test_DecisinTreeRegressor(X_train,X_test,y_train,y_test)

def test_DecisionTreeRegressor_splitter(*data):
    #检验随机划分与最优化分的影响
    X_train,X_test,y_train,y_test = data
    splitters = ['best','random']
    for splitter in splitters:
        regr = DecisionTreeRegressor(splitter=splitter)
        regr.fit(X_train,y_train)
        print("Splitter %s" % splitter)
        print("Training score:%f" % (regr.score(X_train,y_train)))
        print("Testing score:%f" % (regr.score(X_test,y_test)))


#X_train,X_test,y_train,y_test = creat_data(100)
#test_DecisionTreeRegressor_splitter(X_train,X_test,y_train,y_test)


def test_DecisionTreeRegressor_depth(*data,maxdepth):
    #考察决策树深度的影响
    X_train, X_test, y_train, y_test = data
    depths = np.arange(1,maxdepth)
    training_scores = []
    testing_scores = []
    for depth in depths:
        regr = DecisionTreeRegressor(max_depth=depth)
        regr.fit(X_train,y_train)
        training_scores.append(regr.score(X_train,y_train))
        testing_scores.append(regr.score(X_test,y_test))

    #绘图
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(depths,training_scores,label="traing score")
    ax.plot(depths,testing_scores,label="testing score")
    ax.set_xlabel("maxdepth")
    ax.set_ylabel("score")
    ax.set_title("Decision Tree Regression")
    ax.legend(framealpha=0.5)
    plt.show()

X_train,X_test,y_train,y_test = creat_data(100)
test_DecisionTreeRegressor_depth(X_train,X_test,y_train,y_test,maxdepth=20)
```
相关阅读:
Docker安装Zookeeper并进行操作
 JVM 完整深入解析
 synchronized关键字加到static静态方法和非static静态方法区别
 submit与execute区别
 ThreadPoolTaskExecutor和ThreadPoolExecutor区别
 Redis占用内存大小
 Java中CycliBarriar和CountdownLatch区别
 Java并发编程：线程间协作的两种方式：wait、notify、notifyAll和Condition
文本格式
 JavaScript事件
原文地址：https://www.cnblogs.com/freeman818/p/7800765.html