《机器学习实战》第3章——决策树（笔记）

一、DT（决策树算法）概述

一句话：以信息熵为度量构造一棵熵值下降最快的树，到叶子节点处熵值为0（叶节点中的实例都属于一类）。

自顶向下的递归方法选择最优特征，并根据该特征对训练数据进行分割，使得各个子数据集有一个最好的分类的过程。

决策树算法：ID3和C4.5都是基于信息增益作为特征选择的度量，CART基于基尼指数作为特征选择的度量；

结点：内部结点为特征（或属性），叶结点为类别；

种类：分类树（对离散数据进行决策）和回归树（对连续数据进行决策）；

缓解过拟合方法：剪枝

二、优缺点

优点：

1、（易理解性和已推理性）计算不复杂、结果易理解，根据结果很容易推断出相应的逻辑表达式；

2、（数据不敏感）数据预处理比较简答，对中间值的缺失不敏感；

3、（特征无关性）可处理不相关特征数据；

4、（多特征性）可以对有许多特征的数据集构造决策树；

5、（时间性）在相对短的时间内能够对大数据集合做出可行且效果良好的分类结果；

缺点：

1、容易产生过拟合（剪裁）；

2、忽略特征之间的相关性；

3、对噪声数据敏感；ps：噪声数据是指数据中存在着错误或异常(偏离期望值)的数据，这些数据对数据的分析造成了干扰。

处理缺失数据的方法：

1、中位数（数值型）或众数（类别型）；ps：众数（Mode）是指在统计分布上具有明显集中趋势点的数值，代表数据的一般水平；

2、使用其他的训练数据的相应特征做加权补充；

三、决策树代码

代码来自于《机器学习实践》

  1 '''
  2 Created on Oct 12, 2010
  3 Decision Tree Source Code for Machine Learning in Action Ch. 3
  4 @author: Peter Harrington
  5 '''
  6 from math import log
  7 import operator
  8 
  9 def createDataSet():
 10     dataSet = [[1, 1,  'yes'],
 11                [1, 0, 'yes'],
 12                [0, 0,  'no'],
 13                [0, 1,  'no'],
 14                [0, 1,  'no'],
 15                [0, 0, 'yes'],
 16                [1, 1,  'yes'],
 17                [1, 0,  'no'],
 18                [0, 0,  'no'],
 19                [0, 0,  'no']]
 20     labels = ['no surfacing','flippers'] # 特征的名称，便于构成内结点
 21     #change to discrete values
 22     return dataSet, labels
 23 
 24 def calcShannonEnt(dataSet):
 25     numEntries = len(dataSet)
 26     labelCounts = {}
 27     for featVec in dataSet: #the the number of unique elements and their occurance
 28         currentLabel = featVec[-1]
 29         # if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
 30         labelCounts[currentLabel] = labelCounts.get(currentLabel, 0) + 1
 31         # labelCounts[currentLabel] += 1
 32     shannonEnt = 0.0
 33     for key in labelCounts:
 34         prob = float(labelCounts[key])/numEntries
 35         shannonEnt -= prob * log(prob,2) #log base 2
 36     return shannonEnt
 37     
 38 def splitDataSet(dataSet, axis, value):
 39     retDataSet = []
 40     for featVec in dataSet:
 41         if featVec[axis] == value:
 42             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
 43             reducedFeatVec.extend(featVec[axis+1:])
 44             retDataSet.append(reducedFeatVec)
 45     return retDataSet
 46     
 47 def chooseBestFeatureToSplit(dataSet):
 48     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
 49     baseEntropy = calcShannonEnt(dataSet)
 50     bestInfoGain = 0.0; bestFeature = -1
 51     for i in range(numFeatures):        #iterate over all the features
 52         featList = [example[i] for example in dataSet] #create a list of all the examples of this feature
 53         uniqueVals = set(featList)       #get a set of unique values
 54         newEntropy = 0.0
 55         for value in uniqueVals:
 56             subDataSet = splitDataSet(dataSet, i, value)
 57             prob = len(subDataSet)/float(len(dataSet))
 58             newEntropy += prob * calcShannonEnt(subDataSet)  # 数据占比*（数据集的）香浓熵
 59         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
 60         if (infoGain > bestInfoGain):       #compare this to the best gain so far
 61             bestInfoGain = infoGain         #if better than current best, set to best
 62             bestFeature = i
 63     return bestFeature                      #returns an integer
 64 
 65 def majorityCnt(classList):
 66     classCount={}
 67     for vote in classList:
 68         if vote not in classCount.keys(): classCount[vote] = 0
 69         classCount[vote] += 1
 70     sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
 71     sortedClassCount2 = sorted(classCount.items(), key=lambda x: x[1], reverse=True)
 72     return sortedClassCount[0][0]
 73 
 74 def createTree(dataSet,labels):
 75     classList = [example[-1] for example in dataSet]
 76     if classList.count(classList[0]) == len(classList): # 所含元素的个数是列表的长度，说明内容都相同了
 77         return classList[0]#stop splitting when all of the classes are equal
 78     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
 79         return majorityCnt(classList)
 80     bestFeat = chooseBestFeatureToSplit(dataSet) # 获取最好的特征
 81     bestFeatLabel = labels[bestFeat] # 获取最好的特征所对应的名称
 82     myTree = {bestFeatLabel:{}} # 以最好特征的名称最为内结点
 83     del(labels[bestFeat]) # 可以去掉特征的名称了，因为已经放在字典所构成的树中啦
 84     featValues = [example[bestFeat] for example in dataSet] # 获取最好特征下的数据
 85     uniqueVals = set(featValues) # 确定该最好特征有多少种分类
 86     for value in uniqueVals:
 87         subLabels = labels[:]       # 复制所有的现有特征名称，这样和子函数中的标签不容易混淆。copy all of labels, so trees don't mess up existing labels
 88         # 在同一特征下，根据不同属性值进行数据集划分，并得到对应的子数据集，而子数据集的特征名称也会传递下去。
 89         # 返回的是类别的名称。注意字典的key-value操作方式。可以存在一个key，多个value的情况，而每个value值又可以是
 90         # 
 91         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
 92     return myTree                            
 93     
 94 def classify(inputTree,featLabels,testVec):
 95     firstStr = inputTree.keys()[0]
 96     secondDict = inputTree[firstStr]
 97     featIndex = featLabels.index(firstStr)
 98     key = testVec[featIndex]
 99     valueOfFeat = secondDict[key]
100     if isinstance(valueOfFeat, dict): 
101         classLabel = classify(valueOfFeat, featLabels, testVec)
102     else: classLabel = valueOfFeat
103     return classLabel
104 
105 def storeTree(inputTree,filename):
106     import pickle
107     fw = open(filename,'w')
108     pickle.dump(inputTree,fw)
109     fw.close()
110     
111 def grabTree(filename):
112     import pickle
113     fr = open(filename)
114     return pickle.load(fr)
115     
116 
117 
118 if __name__ == '__main__':
119     dataSet, labels = createDataSet()
120     print(createTree(dataSet, labels))

总结：

1.上述代码是通过ID3算法来实现的，容易产生过拟合，可以通过剪裁解决过拟合；

2.若是不同的特征下，具有相同的分类项时，不能够通过很好的区分出来，这时候可以使用C4.5算法，来改进算法；

3.在进行熵的计算时，需要计算同一特征下不同属性下的熵值*该属性的占比数，并相加，才是该特征值的熵值；

4.训练数据是通过字典存放的，测试数据时，需要使用递归的方法遍历整个树。以算法训练所获得模型，从头开始扫描key值，然后在测试数据中找到对应的属性值，若该属性下的数据仍然为字典，说明还可以继续分类，否则，就是该分类值，作为递归的结束判断条件；

5.由于样本数据的数据是特征标签的，需要单独说明；需要注意特征标签和特征属性之间的关联性。

相关阅读:
（二）微信开发工具
 （一）微信小程序环境搭建
 mysql安装--window版
 （六--二）scrapy框架之持久化操作
 windows安装redis
（六--一）scrapy框架简介和基础应用
 （五）selenuim和phantonJs处理网页动态加载数据的爬取
 （四）requests模块的cookies和代理操作
 （三）三种数据解析方式学习
 （二）requests模块
原文地址：https://www.cnblogs.com/gwzz/p/13180907.html