Principle of DecisionTree Algorithm

Principle of DecisionTree Algorithm

Ideas of Decision Tree ID3 Algorithm

The ID3 algorithm uses the information gain size to determine what features the current node should use to construct the decision tree, and uses the calculated maximum gain of information to establish the current node of the decision tree. Here we give a concrete example of information gain calculation. For example, we have 15 samples D and the output is 0 or 1. Among them, 9 outputs are 0, and 6 outputs are 1. There is a feature A in the sample, which takes the values A1, A2 and A3. In the output of the sample with the value A1, there are 3 outputs as 1, 2 outputs are 0, the sample output is A2, 2 outputs are 1, 3 outputs are 0, and the value is A3. In the sample, 4 outputs are 1, and 1 output is 0.

The entropy of sample D is: H(D)=−(9/15log₂9/15+6/15log₂6/15)=0.971

The conditional entropy of the sample D under the feature is:

H(D|A)=5/15H(D1)+5/15H(D2)+5/15H(D3)

=−5/15(3/5log₂3/5+2/5log₂2/5)−5/15(2/5log₂2/5+3/5log₂3/5)−5/15(4/5log₂4/5+1/5log₂1/5)

=0.888

The corresponding information gain is I(D,A)=H(D)−H(D|A)=0.083

Let's take a look at what the specific algorithm process is like.

The input is m samples, the sample output set is D, each sample has n discrete features, the feature set is A, and the output is decision tree T.

The process of the algorithm is:

1) Initialization information gain threshold ε

2) Determine whether the sample is the same type of output Di, and if so, return to the single node tree T. The tag category is Di

3) Determine whether the feature is empty. If yes, return to the single-node tree T. The tag category is the category with the largest number of instances of the output category D in the sample.

4) Calculate the information gain of each feature (a total of n) in output A for output D, and select the feature Ag with the largest information gain.

5) If the information gain of the Ag is less than the threshold ε, a single node tree T is returned, and the tag category is the category with the largest number of output class D instances in the sample.

6) Otherwise, according to the different values of the characteristic Ag, Agi divides the corresponding sample output D into different categories Di. Each category produces a child node. The corresponding feature value is Agi. Returns the number T of nodes added.

7) For all child nodes, let D=Di, A=A−{Ag} recursively call 2-6 steps to get the subtree Ti and return.
相关阅读:
PHP.TP框架下商品项目的优化2-图片优化
 PHP.TP框架下商品项目的优化1-时间插件、鼠标所在行高亮、布局规划页面
 PHP.26-TP框架商城应用实例-后台3-商品修改、删除
 PHP.25-TP框架商城应用实例-后台2-商品列表页-搜索、翻页、排序
 PHP.24-TP框架商城应用实例-后台1-添加商品功能、钩子函数、在线编辑器、过滤XSS、上传图片并生成缩略图
 PHP.23-ThinkPHP框架的三种模型实例化-（D()方法与M()方法的区别）
PHP.22-Smart模版
 python爬取某站上海租房图片
 Python爬虫入门这一篇就够了
 按PEP8风格自动排版Python代码
原文地址：https://www.cnblogs.com/aiden-liu/p/10773686.html