数据分析(三)
在分析UCI数据之前,有必要先了解一些决策树的概念(decision tree)
-
此处推荐一个关于决策树的博客地址:
http://www.cnblogs.com/yonghao/p/5061873.html
-
决策树(decision tree (DT))的基本特征
-
DT 是一个监督学习方法(supervised learning method)
-
DT is a supervised learning method, thus we need labeled data
-
It is one process only thus it is not good for giant datasets
-
PS: It is pretty good on small and clean datasets
-
-
UCI数据特征: UCI credit approval data set
-
690 data entries, relatively small dataset
-
15 attributes, pretty tiny to be honest
-
missing value is only 5%
-
2 class data
-
-
By looking at these two, we know DT should work well for our dataset
综上,就可以尝试用代码实现决策树的功能了,此时使用段老师提供的skeleton(框架),按照以下步骤写自己的代码
-
Copy and paste your code to function
readfile(file_name)
under the comment# Your code here
. -
Make sure your input and output matches how I descirbed in the docstring
-
Make a minor improvement to handle missing data, in this case let's use string
"missing"
to represent missing data. Note that it is given as"?"
. -
Implement
is_missing(value)
,class_counts(rows)
,is_numeric(value)
as directed in the docstring -
Implement class
Determine
. This object represents a node of our DT. 这个对象表示的是决策树的节点。-
It has 2 inputs and a function. 有两个输入,一个方法
-
We can think of it as the Question we are asking at each node. 可以理解成决策树中每个节点我们所提出的“问题”
-
-
Implement the method
partition(rows, question)
as described in the docstring-
Use Determine class to partition data into 2 groups
-
-
Implement the method
gini(rows)
as described in the docstring -
Implement the method
info_gain(left, right, current_uncertainty)
as described in the docstring -
my code is as follows , for reference only(以下是我的代码,仅供参考)
def readfile(file_name):
"""
This function reads data file and returns structured and cleaned data in a list
:param file_name: relative path under data folder
:return: data, in this case it should be a 2-D list of the form
[[data1_1, data1_2, ...],
[data2_1, data2_2, ...],
[data3_1, data3_2, ...],
...]
i.e.
[['a', 58.67, 4.46, 'u', 'g', 'q', 'h', 3.04, 't', 't', 6.0, 'f', 'g', '00043', 560.0, '+'],
['a', 24.5, 0.5, 'u', 'g', 'q', 'h', 1.5, 't', 'f', 0.0, 'f', 'g', '00280', 824.0, '+'],
['b', 27.83, 1.54, 'u', 'g', 'w', 'v', 3.75, 't', 't', 5.0, 't', 'g', '00100', 3.0, '+'],
...]
Couple things you should note:
1. You need to handle missing data. In this case let's use "missing" to represent all missing data
2. Be careful of data types. For instance,
"58.67" and "0.2356" should be number and not a string
"00043" should be string but not a number
It is OK to treat all numbers as float in this case. (You don't need to worry about differentiating integer and float)
"""
# Your code here
data_ = open(file_name, 'r')
# print(data_)
lines = data_.readlines()
output = []
# never use built-in names unless you mean to replace it
for list_str in lines:
str_list = list_str[:-1].split(",")
# keep it
# str_list.remove(str_list[len(str_list)-1])
data = []
for substr in str_list:
if substr.isdigit():
if len(substr) > 1 and substr.startswith('0'):
data.append(substr)
else:
substr = int(substr)
data.append(substr)
else:
try:
current = float(substr)
data.append(current)
except ValueError as e:
if substr == '?':
substr = 'missing'
data.append(substr)
output.append(data)
return output
def is_missing(value):
"""
Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
:param value: value to be checked
:return: boolean (True, False) of whether the input value is the same as our "missing" notation
"""
return value == 'missing'
def class_counts(rows):
"""
Count how many data samples there are for each label
数每个标签的样本数
:param rows: Input is a 2D list in the form of what you have returned in readfile()
:return: Output is a dictionary/map in the form:
{"label_1": #count,
"label_2": #count,
"label_3": #count,
...
}
"""
# 这个方法是一个死方法 只使用于当前给定标签(‘+’,‘-’)的数据统计 为了达到能使更多不确定标签的数据的统计 扩展出下面方法
# label_dict = {}
# count1 = 0
# count2 = 0
# # rows 是readfile返回的结果
# for row in rows:
# if row[-1] == '+':
# count1 += 1
# elif row[-1] == '-':
# count2 += 1
# label_dict['+'] = count1
# label_dict['-'] = count2
# return label_dict
# 扩展方法一
# 这个方法可以完成任何不同标签的数据的统计 使用了两个循环 第一个循环是统计出所有数据中存在的不同类型的标签 得到一个标签列表lable_list
# 然后遍历lable_list中的标签 重要的是在其中嵌套了遍历所有数据的循环 同时在当前循环中统计出所有数据的标签中和lable_list中标签相同的总数
# label_dict = {}
# lable_list = []