• 数据分析系列精彩浓缩(三)


    数据分析(三)

    在分析UCI数据之前,有必要先了解一些决策树的概念(decision tree)

    • 此处推荐一个关于决策树的博客地址:
      http://www.cnblogs.com/yonghao/p/5061873.html
    • 决策树(decision tree (DT))的基本特征

      • DT 是一个监督学习方法(supervised learning method)

      • DT is a supervised learning method, thus we need labeled data

      • It is one process only thus it is not good for giant datasets

      • PS: It is pretty good on small and clean datasets

    • UCI数据特征: UCI credit approval data set

      • 690 data entries, relatively small dataset

      • 15 attributes, pretty tiny to be honest

      • missing value is only 5%

      • 2 class data

    • By looking at these two, we know DT should work well for our dataset

    综上,就可以尝试用代码实现决策树的功能了,此时使用段老师提供的skeleton(框架),按照以下步骤写自己的代码

    • Copy and paste your code to function readfile(file_name) under the comment # Your code here.

    • Make sure your input and output matches how I descirbed in the docstring

    • Make a minor improvement to handle missing data, in this case let's use string "missing" to represent missing data. Note that it is given as "?".

    • Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring
    • Implement class Determine. This object represents a node of our DT. 这个对象表示的是决策树的节点。
      • It has 2 inputs and a function. 有两个输入,一个方法

      • We can think of it as the Question we are asking at each node. 可以理解成决策树中每个节点我们所提出的“问题”

    • Implement the method partition(rows, question)as described in the docstring
      • Use Determine class to partition data into 2 groups

    • Implement the method gini(rows) as described in the docstring
      • Here is the formula for Gini impurity:

        • where n is the number of classes

        • is the percentage of the given class i

    • Implement the method info_gain(left, right, current_uncertainty) as described in the docstring
      • Here is the formula for Information Gain:

        • where

        • is current_uncertainty

        • is the percentage/probability of left branch, same story for

    • my code is as follows , for reference only(以下是我的代码,仅供参考)

      def readfile(file_name):
         """
        This function reads data file and returns structured and cleaned data in a list
        :param file_name: relative path under data folder
        :return: data, in this case it should be a 2-D list of the form
        [[data1_1, data1_2, ...],
          [data2_1, data2_2, ...],
          [data3_1, data3_2, ...],
          ...]
         
        i.e.
        [['a', 58.67, 4.46, 'u', 'g', 'q', 'h', 3.04, 't', 't', 6.0, 'f', 'g', '00043', 560.0, '+'],
          ['a', 24.5, 0.5, 'u', 'g', 'q', 'h', 1.5, 't', 'f', 0.0, 'f', 'g', '00280', 824.0, '+'],
          ['b', 27.83, 1.54, 'u', 'g', 'w', 'v', 3.75, 't', 't', 5.0, 't', 'g', '00100', 3.0, '+'],
        ...]
         
        Couple things you should note:
        1. You need to handle missing data. In this case let's use "missing" to represent all missing data
        2. Be careful of data types. For instance,
            "58.67" and "0.2356" should be number and not a string
            "00043" should be string but not a number
            It is OK to treat all numbers as float in this case. (You don't need to worry about differentiating integer and float)
        """
         # Your code here
         data_ = open(file_name, 'r')
         # print(data_)
         lines = data_.readlines()
         output = []
         # never use built-in names unless you mean to replace it
         for list_str in lines:
             str_list = list_str[:-1].split(",")
             # keep it
             # str_list.remove(str_list[len(str_list)-1])
             data = []
             for substr in str_list:
                 if substr.isdigit():
                     if len(substr) > 1 and substr.startswith('0'):
                         data.append(substr)
                     else:
                         substr = int(substr)
                         data.append(substr)
                 else:
                     try:
                         current = float(substr)
                         data.append(current)
                     except ValueError as e:
                         if substr == '?':
                             substr = 'missing'
                         data.append(substr)
             output.append(data)
         return output




      def is_missing(value):
         """
        Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
        :param value: value to be checked
        :return: boolean (True, False) of whether the input value is the same as our "missing" notation
        """
         return value == 'missing'


      def class_counts(rows):
         """
        Count how many data samples there are for each label
        数每个标签的样本数
        :param rows: Input is a 2D list in the form of what you have returned in readfile()
        :return: Output is a dictionary/map in the form:
        {"label_1": #count,
          "label_2": #count,
          "label_3": #count,
          ...
        }
        """
         # 这个方法是一个死方法 只使用于当前给定标签(‘+’,‘-’)的数据统计   为了达到能使更多不确定标签的数据的统计 扩展出下面方法
         # label_dict = {}
         # count1 = 0
         # count2 = 0
         # # rows 是readfile返回的结果
         # for row in rows:
         #     if row[-1] == '+':
         #         count1 += 1
         #     elif row[-1] == '-':
         #         count2 += 1
         # label_dict['+'] = count1
         # label_dict['-'] = count2
         # return label_dict

         # 扩展方法一
         # 这个方法可以完成任何不同标签的数据的统计 使用了两个循环 第一个循环是统计出所有数据中存在的不同类型的标签 得到一个标签列表lable_list
         # 然后遍历lable_list中的标签 重要的是在其中嵌套了遍历所有数据的循环 同时在当前循环中统计出所有数据的标签中和lable_list中标签相同的总数
         # label_dict = {}
         # lable_list = []
         # for row in rows:
         #     lable = row[-1]
         #     if lable_list == []:
         #         lable_list.append(lable)
         #     else:
         #         if lable in lable_list:
         #             continue
         #         else:
         #             lable_list.append(lable)
         #
         # for lable_i in lable_list:
         #     count_row_i = 0
         #     for row_i in rows:
         #         if lable_i == row_i[-1]:
         #             count_row_i += 1
         #     label_dict[lable_i] = count_row_i
         # print(label_dict)
         # return label_dict
         #

      # 扩展方法二
         # 此方法是巧妙的使用了dict.key()函数将所有的状态进行保存以及对出现的次数进行累计
         label_dict = {}
         for row in rows:
             keys = label_dict.keys()
             if row[-1] in keys:
                 label_dict[row[-1]] += 1
             elif row[-1] not in keys:
                 label_dict[row[-1]] = 1
         return label_dict


      def is_numeric(value):
         print(type(value),'-----')
         print(value)
         """
        Test if the input is a number(float/int)  
        :param value: Input is a value to be tested    
        :return: Boolean (True/False)    
        """
         # Your code here
         # 此处用到eavl()函数:将字符串string对象转换为有效的表达式参与求值运算返回计算结果
         # if type(eval(str(value))) == int or type(eval(str(value))) == float:
         #     return True
         # 不用eval()也可以 而且有博客说eval()存在一定安全隐患

         # if value is letter(字母) 和将以0开头的字符串检出来
         if str(value).isalpha() or str(value).startswith('0'):
             return False
         return type(int(value)) == int or type(float(value)) == float


      class Determine:
         """
        这个class用来对比。取列序号和值
        match方法比较数值或者字符串
        可以理解为决策树每个节点所提出的“问题”,如:
            今天温度是冷还是热?
            今天天气是晴,多云,还是有雨?
        """
         def __init__(self, column, value):
             """
            initial structure of our object
            :param column: column index of our "question"
            :param value: splitting value of our "question"
            """
             self.column = column
             self.value = value

         def match(self, example):
             """
            Compares example data and self.value
            note that you need to determine whether the data asked is numeric or categorical/string
            Be careful for missing data
            :param example: a full row of data
            :return: boolean(True/False) of whether input data is GREATER THAN self.value (numeric) or the SAME AS self.value (string)
            """
             # Your code here . missing is string too so don't judge(判断)
             e_index = self.column
             value_node = self.value
             # 此处and之后的条件是在e_index = 10是补充的,因为此列的数据类型不统一,包括0开头的字符串,还有int型数字,这就尴尬了,int 和 str 无法做compare
             if is_numeric(example[e_index]) and type(value_node) is int or type(value_node) is float:
                 return example[e_index] > value_node
             else:
                 return example[e_index] == value_node


         def __repr__(self):
             """
            打印树的时候用
            :return:
            """
             if is_numeric(self.value):
                 condition = ">="
             else:
                 condition = "是"
             return "{} {} {}?".format(
                 header[self.column], condition, str(self.value))


      def partition(rows, question):
         """
        将数据分割,如果满足上面Question条件则被分入true_row,否则被分入false_row
        :param rows: data set/subset
        :param question: Determine object you implemented above
        :return: 2 lists based on the answer of the question
        """
         # Your code here . question is Determine's object
         true_rows, false_rows = [], []
         # 此处将二维数组进行遍历的目的是Determine对象中match方法只处理每个一维列表中指定索引的数据
         for row in rows:
             if question.match(row):
                 true_rows.append(row)
             else:
                 false_rows.append(row)
         return true_rows, false_rows


      def gini(rows):
         """
        计算一串数据的Gini值,即离散度的一种表达方式
        :param rows: data set/subset
        :return: gini值,”不纯度“ impurity
        """
         data_set_size = len(rows)    # 所有数据的总长度
         class_dict = class_counts(rows)
         sum_subgini = 0
         for class_dict_value in class_dict.values():
             sub_gini = (class_dict_value/data_set_size) ** 2
             sum_subgini += sub_gini
         gini = 1 - sum_subgini
         return gini



      def info_gain(left, right, current_uncertainty):
         """
        计算信息增益
        Please refer to the .md tutorial for details
        :param left: left branch
        :param right: right branch
        :param current_uncertainty: current uncertainty (data)
        """
         p_left = len(left) / (len(left) + len(right))
         p_right = 1 - p_left
         return current_uncertainty - p_left * gini(left) - p_right * gini(right)




      # 使用这组数据测试自己代码的质量
      data = readfile("E:datacrx.data")
      t, f = partition(data, Determine(2,'1.8'))
      print(info_gain(t, f, gini(data)))

     

    January 2, 2019

  • 相关阅读:
    npm 引入第三方过滤器
    登录加密 md5
    JavaScript 日期处理类库 moment
    Axios 是一个基于 promise 的 HTTP 库,可以用在浏览器和 node.js 中。
    js 常见的小数取整问题
    vue 路由跳转到外部链接
    js some和filter用法和区别
    前端调用后端接口返回200(成功状态码),后端有返回,但是控制台Network Response为空,没展示任何信息
    二叉树的镜像
    树的子结构
  • 原文地址:https://www.cnblogs.com/jcjc/p/10234562.html
Copyright © 2020-2023  润新知