• 【作业三】林轩田机器学习技法


    这次关注的作业题目是Q13~Q20,主要是实现basic C&RT分类树,以及由其构成的Random Forest。

    其中basic C&RT分类树的实现思路如下:

    (一)先抽象出来几个功能:

    1)从local file读数据并转化成numpy.array的形式(考虑空行容错)(def read_input_data(path)

    2)如何根据某个维度的feature,计算这个feature产生的branch criteria(此题中为decision stump)(def learn_decisionStump(x,y)

    3)比较多个feature产生的branch criteria,挑选最好的作为当前节点的b(x),并把数据分成两份送到左右子树 (def splited_by_decisionStump(x, y)

    4)给定一组y,计算其gini index ( def calculate_GiniIndex(y) )

    (二)根据上述几个抽象的功能,用dfs递归方法实现C&RT分类树:

    1)终止条件:空(返回空)、输入数据都属于一类(这里是一个坑,不能直接返回空,需要构造一个叶子节点再返回;这里叶子节点的index设为-1,用于标示走到叶子节点

    2)利用抽象出来功能,计算branch criteria,生成新节点

    3)将数据分为两组并送到left和right两个分支分别获得分支的sub branch criteria

    4)返回这一层新生成的节点

    (三)如何保存模型学习结果?

    随着树的生长不断进行,记录每个节点的branch criteria,并以binary search tree的形式存放model。

    (四)如何预测?

    推来一个新的样本,只要根据每一层branch criteria中的index和val,来判断往哪走(注意,这个过程是根sign没关系的);直到走到叶子节点(index==-1判断叶子节点),看叶子节点的sign是啥,输入数据的标签就是啥。

    按照(一)~(四),再细心一些,就可以把模型写做出来了。至于后面的Random Forest的过程,就是把先sampling再不断生成C&RT树

    这里没有对高效的sampling方法探究,感觉自己的sampling方式很shit。但是稍微等等,也可以出来正确的结果。

    全部代码如下:

    #encoding=utf8
    import sys
    import numpy as np
    import math
    from random import *
    
    
    ##
    # tree node for storing C&RT model's decision node
    # i: feature index
    # v: decision-stump threshold value
    # s: decision-stump sign ( direction )
    # left: left branch TreeNode
    # right: right branch TreeNode
    class TreeNode:
        def __init__(self, i, v):
            self.index = i
            self.val = v
            self.sign = 0
            self.left = None
            self.right = None
    
    ##
    # read data from local file
    # return with numpy array
    def read_input_data(path):
        x = []
        y = []
        for line in open(path).readlines():
            if line.strip()=='': continue
            items = line.strip().split(' ')
            tmp_x = []
            for i in range(0,len(items)-1): tmp_x.append(float(items[i]))
            x.append(tmp_x)
            y.append(float(items[-1]))
        return np.array(x),np.array(y)
    
    ##
    # input All data ( binary categories in this context )
    # learning decision-stump from the data
    # splited subdata via learned decision-stump
    # return two splited data, index, val, sign
    def splited_by_decisionStump(x, y):
        # storeing sorted index via all x's certain feature
        sorted_index = []
        for i in range(0, x.shape[1]):
            sorted_index.append(np.argsort(x[:,i]))
        # learn the best feature for this node's decision stump
        n1 = x.shape[0]/2
        n2 = x.shape[0]-n1
        Branch = float("inf")
        index = -1
        val = 0
        for i in range(0, x.shape[1]):
            # learn decision stump via x[i]
            xi = x[sorted_index[i], i]
            yi = y[sorted_index[i]]
            # minimize cost function of feature i
            b, v = learn_decisionStump(xi, yi)
            # update least impuirty parameter (val,sign)
            if Branch>b:
                Branch = b
                index = i
                val = v
        # spliting data with best feature and it's val & sign
        leftX = x[np.where(x[:,index]<val)]
        leftY = y[np.where(x[:,index]<val)]
        rightX = x[np.where(x[:,index]>=val)]
        rightY = y[np.where(x[:,index]>=val)]
        return leftX, leftY, rightX, rightY, index, val
    
    # learn decision-stump threshold from one feature dimension
    def learn_decisionStump(x,y):
        # calculate median of interval
        thetas = np.array([ (x[i]+x[i+1])/2 for i in range(0, x.shape[0]-1) ] )
        B = float("inf")
        target_theta = 0.0
        # traversal each median value
        for theta in thetas:
            ly = y[np.where(x<theta)]
            ry = y[np.where(x>=theta)]
            b = ly.shape[0]*calculate_GiniIndex(ly) + ry.shape[0]*calculate_GiniIndex(ry)
            if B>b:
                B = b
                target_theta = theta
        return B, target_theta
    
    
    ## 
    # input data ( binary catergories in this context )
    # return with Gini Index
    def calculate_GiniIndex(y):
        if y.shape[0]==0: return 0
        n1 = sum(y==1)
        n2 = sum(y==-1)
        if (n1+n2)==0: return 0
        return 1.0 - math.pow(1.0*n1/(n1+n2),2) - math.pow(1.0*n2/(n1+n2),2)
    
    
    ## 
    # C&RT tree's dfs learning algorithm
    # return with learned model within a binary tree
    def CART(x, y):
        if x.shape[0]==0: return None # none case
        if calculate_GiniIndex(y)==0: # terminal case ( only one category )
            node = TreeNode(-1, -1)
            node.sign = 1 if y[0]==1 else -1
            return node
        leftX, leftY, rightX, rightY, index, val = splited_by_decisionStump(x,y)
        node = TreeNode(index,val)
        node.left = CART(leftX, leftY)
        node.right = CART(rightX, rightY)
        return node
    
    ## Q13
    # count internal nodes
    def count_internal_nodes(root):
        if root==None: return 0
        if root.left==None and root.right==None: return 0
        print root.index, root.val
        l = 0
        r = 0
        if root.left!=None: 
            l = count_internal_nodes(root.left)
        if root.right!=None: 
            r = count_internal_nodes(root.right)
        return 1 + l + r
    
    ## Q15
    # predict
    def predict(root, x):
        if root.index==-1: return root.sign
        if x[root.index]<root.val:
            return predict(root.left, x)
        else:
            return predict(root.right, x)
    # calculate Eout
    def calculate_E(model, path):
        x,y = read_input_data(path)
        error_count = 0
        for i in range(0, x.shape[0]):
            error_count = error_count + (1 if predict(model, x[i])!=y[i] else 0)
        return 1.0*error_count/x.shape[0]
    
    ## Q16
    # Random Forest via Bagging and average Ein(gt)
    def randomForest(x, y, T):
        error_rate = 0
        trees = []
        for i in range(0,T):
            xi,yi = naive_sampling(x, y)
            model = CART(xi,yi)
            error_rate += calculate_E(model,"train.dat")
            trees.append(model)
        return error_rate/T, trees
    # holy shit naive sampling
    def naive_sampling(x, y):
        sampleX = np.zeros(x.shape)
        sampleY = np.zeros(y.shape)
        for i in range(0, x.shape[0]):
            index = randint(0, x.shape[0]-1)
            sampleX[i] = x[index]
            sampleY[i] = y[index]
        return sampleX, sampleY
    
    ## Q17 Q18
    # Ein(G)
    def calculate_RF_E(trees, path):
        x,y = read_input_data(path)
        error_count = 0
        for i in range(0, x.shape[0]):
            yp = rf_predict(trees, x[i])
            error_count += 1 if yp!=y[i] else 0
        return 1.0*error_count/x.shape[0]
    # random forest predict process
    def rf_predict(trees, x):
        positives = 0
        negatives = 0
        for tree in trees:
            yp = predict(tree, x)
            if yp==1:
                positives += 1
            else:
                negatives += 1
        return 1 if positives>negatives else -1
    
    ## Q19
    # prune to only one branch
    def one_branch_CART(x, y):
        if x.shape[0]==0: return None # none case
        if calculate_GiniIndex(y)==0: # terminal case ( only one category )
            node = TreeNode(-1, -1)
            node.sign = 1 if y[0]==1 else -1
            return node
        leftX, leftY, rightX, rightY, index, val = splited_by_decisionStump(x,y)
        node = TreeNode(index, val)
        node.left = TreeNode(-1, -1)
        node.right = TreeNode(-1, -1)
        ly = y[np.where(x[:,index]<val)]
        node.left.sign = 1 if sum(ly==1)>sum(ly==-1) else -1
        node.right.sign = -node.left.sign
        return node
    
    def one_branch_randomForest(x, y, T):
        trees = []
        for i in range(0,T):
            xi,yi = naive_sampling(x, y)
            model = one_branch_CART(xi, yi)
            trees.append(model)
        return trees
    
    
    
    def main():
        # x,y = read_input_data("unitTestSplitedByDecisionStump.dat")
        x,y = read_input_data("train.dat")
        root = CART(x,y)
        print count_internal_nodes(root)
        print calculate_E(root, "test.dat")
        error_rate,trees = randomForest(x, y, 301)
        print error_rate
        print calculate_RF_E(trees, "train.dat")
        print calculate_RF_E(trees, "test.dat")
        trees = one_branch_randomForest(x, y, 301)
        print calculate_RF_E(trees, "train.dat")
        print calculate_RF_E(trees, "test.dat")
    
    if __name__ == '__main__':
        main()

    通过这几道题目,体会了Random Forest的好处:

    1)Random Forest的每棵树都不是最强的,但是整合在一起可以很强(如Ein做到0)

    2)虽然单棵不剪枝tree的Ein也可以做到0,但是泛化性能(Eout)就比较弱了

    3)Random Forest巧用了Bagging的方法,整合了多棵tree:不用剪枝,还增加了泛化性能

    4)如果削弱Random Forest中的每棵树的功能,对整体效果是有影响的

  • 相关阅读:
    基于JSP+SERVLET的新闻发布系统(三)
    linux date
    试论知识工作者的任务分析
    SRM 577 Div II Level Two: EllysRoomAssignmentsDiv2
    UNIX环境高级编程——TCP/IP网络编程
    是时候抛弃web.xml了?
    使用Spring Boot Actuator、Jolokia和Grafana实现准实时监控--转
    spring-security-oauth2注解详解
    一张图了解javaJwt
    nginx假死导致的问题回顾
  • 原文地址:https://www.cnblogs.com/xbf9xbf/p/4716834.html
Copyright © 2020-2023  润新知