• 【原创】xgboost 特征评分的计算原理


    xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算;

    而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性,

    调用的源码就不准备详述,本文主要侧重的是计算的原理,函数get_fscore源码如下,

    源码来自安装包:xgboost/python-package/xgboost/core.py

    通过下面的源码可以看出,特征评分可以看成是被用来分离决策树的次数,而这个与

    《统计学习基础-数据挖掘、推理与推测》中10.13.1 计算公式有写差异,此处需要注意。

    注:考虑的角度不同,计算方法略有差异。

     def get_fscore(self, fmap=''):
            """Get feature importance of each feature.
    
            Parameters
            ----------
            fmap: str (optional)
               The name of feature map file
            """
    
            return self.get_score(fmap, importance_type='weight')
    
        def get_score(self, fmap='', importance_type='weight'):
            """Get feature importance of each feature.
            Importance type can be defined as:
                'weight' - the number of times a feature is used to split the data across all trees.
                'gain' - the average gain of the feature when it is used in trees
                'cover' - the average coverage of the feature when it is used in trees
    
            Parameters
            ----------
            fmap: str (optional)
               The name of feature map file
            """
    
            if importance_type not in ['weight', 'gain', 'cover']:
                msg = "importance_type mismatch, got '{}', expected 'weight', 'gain', or 'cover'"
                raise ValueError(msg.format(importance_type))
    
            # if it's weight, then omap stores the number of missing values
            if importance_type == 'weight':
                # do a simpler tree dump to save time
                trees = self.get_dump(fmap, with_stats=False)
    
                fmap = {}
                for tree in trees:
                    for line in tree.split('
    '):
                        # look for the opening square bracket
                        arr = line.split('[')
                        # if no opening bracket (leaf node), ignore this line
                        if len(arr) == 1:
                            continue
    
                        # extract feature name from string between []
                        fid = arr[1].split(']')[0].split('<')[0]
    
                        if fid not in fmap:
                            # if the feature hasn't been seen yet
                            fmap[fid] = 1
                        else:
                            fmap[fid] += 1
    
                return fmap
    
            else:
                trees = self.get_dump(fmap, with_stats=True)
    
                importance_type += '='
                fmap = {}
                gmap = {}
                for tree in trees:
                    for line in tree.split('
    '):
                        # look for the opening square bracket
                        arr = line.split('[')
                        # if no opening bracket (leaf node), ignore this line
                        if len(arr) == 1:
                            continue
    
                        # look for the closing bracket, extract only info within that bracket
                        fid = arr[1].split(']')
    
                        # extract gain or cover from string after closing bracket
                        g = float(fid[1].split(importance_type)[1].split(',')[0])
    
                        # extract feature name from string before closing bracket
                        fid = fid[0].split('<')[0]
    
                        if fid not in fmap:
                            # if the feature hasn't been seen yet
                            fmap[fid] = 1
                            gmap[fid] = g
                        else:
                            fmap[fid] += 1
                            gmap[fid] += g
    
                # calculate average value (gain/cover) for each feature
                for fid in gmap:
                    gmap[fid] = gmap[fid] / fmap[fid]
    
                return gmap
    

     GBDT特征评分的计算说明原理:

    链接:1、http://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

    详细的代码说明过程:可以从上面的链接进入下面的链接:

    http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

  • 相关阅读:
    java 日期的格式化
    JAVA 线程
    java 异常
    java 内部类
    java 多态
    SpringBoot(12) SpringBoot创建非web应用
    SpringCloud(1) 架构演进和基础知识简介
    SpringBoot(11) SpringBoot自定义拦截器
    SpringBoot(10) Servlet3.0的注解:自定义原生Servlet、自定义原生Listener
    SpringBoot(9) SpringBoot整合Mybaties
  • 原文地址:https://www.cnblogs.com/haobang008/p/5929378.html
Copyright © 2020-2023  润新知