• XGBboost 特征评分的计算原理


      xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算,而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性,

    调用的源码就不准备详述,本文主要侧重的是计算的原理,函数get_fscore源码如下,源码来自安装包:xgboost/python-package/xgboost/core.py

      通过下面的源码可以看出,特征评分可以看成是被用来分离决策树的次数。

    def get_fscore(self, fmap=''):
            """Get feature importance of each feature.
    
            Parameters
            ----------
            fmap: str (optional)
               The name of feature map file
            """
    
            return self.get_score(fmap, importance_type='weight')
    
        def get_score(self, fmap='', importance_type='weight'):
            """Get feature importance of each feature.
            Importance type can be defined as:
                'weight' - the number of times a feature is used to split the data across all trees.
                'gain' - the average gain of the feature when it is used in trees
                'cover' - the average coverage of the feature when it is used in trees
    
            Parameters
            ----------
            fmap: str (optional)
               The name of feature map file
            """
    
            if importance_type not in ['weight', 'gain', 'cover']:
                msg = "importance_type mismatch, got '{}', expected 'weight', 'gain', or 'cover'"
                raise ValueError(msg.format(importance_type))
    
            # if it's weight, then omap stores the number of missing values
            if importance_type == 'weight':
                # do a simpler tree dump to save time
                trees = self.get_dump(fmap, with_stats=False)
    
                fmap = {}
                for tree in trees:
                    for line in tree.split('
    '):
                        # look for the opening square bracket
                        arr = line.split('[')
                        # if no opening bracket (leaf node), ignore this line
                        if len(arr) == 1:
                            continue
    
                        # extract feature name from string between []
                        fid = arr[1].split(']')[0].split('<')[0]
    
                        if fid not in fmap:
                            # if the feature hasn't been seen yet
                            fmap[fid] = 1
                        else:
                            fmap[fid] += 1
    
                return fmap
    
            else:
                trees = self.get_dump(fmap, with_stats=True)
    
                importance_type += '='
                fmap = {}
                gmap = {}
                for tree in trees:
                    for line in tree.split('
    '):
                        # look for the opening square bracket
                        arr = line.split('[')
                        # if no opening bracket (leaf node), ignore this line
                        if len(arr) == 1:
                            continue
    
                        # look for the closing bracket, extract only info within that bracket
                        fid = arr[1].split(']')
    
                        # extract gain or cover from string after closing bracket
                        g = float(fid[1].split(importance_type)[1].split(',')[0])
    
                        # extract feature name from string before closing bracket
                        fid = fid[0].split('<')[0]
    
                        if fid not in fmap:
                            # if the feature hasn't been seen yet
                            fmap[fid] = 1
                            gmap[fid] = g
                        else:
                            fmap[fid] += 1
                            gmap[fid] += g
    
                # calculate average value (gain/cover) for each feature
                for fid in gmap:
                    gmap[fid] = gmap[fid] / fmap[fid]
    
                return gmap
    

      

  • 相关阅读:
    day02 Python 字符串编码
    地坛——我的最爱 (2006-11-12 09:33:18)
    心灵噬血虫 (2007-01-02 12:33:36)
    ArcSDE for oracle10g安装后post的时候出现错误
    创建featureclass,为它赋别名,并移动到数据集下
    feature.shape和feature.shapecopy的区别
    IPoint从自定义的投影坐标系转换到自定义的地理坐标系
    女儿傻 女儿悲 2014-2-23
    自嘲 2014-2-7
    写在双节 2014-2-14
  • 原文地址:https://www.cnblogs.com/lvpengbo/p/8822288.html
Copyright © 2020-2023  润新知