• RandomForest中的feature_importance


     python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)

    https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

     

    随机森林算法(RandomForest)的输出有一个变量是 feature_importances_ ,翻译过来是 特征重要性,具体含义是什么,这里试着解释一下。

    参考官网和其他资料可以发现,RF可以输出两种 feature_importance,分别是Variable importance和Gini importance,两者都是feature_importance,只是计算方法不同。

    Variable importance

    选定一个feature M,在所有OOB样本的feature M上人为添加噪声,再测试模型在OOB上的判断精确率,精确率相比没有噪声时下降了多少,就表示该特征有多重要。

    假如一个feature对数据分类很重要,那么一旦这个特征的数据不再准确,对测试结果会造成较大的影响,而那些不重要的feature,即使受到噪声干扰,对测试结果也没什么影响。这就是 Variable importance 方法的朴素思想。

    [添加噪声:这里官网给出的说法是 randomly permute the values of variable m in the oob cases,permute的含义我还不是很确定,有的说法是打乱顺序,有的说法是在数据上加入白噪声。]

    Gini importance

    选定一个feature M,统计RF的每一棵树中,由M形成的分支节点的Gini指数下降程度(或不纯度下降程度)之和,这就是M的importance。

    两者对比来看,前者比后者计算量更大,后者只需要一边构建DT,一边做统计就可以。从sklearn的官方文档对feature_importances_参数的描述来看,sklearn应当是使用了Gini importance对feature进行排序,同时sklearn把所有的Gini importance以sum的方式做了归一化,得到了最终的feature_importances_输出参数。

    参考文献:

    RandomForest 官网 https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

    Variable importance

    The variable importances are critical. The run computing importances is done by switching imp =0 to imp =1 in the above parameter list. The output has four columns:

    	gene number 
    	the raw importance score 
    	the z-score obtained by dividing the raw score by its standard error 
    	the significance level.

    The highest 25 gene importances are listed sorted by their z-scores. To get the output on a disk file, put impout =1, and give a name to the corresponding output file. If impout is put equal to 2 the results are written to screen and you will see a display similar to that immediately below:

    gene       raw     z-score  significance
    number    score
      667     1.414     1.069     0.143
      689     1.259     0.961     0.168
      666     1.112     0.903     0.183
      668     1.031     0.849     0.198
      682     0.820     0.803     0.211
      878     0.649     0.736     0.231
     1080     0.514     0.729     0.233
     1104     0.514     0.718     0.237
      879     0.591     0.713     0.238
      895     0.519     0.685     0.247
     3621     0.552     0.684     0.247
     3529     0.650     0.683     0.247
     3404     0.453     0.661     0.254
      623     0.286     0.655     0.256
     3617     0.498     0.654     0.257
      650     0.505     0.650     0.258
      645     0.380     0.644     0.260
     3616     0.497     0.636     0.262
      938     0.421     0.635     0.263
      915     0.426     0.631     0.264
      669     0.484     0.626     0.266
      663     0.550     0.625     0.266
      723     0.334     0.610     0.271
      685     0.405     0.605     0.272
     3631     0.402     0.603     0.273
    

    Using important variables

    Another useful option is to do an automatic rerun using only those variables that were most important in the original run. Say we want to use only the 15 most important variables found in the first run in the second run. Then in the options change mdim2nd=0 to mdim2nd=15 , keep imp=1 and compile. Directing output to screen, you will see the same output as above for the first run plus the following output for the second run. Then the importances are output for the 15 variables used in the 2nd run.

        gene         raw       z-score    significance
       number       score
        3621 		6.235 		2.753 		0.003 
        1104 		6.059 		2.709 		0.003 
        3529 		5.671 		2.568 		0.005 
         666 		7.837 		2.389 		0.008 
        3631 		4.657 		2.363 		0.009 
         667 		7.005 		2.275 		0.011 
         668 		6.828 		2.255 		0.012 
         689 		6.637 		2.182 		0.015 
         878 		4.733 		2.169 		0.015 
         682 		4.305 		1.817 		0.035 
         644 		2.710 		1.563 		0.059 
         879 		1.750 		1.283 		0.100 
         686 		1.937 		1.261 		0.104 
        1080 		0.927 		0.906 		0.183 
         623 		0.564 		0.847 		0.199 
    	

    Variable interactions

    Another option is looking at interactions between variables. If variable m1 is correlated with variable m2 then a split on m1 will decrease the probability of a nearby split on m2 . The distance between splits on any two variables is compared with their theoretical difference if the variables were independent. The latter is subtracted from the former-a large resulting value is an indication of a repulsive interaction. To get this output, change interact =0 to interact=1 leaving imp =1 and mdim2nd =10.

    The output consists of a code list: telling us the numbers of the genes corresponding to id. 1-10. The interactions are rounded to the closest integer and given in the matrix following two column list that tells which gene number is number 1 in the table, etc.

    		
         1   2   3   4   5   6   7   8   9  10
     1   0  13   2   4   8  -7   3  -1  -7  -2
     2  13   0  11  14  11   6   3  -1   6   1
     3   2  11   0   6   7  -4   3   1   1  -2
     4   4  14   6   0  11  -2   1  -2   2  -4
     5   8  11   7  11   0  -1   3   1  -8   1
     6  -7   6  -4  -2  -1   0   7   6  -6  -1
     7   3   3   3   1   3   7   0  24  -1  -1
     8  -1  -1   1  -2   1   6  24   0  -2  -3
     9  -7   6   1   2  -8  -6  -1  -2   0  -5
    10  -2   1  -2  -4   1  -1  -1  -3  -5   0

    There are large interactions between gene 2 and genes 1,3,4,5 and between 7 and 8.

     https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149( 欢迎关注博主主页,学习python视频资源,还有大量免费python经典文章)

  • 相关阅读:
    微信小程序wx.chooseImage和wx.previewImage的综合使用(图片上传不限制最多张数)
    js数组与字符串之间的相互转换
    微信小程序wx.previewImage实用案例(交流QQ群:604788754)
    PHP:第一章——PHP中的魔术常量
    小程序模板嵌套以及相关遍历数据绑定
    6 大主流 Web 框架优缺点对比:15篇前端热文回看
    通俗地讲,Netty 能做什么?
    Netty
    为什么选择Netty
    linux下gsoap的初次使用 -- c风格加法实例
  • 原文地址:https://www.cnblogs.com/webRobot/p/10899580.html
Copyright © 2020-2023  润新知