• 特征选择Boruta


    A good feature subset is one that:

    contains features highly correlated with (predictive of) the class,

    yet uncorrelated with (not predictive of) each other. 

    特征选择的三种方法:

    1)单一变量选择法:假设特征变量与响应变量y是线性关系。 看每个特征变量与响应变量y的相关程度。

    2)随机森林法: 假设特征变量与响应变量y是非线性关系。 根据特征的重要性排序, 来选择特征。

    3)RFE( recursive feature elimination):递归特征消除。

    利用pipeline + gridSearchCv 实现 对 特征选择+ 分类器的参数优化选择。  

    Because RandomizedLogisticRegression is used for feature selection, it would need to be cross validated as part of a pipeline. You can apply GridSearchCV to a Pipeline which contains it as a feature selection step along with your classifier of choice. An example might look like:

    pipeline = Pipeline([
      ('fs', RandomizedLogisticRegression()),
      ('clf', LogisticRegression())
    ])
    
    params = {'fs__C':[0.1, 1, 10]}
    
    grid_search = GridSearchCV(pipeline, params)
    grid_search.fit(X_train,y_train)

    参考文献: 

    http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/

    使用Boruta前 ,需要对缺失值进行填充。 

    https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/

    Variable selection is an important aspect of model building which every analyst must learn. After all, it helps in building predictive models free from correlated variables, biases and unwanted noise.

    A lot of novice analysts assume that keeping all (or more) variables will result in the best model as you are not losing any information. Sadly, that is not true!

    How many times has it happened that removing a variable from model has increased your model accuracy ?

    At least, it has happened to me. Such variables are often found to be correlated and hinder achieving higher model accuracy. Today, we’ll learn one of the ways of how to get rid of such variables in R. I must say, R has an incredible CRAN repository. Out of all packages, one such available package for variable selection is Boruta Package.

  • 相关阅读:
    SQL 数据库 复制 与订阅 实现数据同步
    SQL 2008配置管理工具服务显示 远程过程调用失败0x800706be
    SQL2005中使用identity_insert向自动增量字段中写入内
    【树莓派】【转载】基于树莓派,制作家庭媒体中心+下载机
    Linux 按时间批量删除文件(删除N天前文件)
    【树莓派】为树莓派配置或扩展swap分区
    开源硬件相关平台
    【树莓派】树莓派上刷android系统
    【树莓派】树莓派上面安装配置teamviewer
    【树莓派】使用xdrp远程登录树莓派的图形界面
  • 原文地址:https://www.cnblogs.com/xinping-study/p/7007507.html
Copyright © 2020-2023  润新知