• 特征选择实践


    前言

    数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。特征工程的目的是最大限度地从原始数据中提取特征以供算法和模型使用。本文主要关注于常用的特征选择方法

    过采样(Over-sampling)

    针对不平衡数据, 最简单的一种方法就是生成少数类的样本, 这其中最基本的一种方法就是: 从少数类的样本中进行随机采样来增加新的样本, 可以使用RandomOverSampler 函数实现上述的功能

    from sklearn.datasets import make_classification
    from collections import Counter
    X, y = make_classification(n_samples=5000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=3,
                               n_clusters_per_class=1,
                               weights=[0.01, 0.05, 0.94],
                               class_sep=0.8, random_state=0)
    Counter(y)
    Out[10]: Counter({0: 64, 1: 262, 2: 4674})
    
    from imblearn.over_sampling import RandomOverSampler
    
    ros = RandomOverSampler(random_state=0)
    X_resampled, y_resampled = ros.fit_sample(X, y)
    
    
    sorted(Counter(y_resampled).items())
    Out[13]:
    [(0, 4674), (1, 4674), (2, 4674)]
    

    特征处理

    标准化

    StandardScaler

    公式为:(X-mean)/std 计算时对每个属性/每列分别进行。
    将数据按期属性(按列进行)减去其均值,并处以其方差。得到的结果是,对于每个属性/每列来说所有数据都聚集在0附近,方差为1。

    >>> from sklearn.preprocessing import StandardScaler
    >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
    >>> scaler = StandardScaler()
    >>> print(scaler.fit(data))
    StandardScaler(copy=True, with_mean=True, with_std=True)
    >>> print(scaler.mean_)
    [0.5 0.5]
    >>> print(scaler.std_)
    [0.5 0.5]
    >>> print(scaler.transform(data))
    [[-1. -1.]
     [-1. -1.]
     [ 1.  1.]
     [ 1.  1.]]
    

    RobustScaler

    如果数据有离群点,上述StandardScaler效果可能不好,这种情况可以使用RobustScaler,它有对数据中心化和数据的缩放鲁棒性更强的参数
    RobustScaler根据分位数范围(默认为IQR: IQR是第1四分位数和第3个四分位数之间的范围。)删除中位数并缩放数据。

    >>> from sklearn.preprocessing import RobustScaler
    >>> X = [[ 1., -2.,  2.],
    ...      [ -2.,  1.,  3.],
    ...      [ 4.,  1., -2.]]
    >>> transformer = RobustScaler().fit(X)
    >>> transformer
    RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
           with_scaling=True)
    >>> transformer.transform(X)
    array([[ 0. , -2. ,  0. ],
           [-1. ,  0. ,  0.4],
           [ 1. ,  0. , -1.6]])
    

    特征选择

    # Defining the ranked dictionary, the coefficients are normalized.
    def rank_to_dict(ranks, names, order=1):
        minmax = MinMaxScaler()
        ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
        ranks = map(lambda x: round(x, 2), ranks)
        return dict(zip(names, ranks))
    

    iv统计

    # Initialize a dictionary 
    ranks = {}
    X=X_res
    Y=y_res
    names=Xtrain.columns
    
    #information Value Predictive Power 
    #• < 0.02 useless for prediction 
    #•0.02 to 0.1 Weak predictor 
    #•0.1 to 0.3 Medium predictor 
    #•0.3 to 0.5 Strong predictor 
    #• > 0.5 Suspicious or too good to be true
    
    print ("start iv")
    import information_value
    woe=information_value.WOE()
    res_woe,res_iv=woe.woe(X,Y)
    ranks["IV"] = rank_to_dict(np.abs(res_iv), names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    LinearRegression

    ### Linear Modeling ###
    print ("start lr")
    ### Simple Linear Regression ###
    from sklearn.linear_model import LinearRegression
    lr = LinearRegression(normalize=True)
    lr.fit(X_res_std, Y)
    ranks["LR"] = rank_to_dict(np.abs(lr.coef_), names)
    

    Ridge Regression

    ### Ridge Regression ###
    from sklearn.linear_model import Ridge
    ridge = Ridge(alpha=7)
    ridge.fit(X_res_std, Y)
    ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    Lasso Regression based on AIC

    ### Lasso Regression based on AIC ###
    from sklearn.linear_model import LassoLarsIC
    lasso_aic = LassoLarsIC(criterion='aic',  max_iter=50000)
    lasso_aic.fit(X_res_std, Y)
    ranks["Lasso_aic"] = rank_to_dict(np.abs(lasso_aic.coef_), names)
    

    Lasso Regression based on BIC

    ### Lasso Regression based on BIC ###
    
    lasso_bic = LassoLarsIC(criterion='bic', max_iter=50000)
    lasso_bic.fit(X_res_std, Y)
    ranks["Lasso_bic"] = rank_to_dict(np.abs(lasso_bic.coef_), names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    RandomForestClassifier

    ### Random Forest ###
    from sklearn.ensemble import RandomForestClassifier
    rf = RandomForestClassifier(n_estimators=100)
    rf.fit(X,Y)
    ranks["RF"] = rank_to_dict(rf.feature_importances_, names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    ExtraTreesClassifier

    ### Extra Trees ###
    from sklearn.ensemble import ExtraTreesClassifier
    forest = ExtraTreesClassifier(n_estimators=100,
                                  random_state=0)
    
    forest.fit(X,Y)
    ranks["ERT"] = rank_to_dict(forest.feature_importances_,names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    f_classif

    ### Correlation ###
    from sklearn.feature_selection import f_classif
    f, pval  = f_classif(X, Y, center=True)
    ranks["Corr"] = rank_to_dict(f, names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    chi2

    ### chi2 ###
    #c,p = chi2(X,Y)
    #ranks["chi2"] = rank_to_dict(c,names)
    

    均值

    ### Mean Calculation ###
    
    r = {}
    for name in names:
        r[name] = round(np.mean([ranks[method][name] for method in ranks.keys()]), 2)
    
    

    reference

  • 相关阅读:
    批量下载文件方法
    批量下载文件技术
    批量下载文件插件
    Java实现 LeetCode 539 最小时间差(单位转换)
    Java实现 LeetCode 535 TinyURL 的加密与解密(位运算加密)
    2018-8-10-git-提交添加-emoij-文字
    2018-8-10-git-提交添加-emoij-文字
    2018-8-10-win10-uwp-使用-Geometry-resources-在-xaml
    【树莓派】树莓派4无痛安装系统(NOOBS篇)
    SSH流量转发的姿势
  • 原文地址:https://www.cnblogs.com/duanxingxing/p/10769100.html
Copyright © 2020-2023  润新知