• 特征选择实践


    前言

    数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。特征工程的目的是最大限度地从原始数据中提取特征以供算法和模型使用。本文主要关注于常用的特征选择方法

    过采样(Over-sampling)

    针对不平衡数据, 最简单的一种方法就是生成少数类的样本, 这其中最基本的一种方法就是: 从少数类的样本中进行随机采样来增加新的样本, 可以使用RandomOverSampler 函数实现上述的功能

    from sklearn.datasets import make_classification
    from collections import Counter
    X, y = make_classification(n_samples=5000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=3,
                               n_clusters_per_class=1,
                               weights=[0.01, 0.05, 0.94],
                               class_sep=0.8, random_state=0)
    Counter(y)
    Out[10]: Counter({0: 64, 1: 262, 2: 4674})
    
    from imblearn.over_sampling import RandomOverSampler
    
    ros = RandomOverSampler(random_state=0)
    X_resampled, y_resampled = ros.fit_sample(X, y)
    
    
    sorted(Counter(y_resampled).items())
    Out[13]:
    [(0, 4674), (1, 4674), (2, 4674)]
    

    特征处理

    标准化

    StandardScaler

    公式为:(X-mean)/std 计算时对每个属性/每列分别进行。
    将数据按期属性(按列进行)减去其均值,并处以其方差。得到的结果是,对于每个属性/每列来说所有数据都聚集在0附近,方差为1。

    >>> from sklearn.preprocessing import StandardScaler
    >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
    >>> scaler = StandardScaler()
    >>> print(scaler.fit(data))
    StandardScaler(copy=True, with_mean=True, with_std=True)
    >>> print(scaler.mean_)
    [0.5 0.5]
    >>> print(scaler.std_)
    [0.5 0.5]
    >>> print(scaler.transform(data))
    [[-1. -1.]
     [-1. -1.]
     [ 1.  1.]
     [ 1.  1.]]
    

    RobustScaler

    如果数据有离群点,上述StandardScaler效果可能不好,这种情况可以使用RobustScaler,它有对数据中心化和数据的缩放鲁棒性更强的参数
    RobustScaler根据分位数范围(默认为IQR: IQR是第1四分位数和第3个四分位数之间的范围。)删除中位数并缩放数据。

    >>> from sklearn.preprocessing import RobustScaler
    >>> X = [[ 1., -2.,  2.],
    ...      [ -2.,  1.,  3.],
    ...      [ 4.,  1., -2.]]
    >>> transformer = RobustScaler().fit(X)
    >>> transformer
    RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
           with_scaling=True)
    >>> transformer.transform(X)
    array([[ 0. , -2. ,  0. ],
           [-1. ,  0. ,  0.4],
           [ 1. ,  0. , -1.6]])
    

    特征选择

    # Defining the ranked dictionary, the coefficients are normalized.
    def rank_to_dict(ranks, names, order=1):
        minmax = MinMaxScaler()
        ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
        ranks = map(lambda x: round(x, 2), ranks)
        return dict(zip(names, ranks))
    

    iv统计

    # Initialize a dictionary 
    ranks = {}
    X=X_res
    Y=y_res
    names=Xtrain.columns
    
    #information Value Predictive Power 
    #• < 0.02 useless for prediction 
    #•0.02 to 0.1 Weak predictor 
    #•0.1 to 0.3 Medium predictor 
    #•0.3 to 0.5 Strong predictor 
    #• > 0.5 Suspicious or too good to be true
    
    print ("start iv")
    import information_value
    woe=information_value.WOE()
    res_woe,res_iv=woe.woe(X,Y)
    ranks["IV"] = rank_to_dict(np.abs(res_iv), names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    LinearRegression

    ### Linear Modeling ###
    print ("start lr")
    ### Simple Linear Regression ###
    from sklearn.linear_model import LinearRegression
    lr = LinearRegression(normalize=True)
    lr.fit(X_res_std, Y)
    ranks["LR"] = rank_to_dict(np.abs(lr.coef_), names)
    

    Ridge Regression

    ### Ridge Regression ###
    from sklearn.linear_model import Ridge
    ridge = Ridge(alpha=7)
    ridge.fit(X_res_std, Y)
    ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    Lasso Regression based on AIC

    ### Lasso Regression based on AIC ###
    from sklearn.linear_model import LassoLarsIC
    lasso_aic = LassoLarsIC(criterion='aic',  max_iter=50000)
    lasso_aic.fit(X_res_std, Y)
    ranks["Lasso_aic"] = rank_to_dict(np.abs(lasso_aic.coef_), names)
    

    Lasso Regression based on BIC

    ### Lasso Regression based on BIC ###
    
    lasso_bic = LassoLarsIC(criterion='bic', max_iter=50000)
    lasso_bic.fit(X_res_std, Y)
    ranks["Lasso_bic"] = rank_to_dict(np.abs(lasso_bic.coef_), names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    RandomForestClassifier

    ### Random Forest ###
    from sklearn.ensemble import RandomForestClassifier
    rf = RandomForestClassifier(n_estimators=100)
    rf.fit(X,Y)
    ranks["RF"] = rank_to_dict(rf.feature_importances_, names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    ExtraTreesClassifier

    ### Extra Trees ###
    from sklearn.ensemble import ExtraTreesClassifier
    forest = ExtraTreesClassifier(n_estimators=100,
                                  random_state=0)
    
    forest.fit(X,Y)
    ranks["ERT"] = rank_to_dict(forest.feature_importances_,names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    f_classif

    ### Correlation ###
    from sklearn.feature_selection import f_classif
    f, pval  = f_classif(X, Y, center=True)
    ranks["Corr"] = rank_to_dict(f, names)
    pickle.dump(ranks, open(folder + 'ranks', 'wb'))
    

    chi2

    ### chi2 ###
    #c,p = chi2(X,Y)
    #ranks["chi2"] = rank_to_dict(c,names)
    

    均值

    ### Mean Calculation ###
    
    r = {}
    for name in names:
        r[name] = round(np.mean([ranks[method][name] for method in ranks.keys()]), 2)
    
    

    reference

  • 相关阅读:
    八 sizeof枚举
    九 推算程序结果
    十 交换变量特殊写法
    十一 移位-加减优先级 define undef
    十二 部分易忽略的优先级优先级
    十三 C语言的#特殊用法
    十四 访问数组:指针形式,下标形式
    VS出现未加载wntdll.pdb的解决办法
    C++继承产生的问题
    opencv加载图片imread失败的原因
  • 原文地址:https://www.cnblogs.com/duanxingxing/p/10769100.html
Copyright © 2020-2023  润新知