• 特征选择


    1, 去掉取值变化小的特征(Removing features with low variance)

    sklearn.feature_selection.VarianceThreshold(threshold=0.0)

    2, 单变量特征选择(Univariate feature selection)

    sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10)

    选择前k个分数较高的特征,去掉其他的特征

    sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10)

    f_regression(单因素线性回归试验)用作回归 
    chi2卡方检验,f_classif(方差分析的F值)等用作分类

    选择一定百分比的最高的评分的特征。

    sklearn.feature_selection.SelectFpr(score_func=<function f_classif>, alpha=0.05)

    根据配置的参选搜索

    sklearn.feature_selection.GenericUnivariateSelect(score_func=<function f_classif>, mode='percentile', param=1e-05

    3,递归特征消除Recursive feature elimination (RFE)

    递归特征消除的主要思想是反复的构建模型(如SVM或者回归模型)然后选出最好的(或者最差的)的特征(可以根据系数来选),把选出来的特征选择出来,然后在剩余的特征上重复这个过程,直到所有特征都遍历了。这个过程中特征被消除的次序就是特征的排序。因此,这是一种寻找最优特征子集的贪心算法。
    RFE的稳定性很大程度上取决于在迭代的时候底层用哪种模型。例如,假如RFE采用的普通的回归,没有经过正则化的回归是不稳定的,那么RFE就是不稳定的;假如采用的是Ridge,而用Ridge正则化的回归是稳定的,那么RFE就是稳定的。

    class sklearn.feature_selection.RFECV(estimator, step=1, cv=None, scoring=None, estimator_params=None, verbose=0)

    from sklearn.svm import SVC
    from sklearn.datasets import load_digits
    from sklearn.feature_selection import RFE
    import matplotlib.pyplot as plt

    # Load the digits dataset
    digits = load_digits()
    X = digits.images.reshape((len(digits.images), -1))
    y = digits.target

    # Create the RFE object and rank each pixel
    svc = SVC(kernel="linear", C=1)
    rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
    rfe.fit(X, y)
    ranking = rfe.ranking_.reshape(digits.images[0].shape)

    # Plot pixel ranking
    plt.matshow(ranking)
    plt.colorbar()
    plt.title("Ranking of pixels with RFE")
    plt.show()

    4, Feature selection using SelectFromModel

    SelectFromModel 是一个 meta-transformer,可以和在训练完后有一个coef_ 或者 feature_importances_ 属性的评估器(机器学习算法)一起使用。
    如果相应的coef_ 或者feature_importances_ 的值小于设置的阀值参数,这些特征可以视为不重要或者删除。除了指定阀值参数外,也可以通过设置一个字符串参数,使用内置的启发式搜索找到夜歌阀值。可以使用的字符串参数包括:“mean”, “median” 以及这两的浮点乘积,例如“0.1*mean”.

    sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False)

    与Lasso一起使用,从boston数据集中选择最好的两组特征值。

    import matplotlib.pyplot as plt
    import numpy as np

    from sklearn.datasets import load_boston
    from sklearn.feature_selection import SelectFromModel
    from sklearn.linear_model import LassoCV

    # Load the boston dataset.
    boston = load_boston()
    X, y = boston['data'], boston['target']

    # We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
    clf = LassoCV()

    # Set a minimum threshold of 0.25
    sfm = SelectFromModel(clf, threshold=0.25)
    sfm.fit(X, y)
    n_features = sfm.transform(X).shape[1]

    # Reset the threshold till the number of features equals two.
    # Note that the attribute can be set directly instead of repeatedly
    # fitting the metatransformer.
    while n_features > 2:
    sfm.threshold += 0.1
    X_transform = sfm.transform(X)
    n_features = X_transform.shape[1]

    # Plot the selected two features from X.
    plt.title(
    "Features selected from Boston using SelectFromModel with "
    "threshold %0.3f." % sfm.threshold)
    feature1 = X_transform[:, 0]
    feature2 = X_transform[:, 1]
    plt.plot(feature1, feature2, 'r.')
    plt.xlabel("Feature number 1")
    plt.ylabel("Feature number 2")
    plt.ylim([np.min(feature2), np.max(feature2)])
    plt.show()

    4.1,L1-based feature selection

    L1正则化将系数w的l1范数作为惩罚项加到损失函数上,由于正则项非零,这就迫使那些弱的特征所对应的系数变成0。因此L1正则化往往会使学到的模型很稀疏(系数w经常为0),

    这个特性使得L1正则化成为一种很好的特征选择方法。

    from sklearn.svm import LinearSVC
    from sklearn.datasets import load_iris
    from sklearn.feature_selection import SelectFromModel
    iris = load_iris()
    X, y = iris.data, iris.target
    X.shape
    #(150, 4)
    lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
    model = SelectFromModel(lsvc, prefit=True)
    X_new = model.transform(X)
    X_new.shape
    #(150, 3)

    4.2, 随机稀疏模型Randomized sparse models

    面临一些相互关联的特征是基于L1的稀疏模型的限制,因为模型只选择其中一个特征。为了减少这个问题,可以使用随机特征选择方法,通过打乱设计的矩阵或者子采样的数据并,多次重新估算稀疏模型,并且统计有多少次一个特定的回归量是被选中。

    RandomizedLasso使用Lasso实现回归设置

    sklearn.linear_model.RandomizedLasso(alpha='aic', scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, fit_intercept=True, verbose=False, normalize=True, precompute='auto', max_iter=500, eps=2.2204460492503131e-16, random_state=None, n_jobs=1, pre_dispatch='3*n_jobs', memory=Memory(cachedir=None))
    1
    RandomizedLogisticRegression 使用逻辑回归 logistic regression,适合分类任务

    sklearn.linear_model.RandomizedLogisticRegression(C=1, scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, tol=0.001, fit_intercept=True, verbose=False, normalize=True, random_state=None, n_jobs=1, pre_dispatch='3*n_jobs', memory=Memory(cachedir=None))

    4.3, 基于树的特征选择Tree-based feature selection
    基于树的评估器 (查看sklearn.tree 模块以及在sklearn.ensemble模块中的树的森林) 可以被用来计算特征的重要性,根据特征的重要性去掉无关紧要的特征 (当配合sklearn.feature_selection.SelectFromModel meta-transformer):

    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.datasets import load_iris
    from sklearn.feature_selection import SelectFromModel
    iris = load_iris()
    X, y = iris.data, iris.target
    X.shape
    #(150, 4)
    clf = ExtraTreesClassifier()
    clf = clf.fit(X, y)
    clf.feature_importances_
    array([ 0.04..., 0.05..., 0.4..., 0.4...])
    model = SelectFromModel(clf, prefit=True)
    X_new = model.transform(X)
    X_new.shape
    #(150, 2)

    5, Feature selection as part of a pipeline
    在进行学习之前,特征选择通常被用作预处理步骤。在scikit-learn中推荐使用的处理的方法是sklearn.pipeline.Pipeline

    sklearn.pipeline.Pipeline(steps)
    1
    Pipeline of transforms with a final estimator.
    Sequentially 应用一个包含 transforms and a final estimator的列表 ,pipeline中间的步骤必须是‘transforms’, 也就是它们必须完成fit 以及transform 方法s. final estimator 仅仅只需要完成 fit方法.

    使用pipeline是未来组合多个可以在设置不同参数时进行一起交叉验证的步骤 。因此,它允许设置不同步骤中的参数事使用参数名,这些参数名使用‘__’进行分隔。如下实例中所示:

    from sklearn import svm
    from sklearn.datasets import samples_generator
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import f_regression
    from sklearn.pipeline import Pipeline
    # generate some data to play with
    X, y = samples_generator.make_classification(
    ... n_informative=5, n_redundant=0, random_state=42)
    # ANOVA SVM-C
    anova_filter = SelectKBest(f_regression, k=5)
    clf = svm.SVC(kernel='linear')
    anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
    # You can set the parameters using the names issued
    # For instance, fit using a k of 10 in the SelectKBest
    # and a parameter 'C' of the svm
    anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
    ...
    Pipeline(steps=[...])
    prediction = anova_svm.predict(X)
    anova_svm.score(X, y)
    0.77...
    # getting the selected features chosen by anova_filter
    anova_svm.named_steps['anova'].get_support()
    #array([ True, True, True, False, False, True, False, True, True, True,
    False, False, True, False, True, False, False, False, False,
    True], dtype=bool)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    简单语法示例:

    clf = Pipeline([
    ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
    ('classification', RandomForestClassifier())
    ])
    clf.fit(X, y)

    作者:面向未来的历史
    来源:CSDN
    原文:https://blog.csdn.net/a1368783069/article/details/52048349

     

  • 相关阅读:
    java将汉语转换为拼音工具类
    Maven 参数说明
    tcp socket的backlog参数
    Java 运行中jar包冲突,定位使用哪个jar包
    Java CMS GC
    数据仓库
    compareTo
    java程序性能分析之thread dump和heap dump
    npm 与 package.json 快速入门教程
    基本 Java Bean
  • 原文地址:https://www.cnblogs.com/jian-gao/p/10762259.html
Copyright © 2020-2023  润新知