• 机器学习


    特征决定了最优效果的上限,算法与模型只是让效果更逼近这个上限,所以特征工程与选择什么样的特征很重要!

    以下是一些特征筛选与降维技巧

    # -*- coding:utf-8 -*-
    import scipy as sc
    import libsvm_file_process as data_process
    import numpy as np
    from minepy import MINE
    from sklearn.feature_selection import VarianceThreshold
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import chi2
    from sklearn.feature_selection import f_regression
    from sklearn.feature_selection import RFE
    from sklearn.svm import SVR
    from sklearn.linear_model import LogisticRegression
    from sklearn.decomposition import PCA
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    
    
    class feature_select:
        """
        特征筛选方式:
            相关链接:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
            皮尔逊相关性
            互信息
            单因素 - 卡方判断,F值,假正率
            方差过滤
            递归特征消除法 - 每次消除一个特征,依据是特征前面的系数
            基于模型(LR/GBDT等)的特征选择 SelectFromModel
                模型(LR/GBDT)必须有feature_importances_ 或 coef_这个属性
        降维:
            PCA(unsurperised):一般用于无监督情况下的降维,有监督的时候,也可以小幅降维 去除噪音,然后再使用LDA 降维
    
            LDA(surperised):本质上是一个分类器,在使用上,要求降低的维度要小于分类的维度
        """
    
        def __init__(self):
            self.data_path = "/trainData/libsvm2/"
            self.trainData = ["20180101"]
            # 计算互信息
            self.mine = MINE(alpha=0.6, c=15, est="mic_approx")
            # 方差过滤 一般用于无监督学习
            self.variance_filter = VarianceThreshold(threshold=0.1)
            # chi2 - 卡方检验; f_regression - f值; SelectFpr-假正率;等
            self.chi_squared = SelectKBest(f_regression, k=2)
            # 递归特征消除
            self.estimator = LogisticRegression()  # SVR(kernel="linear")
            self.selector = RFE(self.estimator, 5, step=1)
            # PCA 降维
            self.pca = PCA(n_components=5)
            # LDA 降维
            self.lda = LinearDiscriminantAnalysis(n_components=2)
    
        def select(self):
            for i in range(len(self.trainData)):
                generator = data_process.get_data_batch(self.data_path + self.trainData[i] + "/part-00000", 100000)
                labels, features = generator.next()
                # 方差过滤
                filter1 = self.variance_filter.fit_transform(features)
                print filter1.shape, features.shape
                print self.variance_filter.get_support()
                # 卡方检验
                filter2 = self.chi_squared.fit_transform(features, labels)
                print filter2.shape
                print self.chi_squared.get_support()
                # 递归特征消除(比较耗时 暂时先注释掉)
                # self.selector.fit(features, labels)
                # print self.selector.support_
                # PCA 降维
                transform1 = self.pca.fit_transform(features)
                print 'transform1:', transform1
                # LDA降维
                self.lda.fit(features, labels)
                transform2 = self.lda.transform(features)
                print 'transform2:', transform2
                for j in range(int(features.shape[1]) - 870):
                    features_j = features[0:, j + 870: j + 871]
                    self.mine.compute_score(features_j.flatten(), labels.flatten())
                    # 计算互信息
                    print self.mine.mic()
                    # 计算皮尔逊系数
                    print j, sc.stats.pearsonr(features_j.reshape(-1, 1), labels.reshape(-1, 1))
    
    
    if __name__ == '__main__':
        feature_util = feature_select()
        feature_util.select()
    View Code
  • 相关阅读:
    EL表达式具体解释
    Android 实战美女拼图游戏 你能坚持到第几关
    redis sentinel安装及配置(单机版)
    ElasticSearch scroll查询 api
    springboot自动配置原理
    kafka实践
    Springboot mybatis
    计算机原理
    快速排序算法
    maven常见报错问题
  • 原文地址:https://www.cnblogs.com/tengpan-cn/p/8445224.html
Copyright © 2020-2023  润新知