• sklearn 数据预处理1: StandardScaler


    作用:去均值和方差归一化。且是针对每一个特征维度来做的,而不是针对样本。
    【注:】
    并不是所有的标准化都能给estimator带来好处。
    “Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).”

    实例代码

    # coding=utf-8
    # 统计训练集的 mean 和 std 信息
    from sklearn.preprocessing import StandardScaler
    import numpy as np
    
    
    def test_algorithm():
        np.random.seed(123)
        print('use sklearn')
        # 注:shape of data: [n_samples, n_features]
        data = np.random.randn(10, 4)
        scaler = StandardScaler()
        scaler.fit(data)
        trans_data = scaler.transform(data)
        print('original data: ')
        print data
        print('transformed data: ')
        print trans_data
        print('scaler info: scaler.mean_: {}, scaler.var_: {}'.format(scaler.mean_, scaler.var_))
        print('
    ')
    
        print('use numpy by self')
        mean = np.mean(data, axis=0)
        std = np.std(data, axis=0)
        var = std * std
        print('mean: {}, std: {}, var: {}'.format(mean, std, var))
        # numpy 的广播功能
        another_trans_data = data - mean
        # 注:是除以标准差
        another_trans_data = another_trans_data / std
        print('another_trans_data: ')
        print another_trans_data
    
    
    if __name__ == '__main__':
        test_algorithm()
    

    程序的输出如下:

    use sklearn
        original data:
        [[-1.0856306   0.99734545  0.2829785 - 1.50629471]
         [-0.57860025  1.65143654 - 2.42667924 - 0.42891263]
        [1.26593626 - 0.8667404 - 0.67888615 - 0.09470897]
        [1.49138963 - 0.638902 - 0.44398196 - 0.43435128]
        [2.20593008
        2.18678609
        1.0040539
        0.3861864]
        [0.73736858  1.49073203 - 0.93583387  1.17582904]
        [-1.25388067 - 0.6377515
        0.9071052 - 1.4286807]
        [-0.14006872 - 0.8617549 - 0.25561937 - 2.79858911]
        [-1.7715331 - 0.69987723
        0.92746243 - 0.17363568]
        [0.00284592  0.68822271 - 0.87953634  0.28362732]]
        transformed
        data:
        [[-0.94511643  0.58665507  0.5223171 - 0.93064483]
         [-0.53659117  1.16247784 - 2.13366794  0.06768082]
        [0.9495916 - 1.05437488 - 0.42049501
        0.3773612]
        [1.13124423 - 0.85379954 - 0.19024378  0.06264126]
        [1.70696485
        1.63376764
        1.22910949
        0.8229693]
        [0.52371324  1.02100318 - 0.67235312  1.55466934]
        [-1.08067913 - 0.85278672
        1.13408114 - 0.858726]
        [-0.18325687 - 1.04998594 - 0.00561227 - 2.1281129]
        [-1.49776284 - 0.9074785
        1.15403514
        0.30422599]
        [-0.06810748  0.31452186 - 0.61717074  0.72793583]]
        scaler info: scaler.mean_: [0.08737571  0.33094968 - 0.24989369 - 0.50195303], scaler.var_: [1.54038781  1.29032409
                                                                                              1.04082479  1.16464894]
    
        use numpy by self
        mean: [0.08737571  0.33094968 - 0.24989369 - 0.50195303], std: [1.24112361  1.13592433  1.02020821
                                                                        1.07918902], var: [1.54038781  1.29032409
                                                                                           1.04082479  1.16464894]
        another_trans_data:
        [[-0.94511643  0.58665507  0.5223171 - 0.93064483]
         [-0.53659117  1.16247784 - 2.13366794  0.06768082]
        [0.9495916 - 1.05437488 - 0.42049501
        0.3773612]
        [1.13124423 - 0.85379954 - 0.19024378  0.06264126]
        [1.70696485
        1.63376764
        1.22910949
        0.8229693]
        [0.52371324  1.02100318 - 0.67235312  1.55466934]
        [-1.08067913 - 0.85278672
        1.13408114 - 0.858726]
        [-0.18325687 - 1.04998594 - 0.00561227 - 2.1281129]
        [-1.49776284 - 0.9074785
        1.15403514
        0.30422599]
        [-0.06810748  0.31452186 - 0.61717074  0.72793583]]
  • 相关阅读:
    kali 无线网络渗透测试
    kali 漏洞扫描
    Python复杂多重排序
    《编写高质量代码:改善Python程序的91个建议》读后感
    Python用format格式化字符串
    CDH安装Hadoop
    Python设计模式——状体模式
    HBase的安装与使用
    Python设计模式——观察者模式
    Python设计模式——建造者模式
  • 原文地址:https://www.cnblogs.com/mfryf/p/9019891.html
Copyright © 2020-2023  润新知