• Pandas入门学习笔记3


    3 汇总和计算描述统计

    pandas有一组用于常用的数学和统计方法。他们一般都是基于没有缺失数据而构建的。
    下面是一些简约方法的选项:

    In [81]: df = DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.73,-1.3]],index=list('abcd'),columns=['one','two'])
    
    In [82]: df
    Out[82]:
        one  two
    a  1.40  NaN
    b  7.10 -4.5
    c   NaN  NaN
    d  0.73 -1.3
    
    In [83]: df.sum()
    Out[83]:
    one    9.23
    two   -5.80
    dtype: float64
    
    In [84]: df.mean()
    Out[84]:
    one    3.076667
    two   -2.900000
    dtype: float64
    
    In [85]: df.mean(axis=1)  # 指定方向
    Out[85]:
    a    1.400
    b    1.300
    c      NaN
    d   -0.285
    dtype: float64
    
    In [86]: df.mean(axis=1, skipna=False) # 排除nan
    Out[86]:
    a      NaN
    b    1.300
    c      NaN
    d   -0.285
    dtype: float64
    

    下面是描述和汇总统计相关的方法:

    In [88]: df.describe()
    /Users/yangfeilong/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
      RuntimeWarning)
    Out[88]:
                one       two
    count  3.000000  2.000000
    mean   3.076667 -2.900000
    std    3.500376  2.262742
    min    0.730000 -4.500000
    25%         NaN       NaN
    50%         NaN       NaN
    75%         NaN       NaN
    max    7.100000 -1.300000
    
    In [89]: df.max()
    Out[89]:
    one    7.1
    two   -1.3
    dtype: float64
    
    In [90]: df.min(axis=1)
    Out[90]:
    a    1.4
    b   -4.5
    c    NaN
    d   -1.3
    dtype: float64
    
    In [91]: df.quantile()
    Out[91]:
    one   NaN
    two   NaN
    dtype: float64
    
    In [92]: s1 = Series(np.arange(100))
    
    In [93]: s1.quantile()
    Out[93]: 49.5
    
    In [94]:
    
    In [95]: s1.quantile(0.2)
    Out[95]: 19.800000000000001
    
    In [96]: s1.quantile(0.24)
    Out[96]: 23.759999999999998
    
    In [97]: s1.quantile(0.25)
    Out[97]: 24.75
    
    In [98]: s1.quantile(0.5)
    Out[98]: 49.5
    
    In [99]: s1.quantile()
    Out[99]: 49.5
    
    In [100]: s1.median()
    Out[100]: 49.5
    
    In [101]: s1.mad()
    Out[101]: 25.0
    
    In [102]: df = DataFrame(np.arange(100).reshape(10,10),columns=list('abcdefghij'))
    
    In [103]: df
    Out[103]:
        a   b   c   d   e   f   g   h   i   j
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99
    
    In [104]: df.mad()
    Out[104]:
    a    25.0
    b    25.0
    c    25.0
    d    25.0
    e    25.0
    f    25.0
    g    25.0
    h    25.0
    i    25.0
    j    25.0
    dtype: float64
    
    In [105]: df.mad(axis=1)
    Out[105]:
    0    2.5
    1    2.5
    2    2.5
    3    2.5
    4    2.5
    5    2.5
    6    2.5
    7    2.5
    8    2.5
    9    2.5
    dtype: float64
    
    In [106]: df.var(axis=1)
    Out[106]:
    0    9.166667
    1    9.166667
    2    9.166667
    3    9.166667
    4    9.166667
    5    9.166667
    6    9.166667
    7    9.166667
    8    9.166667
    9    9.166667
    dtype: float64
    
    In [107]: df.var(axis=0)
    Out[107]:
    a    916.666667
    b    916.666667
    c    916.666667
    d    916.666667
    e    916.666667
    f    916.666667
    g    916.666667
    h    916.666667
    i    916.666667
    j    916.666667
    dtype: float64
    
    In [108]: df.cummax()
    Out[108]:
        a   b   c   d   e   f   g   h   i   j
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99
    
    In [109]: df
    Out[109]:
        a   b   c   d   e   f   g   h   i   j
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99
    
    In [112]: df.diff()
    Out[112]:
          a     b     c     d     e     f     g     h     i     j
    0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN
    1  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    2  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    3  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    4  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    5  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    6  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    7  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    8  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    9  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0  10.0
    
    

    3.1 相关系数和协方差

    相关概念:

    • 相关系数:反映两变量间线性相关关系的统计指标称为相关系数
    • 协方差:在概率论和统计学中,协方差用于衡量两个变量的总体误差。而方差是协方差的一种特殊情况,即当两个变量是相同的情况。
    In [92]: s1 = Series(np.arange(100))
    
    In [117]: s1.corr(s2)
    Out[117]: 0.99999999999999989
    
    In [118]: s2 = Series(np.arange(2,202,2))
    
    In [119]: s1.corr(s2)
    Out[119]: 0.99999999999999989
    
    In [120]: s1.cov(s2)
    Out[120]: 1683.3333333333335
    
    In [102]: df = DataFrame(np.arange(100).reshape(10,10),columns=list('abcdefghij'))
    
    In [122]: df.corr()
    Out[122]:
         a    b    c    d    e    f    g    h    i    j
    a  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    b  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    c  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    d  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    e  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    f  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    g  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    h  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    i  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    j  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    
    

    3.2 唯一值、值计数以及成员资格

    unique:去重

    In [124]: s1 = Series(list('ccbbabcddacbd'))
    
    In [125]: s1.unique()
    Out[125]: array(['c', 'b', 'a', 'd'], dtype=object)
    

    value_counts:值计数

    In [129]: s1.value_counts()  # 默认降序排列
    Out[129]:
    b    4
    c    4
    d    3
    a    2
    dtype: int64
    
    In [130]: pd.value_counts(s1,sort=False)  # pd也可以直接调用
    Out[130]:
    a    2
    c    4
    b    4
    d    3
    dtype: int64
    
    

    isin:判断矢量化的集合成员

    In [134]: obj = Series(list('abcbcdcddcba'))
    
    In [135]: mask = obj.isin(['b','c'])
    
    In [136]: mask
    Out[136]:
    0     False
    1      True
    2      True
    3      True
    4      True
    5     False
    6      True
    7     False
    8     False
    9      True
    10     True
    11    False
    dtype: bool
    
    In [137]: obj[mask]
    Out[137]:
    1     b
    2     c
    3     b
    4     c
    6     c
    9     c
    10    b
    dtype: object
    
    

    如下表:

    形成一个相关列的柱状图

    In [138]: data = DataFrame({'Qu1':[1,3,4,3,4],'Qu2':[2,3,1,2,3],'Qu3':[1,5,2,4,4]})
    
    In [139]: data
    Out[139]:
       Qu1  Qu2  Qu3
    0    1    2    1
    1    3    3    5
    2    4    1    2
    3    3    2    4
    4    4    3    4
    
    In [143]: data.apply(pd.value_counts).fillna(0)
    Out[143]:
       Qu1  Qu2  Qu3
    1  1.0  1.0  1.0
    2  0.0  2.0  1.0
    3  2.0  2.0  0.0
    4  2.0  0.0  2.0
    5  0.0  0.0  1.0
    

    4 处理缺失数据

    pandas使用非浮点数(NaN)来表示缺失数据,它只是表示缺少数据的一种标识。

    In [144]: string_data = Series(['hello',np.nan,'world'])
    
    In [145]: string_data
    Out[145]:
    0    hello
    1      NaN
    2    world
    dtype: object
    
    In [146]: string_data.isnull()
    Out[146]:
    0    False
    1     True
    2    False
    dtype: bool
    
    

    注意:python中的None值也会被当成Nan处理。

    4.1 滤除缺失数据

    纯手工处理永远是最好的,但是很麻烦,使用dropna来处理简单一些。

    In [146]: string_data.isnull()
    Out[146]:
    0    False
    1     True
    2    False
    dtype: bool
    
    In [147]: data = Series([1,np.nan,3,np.nan])
    
    In [148]: data
    Out[148]:
    0    1.0
    1    NaN
    2    3.0
    3    NaN
    dtype: float64
    
    In [149]: data.dropna()
    Out[149]:
    0    1.0
    2    3.0
    dtype: float64
    
    

    当然也可以使用bool索引来处理。

    In [150]: data[data.notnull()]
    Out[150]:
    0    1.0
    2    3.0
    dtype: float64
    
    

    DataFrame而言比较麻烦。

    In [152]: df = DataFrame([[1,2,3],[np.nan,np.nan,np.nan],[3,4,np.nan],[2,3,4]])
    
    In [153]: df
    Out[153]:
         0    1    2
    0  1.0  2.0  3.0
    1  NaN  NaN  NaN
    2  3.0  4.0  NaN
    3  2.0  3.0  4.0
    
    In [154]: df.dropna()
    Out[154]:
         0    1    2
    0  1.0  2.0  3.0
    3  2.0  3.0  4.0
    
    In [155]: df.dropna(how='all')  # 只丢弃全部都是nan的行。
    Out[155]:
         0    1    2
    0  1.0  2.0  3.0
    2  3.0  4.0  NaN
    3  2.0  3.0  4.0
    
    In [164]: df[4] = np.nan
    
    In [165]: df
    Out[165]:
         0    1    2   4
    0  1.0  2.0  3.0 NaN
    1  NaN  NaN  NaN NaN
    2  3.0  4.0  NaN NaN
    3  2.0  3.0  4.0 NaN
    
    In [166]: df.dropna(axis=1,how='all')
    Out[166]:
         0    1    2
    0  1.0  2.0  3.0
    1  NaN  NaN  NaN
    2  3.0  4.0  NaN
    3  2.0  3.0  4.0
    
    
    

    4.2 填充缺失数据

    生成数据:

    In [167]: df = DataFrame(np.random.randn(4,4),columns=list('abcd'))
    
    In [168]: df
    Out[168]:
              a         b         c         d
    0 -0.010218 -0.256541 -0.507837  0.470124
    1  0.293587  0.517149 -1.813092 -0.791727
    2  0.434398  1.352332  0.012355 -1.687852
    3  0.573836 -0.701182 -0.548737  0.022037
    
    In [169]: df.ix[:2,2]
    Out[169]:
    0   -0.507837
    1   -1.813092
    2    0.012355
    Name: c, dtype: float64
    
    In [170]: df.ix[:2,2]= np.nan
    
    In [171]: df.ix[:1,3]= np.nan
    
    In [172]: df
    Out[172]:
              a         b         c         d
    0 -0.010218 -0.256541       NaN       NaN
    1  0.293587  0.517149       NaN       NaN
    2  0.434398  1.352332       NaN -1.687852
    3  0.573836 -0.701182 -0.548737  0.022037
    
    In [173]: df.fillna(0)  #全部填充0
    Out[173]:
              a         b         c         d
    0 -0.010218 -0.256541  0.000000  0.000000
    1  0.293587  0.517149  0.000000  0.000000
    2  0.434398  1.352332  0.000000 -1.687852
    3  0.573836 -0.701182 -0.548737  0.022037
    
    
    In [176]: df.fillna({'c':0,'d':0.5}) #不同列填充不同的值
    Out[176]:
              a         b         c         d
    0 -0.010218 -0.256541  0.000000  0.500000
    1  0.293587  0.517149  0.000000  0.500000
    2  0.434398  1.352332  0.000000 -1.687852
    3  0.573836 -0.701182 -0.548737  0.022037
    
    #默认总是会返回新的对象,也可以在源对象上修改;
    In [177]: _ = df.fillna({'c':0,'d':0.5},inplace=True)
    
    In [178]: df
    Out[178]:
              a         b         c         d
    0 -0.010218 -0.256541  0.000000  0.500000
    1  0.293587  0.517149  0.000000  0.500000
    2  0.434398  1.352332  0.000000 -1.687852
    3  0.573836 -0.701182 -0.548737  0.022037
    
    

    同样,也可以使用其他选项

    In [181]: df
    Out[181]:
              a         b         c         d
    0 -0.010218 -0.256541       NaN       NaN
    1  0.293587  0.517149       NaN       NaN
    2  0.434398  1.352332       NaN -1.687852
    3  0.573836 -0.701182 -0.548737  0.022037
    
    In [184]: df.fillna(method='bfill',limit=2)
    Out[184]:
              a         b         c         d
    0 -0.010218 -0.256541       NaN -1.687852
    1  0.293587  0.517149 -0.548737 -1.687852
    2  0.434398  1.352332 -0.548737 -1.687852
    3  0.573836 -0.701182 -0.548737  0.022037
    

    待续。。。

  • 相关阅读:
    javascript:;与javascript:void(0);
    visual studio 扩展插件
    大数据大规律
    由浅入深理解express源码(四)
    由浅入深理解express源码(三)
    由浅入深理解express源码(二)
    由浅入深理解express源码(一)
    mac链接linux终端,shell脚本发布代码
    Mac nginx 配置
    js正则表达式之解析——URL的正则表达式
  • 原文地址:https://www.cnblogs.com/felo/p/6361835.html
Copyright © 2020-2023  润新知