• Pandas分组与聚合


    分组 (groupby)

    • 对数据集进行分组,然后对每组进行统计分析

    • SQL能够对数据进行过滤,分组聚合

    • pandas能利用groupby进行更加复杂的分组运算

    • 分组运算过程:split->apply->combine

      1. 拆分:进行分组的根据

      2. 应用:每个分组运行的计算规则

      3. 合并:把每个分组的计算结果合并起来

    一、GroupBy对象:DataFrameGroupBy,SeriesGroupBy

    1. 分组操作

    groupby()进行分组,GroupBy对象没有进行实际运算,只是包含分组的中间数据

    按列名分组:obj.groupby(‘label’)

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1' : ['a', 'b', 'a', 'b',
                          'a', 'b', 'a', 'a'],
                'key2' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                'data1': np.random.randn(8),
                'data2': np.random.randn(8)}
    df_obj = pd.DataFrame(dict_obj)
    print(df_obj)
    # dataframe根据key1进行分组
    print(type(df_obj.groupby('key1')))
    
    # dataframe的 data1 列根据 key1 进行分组
    print(type(df_obj['data1'].groupby(df_obj['key1'])))

    效果

      key1   key2     data1     data2
    0    a    one -0.781769  0.210258
    1    b    one  0.396690 -0.315129
    2    a    two -1.819083  0.394317
    3    b  three  0.142387 -0.237055
    4    a    two -0.466665 -0.159256
    5    b    two  1.482328 -1.391806
    6    a    one -1.252323 -0.332732
    7    a  three  2.090247 -0.142536
    <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
    <class 'pandas.core.groupby.generic.SeriesGroupBy'>

    2. 分组运算

    对GroupBy对象进行分组运算/多重分组运算,如mean()

    非数值数据不进行分组运算

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1' : ['a', 'b', 'a', 'b',
                          'a', 'b', 'a', 'a'],
                'key2' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                'data1': np.random.randn(8),
                'data2': np.random.randn(8)}
    df_obj = pd.DataFrame(dict_obj)
    print(df_obj)
    # 分组运算
    grouped1 = df_obj.groupby('key1')
    print(grouped1.mean())
    
    grouped2 = df_obj['data1'].groupby(df_obj['key1'])
    print(grouped2.mean())

    效果

    key1   key2     data1     data2
    0    a    one  0.405285 -1.813457
    1    b    one -1.037339  1.096004
    2    a    two -0.766808 -0.090973
    3    b  three -0.167037  1.077283
    4    a    two  0.543593 -0.323844
    5    b    two -1.178883  0.670555
    6    a    one -0.361643 -0.789085
    7    a  three  0.247109  0.374743
             data1     data2
    key1                    
    a     0.013507 -0.528523
    b    -0.794420  0.947948
    key1
    a    0.013507
    b   -0.794420
    Name: data1, dtype: float64

    size() 返回每个分组的元素个数

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1' : ['a', 'b', 'a', 'b',
                          'a', 'b', 'a', 'a'],
                'key2' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                'data1': np.random.randn(8),
                'data2': np.random.randn(8)}
    df_obj = pd.DataFrame(dict_obj)
    print(df_obj)
    # 分组运算
    grouped1 = df_obj.groupby('key1')
    print(grouped1.mean())
    
    grouped2 = df_obj['data1'].groupby(df_obj['key1'])
    print(grouped2.mean())
    # size
    print(grouped1.size())
    print(grouped2.size())

    效果

      key1   key2     data1     data2
    0    a    one -0.304378  1.082606
    1    b    one  2.915354  0.618401
    2    a    two  0.517311  0.400274
    3    b  three  0.305766 -0.688211
    4    a    two -0.321439 -1.570279
    5    b    two -0.598021 -0.002561
    6    a    one -1.572549  0.917218
    7    a  three  1.147867 -1.142880
             data1     data2
    key1                    
    a    -0.106638 -0.062612
    b     0.874366 -0.024124
    key1
    a   -0.106638
    b    0.874366
    Name: data1, dtype: float64
    key1
    a    5
    b    3
    dtype: int64
    key1
    a    5
    b    3
    Name: data1, dtype: int64

    3. 按自定义的key分组

    obj.groupby(self_def_key)

    自定义的key可为列表或多层列表

    obj.groupby([‘label1’, ‘label2’])->多层dataframe

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1' : ['a', 'b', 'a', 'b',
                          'a', 'b', 'a', 'a'],
                'key2' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                'data1': np.random.randn(8),
                'data2': np.random.randn(8)}
    df_obj = pd.DataFrame(dict_obj)
    print(df_obj)
    # 按自定义key分组,列表
    self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
    print(df_obj.groupby(self_def_key).size())
    
    # 按自定义key分组,多层列表
    print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())
    
    # 按多个列多层分组
    grouped2 = df_obj.groupby(['key1', 'key2'])
    print(grouped2.size())
    
    # 多层分组按key的顺序进行
    grouped3 = df_obj.groupby(['key2', 'key1'])
    print(grouped3.mean())
    # unstack可以将多层索引的结果转换成单层的dataframe
    print(grouped3.mean().unstack())

    效果

     key1   key2     data1     data2
    0    a    one  0.997420  1.427648
    1    b    one -0.439493  0.864381
    2    a    two -0.586568 -0.266915
    3    b  three  0.626975 -1.411871
    4    a    two  0.549642  0.875476
    5    b    two -1.275776 -1.124462
    6    a    one -0.635578  0.039922
    7    a  three  1.037085 -1.842645
    0    1
    1    1
    2    1
    3    2
    4    1
    5    1
    7    1
    dtype: int64
    key1  key2 
    a     one      2
          three    1
          two      2
    b     one      1
          three    1
          two      1
    dtype: int64
    key1  key2 
    a     one      2
          three    1
          two      2
    b     one      1
          three    1
          two      1
    dtype: int64
                   data1     data2
    key2  key1                    
    one   a     0.180921  0.733785
          b    -0.439493  0.864381
    three a     1.037085 -1.842645
          b     0.626975 -1.411871
    two   a    -0.018463  0.304281
          b    -1.275776 -1.124462
              data1               data2          
    key1          a         b         a         b
    key2                                         
    one    0.180921 -0.439493  0.733785  0.864381
    three  1.037085  0.626975 -1.842645 -1.411871
    two   -0.018463 -1.275776  0.304281 -1.124462

    二、GroupBy对象支持迭代操作

    每次迭代返回一个元组 (group_name, group_data)

    可用于分组数据的具体运算

    1. 单层分组

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1': ['a', 'b', 'a', 'b',
                         'a', 'b', 'a', 'a'],
                'key2': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                'data1': np.random.randn(8),
                'data2': np.random.randn(8)}
    df_obj = pd.DataFrame(dict_obj)
    print(df_obj)
    
    grouped1 = df_obj.groupby('key1')
    # 单层分组,根据key1
    print("---")
    for group_name, group_data in grouped1:
        print(group_name)
        print(group_data)

    效果

     key1   key2     data1     data2
    0    a    one  1.362713  0.545293
    1    b    one  1.276236  0.747326
    2    a    two -0.224800  0.347855
    3    b  three -0.522983 -0.150675
    4    a    two  0.233628  2.042220
    5    b    two  0.090460 -0.611156
    6    a    one  0.372772 -0.779909
    7    a  three  0.281172 -0.192912
    ---
    a
      key1   key2     data1     data2
    0    a    one  1.362713  0.545293
    2    a    two -0.224800  0.347855
    4    a    two  0.233628  2.042220
    6    a    one  0.372772 -0.779909
    7    a  three  0.281172 -0.192912
    b
      key1   key2     data1     data2
    1    b    one  1.276236  0.747326
    3    b  three -0.522983 -0.150675
    5    b    two  0.090460 -0.611156

    2. 多层分组

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1' : ['a', 'b', 'a', 'b',
                          'a', 'b', 'a', 'a'],
                'key2' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                'data1': np.random.randn(8),
                'data2': np.random.randn(8)}
    df_obj = pd.DataFrame(dict_obj)
    print(df_obj)
    
    # 按多个列多层分组
    grouped2 = df_obj.groupby(['key1', 'key2'])
    
    # 多层分组
    for group_name, group_data in grouped2:
        print(group_name)
        print(group_data)

    效果

    key1   key2     data1     data2
    0    a    one -1.045283  1.719594
    1    b    one -0.578823  0.432679
    2    a    two -1.018024  0.031187
    3    b  three -0.014241 -0.361193
    4    a    two  0.450544  0.266884
    5    b    two -0.304399  0.382678
    6    a    one  1.650063 -1.184126
    7    a  three -1.373968 -0.368473
    ('a', 'one')
      key1 key2     data1     data2
    0    a  one -1.045283  1.719594
    6    a  one  1.650063 -1.184126
    ('a', 'three')
      key1   key2     data1     data2
    7    a  three -1.373968 -0.368473
    ('a', 'two')
      key1 key2     data1     data2
    2    a  two -1.018024  0.031187
    4    a  two  0.450544  0.266884
    ('b', 'one')
      key1 key2     data1     data2
    1    b  one -0.578823  0.432679
    ('b', 'three')
      key1   key2     data1     data2
    3    b  three -0.014241 -0.361193
    ('b', 'two')
      key1 key2     data1     data2
    5    b  two -0.304399  0.382678

    三、GroupBy对象可以转换成列表或字典

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1': ['a', 'b', 'a', 'b',
                         'a', 'b', 'a', 'a'],
                'key2': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                'data1': np.random.randn(8),
                'data2': np.random.randn(8)}
    df_obj = pd.DataFrame(dict_obj)
    
    grouped1 = df_obj.groupby('key1')
    # GroupBy对象转换list
    print(list(grouped1))
    
    # GroupBy对象转换dict
    print(dict(list(grouped1)))

    效果

    [
    (
    'a', key1 key2 data1 data2 0 a one 0.377288 -1.662824 2 a two -0.206785 2.269787 4 a two -2.085155 -0.067927 6 a one -0.510846 0.888039 7 a three 0.782220 0.438419)
    , (
    'b',
      key1 key2 data1 data2
    1 b one -1.999189 -0.934550 3 b three 0.696655 -2.640779 5 b two 0.247095 -0.227695)
    ]
    {
    'a':
      key1 key2 data1 data2 0 a one
    0.377288 -1.662824 2 a two -0.206785 2.269787 4 a two -2.085155 -0.067927 6 a one -0.510846 0.888039 7 a three 0.782220 0.438419, 'b': key1 key2 data1 data2 1 b one -1.999189 -0.934550 3 b three 0.696655 -2.640779 5 b two 0.247095 -0.227695
    }

    1. 按列分组、按数据类型分组

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1': ['a', 'b', 'a', 'b',
                         'a', 'b', 'a', 'a'],
                'key2': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                'data1': np.random.randn(8),
                'data2': np.random.randn(8)}
    df_obj = pd.DataFrame(dict_obj)
    
    grouped1 = df_obj.groupby('key1')
    
    # 按列分组
    print('----------------')
    # 按列分组
    print(df_obj.dtypes)
    print('----------------')
    # 按数据类型分组
    print(df_obj.groupby(df_obj.dtypes, axis=1).size())

    效果

    ----------------
    key1      object
    key2      object
    data1    float64
    data2    float64
    dtype: object
    ----------------
    float64    2
    object     2
    dtype: int64

    2. 其他分组方法

    import pandas as pd
    import numpy as np
    
    df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5, 5)),
                           columns=['a', 'b', 'c', 'd', 'e'],
                           index=['A', 'B', 'C', 'D', 'E'])
    df_obj2.loc[1, 1:4] = np.NaN
    print(df_obj2)

    效果:

         a    b    c    d    e
    A  3.0  4.0  9.0  4.0  9.0
    B  2.0  4.0  8.0  5.0  6.0
    C  5.0  4.0  5.0  4.0  4.0
    D  8.0  8.0  5.0  4.0  7.0
    E  9.0  1.0  1.0  3.0  5.0
    1  NaN  NaN  NaN  NaN  NaN

    3. 通过字典分组

    import pandas as pd
    import numpy as np
    
    # 通过字典分组
    df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5, 5)),
                           columns=['a', 'b', 'c', 'd', 'e'],
                           index=['A', 'B', 'C', 'D', 'E'])
    mapping_dict = {'a':'Python', 'b':'Python', 'c':'Java', 'd':'C', 'e':'Java'}
    print('---------------')
    print(df_obj2)
    print('---------------')
    print(df_obj2.groupby(mapping_dict, axis=1).size())
    print('---------------')
    print(df_obj2.groupby(mapping_dict, axis=1).count()) # 非NaN的个数
    print('---------------')
    print(df_obj2.groupby(mapping_dict, axis=1).sum())

    效果:

    ---------------
       a  b  c  d  e
    A  8  2  7  3  1
    B  6  5  6  6  4
    C  3  8  2  4  9
    D  8  5  5  6  8
    E  1  7  6  4  6
    ---------------
    C         1
    Java      2
    Python    2
    dtype: int64
    ---------------
       C  Java  Python
    A  1     2       2
    B  1     2       2
    C  1     2       2
    D  1     2       2
    E  1     2       2
    ---------------
       C  Java  Python
    A  3     8      10
    B  6    10      11
    C  4    11      11
    D  6    13      13

    4. 通过函数分组,函数传入的参数为行索引或列索引

    # 通过函数分组
    df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
                           columns=['a', 'b', 'c', 'd', 'e'],
                           index=['AA', 'BBB', 'CC', 'D', 'EE'])
    #df_obj3
    
    def group_key(idx):
        """
            idx 为列索引或行索引
        """
        #return idx
        return len(idx)
    
    print(df_obj3.groupby(group_key).size())
    
    # 以上自定义函数等价于
    #df_obj3.groupby(len).size()

    效果:

    1    1
    2    3
    3    1
    dtype: int64

    5. 通过索引级别分组

    # 通过索引级别分组
    columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
                                         ['A', 'A', 'B', 'C', 'B']], names=['language', 'index'])
    df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
    print(df_obj4)
    
    # 根据language进行分组
    print(df_obj4.groupby(level='language', axis=1).sum())
    # 根据index进行分组
    print(df_obj4.groupby(level='index', axis=1).sum())

    效果:

    language Python Java Python Java Python
    index         A    A      B    C      B
    0             7    2      4    6      5
    1             2    8      8    8      8
    2             2    5      6    2      8
    3             8    7      6    9      5
    4             8    3      4    9      2
    language  Java  Python
    0            8      16
    1           16      18
    2            7      16
    3           16      19
    4           12      14
    index   A   B  C
    0       9   9  6
    1      10  16  8
    2       7  14  2
    3      15  11  9
    4      11   6  9

    聚合 (aggregation)

    • 数组产生标量的过程,如mean()、count()等

    • 常用于对分组后的数据进行计算

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1' : ['a', 'b', 'a', 'b',
                          'a', 'b', 'a', 'a'],
                'key2' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                'data1': np.random.randint(1,10, 8),
                'data2': np.random.randint(1,10, 8)}
    df_obj5 = pd.DataFrame(dict_obj)
    print(df_obj5)

    效果:

      key1   key2  data1  data2
    0    a    one      4      4
    1    b    one      1      6
    2    a    two      5      3
    3    b  three      7      2
    4    a    two      6      3
    5    b    two      7      5
    6    a    one      3      9
    7    a  three      3      8

    1. 内置的聚合函数

    sum(), mean(), max(), min(), count(), size(), describe()

    import pandas as pd
    import numpy as np
    
    
    df_obj5 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
                           columns=['key1', 'b', 'c', 'd', 'e'],
                           index=['key1', 'B', 'C', 'D', 'E'])
    print(df_obj5.groupby('key1').sum())
    print(df_obj5.groupby('key1').max())
    print(df_obj5.groupby('key1').min())
    print(df_obj5.groupby('key1').mean())
    print(df_obj5.groupby('key1').size())
    print(df_obj5.groupby('key1').count())
    print(df_obj5.groupby('key1').describe())

    效果:

          b   c  d  e
    key1              
    2      6   7  7  6
    3      1   6  5  8
    7      9   2  3  1
    9     14  12  9  5
          b  c  d  e
    key1            
    2     6  7  7  6
    3     1  6  5  8
    7     9  2  3  1
    9     9  8  8  3
          b  c  d  e
    key1            
    2     6  7  7  6
    3     1  6  5  8
    7     9  2  3  1
    9     5  4  1  2
            b    c    d    e
    key1                    
    2     6.0  7.0  7.0  6.0
    3     1.0  6.0  5.0  8.0
    7     9.0  2.0  3.0  1.0
    9     7.0  6.0  4.5  2.5
    key1
    2    1
    3    1
    7    1
    9    2
    dtype: int64
          b  c  d  e
    key1            
    2     1  1  1  1
    3     1  1  1  1
    7     1  1  1  1
    9     2  2  2  2
             b                                ...         e                           
         count mean       std  min  25%  50%  ...       std  min   25%  50%   75%  max
    key1                                      ...                                     
    2      1.0  6.0       NaN  6.0  6.0  6.0  ...       NaN  6.0  6.00  6.0  6.00  6.0
    3      1.0  1.0       NaN  1.0  1.0  1.0  ...       NaN  8.0  8.00  8.0  8.00  8.0
    7      1.0  9.0       NaN  9.0  9.0  9.0  ...       NaN  1.0  1.00  1.0  1.00  1.0
    9      2.0  7.0  2.828427  5.0  6.0  7.0  ...  0.707107  2.0  2.25  2.5  2.75  3.0
    
    [4 rows x 32 columns]

    2. 可自定义函数,传入agg方法中

    grouped.agg(func)

    func的参数为groupby索引对应的记录

    import pandas as pd
    import numpy as np
    
    df_obj5 = pd.DataFrame(np.random.randint(1, 10, (5, 5)),
                           columns=['key1', 'b', 'c', 'd', 'e'],
                           index=['key1', 'B', 'C', 'D', 'E'])
    df_obj = pd.DataFrame(np.random.randint(1, 10, (5,5)),
                           columns=['key1', 'b', 'c', 'd', 'e'],
                           index=['key1', 'B', 'C', 'D', 'E'])
    
    # 自定义聚合函数
    def peak_range(df):
        """
            返回数值范围
        """
        # print type(df) #参数为索引所对应的记录
        return df.max() - df.min()
    
    
    print(df_obj5.groupby('key1').agg(peak_range))
    print(df_obj.groupby('key1').agg(lambda df: df.max() - df.min()))

    效果:

          b  c  d  e
    key1            
    1     0  0  0  0
    2     0  0  0  0
    5     0  0  0  0
    9     6  3  4  5
          b  c  d  e
    key1            
    2     1  5  7  7
    3     0  0  0  0
    5     0  0  0  0
    6     0  0  0  0

    3. 应用多个聚合函数

    同时应用多个函数进行聚合操作,使用函数列表

    import pandas as pd
    import numpy as np
    
    df_obj5 = pd.DataFrame(np.random.randint(1, 10, (5, 5)),
                           columns=['key1', 'b', 'c', 'd', 'e'],
                           index=['key1', 'B', 'C', 'D', 'E'])
    df_obj = pd.DataFrame(np.random.randint(1, 10, (5,5)),
                           columns=['key1', 'b', 'c', 'd', 'e'],
                           index=['key1', 'B', 'C', 'D', 'E'])
    
    # 自定义聚合函数
    def peak_range(df):
        """
            返回数值范围
        """
        # print type(df) #参数为索引所对应的记录
        return df.max() - df.min()
    
    
    # 应用多个聚合函数
    
    # 同时应用多个聚合函数
    print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默认列名为函数名
    
    print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通过元组提供新的列名

    效果:

            b                         c  ...          d    e                     
         mean std count peak_range mean  ... peak_range mean std count peak_range
    key1                                 ...                                     
    1       4 NaN     1          0    3  ...          0    8 NaN     1          0
    2       5 NaN     1          0    9  ...          0    5 NaN     1          0
    7       4 NaN     1          0    4  ...          0    7 NaN     1          0
    8       2 NaN     1          0    8  ...          0    7 NaN     1          0
    9       5 NaN     1          0    8  ...          0    8 NaN     1          0
    
    [5 rows x 16 columns]
            b                    c            ...   d                e                
         mean std count range mean std count  ... std count range mean std count range
    key1                                      ...                                     
    1       4 NaN     1     0    3 NaN     1  ... NaN     1     0    8 NaN     1     0
    2       5 NaN     1     0    9 NaN     1  ... NaN     1     0    5 NaN     1     0
    7       4 NaN     1     0    4 NaN     1  ... NaN     1     0    7 NaN     1     0
    8       2 NaN     1     0    8 NaN     1  ... NaN     1     0    7 NaN     1     0
    9       5 NaN     1     0    8 NaN     1  ... NaN     1     0    8 NaN     1     0
    
    [5 rows x 16 columns]

    4. 对不同的列分别作用不同的聚合函数,使用dict

    import pandas as pd
    import numpy as np
    
    df_obj = pd.DataFrame(np.random.randint(1, 10, (5, 5)),
                          columns=['a', 'b', 'c', 'd', 'e'],
                          index=['A', 'B', 'C', 'D', 'E'])
    # 应用多个聚合函数
    
    # 每列作用不同的聚合函数
    dict_mapping = {'a': 'mean',
                    'b': 'sum'}
    print(df_obj.groupby('a').agg(dict_mapping))
    print('--------')
    dict_mapping = {'a': ['mean', 'max'],
                    'b': 'sum'}
    print(df_obj.groupby('a').agg(dict_mapping))

    效果

       a  b
    a      
    2  2  8
    4  4  4
    5  5  3
    7  7  7
    --------
         a       b
      mean max sum
    a             
    2    2   2   8
    4    4   4   4
    5    5   5   3

    5. 常用的内置聚合函数

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
                          'a', 'b', 'a', 'a'],
                'key2' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                'data1': np.random.randint(1, 10, 8),
                'data2': np.random.randint(1, 10, 8)}
    df_obj = pd.DataFrame(dict_obj)
    print(df_obj)
    
    # 按key1分组后,计算data1,data2的统计信息并附加到原始表格中,并添加表头前缀
    k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_')
    print(k1_sum)

    效果:

      key1   key2  data1  data2
    0    a    one      7      7
    1    b    one      4      9
    2    a    two      8      2
    3    b  three      6      3
    4    a    two      3      3
    5    b    two      3      3
    6    a    one      6      2
    7    a  three      9      1
          sum_data1  sum_data2
    key1                      
    a            33         15
    b            13         15

    1. merge

    使用merge的外连接,比较复杂

    # 方法1,使用merge
    k1_sum_merge = pd.merge(df_obj, k1_sum, left_on='key1', right_index=True)
    print(k1_sum_merge)

    2. transform

    transform的计算结果和原始数据的形状保持一致,

    如:grouped.transform(np.sum)

    # 方法2,使用transform
    k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
    df_obj[k1_sum_tf.columns] = k1_sum_tf
    print(df_obj)

    也可传入自定义函数,

    # 自定义函数传入transform
    def diff_mean(s):
        """
            返回数据与均值的差值
        """
        return s - s.mean()
    
    print(df_obj.groupby('key1').transform(diff_mean))

    整体代码:

    import pandas as pd
    import numpy as np
    
    dict_obj = {'key1': ['a', 'b', 'a', 'b',
                         'a', 'b', 'a', 'a'],
                'key2': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                'data1': np.random.randint(1, 10, 8),
                'data2': np.random.randint(1, 10, 8)}
    df_obj = pd.DataFrame(dict_obj)
    print(df_obj)
    
    # 按key1分组后,计算data1,data2的统计信息并附加到原始表格中,并添加表头前缀
    k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_')
    print(k1_sum)
    # 方法1,使用merge
    k1_sum_merge = pd.merge(df_obj, k1_sum, left_on='key1', right_index=True)
    print(k1_sum_merge)
    # 方法2,使用transform
    k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
    df_obj[k1_sum_tf.columns] = k1_sum_tf
    print(df_obj)
    
    
    # 自定义函数传入transform
    def diff_mean(s):
        """
            返回数据与均值的差值
        """
        return s - s.mean()
    
    
    print(df_obj.groupby('key1').transform(diff_mean))

    效果:

      key1   key2  data1  data2
    0    a    one      9      8
    1    b    one      5      3
    2    a    two      3      5
    3    b  three      3      7
    4    a    two      3      3
    5    b    two      2      2
    6    a    one      4      2
    7    a  three      6      8
          sum_data1  sum_data2
    key1                      
    a            25         26
    b            10         12
      key1   key2  data1  data2  sum_data1  sum_data2
    0    a    one      9      8         25         26
    2    a    two      3      5         25         26
    4    a    two      3      3         25         26
    6    a    one      4      2         25         26
    7    a  three      6      8         25         26
    1    b    one      5      3         10         12
    3    b  three      3      7         10         12
    5    b    two      2      2         10         12
      key1   key2  data1  data2           sum_key2 sum_data1 sum_data2
    0    a    one      9      8  onetwotwoonethree        25        26
    1    b    one      5      3        onethreetwo        10        12
    2    a    two      3      5  onetwotwoonethree        25        26
    3    b  three      3      7        onethreetwo        10        12
    4    a    two      3      3  onetwotwoonethree        25        26
    5    b    two      2      2        onethreetwo        10        12
    6    a    one      4      2  onetwotwoonethree        25        26
    7    a  three      6      8  onetwotwoonethree        25        26
          data1  data2 sum_data1 sum_data2
    0  4.000000    2.8         0         0
    1  1.666667   -1.0         0         0
    2 -2.000000   -0.2         0         0
    3 -0.333333    3.0         0         0
    4 -2.000000   -2.2         0         0
    5 -1.333333   -2.0         0         0
    6 -1.000000   -3.2         0         0
    7  1.000000    2.8         0         0

    groupby.apply(func)

    func函数也可以在各分组上分别调用,最后结果通过pd.concat组装到一起(数据合并)

    import pandas as pd
    import numpy as np
    
    dataset_path = './starcraft.csv'
    df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek', 
                                                 'TotalHours', 'APM'])
    
    def top_n(df, n=3, column='APM'):
        """
            返回每个分组按 column 的 top n 数据
        """
        return df.sort_values(by=column, ascending=False)[:n]
    
    print(df_data.groupby('LeagueIndex').apply(top_n))

    1. 产生层级索引:外层索引是分组名,内层索引是df_obj的行索引

    # apply函数接收的参数会传入自定义的函数中
    print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age'))

    2. 禁止层级索引, group_keys=False

    print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))

    apply可以用来处理不同分组内的缺失数据填充,填充该分组的均值。

     整体代码:

    import pandas as pd
    import numpy as np
    
    dataset_path = './starcraft.csv'
    df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek',
                                                 'TotalHours', 'APM'])
    
    def top_n(df, n=3, column='APM'):
        """
            返回每个分组按 column 的 top n 数据
        """
        return df.sort_values(by=column, ascending=False)[:n]
    
    
    print(df_data.groupby('LeagueIndex').apply(top_n))
    
    # apply函数接收的参数会传入自定义的函数中
    print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age'))
    
    print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))
  • 相关阅读:
    Spring Cloud Stream
    解决eclipse中git插件中的cannot open git-upload-pack问题
    Unsupported major.minor version 51.0解决办法
    git将本地仓库上传到远程仓库
    Ubuntu下如何安装与运行Redis
    git中进入带有空格的目录下的解决办法
    http://www.111cn.net/jsp/Jsp-Servlet/45158.htm
    MySQL 5.6 for Windows 解压缩版配置安装
    http://blog.csdn.net/congcong68/article/details/39252897
    http://jingyan.baidu.com/article/bad08e1ee14ae409c85121cf.html
  • 原文地址:https://www.cnblogs.com/loaderman/p/11967365.html
Copyright © 2020-2023  润新知