• 数据可视化基础专题(三十四):Pandas基础(十四) 分组(二)Aggregation/apply


    Aggregation

    Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. These operations are similar to the aggregating APIwindow API, and resample API.

    An obvious one is aggregation via the aggregate() or equivalently agg() method:

    In [67]: grouped = df.groupby("A")
    
    In [68]: grouped.aggregate(np.sum)
    Out[68]: 
                C         D
    A                      
    bar  0.392940  1.732707
    foo -1.796421  2.824590
    
    In [69]: grouped = df.groupby(["A", "B"])
    
    In [70]: grouped.aggregate(np.sum)
    Out[70]: 
                      C         D
    A   B                        
    bar one    0.254161  1.511763
        three  0.215897 -0.990582
        two   -0.077118  1.211526
    foo one   -0.983776  1.614581
        three -0.862495  0.024580
        two    0.049851  1.185429

    As you can see, the result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option:

    In [71]: grouped = df.groupby(["A", "B"], as_index=False)
    
    In [72]: grouped.aggregate(np.sum)
    Out[72]: 
         A      B         C         D
    0  bar    one  0.254161  1.511763
    1  bar  three  0.215897 -0.990582
    2  bar    two -0.077118  1.211526
    3  foo    one -0.983776  1.614581
    4  foo  three -0.862495  0.024580
    5  foo    two  0.049851  1.185429
    
    In [73]: df.groupby("A", as_index=False).sum()
    Out[73]: 
         A         C         D
    0  bar  0.392940  1.732707
    1  foo -1.796421  2.824590

    Note that you could use the reset_index DataFrame function to achieve the same result as the column names are stored in the resulting MultiIndex:

    In [74]: df.groupby(["A", "B"]).sum().reset_index()
    Out[74]: 
         A      B         C         D
    0  bar    one  0.254161  1.511763
    1  bar  three  0.215897 -0.990582
    2  bar    two -0.077118  1.211526
    3  foo    one -0.983776  1.614581
    4  foo  three -0.862495  0.024580
    5  foo    two  0.049851  1.185429

    Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.

    In [75]: grouped.size()
    Out[75]: 
         A      B  size
    0  bar    one     1
    1  bar  three     1
    2  bar    two     1
    3  foo    one     2
    4  foo  three     1
    5  foo    two     2
    In [76]: grouped.describe()
    Out[76]: 
          C                                                    ...         D                                                  
      count      mean       std       min       25%       50%  ...       std       min       25%       50%       75%       max
    0   1.0  0.254161       NaN  0.254161  0.254161  0.254161  ...       NaN  1.511763  1.511763  1.511763  1.511763  1.511763
    1   1.0  0.215897       NaN  0.215897  0.215897  0.215897  ...       NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
    2   1.0 -0.077118       NaN -0.077118 -0.077118 -0.077118  ...       NaN  1.211526  1.211526  1.211526  1.211526  1.211526
    3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888  ...  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061
    4   1.0 -0.862495       NaN -0.862495 -0.862495 -0.862495  ...       NaN  0.024580  0.024580  0.024580  0.024580  0.024580
    5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  ...  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081
    
    [6 rows x 16 columns]

    Another aggregation example is to compute the number of unique values of each group. This is similar to the value_counts function, except that it only counts unique values.

    In [77]: ll = [['foo', 1], ['foo', 2], ['foo', 2], ['bar', 1], ['bar', 1]]
    
    In [78]: df4 = pd.DataFrame(ll, columns=["A", "B"])
    
    In [79]: df4
    Out[79]: 
         A  B
    0  foo  1
    1  foo  2
    2  foo  2
    3  bar  1
    4  bar  1
    
    In [80]: df4.groupby("A")["B"].nunique()
    Out[80]: 
    A
    bar    1
    foo    2
    Name: B, dtype: int64

    Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:

    Function

    Description

    mean()

    Compute mean of groups

    sum()

    Compute sum of group values

    size()

    Compute group sizes

    count()

    Compute count of group

    std()

    Standard deviation of groups

    var()

    Compute variance of groups

    sem()

    Standard error of the mean of groups

    describe()

    Generates descriptive statistics

    first()

    Compute first of group values

    last()

    Compute last of group values

    nth()

    Take nth value, or a subset if n is a list

    min()

    Compute min of group values

    max()

    Compute max of group values

    The aggregating functions above will exclude NA values. Any function which reduces a Series to a scalar value is an aggregation function and will work, a trivial example is df.groupby('A').agg(lambda ser: 1). Note that nth() can act as a reducer or a filter, see here.

    1 Applying multiple functions at once

    With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:

    In [81]: grouped = df.groupby("A")
    
    In [82]: grouped["C"].agg([np.sum, np.mean, np.std])
    Out[82]: 
              sum      mean       std
    A                                
    bar  0.392940  0.130980  0.181231
    foo -1.796421 -0.359284  0.912265

    On a grouped DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:

    In [83]: grouped.agg([np.sum, np.mean, np.std])
    Out[83]: 
                C                             D                    
              sum      mean       std       sum      mean       std
    A                                                              
    bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
    foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

    The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this:

    In [84]: (
       ....:     grouped["C"]
       ....:     .agg([np.sum, np.mean, np.std])
       ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
       ....: )
       ....: 
    Out[84]: 
              foo       bar       baz
    A                                
    bar  0.392940  0.130980  0.181231
    foo -1.796421 -0.359284  0.912265

    For a grouped DataFrame, you can rename in a similar manner:

    In [85]: (
       ....:     grouped.agg([np.sum, np.mean, np.std]).rename(
       ....:         columns={"sum": "foo", "mean": "bar", "std": "baz"}
       ....:     )
       ....: )
       ....: 
    Out[85]: 
                C                             D                    
              foo       bar       baz       foo       bar       baz
    A                                                              
    bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
    foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

    In general, the output column names should be unique. You can’t apply the same function (or two functions with the same name) to the same column.

    In [86]: grouped["C"].agg(["sum", "sum"])
    Out[86]: 
              sum       sum
    A                      
    bar  0.392940  0.392940
    foo -1.796421 -1.796421
    

    pandas does allow you to provide multiple lambdas. In this case, pandas will mangle the name of the (nameless) lambda functions, appending _<i> to each subsequent lambda.

    In [87]: grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
    Out[87]: 
         <lambda_0>  <lambda_1>
    A                          
    bar    0.331279    0.084917
    foo    2.337259   -0.215962
    

    2 Named aggregation

    To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

    • The keywords are the output column names

    • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

    In [88]: animals = pd.DataFrame(
       ....:     {
       ....:         "kind": ["cat", "dog", "cat", "dog"],
       ....:         "height": [9.1, 6.0, 9.5, 34.0],
       ....:         "weight": [7.9, 7.5, 9.9, 198.0],
       ....:     }
       ....: )
       ....: 
    
    In [89]: animals
    Out[89]: 
      kind  height  weight
    0  cat     9.1     7.9
    1  dog     6.0     7.5
    2  cat     9.5     9.9
    3  dog    34.0   198.0
    
    In [90]: animals.groupby("kind").agg(
       ....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),
       ....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),
       ....:     average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
       ....: )
       ....: 
    Out[90]: 
          min_height  max_height  average_weight
    kind                                        
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

    pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.

    In [91]: animals.groupby("kind").agg(
       ....:     min_height=("height", "min"),
       ....:     max_height=("height", "max"),
       ....:     average_weight=("weight", np.mean),
       ....: )
       ....: 
    Out[91]: 
          min_height  max_height  average_weight
    kind                                        
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

    If your desired output column names are not valid Python keywords, construct a dictionary and unpack the keyword arguments

    In [92]: animals.groupby("kind").agg(
       ....:     **{
       ....:         "total weight": pd.NamedAgg(column="weight", aggfunc=sum)
       ....:     }
       ....: )
       ....: 
    Out[92]: 
          total weight
    kind              
    cat           17.8
    dog          205.5

    Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().

    Note

    For Python 3.5 and earlier, the order of **kwargs in a functions was not preserved. This means that the output column ordering would not be consistent. To ensure consistent ordering, the keys (and so output columns) will always be sorted for Python 3.5.

    Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

    In [93]: animals.groupby("kind").height.agg(
       ....:     min_height="min",
       ....:     max_height="max",
       ....: )
       ....: 
    Out[93]: 
          min_height  max_height
    kind                        
    cat          9.1         9.5
    dog          6.0        34.0

    3 Applying different functions to DataFrame columns

    By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:

    In [94]: grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
    Out[94]: 
                C         D
    A                      
    bar  0.392940  1.366330
    foo -1.796421  0.884785

    The function names can also be strings. In order for a string to be valid it must be either implemented on GroupBy or available via dispatching:

    In [95]: grouped.agg({"C": "sum", "D": "std"})
    Out[95]: 
                C         D
    A                      
    bar  0.392940  1.366330
    foo -1.796421  0.884785

    4 Cython-optimized aggregation functions

    Some common aggregations, currently only summeanstd, and sem, have optimized Cython implementations:

    In [96]: df.groupby("A").sum()
    Out[96]: 
                C         D
    A                      
    bar  0.392940  1.732707
    foo -1.796421  2.824590
    
    In [97]: df.groupby(["A", "B"]).mean()
    Out[97]: 
                      C         D
    A   B                        
    bar one    0.254161  1.511763
        three  0.215897 -0.990582
        two   -0.077118  1.211526
    foo one   -0.491888  0.807291
        three -0.862495  0.024580
        two    0.024925  0.592714

    Of course sum and mean are implemented on pandas objects, so the above code would work even without the special versions via dispatching (see below).

    Flexible apply

    Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases, for example:

    In [156]: df
    Out[156]: 
         A      B         C         D
    0  foo    one -0.575247  1.346061
    1  bar    one  0.254161  1.511763
    2  foo    two -1.143704  1.627081
    3  bar  three  0.215897 -0.990582
    4  foo    two  1.193555 -0.441652
    5  bar    two -0.077118  1.211526
    6  foo    one -0.408530  0.268520
    7  foo  three -0.862495  0.024580
    
    In [157]: grouped = df.groupby("A")
    
    # could also just call .describe()
    In [158]: grouped["C"].apply(lambda x: x.describe())
    Out[158]: 
    A         
    bar  count    3.000000
         mean     0.130980
         std      0.181231
         min     -0.077118
         25%      0.069390
                    ...   
    foo  min     -1.143704
         25%     -0.862495
         50%     -0.575247
         75%     -0.408530
         max      1.193555
    Name: C, Length: 16, dtype: float64

    The dimension of the returned result can also change:

    In [159]: grouped = df.groupby('A')['C']
    
    In [160]: def f(group):
       .....:     return pd.DataFrame({'original': group,
       .....:                          'demeaned': group - group.mean()})
       .....: 
    
    In [161]: grouped.apply(f)
    Out[161]: 
       original  demeaned
    0 -0.575247 -0.215962
    1  0.254161  0.123181
    2 -1.143704 -0.784420
    3  0.215897  0.084917
    4  1.193555  1.552839
    5 -0.077118 -0.208098
    6 -0.408530 -0.049245
    7 -0.862495 -0.503211

    apply on a Series can operate on a returned value from the applied function, that is itself a series, and possibly upcast the result to a DataFrame:

    In [162]: def f(x):
       .....:     return pd.Series([x, x ** 2], index=["x", "x^2"])
       .....: 
    
    In [163]: s = pd.Series(np.random.rand(5))
    
    In [164]: s
    Out[164]: 
    0    0.321438
    1    0.493496
    2    0.139505
    3    0.910103
    4    0.194158
    dtype: float64
    
    In [165]: s.apply(f)
    Out[165]: 
              x       x^2
    0  0.321438  0.103323
    1  0.493496  0.243538
    2  0.139505  0.019462
    3  0.910103  0.828287
    4  0.194158  0.037697
  • 相关阅读:
    Java排序算法之堆排序
    servlet学习总结(一)——HttpServletRequest(转载)
    servlet学习总结(一)——初识Servlet
    Java排序算法之快速排序
    Java排序算法之直接选择排序
    第八课、泛型编程简介
    第六课、算法效率的度量
    第四课、程序灵魂的审判
    第三课、初识程序的灵魂------------------------狄泰软件学院
    用solidity语言开发代币智能合约
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14876670.html
Copyright © 2020-2023  润新知