• pandas中 transform 函数和 apply 函数的区别

    There are two major differences between the transform and apply groupby methods.

    • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function
    • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.(transform必须返回与组合相同长度的序列(一维的序列、数组或列表))

    So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

    transform 函数:

                         1.只允许在同一时间在一个Series上进行一次转换,如果定义列‘a’ 减去列‘b’,  则会出现异常;

                          2.必须返回与 group相同的单个维度的序列(行)

                          3. 返回单个标量对象也可以使用,如 . transform(sum)


                        1. 不同于transform只允许在Series上进行一次转换, apply对整个DataFrame 作用

                         2.apply隐式地将group 上所有的列作为自定义函数


    import numpy as np
    import pandas as pd
    data  = pd.DataFrame({'state':['Florida','Florida','Texas','Texas'],
    #    a   b    state
    # 0  4   6  Florida
    # 1  5  10  Florida
    # 2  1   3    Texas
    # 3  3  11    Texas
    def sub_two(X):
        return X['a'] - X['b']
    data1 = data.groupby(data['state']).apply(sub_two) # 此处使用transform 则会出现错误
    # state     
    # Florida  0   -2
    #          1   -5
    # Texas    2   -2
    #          3   -8
    # dtype: int64


    :我们可以看到使用transform 和apply 的输出结果形式是不一样的,transform返回与数据同样长度的行,而apply则进行了聚合


    def group_sum(x):
        return x.sum()
    data3 = data.groupby(data['state']).transform(group_sum)    #返回与数据一样的 行
    #    a   b
    # 0  9  16
    # 1  9  16
    # 2  4  14
    # 3  4  14
    data4 = data.groupby(data['state']).apply(group_sum)
    #          a   b           state
    # state                         
    # Florida  9  16  FloridaFlorida
    # Texas    4  14      TexasTexas

    The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:


    df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                              'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8), 'D' : np.random.randn(8)})
    #      A      B         C         D
    # 0  foo    one  0.824188  0.640573
    # 1  bar    one  0.479966 -0.786443
    # 2  foo    two  1.173468  0.608870
    # 3  bar  three  0.909048 -0.931012
    # 4  foo    two -0.571721  0.978222
    # 5  bar    two -0.109497 -0.736918
    # 6  foo    one  0.019028 -0.298733
    # 7  foo  three -0.943761 -0.460587
    def zscore(x):
        return (x - x.mean())/ x.var()  
    print(df.groupby('A').transform(zscore))  #自动识别CD列
    print(df.groupby('A')['C','D'].apply(zscore))   #此种形式则两种输出数据是一样的
    # df.groupby('A').apply(zscore)  此种情况则会报错,apply对整个dataframe作用
    df['sum_c'] = df.groupby('A')['C'].transform(sum)   #先对A列进行分组, 计算C列的和
    df = df.sort_values('A')
    #      A      B         C         D     sum_c
    # 1  bar    one  0.479966 -0.786443  1.279517
    # 3  bar  three  0.909048 -0.931012  1.279517
    # 5  bar    two -0.109497 -0.736918  1.279517
    # 0  foo    one  0.824188  0.640573  0.501202
    # 2  foo    two  1.173468  0.608870  0.501202
    # 4  foo    two -0.571721  0.978222  0.501202
    # 6  foo    one  0.019028 -0.298733  0.501202
    # 7  foo  three -0.943761 -0.460587  0.501202
    # A
    # bar    1.279517
    # foo    0.501202
    # Name: C, dtype: float64

    The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. 

    函数传递给transform必须返回一个数字,一行,或者与参数相同的形状。 如果是一个数字,那么数字将被设置为组中的所有元素,如果是一行,它将会被广播到组中的所有行。


