• 30分钟了解pandas


    参考资料:https://pandas.pydata.org/docs/user_guide/10min.html

    创建对象

    创建一个Series对象

    In [150]: s = pd.Series([1, 3, 5, np.nan, 6, 8])                                                            
    
    In [151]: s                                                                                                 
    Out[151]: 
    0    1.0
    1    3.0
    2    5.0
    3    NaN
    4    6.0
    5    8.0
    dtype: float64
    

      

    创建一个DataFrame对象通过NumPy的数组

    In [152]: dates = pd.date_range('20130101', periods=6)                                                      
    
    In [153]: dates                                                                                             
    Out[153]: 
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05', '2013-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    In [154]: df = pd.DataFrame(np.random.randn(6,4),index=dates, columns=list('ABCD'))                         
    
    In [155]: df                                                                                                
    Out[155]: 
                       A         B         C         D
    2013-01-01  0.911951  0.119077  2.244598 -1.524168
    2013-01-02 -0.711591  1.814327  0.859623 -0.249116
    2013-01-03 -0.041417 -1.158472  1.037752  1.124356
    2013-01-04  1.222247 -0.651681  1.764630  0.500507
    2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
    2013-01-06 -0.724522 -1.238991  0.223425 -1.775954  

    通过一个字典来生成一个df对象

    In [179]: df2 = pd.DataFrame( 
         ...:    ...:     { 
         ...:    ...:         "A": 1.0, 
         ...:    ...:         "B": pd.Timestamp("20130102"), 
         ...:    ...:         "C": pd.Series(1, index=list(range(4)), dtype="float32"), 
         ...:    ...:         "D": np.array([3] * 4, dtype="int32"), 
         ...:    ...:         "E": pd.Categorical(["test", "train", "test", "train"]), 
         ...:    ...:         "F": "foo", 
         ...:    ...:     }, 
         ...:    ...:     
         ...: )                                                                                                 
    
    In [180]: df2                                                                                               
    Out[180]: 
         A          B    C  D      E    F
    0  1.0 2013-01-02  1.0  3   test  foo
    1  1.0 2013-01-02  1.0  3  train  foo
    2  1.0 2013-01-02  1.0  3   test  foo
    3  1.0 2013-01-02  1.0  3  train  foo
    
    In [181]: df2.dtypes                                                                                        
    Out[181]: 
    A           float64
    B    datetime64[ns]
    C           float32
    D             int32
    E          category
    F            object
    dtype: object
    
    In [182]: 
    

      通过dtypes可以查看相关索引的内容类型。

    查看数据

    通过head, tail,可以查看部分内容,通过index与columns可以查看行索引与列索引的信息

    In [183]: df.head()                                                                                         
    Out[183]: 
                       A         B         C         D
    2013-01-01  0.911951  0.119077  2.244598 -1.524168
    2013-01-02 -0.711591  1.814327  0.859623 -0.249116
    2013-01-03 -0.041417 -1.158472  1.037752  1.124356
    2013-01-04  1.222247 -0.651681  1.764630  0.500507
    2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
    
    In [184]: df.tail(2)                                                                                        
    Out[184]: 
                       A         B         C         D
    2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
    2013-01-06 -0.724522 -1.238991  0.223425 -1.775954
    
    In [185]: df.head(1)                                                                                        
    Out[185]: 
                       A         B         C         D
    2013-01-01  0.911951  0.119077  2.244598 -1.524168
    
    In [186]: df.index                                                                                          
    Out[186]: 
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05', '2013-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    In [187]: df.columns                                                                                        
    Out[187]: Index(['A', 'B', 'C', 'D'], dtype='object')
    
    In [188]:       
    

      通过DataFrame.to_numpy()可以给你一个NumPy的array数据,但有时候这会很耗费电脑,因为pandas是根据每一列一个dtypes的,而NumPy的array是整个数据一个dtype的

    In [194]: df.to_numpy()                                                                                     
    Out[194]: 
    array([[ 0.9119509 ,  0.11907694,  2.24459767, -1.52416844],
           [-0.71159066,  1.81432742,  0.85962346, -0.24911614],
           [-0.0414173 , -1.15847237,  1.03775241,  1.12435552],
           [ 1.22224697, -0.65168145,  1.76462966,  0.50050719],
           [-0.33219183, -1.42487132, -0.68025439,  0.37010889],
           [-0.72452176, -1.23899146,  0.22342519, -1.77595409]])
    
    In [195]: df2.to_numpy()                                                                                    
    Out[195]: 
    array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
           [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
           [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
           [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
          dtype=object)
    

      上面演示了两种to_numpy的方式,其中第二种的花费要多很多。

    describe() 将会展示一些数据的基础信息,比如统计数量,标准差,最小值,最大值.......

    In [202]: df.describe()                                                                                     
    Out[202]: 
                  A         B         C         D
    count  6.000000  6.000000  6.000000  6.000000
    mean   0.054079 -0.423435  0.908296 -0.259045
    std    0.830825  1.229820  1.051730  1.165178
    min   -0.724522 -1.424871 -0.680254 -1.775954
    25%   -0.616741 -1.218862  0.382475 -1.205405
    50%   -0.186805 -0.905077  0.948688  0.060496
    75%    0.673609 -0.073613  1.582910  0.467908
    max    1.222247  1.814327  2.244598  1.124356
    

      通过T可以把行索引与列索引进行转置。

    In [204]: df.T                                                                                              
    Out[204]: 
       2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
    A    0.911951   -0.711591   -0.041417    1.222247   -0.332192   -0.724522
    B    0.119077    1.814327   -1.158472   -0.651681   -1.424871   -1.238991
    C    2.244598    0.859623    1.037752    1.764630   -0.680254    0.223425
    D   -1.524168   -0.249116    1.124356    0.500507    0.370109   -1.775954
    

      通过索引进行排序

    通过sort_index按照索引的数值大小,可以进行排序

    In [207]: df.sort_index(ascending=False)                                                                    
    Out[207]: 
                       A         B         C         D
    2013-01-06 -0.724522 -1.238991  0.223425 -1.775954
    2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
    2013-01-04  1.222247 -0.651681  1.764630  0.500507
    2013-01-03 -0.041417 -1.158472  1.037752  1.124356
    2013-01-02 -0.711591  1.814327  0.859623 -0.249116
    2013-01-01  0.911951  0.119077  2.244598 -1.524168
    
    In [208]: df.sort_index?                                                                                    
    
    In [209]: df.sort_index(axis=1, ascending=False)                                                            
    Out[209]: 
                       D         C         B         A
    2013-01-01 -1.524168  2.244598  0.119077  0.911951
    2013-01-02 -0.249116  0.859623  1.814327 -0.711591
    2013-01-03  1.124356  1.037752 -1.158472 -0.041417
    2013-01-04  0.500507  1.764630 -0.651681  1.222247
    2013-01-05  0.370109 -0.680254 -1.424871 -0.332192
    2013-01-06 -1.775954  0.223425 -1.238991 -0.724522
    

      当然最后还可以通过具体某一列的数值大小进行排序

    In [212]: df.sort_values(by='B')                                                                            
    Out[212]: 
                       A         B         C         D
    2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
    2013-01-06 -0.724522 -1.238991  0.223425 -1.775954
    2013-01-03 -0.041417 -1.158472  1.037752  1.124356
    2013-01-04  1.222247 -0.651681  1.764630  0.500507
    2013-01-01  0.911951  0.119077  2.244598 -1.524168
    2013-01-02 -0.711591  1.814327  0.859623 -0.249116
    

      

    选择

    pandas建议用过.at.iat.loc and .iloc.这4类方法进行取值。

    选择单列数据,返回的是一个Series对象,单个数值的时候返回列数据

    In [226]: df['A']                                                                                           
    Out[226]: 
    2013-01-01    0.911951
    2013-01-02   -0.711591
    2013-01-03   -0.041417
    2013-01-04    1.222247
    2013-01-05   -0.332192
    2013-01-06   -0.724522
    Freq: D, Name: A, dtype: float64
    

      通过切片取值的时候返回多行数据的df对象。[理解为被切片的df对象],切片为取头取尾。

    In [229]: df[0:3]                                                                                           
    Out[229]: 
                       A         B         C         D
    2013-01-01  0.911951  0.119077  2.244598 -1.524168
    2013-01-02 -0.711591  1.814327  0.859623 -0.249116
    2013-01-03 -0.041417 -1.158472  1.037752  1.124356
    
    In [230]: df["20130102":'20130103']                                                                         
    Out[230]: 
                       A         B         C         D
    2013-01-02 -0.711591  1.814327  0.859623 -0.249116
    2013-01-03 -0.041417 -1.158472  1.037752  1.124356
    

      

    通过便签取值,国内书中有说是显式取值。通过loc的方法,注意调用该方法时不用()直接后面加[]就可以.

    In [238]: df.loc[dates[0]]                                                                                  
    Out[238]: 
    A    0.911951
    B    0.119077
    C    2.244598
    D   -1.524168
    Name: 2013-01-01 00:00:00, dtype: float64
    

      

    取多列坐标数据

    In [239]: df.loc[:,['A','B']]                                                                               
    Out[239]: 
                       A         B
    2013-01-01  0.911951  0.119077
    2013-01-02 -0.711591  1.814327
    2013-01-03 -0.041417 -1.158472
    2013-01-04  1.222247 -0.651681
    2013-01-05 -0.332192 -1.424871
    2013-01-06 -0.724522 -1.238991
    

     通过切片选择不同的行与列的信息

    In [240]: df.loc["20130102":"20130104",["A","B"]]                                                           
    Out[240]: 
                       A         B
    2013-01-02 -0.711591  1.814327
    2013-01-03 -0.041417 -1.158472
    2013-01-04  1.222247 -0.651681
    

      只选取一行,选取指定列

    In [241]: df.loc["20130102",["A","B"]]                                                                      
    Out[241]: 
    A   -0.711591
    B    1.814327
    Name: 2013-01-02 00:00:00, dtype: float64
    

     最后介绍两种方式取出单个数值,也就是所谓的标量

    In [242]: df.loc[dates[0],"A"]                                                                              
    Out[242]: 0.91195089904327
    
    In [243]: df.at[dates[0],"A"]                                                                               
    Out[243]: 0.91195089904327
    
    In [244]:  
    

      

    通过位置来取值,也就是所谓的隐式传参 iloc

    跟loc使用差不多,可以传入单值,多值,还有切片

    单值,返回某一行的数据的Series对象

    In [244]: df.iloc[0]                                                                                        
    Out[244]: 
    A    0.911951
    B    0.119077
    C    2.244598
    D   -1.524168
    Name: 2013-01-01 00:00:00, dtype: float64
    
    In [245]: df.iloc[3]                                                                                        
    Out[245]: 
    A    1.222247
    B   -0.651681
    C    1.764630
    D    0.500507
    Name: 2013-01-04 00:00:00, dtype: float64
    

      也可以通过切片,这是跟Python一样,取头不取尾的

    In [246]: df.iloc[3:5,0:2]                                                                                  
    Out[246]: 
                       A         B
    2013-01-04  1.222247 -0.651681
    2013-01-05 -0.332192 -1.424871
    

      还可以传入多个位置参数来取,位置参数之间用逗号分割

    In [247]: df.iloc[[1,2,4],[0,2]]                                                                            
    Out[247]: 
                       A         C
    2013-01-02 -0.711591  0.859623
    2013-01-03 -0.041417  1.037752
    2013-01-05 -0.332192 -0.680254
    

      可以通过单个:来全选数据

    In [248]: df.iloc[1:3]                                                                                      
    Out[248]: 
                       A         B         C         D
    2013-01-02 -0.711591  1.814327  0.859623 -0.249116
    2013-01-03 -0.041417 -1.158472  1.037752  1.124356
    
    In [249]: df.iloc[1:3,:]                                                                                    
    Out[249]: 
                       A         B         C         D
    2013-01-02 -0.711591  1.814327  0.859623 -0.249116
    2013-01-03 -0.041417 -1.158472  1.037752  1.124356
    
    In [250]: df.iloc[:,1:3]                                                                                    
    Out[250]: 
                       B         C
    2013-01-01  0.119077  2.244598
    2013-01-02  1.814327  0.859623
    2013-01-03 -1.158472  1.037752
    2013-01-04 -0.651681  1.764630
    2013-01-05 -1.424871 -0.680254
    2013-01-06 -1.238991  0.223425
    

      也可以通过输入两个数值坐标来取标量。

    In [251]: df.iloc[1,1]                                                                                      
    Out[251]: 1.8143274155708045
    
    In [252]: df.iat[1,1]                                                                                       
    Out[252]: 1.8143274155708045
    

      

    bolllean 索引,有些书称为掩码取值

     使用单一的列数据来选择数据

    In [8]: df[df['A'] > 0]                                                                                                                                                
    Out[8]: 
                       A         B         C         D
    2013-01-02  0.314732  1.187635 -0.290305 -1.091247
    2013-01-03  0.535155  0.453357  0.308055 -0.533740
    2013-01-04  0.688215 -1.474558 -2.561580 -1.211188
    2013-01-06  1.104539 -1.272097  1.423444  0.417609
    

      针对df对象所有对象进行选值

    In [9]: df[df>0]                                                                                                                                                       
    Out[9]: 
                       A         B         C         D
    2013-01-01       NaN  0.954575       NaN  0.070635
    2013-01-02  0.314732  1.187635       NaN       NaN
    2013-01-03  0.535155  0.453357  0.308055       NaN
    2013-01-04  0.688215       NaN       NaN       NaN
    2013-01-05       NaN       NaN  1.522803       NaN
    2013-01-06  1.104539       NaN  1.423444  0.417609
    

      使用isin()来选择数据

    In [14]: df2                                                                                                                                                           
    Out[14]: 
                       A         B         C         D      E
    2013-01-01 -1.629264  0.954575 -0.102982  0.070635    one
    2013-01-02  0.314732  1.187635 -0.290305 -1.091247    one
    2013-01-03  0.535155  0.453357  0.308055 -0.533740    two
    2013-01-04  0.688215 -1.474558 -2.561580 -1.211188  three
    2013-01-05 -1.579950 -0.317184  1.522803 -0.702557   four
    2013-01-06  1.104539 -1.272097  1.423444  0.417609  three
    
    In [15]: df2[df2['E'].isin(['two','four'])]                                                                                                                            
    Out[15]: 
                       A         B         C         D     E
    2013-01-03  0.535155  0.453357  0.308055 -0.533740   two
    2013-01-05 -1.579950 -0.317184  1.522803 -0.702557  four
    

      设定

    设置一个新的列将自动对齐数据的索引。

    In [16]: s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))                                                                                       
    
    In [17]: s1                                                                                                                                                            
    Out[17]: 
    2013-01-02    1
    2013-01-03    2
    2013-01-04    3
    2013-01-05    4
    2013-01-06    5
    2013-01-07    6
    Freq: D, dtype: int64
    
    In [18]: df['F'] = s1  
    

      分别通过便签与索引设定值

    In [19]: df.at[dates[0],"A"] = 0                                                                                                                                       
    
    In [20]: df.iat[0,1] = 0                                                                                                                                               
    

      通过NumPy的数组来分配数据

    In [26]: df.loc[:,"D"] = np.array([5] * len(df))                                                                                                                       
    
    In [27]: df                                                                                                                                                            
    Out[27]: 
                       A         B         C  D    F
    2013-01-01  0.000000  0.000000 -0.102982  5  NaN
    2013-01-02  0.314732  1.187635 -0.290305  5  1.0
    2013-01-03  0.535155  0.453357  0.308055  5  2.0
    2013-01-04  0.688215 -1.474558 -2.561580  5  3.0
    2013-01-05 -1.579950 -0.317184  1.522803  5  4.0
    2013-01-06  1.104539 -1.272097  1.423444  5  5.0
    

      操作具体选定的数据

    In [30]: df2                                                                                                                                                           
    Out[30]: 
                       A         B         C  D    F
    2013-01-01  0.000000  0.000000 -0.102982  5  NaN
    2013-01-02  0.314732  1.187635 -0.290305  5  1.0
    2013-01-03  0.535155  0.453357  0.308055  5  2.0
    2013-01-04  0.688215 -1.474558 -2.561580  5  3.0
    2013-01-05 -1.579950 -0.317184  1.522803  5  4.0
    2013-01-06  1.104539 -1.272097  1.423444  5  5.0
    
    In [31]: df2[df2>0] = -df2                                                                                                                                             
    
    In [32]: df2                                                                                                                                                           
    Out[32]: 
                       A         B         C  D    F
    2013-01-01  0.000000  0.000000 -0.102982 -5  NaN
    2013-01-02 -0.314732 -1.187635 -0.290305 -5 -1.0
    2013-01-03 -0.535155 -0.453357 -0.308055 -5 -2.0
    2013-01-04 -0.688215 -1.474558 -2.561580 -5 -3.0
    2013-01-05 -1.579950 -0.317184 -1.522803 -5 -4.0
    2013-01-06 -1.104539 -1.272097 -1.423444 -5 -5.0
    

      缺失数据

    pandas主要使用np.nan来表示缺失值,一般情况下它不参与计算。

    使用Reindexing可以改变指定象限的索引值。

    In [17]: df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ["E"])                                                                                               
    
    In [18]: df1                                                                                                                                                               
    Out[18]: 
                       A         B         C  D    F   E
    2013-01-01  0.000000  0.000000  0.009247  5  NaN NaN
    2013-01-02  1.012606  0.986936  1.007574  5  1.0 NaN
    2013-01-03 -0.659537 -0.193782 -0.037052  5  2.0 NaN
    2013-01-04  1.957053  1.212591  0.639196  5  3.0 NaN
    
    In [19]: df1.loc[dates[0]:dates[1],'E'] = 1                                                                                                                                
    
    In [20]: df1                                                                                                                                                               
    Out[20]: 
                       A         B         C  D    F    E
    2013-01-01  0.000000  0.000000  0.009247  5  NaN  1.0
    2013-01-02  1.012606  0.986936  1.007574  5  1.0  1.0
    2013-01-03 -0.659537 -0.193782 -0.037052  5  2.0  NaN
    2013-01-04  1.957053  1.212591  0.639196  5  3.0  NaN
    

      放弃任何有nan的行

    In [21]: df1.dropna(how='any')                                                                                                                                             
    Out[21]: 
                       A         B         C  D    F    E
    2013-01-02  1.012606  0.986936  1.007574  5  1.0  1.0
    

      填写缺失的数据

    In [22]: df1.fillna(value=5)                                                                                                                                               
    Out[22]: 
                       A         B         C  D    F    E
    2013-01-01  0.000000  0.000000  0.009247  5  5.0  1.0
    2013-01-02  1.012606  0.986936  1.007574  5  1.0  1.0
    2013-01-03 -0.659537 -0.193782 -0.037052  5  2.0  5.0
    2013-01-04  1.957053  1.212591  0.639196  5  3.0  5.0
    

      通过bool掩码值获取是否是nan的df对象

    In [23]: pd.isna(df1)                                                                                                                                                      
    Out[23]: 
                    A      B      C      D      F      E
    2013-01-01  False  False  False  False   True  False
    2013-01-02  False  False  False  False  False  False
    2013-01-03  False  False  False  False  False   True
    2013-01-04  False  False  False  False  False   True
    

      Operations[操作]

    Stats[操作]

    操作一般不包括缺失的数据,进行描述性统计

    In [12]: df                                                                                                                                                      
    Out[12]: 
                       A         B         C  D    F
    2013-01-01  0.000000  0.000000  0.913359  5  NaN
    2013-01-02 -1.970074  0.380875 -0.648481  5  1.0
    2013-01-03  0.281023  0.861409  0.084303  5  2.0
    2013-01-04 -0.795286  1.961307  1.569313  5  3.0
    2013-01-05  0.974005  0.648449 -1.928545  5  4.0
    2013-01-06  0.000812 -0.274185 -0.469916  5  5.0
    
    In [13]: df.mean()                                                                                                                                               
    Out[13]: 
    A   -0.251587
    B    0.596309
    C   -0.079994
    D    5.000000
    F    3.000000
    dtype: float64
    
    In [14]: df.mean(1)                                                                                                                                              
    Out[14]: 
    2013-01-01    1.478340
    2013-01-02    0.752464
    2013-01-03    1.645347
    2013-01-04    2.147067
    2013-01-05    1.738782
    2013-01-06    1.851342
    Freq: D, dtype: float64
    

      针对不同的象限取平均。

    操作不同维度和需要对齐的对象。pandas将会自动沿指定的维度广播。

    In [27]: s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)                                                                                                  
    
    In [28]: s                                                                                                                                                       
    Out[28]: 
    2013-01-01    NaN
    2013-01-02    NaN
    2013-01-03    1.0
    2013-01-04    3.0
    2013-01-05    5.0
    2013-01-06    NaN
    Freq: D, dtype: float64
    
    In [29]: df.sub?                                                                                                                                                 
    
    In [30]: df.sub(s,axis='index')                                                                                                                                  
    Out[30]: 
                       A         B         C    D    F
    2013-01-01       NaN       NaN       NaN  NaN  NaN
    2013-01-02       NaN       NaN       NaN  NaN  NaN
    2013-01-03 -0.718977 -0.138591 -0.915697  4.0  1.0
    2013-01-04 -3.795286 -1.038693 -1.430687  2.0  0.0
    2013-01-05 -4.025995 -4.351551 -6.928545  0.0 -1.0
    2013-01-06       NaN       NaN       NaN  NaN  NaN
    

      

    Apply

    申请

    Applying functions to the data:

    将函数应用与数据

    In [31]: df                                                                                                                                                      
    Out[31]: 
                       A         B         C  D    F
    2013-01-01  0.000000  0.000000  0.913359  5  NaN
    2013-01-02 -1.970074  0.380875 -0.648481  5  1.0
    2013-01-03  0.281023  0.861409  0.084303  5  2.0
    2013-01-04 -0.795286  1.961307  1.569313  5  3.0
    2013-01-05  0.974005  0.648449 -1.928545  5  4.0
    2013-01-06  0.000812 -0.274185 -0.469916  5  5.0
    
    In [32]: df.apply(np.cumsum)                                                                                                                                     
    Out[32]: 
                       A         B         C   D     F
    2013-01-01  0.000000  0.000000  0.913359   5   NaN
    2013-01-02 -1.970074  0.380875  0.264878  10   1.0
    2013-01-03 -1.689051  1.242284  0.349182  15   3.0
    2013-01-04 -2.484337  3.203590  1.918495  20   6.0
    2013-01-05 -1.510332  3.852039 -0.010051  25  10.0
    2013-01-06 -1.509519  3.577854 -0.479966  30  15.0
    
    In [33]: df.apply?                                                                                                                                               
    
    In [34]: df.apply(lambda x: x.max() - x.min())                                                                                                                   
    Out[34]: 
    A    2.944080
    B    2.235492
    C    3.497859
    D    0.000000
    F    4.000000
    dtype: float64
    
    In [35]:       
    

      默认对列数据进行操作,添加一个函数。

    Histogramming

    柱状图

    See more at Histogramming and Discretization.

    In [35]: s = pd.Series(np.random.randint(0,7,size=10))                                                                                                           
    
    In [36]: s                                                                                                                                                       
    Out[36]: 
    0    1
    1    0
    2    1
    3    3
    4    4
    5    0
    6    0
    7    5
    8    0
    9    1
    dtype: int64
    
    In [37]: s.value_counts()                                                                                                                                        
    Out[37]: 
    0    4
    1    3
    5    1
    4    1
    3    1
    dtype: int64
    
    In [38]: s.value_counts().__class__                                                                                                                              
    Out[38]: pandas.core.series.Series
    
    In [39]: s.value_counts().index                                                                                                                                  
    Out[39]: Int64Index([0, 1, 5, 4, 3], dtype='int64')
    

      前面的索引时出现的数字,后面是出现的次数。

    String Methods

    字符串方法

    Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.

    Series在str属性中配备了一组字符串处理方法,这些方法可以轻松地对数组的每个元素进行操作,

    In [40]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])                                                                            
    
    In [41]: s.str.lower()                                                                                                                                           
    Out[41]: 
    0       a
    1       b
    2       c
    3    aaba
    4    baca
    5     NaN
    6    caba
    7     dog
    8     cat
    dtype: object
    
    In [42]:      
    

    Merge

    合并

    Concat

    pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

    See the Merging section.

    Concatenating pandas objects together with concat():

    用concat连接pandas对象

    In [6]: df = pd.DataFrame(np.random.randn(10, 4),index=[chr(i) for i in range(65,75)])                                                          
    
    In [7]: df                                                                                                                                      
    Out[7]: 
              0         1         2         3
    A -0.352374  1.278078  0.543435  0.085902
    B -0.572048 -0.053377  1.278151 -0.143507
    C  0.058471 -0.678905 -0.928697 -0.275203
    D -0.338097  2.616637 -1.541948  0.127263
    E -0.460966 -1.571678 -1.958119 -0.533291
    F -0.832041 -0.103917  1.635657  0.887229
    G -0.171535 -2.181684 -1.682033  0.462375
    H -1.001720  0.925746 -0.205054 -1.493562
    I -0.213802 -1.214741 -0.864377  0.735728
    J -0.375842  0.406692  0.207853  0.626410
    
    In [8]: prices = [df[:3],df[3:7],df[7:]]                                                                                                        
    
    In [9]: prices                                                                                                                                  
    Out[9]: 
    [          0         1         2         3
     A -0.352374  1.278078  0.543435  0.085902
     B -0.572048 -0.053377  1.278151 -0.143507
     C  0.058471 -0.678905 -0.928697 -0.275203,
               0         1         2         3
     D -0.338097  2.616637 -1.541948  0.127263
     E -0.460966 -1.571678 -1.958119 -0.533291
     F -0.832041 -0.103917  1.635657  0.887229
     G -0.171535 -2.181684 -1.682033  0.462375,
               0         1         2         3
     H -1.001720  0.925746 -0.205054 -1.493562
     I -0.213802 -1.214741 -0.864377  0.735728
     J -0.375842  0.406692  0.207853  0.626410]
    
    In [10]: pd.concat(prices)                                                                                                                      
    Out[10]: 
              0         1         2         3
    A -0.352374  1.278078  0.543435  0.085902
    B -0.572048 -0.053377  1.278151 -0.143507
    C  0.058471 -0.678905 -0.928697 -0.275203
    D -0.338097  2.616637 -1.541948  0.127263
    E -0.460966 -1.571678 -1.958119 -0.533291
    F -0.832041 -0.103917  1.635657  0.887229
    G -0.171535 -2.181684 -1.682033  0.462375
    H -1.001720  0.925746 -0.205054 -1.493562
    I -0.213802 -1.214741 -0.864377  0.735728
    J -0.375842  0.406692  0.207853  0.626410
    

      

    Join

    SQL style merges. See the Database style joining section.

    类似与SQL样式的合并

    In [21]: left                                                                                                                                   
    Out[21]: 
       key  lval
    0  foo     1
    1  foo     2
    
    In [22]: right                                                                                                                                  
    Out[22]: 
       key  rval
    0  foo     4
    1  foo     5
    
    In [23]: pd.merge(left, right, on="key")                                                                                                        
    Out[23]: 
       key  lval  rval
    0  foo     1     4
    1  foo     1     5
    2  foo     2     4
    3  foo     2     5
    

      类似与sql里面的join的内联

    另一种形式的样式

    In [24]: left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})                                                                           
    
    In [25]: right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]}) 
        ...:                                                                                                                                        
    
    In [26]: pd.merge(left, right, on="key")                                                                                                        
    Out[26]: 
       key  lval  rval
    0  foo     1     4
    1  bar     2     5
    
    In [27]: left                                                                                                                                   
    Out[27]: 
       key  lval
    0  foo     1
    1  bar     2
    
    In [28]: right                                                                                                                                  
    Out[28]: 
       key  rval
    0  foo     4
    1  bar     5
    

      我的理解与测试,需要两个相同的列名,取两个列名中的交集。

    Grouping

    By “group by” we are referring to a process involving one or more of the following steps:

    • Splitting the data into groups based on some criteria

    • 根据某些条件将数据分成不同的组
    • Applying a function to each group independently

    • 将功能独立应用到每一个组
    • Combining the results into a data structure

    • 将结果合并成一个数据结构

    See the Grouping section.

    In [33]: df = pd.DataFrame( 
        ...:    ....:     { 
        ...:    ....:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], 
        ...:    ....:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"], 
        ...:    ....:         "C": np.random.randn(8), 
        ...:    ....:         "D": np.random.randn(8), 
        ...:    ....:     } 
        ...:    ....: )                                                                                                                             
    
    In [34]: df                                                                                                                                     
    Out[34]: 
         A      B         C         D
    0  foo    one  0.964528 -0.464514
    1  bar    one  2.223520  1.743989
    2  foo    two -0.311332  0.242599
    3  bar  three -1.119041 -1.223514
    4  foo    two -0.879870 -1.140529
    5  bar    two -1.158119 -0.577413
    6  foo    one  0.349084 -0.770265
    7  foo  three -0.063142 -2.306210
    
    In [35]: df.groupby('A')                                                                                                                        
    Out[35]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f87e4b64d60>
    
    In [36]: df.groupby('A').count()                                                                                                                
    Out[36]: 
         B  C  D
    A           
    bar  3  3  3
    foo  5  5  5
    
    In [37]: df.groupby('A').sum()                                                                                                                  
    Out[37]: 
                C         D
    A                      
    bar -0.053639 -0.056937
    foo  0.059268 -4.438920
    

      Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.

    In [38]: df.groupby(["A", "B"]).sum()                                                                                                           
    Out[38]: 
                      C         D
    A   B                        
    bar one    2.223520  1.743989
        three -1.119041 -1.223514
        two   -1.158119 -0.577413
    foo one    1.313612 -1.234779
        three -0.063142 -2.306210
        two   -1.191203 -0.897931
    

      

    Reshaping

    See the sections on Hierarchical Indexing and Reshaping.

    tuples                                                                                                                                 
    Out[49]: 
    [('bar', 'one'),
     ('bar', 'two'),
     ('baz', 'one'),
     ('baz', 'two'),
     ('foo', 'one'),
     ('foo', 'two'),
     ('qux', 'one'),
     ('qux', 'two')]
    
    In [50]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])                                                                   
    
    In [52]: df = pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])                                                                  
    
    In [53]: df                                                                                                                                     
    Out[53]: 
                         A         B
    first second                    
    bar   one     1.468431  1.239477
          two     1.134492  0.163859
    baz   one    -0.646760 -1.188458
          two     1.389973 -1.437824
    foo   one     0.438599  0.053046
          two     0.588275  0.074167
    qux   one    -0.197791 -1.983309
          two     0.725947  0.305324
    
    In [54]: df = df[:4]                                                                                                                            
    
    In [55]: df                                                                                                                                     
    Out[55]: 
                         A         B
    first second                    
    bar   one     1.468431  1.239477
          two     1.134492  0.163859
    baz   one    -0.646760 -1.188458
          two     1.389973 -1.437824
    

      建了一份数据,用了联合索引,取了前面4条

    The stack() method “compresses” a level in the DataFrame’s columns.

    我的理解是stack是将列的索引添加到行的索引进行叠加,使行索引变成多层索引[多一层索引],多列数据变成单列数据[少一层索引],unstack就使将行索引转换到列索引了。

    In [76]: stacked = df.stack()                                                                                                                   
    
    In [77]: stacked                                                                                                                                
    Out[77]: 
    first  second   
    bar    one     A    1.468431
                   B    1.239477
           two     A    1.134492
                   B    0.163859
    baz    one     A   -0.646760
                   B   -1.188458
           two     A    1.389973
                   B   -1.437824
    dtype: float64
    
    In [78]: stacked.index                                                                                                                          
    Out[78]: 
    MultiIndex([('bar', 'one', 'A'),
                ('bar', 'one', 'B'),
                ('bar', 'two', 'A'),
                ('bar', 'two', 'B'),
                ('baz', 'one', 'A'),
                ('baz', 'one', 'B'),
                ('baz', 'two', 'A'),
                ('baz', 'two', 'B')],
               names=['first', 'second', None])
    
    In [79]: stacked.__class__                                                                                                                      
    Out[79]: pandas.core.series.Series
    

      从输出可以看到,stack的方法,将df从df对象,转换成Series对象,本来两层的index,变成了三层

    With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:

    In [80]: stacked                                                                                                                                
    Out[80]: 
    first  second   
    bar    one     A    1.468431
                   B    1.239477
           two     A    1.134492
                   B    0.163859
    baz    one     A   -0.646760
                   B   -1.188458
           two     A    1.389973
                   B   -1.437824
    dtype: float64
    
    In [81]: stacked.unstack()                                                                                                                      
    Out[81]: 
                         A         B
    first second                    
    bar   one     1.468431  1.239477
          two     1.134492  0.163859
    baz   one    -0.646760 -1.188458
          two     1.389973 -1.437824
    
    In [82]: stacked.unstack().__class__                                                                                                            
    Out[82]: pandas.core.frame.DataFrame
    
    In [83]: stacked.unstack(0)                                                                                                                     
    Out[83]: 
    first          bar       baz
    second                      
    one    A  1.468431 -0.646760
           B  1.239477 -1.188458
    two    A  1.134492  1.389973
           B  0.163859 -1.437824
    

      有了stack的基础,unstack就更加好理解了,默认转移的是多层index的最后一层,但也可以指定多层index的索引。

    Pivot tables

    See the section on Pivot Tables.

     数据透视表,根据我自身的理解,感觉就是在原有的数据上面,挑选合适的列选为index,合适的列的column,合适的列为值。
    In [84]: df = pd.DataFrame( 
        ...:    .....:     { 
        ...:    .....:         "A": ["one", "one", "two", "three"] * 3, 
        ...:    .....:         "B": ["A", "B", "C"] * 4, 
        ...:    .....:         "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2, 
        ...:    .....:         "D": np.random.randn(12), 
        ...:    .....:         "E": np.random.randn(12), 
        ...:    .....:     } 
        ...:    .....: ) 
        ...:    .....:                                                                                                                              
    
    In [85]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])                                                                        
    Out[85]: 
    C             bar       foo
    A     B                    
    one   A -1.549085  0.117403
          B -0.343714 -0.231218
          C -0.890250 -1.735391
    three A -0.590521       NaN
          B       NaN -2.155207
          C -0.771405       NaN
    two   A       NaN  1.263717
          B  0.061399       NaN
          C       NaN  0.912017
    
    In [86]: pd.pivot_table(df, values="E", index=["A", "B"], columns=["C"])                                                                        
    Out[86]: 
    C             bar       foo
    A     B                    
    one   A -0.908049  0.088130
          B -0.661966  1.091079
          C -1.096862  0.875454
    three A -0.425327       NaN
          B       NaN  0.474478
          C -0.928048       NaN
    two   A       NaN -0.663361
          B -1.199722       NaN
          C       NaN -0.270919
    
    In [87]: pd.pivot_table(df, values=["E",'D'], index=["A", "B"], columns=["C"])                                                                  
    Out[87]: 
                    D                   E          
    C             bar       foo       bar       foo
    A     B                                        
    one   A -1.549085  0.117403 -0.908049  0.088130
          B -0.343714 -0.231218 -0.661966  1.091079
          C -0.890250 -1.735391 -1.096862  0.875454
    three A -0.590521       NaN -0.425327       NaN
          B       NaN -2.155207       NaN  0.474478
          C -0.771405       NaN -0.928048       NaN
    two   A       NaN  1.263717       NaN -0.663361
          B  0.061399       NaN -1.199722       NaN
          C       NaN  0.912017       NaN -0.270919
    

      

    Time series

    pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.

      

  • 相关阅读:

    《银行》
    动画
    定位
    浮动
    《css的总结》
    Java的Debug调试
    Eclipse "Unable to install breakpoint due to missing line number attributes..."
    解决 Tomcat Server in Eclipse unable to start within 45 seconds 不能启动的问题
    Java下载文件(流的形式)
  • 原文地址:https://www.cnblogs.com/sidianok/p/14368384.html
Copyright © 2020-2023  润新知