• Pandas数据规整学习笔记1


    清理、转换、合并、重塑

    1 合并数据集

    pandas有一些内置的合并数据集方法:

    • pandas.merge:根据一个或多个键将多个DataFrame连接起来,类似数据库连接;
    • pandas.concat:可以沿着一个轴将多个对象堆叠起来;
    • 实例方法combine_first可以将重复数据编制在一起,用以填充另一个对象的缺失值。

    1.1 数据库风格的DataFrame合并

    数据库合并(merge)或连接(join)运行将一个或多个键将行链接起来。来个例子:

    In [1]: import pandas as pd
    
    In [2]: from pandas import Series,DataFrame
    
    In [3]: df1 = DataFrame({'key':list('bbacaab'),'data1':range(7)})
    
    In [4]: df2 = DataFrame({'key':list('abd'),'data2':range(3)})
    
    In [5]: df1
    Out[5]:
       data1 key
    0      0   b
    1      1   b
    2      2   a
    3      3   c
    4      4   a
    5      5   a
    6      6   b
    
    In [6]: df2
    Out[6]:
       data2 key
    0      0   a
    1      1   b
    2      2   d
    
    # merge 合并数据集,相同列的共同值。
    In [7]: pd.merge(df1, df2)    #不指定列名,默认会选择列名相同的key列。
    Out[7]:
       data1 key  data2
    0      0   b      1
    1      1   b      1
    2      6   b      1
    3      2   a      0
    4      4   a      0
    5      5   a      0
    
    In [8]: pd.merge(df1, df2, on='key') # 指定列名
    Out[8]:
       data1 key  data2
    0      0   b      1
    1      1   b      1
    2      6   b      1
    3      2   a      0
    4      4   a      0
    5      5   a      0
    
    # 可以将不同列名进行链接。
    
    In [9]: df3 = DataFrame({'lkey':list('bbacaab'),'data1':range(7)})
    
    In [10]: df4 = DataFrame({'rkey':list('abd'),'data2':range(3)})
    
    In [11]: pd.merge(df3,df4,left_on='lkey',right_on='rkey')
    Out[11]:
       data1 lkey  data2 rkey
    0      0    b      1    b
    1      1    b      1    b
    2      6    b      1    b
    3      2    a      0    a
    4      4    a      0    a
    5      5    a      0    a
    
    # 默认inner,通过how可以指定outer
    In [12]: pd.merge(df1,df2,how='outer')
    Out[12]:
       data1 key  data2
    0    0.0   b    1.0
    1    1.0   b    1.0
    2    6.0   b    1.0
    3    2.0   a    0.0
    4    4.0   a    0.0
    5    5.0   a    0.0
    6    3.0   c    NaN
    7    NaN   d    2.0
    
    In [13]: pd.merge(df1,df2,how='left')
    Out[13]:
       data1 key  data2
    0      0   b    1.0
    1      1   b    1.0
    2      2   a    0.0
    3      3   c    NaN
    4      4   a    0.0
    5      5   a    0.0
    6      6   b    1.0
    
    In [14]: pd.merge(df1,df2,how='right',on='key')
    Out[14]:
       data1 key  data2
    0    0.0   b      1
    1    1.0   b      1
    2    6.0   b      1
    3    2.0   a      0
    4    4.0   a      0
    5    5.0   a      0
    6    NaN   d      2
    
    # 多个键 合并
    
    In [15]: left = DataFrame({'key1':['foo','foo','bar'],'key2':['one','two','one']
        ...: ,'lval':[1,2,3]})
    
    In [16]: right = DataFrame({'key1':['foo','foo','bar','bar'],'key2':['one','one'
        ...: ,'one','two'],'rval':[4,5,6,7]})
    
    In [17]: pd.merge(left,right,on=['key1','key2'],how='outer')
    Out[17]:
      key1 key2  lval  rval
    0  foo  one   1.0   4.0
    1  foo  one   1.0   5.0
    2  foo  two   2.0   NaN
    3  bar  one   3.0   6.0
    4  bar  two   NaN   7.0
    
    # 如果使用的on选项后,还有重复列名,怎么办?
    # merge有一个suffixes选项,可以处理重复列名
    
    In [18]: pd.merge(left,right,on='key1')
    Out[18]:
      key1 key2_x  lval key2_y  rval
    0  foo    one     1    one     4
    1  foo    one     1    one     5
    2  foo    two     2    one     4
    3  foo    two     2    one     5
    4  bar    one     3    one     6
    5  bar    one     3    two     7
    
    In [19]: pd.merge(left,right,on='key1',suffixes=('_left','_right'))
    Out[19]:
      key1 key2_left  lval key2_right  rval
    0  foo       one     1        one     4
    1  foo       one     1        one     5
    2  foo       two     2        one     4
    3  foo       two     2        one     5
    4  bar       one     3        one     6
    5  bar       one     3        two     7
    
    

    merge选项:

    1.2 索引上的合并

    传入left_index=True或者right_index=True,可以将索引作为链接键使用。

    In [20]: left1 = DataFrame({'key':list('abaabc'),'value':range(6)})
    
    In [21]: left2 = DataFrame({'group_val':[3.5,7]},index=['a','b'])
    
    In [22]: left1
    Out[22]:
      key  value
    0   a      0
    1   b      1
    2   a      2
    3   a      3
    4   b      4
    5   c      5
    
    In [23]: left2
    Out[23]:
       group_val
    a        3.5
    b        7.0
    
    # 将右DataFrame的index作为链接键;
    In [24]: pd.merge(left1, left2 ,left_on='key', right_index=True)
    Out[24]:
      key  value  group_val
    0   a      0        3.5
    2   a      2        3.5
    3   a      3        3.5
    1   b      1        7.0
    4   b      4        7.0
    
    In [25]: pd.merge(left1, left2 ,left_on='key', right_index=True, how='outer')
    Out[25]:
      key  value  group_val
    0   a      0        3.5
    2   a      2        3.5
    3   a      3        3.5
    1   b      1        7.0
    4   b      4        7.0
    5   c      5        NaN
    
    # 对于层次化索引,比较复杂
    In [28]: lefth = DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada','Nevada'],'key
        ...: 2':[2000,2001,2002,2001,2002],'data':np.arange(5.)})
    
    In [29]: righth = DataFrame(np.arange(12).reshape(6,2),index=[['Nevada','Nevada'
        ...: ,'Ohio','Ohio','Ohio','Ohio'],[2001,2000,2000,2000,2001,2002]],columns=
        ...: ['event1','event2'])
    
    In [30]: lefth
    Out[30]:
       data    key1  key2
    0   0.0    Ohio  2000
    1   1.0    Ohio  2001
    2   2.0    Ohio  2002
    3   3.0  Nevada  2001
    4   4.0  Nevada  2002
    
    In [31]: righth
    Out[31]:
                 event1  event2
    Nevada 2001       0       1
           2000       2       3
    Ohio   2000       4       5
           2000       6       7
           2001       8       9
           2002      10      11
    
    # 使用列表的形式,指明多个列作为合并键
    In [32]: pd.merge(lefth, righth, left_on=['key1','key2'],right_index=True)
    Out[32]:
       data    key1  key2  event1  event2
    0   0.0    Ohio  2000       4       5
    0   0.0    Ohio  2000       6       7
    1   1.0    Ohio  2001       8       9
    2   2.0    Ohio  2002      10      11
    3   3.0  Nevada  2001       0       1
    
    In [33]: pd.merge(lefth, righth, left_on=['key1','key2'],right_index=True, how='
        ...: outer')
    Out[33]:
       data    key1    key2  event1  event2
    0   0.0    Ohio  2000.0     4.0     5.0
    0   0.0    Ohio  2000.0     6.0     7.0
    1   1.0    Ohio  2001.0     8.0     9.0
    2   2.0    Ohio  2002.0    10.0    11.0
    3   3.0  Nevada  2001.0     0.0     1.0
    4   4.0  Nevada  2002.0     NaN     NaN
    4   NaN  Nevada  2000.0     2.0     3.0
    
    In [34]: left2 = DataFrame([[1,2],[3,4],[5,6]],index=['a','c','e'],columns=['Ohi
        ...: o','Nevada'])
    
    In [35]: right2 = DataFrame([[7,8],[9,10],[11,12],[13,14]],index=list('bcde'),co
        ...: lumns=['Missouri','Alabama'])
    
    In [36]: left2
    Out[36]:
       Ohio  Nevada
    a     1       2
    c     3       4
    e     5       6
    
    In [37]: right2
    Out[37]:
       Missouri  Alabama
    b         7        8
    c         9       10
    d        11       12
    e        13       14
    
    # 同时根据两边索引合并
    In [38]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
        ...:
    Out[38]:
       Ohio  Nevada  Missouri  Alabama
    a   1.0     2.0       NaN      NaN
    b   NaN     NaN       7.0      8.0
    c   3.0     4.0       9.0     10.0
    d   NaN     NaN      11.0     12.0
    e   5.0     6.0      13.0     14.0
    

    DataFrame还有join方法,更好的合并索引。

    In [39]: left2.join(right2, how='outer')
    Out[39]:
       Ohio  Nevada  Missouri  Alabama
    a   1.0     2.0       NaN      NaN
    b   NaN     NaN       7.0      8.0
    c   3.0     4.0       9.0     10.0
    d   NaN     NaN      11.0     12.0
    e   5.0     6.0      13.0     14.0
    In [44]: right1
    Out[44]:
       group_val
    a        3.5
    b        7.0
    
    In [45]: left1
    Out[45]:
      key  value
    0   a      0
    1   b      1
    2   a      2
    3   a      3
    4   b      4
    5   c      5
    
    # left1 使用key来合并,right1使用索引。
    In [46]: left1.join(right1, on='key')
    Out[46]:
      key  value  group_val
    0   a      0        3.5
    1   b      1        7.0
    2   a      2        3.5
    3   a      3        3.5
    4   b      4        7.0
    5   c      5        NaN
    
    # 对于简单的索引合并,可以在join传入一个DataFrame组。
    
    In [47]: another = DataFrame([[7,8],[9,10],[11,12],[16,17]],index=list('acef'),c
        ...: olumns=['NewYork','Oregon'])
    
    In [48]: left2.join([right2, another])  # 传入list
    Out[48]:
       Ohio  Nevada  Missouri  Alabama  NewYork  Oregon
    a     1       2       NaN      NaN        7       8
    c     3       4       9.0     10.0        9      10
    e     5       6      13.0     14.0       11      12
    In [50]: left2.join([right2, another],how='outer')
    Out[50]:
       Ohio  Nevada  Missouri  Alabama  NewYork  Oregon
    a   1.0     2.0       NaN      NaN      7.0     8.0
    b   NaN     NaN       7.0      8.0      NaN     NaN
    c   3.0     4.0       9.0     10.0      9.0    10.0
    d   NaN     NaN      11.0     12.0      NaN     NaN
    e   5.0     6.0      13.0     14.0     11.0    12.0
    f   NaN     NaN       NaN      NaN     16.0    17.0
    

    concat函数也能实现类似功能,后面介绍。

    1.3 轴向连接

    这是另一种连接,也叫作:连接、绑定或堆叠。

    # numpy的函数:concatenation
    In [51]: arr = np.arange(12).reshape(3,4)
    
    In [52]: np.concatenate([arr,arr],axis=1)
    Out[52]:
    array([[ 0,  1,  2,  3,  0,  1,  2,  3],
           [ 4,  5,  6,  7,  4,  5,  6,  7],
           [ 8,  9, 10, 11,  8,  9, 10, 11]])
    
    In [53]: np.concatenate([arr,arr],axis=0)
    Out[53]:
    array([[ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11],
           [ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11]])
    
    

    对于pandas也能推广,

    In [54]: s1 = Series([0,1],index=list('ab'))
    
    In [55]: s2 = Series([2,3,4],index=list('cde'))
    
    In [56]: s3 = Series([5,6],index=list('fg'))
    
    In [57]: pd.concat([s1,s2,s3])
    Out[57]:
    a    0
    b    1
    c    2
    d    3
    e    4
    f    5
    g    6
    dtype: int64
    
    # 做一个DataFrame出来。
    In [58]: pd.concat([s1,s2,s3],axis=1)
    Out[58]:
         0    1    2
    a  0.0  NaN  NaN
    b  1.0  NaN  NaN
    c  NaN  2.0  NaN
    d  NaN  3.0  NaN
    e  NaN  4.0  NaN
    f  NaN  NaN  5.0
    g  NaN  NaN  6.0
    
    
    # 使用inner选项,求并集
    In [59]: s4 = pd.concat([s1*5,s3])
    
    In [60]: s4
    Out[60]:
    a    0
    b    5
    f    5
    g    6
    dtype: int64
    
    In [61]: pd.concat([s1,s4],axis=1)
    Out[61]:
         0  1
    a  0.0  0
    b  1.0  5
    f  NaN  5
    g  NaN  6
    
    In [62]: pd.concat([s1,s4],axis=1,join='inner')
    Out[62]:
       0  1
    a  0  0
    b  1  5
    
    
    # 指定求那些索引的并集
    In [63]: pd.concat([s1,s4],axis=1,join_axes=[list('acbe')])
    Out[63]:
         0    1
    a  0.0  0.0
    c  NaN  NaN
    b  1.0  5.0
    e  NaN  NaN
    
    In [64]: pd.concat([s1,s4],axis=1,join_axes=[list('acbef')])
    Out[64]:
         0    1
    a  0.0  0.0
    c  NaN  NaN
    b  1.0  5.0
    e  NaN  NaN
    f  NaN  5.0
    
    # 用keys创建一个自己的层次化索引
    In [65]: result = pd.concat([s1,s2,s3],keys=['one','two','three'])
    
    In [66]: result
    Out[66]:
    one    a    0
           b    1
    two    c    2
           d    3
           e    4
    three  f    5
           g    6
    dtype: int64
    
    # 在axis=1轴上连接的话,keys就成了列
    In [70]: result = pd.concat([s1,s2,s3],keys=['one','two','three'],axis=1)
    
    In [71]: result
    Out[71]:
       one  two  three
    a  0.0  NaN    NaN
    b  1.0  NaN    NaN
    c  NaN  2.0    NaN
    d  NaN  3.0    NaN
    e  NaN  4.0    NaN
    f  NaN  NaN    5.0
    g  NaN  NaN    6.0
    

    同样的逻辑,对DataFrame也是一样,

    In [72]: df1 = DataFrame(np.arange(6).reshape(3,2),index=list('abc'),columns=['o
        ...: ne','two'])
    
    In [73]: df2 = DataFrame(np.arange(4).reshape(2,2),index=list('ac'),columns=['th
        ...: ree','four'])
    
    In [74]: pd.concat([df1,df2],axis=1,keys=['level1','level2'])
    Out[74]:
      level1     level2     
         one two  three four
    a      0   1    0.0  1.0
    b      2   3    NaN  NaN
    c      4   5    2.0  3.0
    
    In [75]: pd.concat([df1,df2],axis=1)
    Out[75]:
       one  two  three  four
    a    0    1    0.0   1.0
    b    2    3    NaN   NaN
    c    4    5    2.0   3.0
    
    In [76]: pd.concat([df1,df2])
    Out[76]:
       four  one  three  two
    a   NaN  0.0    NaN  1.0
    b   NaN  2.0    NaN  3.0
    c   NaN  4.0    NaN  5.0
    a   1.0  NaN    0.0  NaN
    c   3.0  NaN    2.0  NaN
    
    
    # 还可以传入字典
    In [77]: pd.concat({'level1':df1,'level2':df2},axis=1)
    Out[77]:
      level1     level2     
         one two  three four
    a      0   1    0.0  1.0
    b      2   3    NaN  NaN
    c      4   5    2.0  3.0
    
    # 没用的索引怎么处理?
    In [78]: df3 = DataFrame(np.random.randn(3,4),columns=list('abcd'))
    
    In [79]: df4 = DataFrame(np.random.randn(2,3),columns=list('bda'))
    
      ## 产生新的一列索引,顺序从0开始
    In [80]: pd.concat([df3,df4],ignore_index=True)
    Out[80]:
              a         b         c         d
    0  0.322805 -1.495024  0.971860 -0.904791
    1  0.332766 -2.320841 -0.903456  0.026713
    2 -0.163670  0.387144 -0.746927 -0.360291
    3 -0.074411 -0.557447       NaN  0.704929
    4  2.673221 -1.143143       NaN  0.592930
    

    1.4 合并重叠数据

    合并重叠数据不能使用merge和concat来实现,使用np.where函数。

    In [81]: a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=list('fedcba'))
    
    In [85]: b = Series([0,1,2,3,4,np.nan],index=list('fedcba'))
    
    In [86]: a
    Out[86]:
    f    NaN
    e    2.5
    d    NaN
    c    3.5
    b    4.5
    a    NaN
    dtype: float64
    
    In [87]: b
    Out[87]:
    f    0.0
    e    1.0
    d    2.0
    c    3.0
    b    4.0
    a    NaN
    dtype: float64
    
    # 实现如果a为空,就是用b
    In [88]: np.where(pd.isnull(a), b, a)
    Out[88]: array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])
    
    # 相同的目标
    In [95]: a.combine_first(b)
    Out[95]:
    f    0.0
    e    2.5
    d    2.0
    c    3.5
    b    4.5
    a    NaN
    dtype: float64
    
    

    针对DataFrame:

    In [96]: df1 = DataFrame({'a':[1,np.nan,5,np.nan],'b':[np.nan,2,np.nan,6],'c':ra
        ...: nge(2,18,4)})
    
    In [97]: df2 = DataFrame({'a':[5,4,np.nan,3,7],'b':[np.nan,3,4,6,8]})
    
    In [98]: df1.combine_first(df2)
    Out[98]:
         a    b     c
    0  1.0  NaN   2.0
    1  4.0  2.0   6.0
    2  5.0  4.0  10.0
    3  3.0  6.0  14.0
    4  7.0  8.0   NaN
    

    待续。。。

  • 相关阅读:
    kubernetes安装记录
    nginx ssl 客户端认证配置调研
    centos 6 docker 安装(通过yum)
    elk docker 部署实战
    react 技术栈项目轻量化方案调研
    service worker 实战
    记一次 node-fetch 使用时踩的坑
    那些年使用npm进行依赖管理所踩的坑
    ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
    linux 环境变量设置错误导致 command not found
  • 原文地址:https://www.cnblogs.com/felo/p/6367485.html
Copyright © 2020-2023  润新知