• python merge、concat合并数据集


    数据规整化:合并、清理、过滤

    pandas和python标准库提供了一整套高级、灵活的、高效的核心函数和算法将数据规整化为你想要的形式!

    本篇博客主要介绍:

    合并数据集:.merge()、.concat()等方法,类似于SQL或其他关系型数据库的连接操作。

    合并数据集

    1) merge 函数

    merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
    

      

      参数             说明

      left               参与合并的左侧DataFrame

      right             参与合并的右侧DataFrame

      how              连接方式:‘inner’(默认)#交集;还有,‘outer’(并集)、‘left’(左全部,右部分)、‘right’(右全部,左部分)

      on                 用于连接的列名,必须同时存在于左右两个DataFrame对象中,如果位指定,则以left和right列名的交集作为连接键

      left_on          左侧DataFarme中用作连接键的列

      right_on        右侧DataFarme中用作连接键的列

      left_index      将左侧的行索引用作其连接键

      right_index    将右侧的行索引用作其连接键

      sort               根据连接键对合并后的数据进行排序,默认为True。有时在处理大数据集时,禁用该选项可获得更好的性能

      suffixes        字符串值元组,用于追加到重叠列名的末尾,默认为(‘_x’,‘_y’).例如,左右两个DataFrame对象都有‘data’,则结果中就会出现‘data_x’,‘data_y’

      copy             设置为False,可以在某些特殊情况下避免将数据复制到结果数据结构中。默认总是赋值

    例子:

    df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], 
     "date":pd.date_range('20130102', periods=6),
      "city":['Beijing ', 'SH', 'guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
     "age":[23,44,54,32,34,32],
     "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
      "price":[1200,np.nan,2133,5433,np.nan,4432]},
      columns =['id','date','city','category','age','price'])
    
    df
    Out[46]: 
         id       date        city category  age   price
    0  1001 2013-01-02    Beijing     100-A   23  1200.0
    1  1002 2013-01-03          SH    100-B   44     NaN
    2  1003 2013-01-04  guangzhou     110-A   54  2133.0
    3  1004 2013-01-05    Shenzhen    110-C   32  5433.0
    4  1005 2013-01-06    shanghai    210-A   34     NaN
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0
    
    df1=pd.DataFrame({"acct_no":[1001,1002,1003,1004,1005,1006,1007,1008], 
    "gender":['male','female','male','female','male','female','male','female'],
    "pay":['Y','N','Y','Y','N','Y','N','Y',],
    "m-point":[10,12,20,40,40,40,30,20]})
    
    df1
    Out[48]: 
       acct_no  gender pay  m-point
    0     1001    male   Y       10
    1     1002  female   N       12
    2     1003    male   Y       20
    3     1004  female   Y       40
    4     1005    male   N       40
    5     1006  female   Y       40
    6     1007    male   N       30
    7     1008  female   Y       20
    
    data1 = pd.merge(df,df1,left_on="id",right_on="acct_no")
    
    data1 
    Out[50]: 
         id       date        city category   ...    acct_no  gender  pay m-point
    0  1001 2013-01-02    Beijing     100-A   ...       1001    male    Y      10
    1  1002 2013-01-03          SH    100-B   ...       1002  female    N      12
    2  1003 2013-01-04  guangzhou     110-A   ...       1003    male    Y      20
    3  1004 2013-01-05    Shenzhen    110-C   ...       1004  female    Y      40
    4  1005 2013-01-06    shanghai    210-A   ...       1005    male    N      40
    5  1006 2013-01-07    BEIJING     130-F   ...       1006  female    Y      40
    
    [6 rows x 10 columns]

    2)concat函数

    在这里展示一种新的连接方法,对应于numpy的concatenate函数,pandas有concat函数

    concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)

      参数               说明

      objs               参与连接的列表或字典,且列表或字典里的对象是pandas数据类型,唯一必须给定的参数

      axis=0             指明连接的轴向,0是纵轴,1是横轴,默认是0

      join               指明轴向索引的索引是交集还是并集:‘inner’(交集),‘outer’(并集),默认是‘outer’

      join_axis          指明用于其他n-1条轴的索引(层次化索引,某个轴向有多个索引),不执行交并集

      keys               与连接对象有关的值,用于形成连接轴向上的层次化索引(外层索引),可以是任意值的列表或数组、元组数据、数组列表(如果将levels设置成多级数组的话)

      levels             指定用作层次化索引各级别(内层索引)上的索引,如果设置keys的话

      names              用于创建分层级别的名称,如果设置keys或levels的话

      verify_integrity   检查结果对象新轴上的重复情况,如果发横则引发异常,默认False,允许重复

      ignore_index       不保留连接轴上的索引,产生一组新索引range(total_length)

    例子:

    pd.concat([df,df1])#默认并集、纵向连接
    Out[57]: 
       acct_no   age category        city   ...        id m-point  pay   price
    0      NaN  23.0    100-A    Beijing    ...    1001.0     NaN  NaN  1200.0
    1      NaN  44.0    100-B          SH   ...    1002.0     NaN  NaN     NaN
    2      NaN  54.0    110-A  guangzhou    ...    1003.0     NaN  NaN  2133.0
    3      NaN  32.0    110-C    Shenzhen   ...    1004.0     NaN  NaN  5433.0
    4      NaN  34.0    210-A    shanghai   ...    1005.0     NaN  NaN     NaN
    5      NaN  32.0    130-F    BEIJING    ...    1006.0     NaN  NaN  4432.0
    0   1001.0   NaN      NaN         NaN   ...       NaN    10.0    Y     NaN
    1   1002.0   NaN      NaN         NaN   ...       NaN    12.0    N     NaN
    2   1003.0   NaN      NaN         NaN   ...       NaN    20.0    Y     NaN
    3   1004.0   NaN      NaN         NaN   ...       NaN    40.0    Y     NaN
    4   1005.0   NaN      NaN         NaN   ...       NaN    40.0    N     NaN
    5   1006.0   NaN      NaN         NaN   ...       NaN    40.0    Y     NaN
    6   1007.0   NaN      NaN         NaN   ...       NaN    30.0    N     NaN
    7   1008.0   NaN      NaN         NaN   ...       NaN    20.0    Y     NaN
    
    [14 rows x 10 columns]
    
    pd.concat([df,df1],ignore_index = True)#生成纵轴上的并集,索引会自动生成新的一列
    Out[58]: 
        acct_no   age category        city   ...        id m-point  pay   price
    0       NaN  23.0    100-A    Beijing    ...    1001.0     NaN  NaN  1200.0
    1       NaN  44.0    100-B          SH   ...    1002.0     NaN  NaN     NaN
    2       NaN  54.0    110-A  guangzhou    ...    1003.0     NaN  NaN  2133.0
    3       NaN  32.0    110-C    Shenzhen   ...    1004.0     NaN  NaN  5433.0
    4       NaN  34.0    210-A    shanghai   ...    1005.0     NaN  NaN     NaN
    5       NaN  32.0    130-F    BEIJING    ...    1006.0     NaN  NaN  4432.0
    6    1001.0   NaN      NaN         NaN   ...       NaN    10.0    Y     NaN
    7    1002.0   NaN      NaN         NaN   ...       NaN    12.0    N     NaN
    8    1003.0   NaN      NaN         NaN   ...       NaN    20.0    Y     NaN
    9    1004.0   NaN      NaN         NaN   ...       NaN    40.0    Y     NaN
    10   1005.0   NaN      NaN         NaN   ...       NaN    40.0    N     NaN
    11   1006.0   NaN      NaN         NaN   ...       NaN    40.0    Y     NaN
    12   1007.0   NaN      NaN         NaN   ...       NaN    30.0    N     NaN
    13   1008.0   NaN      NaN         NaN   ...       NaN    20.0    Y     NaN
    
    [14 rows x 10 columns]
    
    pd.concat([df,df1],axis = 1,join = 'inner')#横向取交集,注意该方法对对象表中有重复索引时失效
    Out[59]: 
         id       date        city category   ...    acct_no  gender  pay m-point
    0  1001 2013-01-02    Beijing     100-A   ...       1001    male    Y      10
    1  1002 2013-01-03          SH    100-B   ...       1002  female    N      12
    2  1003 2013-01-04  guangzhou     110-A   ...       1003    male    Y      20
    3  1004 2013-01-05    Shenzhen    110-C   ...       1004  female    Y      40
    4  1005 2013-01-06    shanghai    210-A   ...       1005    male    N      40
    5  1006 2013-01-07    BEIJING     130-F   ...       1006  female    Y      40
    
    [6 rows x 10 columns]
    
    pd.concat([df,df1],axis = 1,join = 'outer')#横向取并集,注意该方法对对象表中有重复索引时失效
    Out[60]: 
           id       date        city category   ...    acct_no  gender  pay m-point
    0  1001.0 2013-01-02    Beijing     100-A   ...       1001    male    Y      10
    1  1002.0 2013-01-03          SH    100-B   ...       1002  female    N      12
    2  1003.0 2013-01-04  guangzhou     110-A   ...       1003    male    Y      20
    3  1004.0 2013-01-05    Shenzhen    110-C   ...       1004  female    Y      40
    4  1005.0 2013-01-06    shanghai    210-A   ...       1005    male    N      40
    5  1006.0 2013-01-07    BEIJING     130-F   ...       1006  female    Y      40
    6     NaN        NaT         NaN      NaN   ...       1007    male    N      30
    7     NaN        NaT         NaN      NaN   ...       1008  female    Y      20
    
    [8 rows x 10 columns]
    

    3)combine_first函数(含有重叠索引的缺失值填补

    #全部索引重叠
    a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index = ['a','b','c','d','e','f']) a Out[62]: a NaN b 2.5 c NaN d 3.5 e 4.5 f NaN dtype: float64 b = pd.Series(np.arange(len(a)),index = ['a','b','c','d','e','f']) b Out[64]: a 0 b 1 c 2 d 3 e 4 f 5 dtype: int32 a.combine_first(b)#利用b填补了a的空值 Out[65]: a 0.0 b 2.5 c 2.0 d 3.5 e 4.5 f 5.0 dtype: float64
    #部分索引重叠 a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index = ['g','b','c','d','e','f']) a Out[67]: g NaN b 2.5 c NaN d 3.5 e 4.5 f NaN dtype: float64 a.combine_first(b)#部分索引重叠 Out[68]: a 0.0 b 2.5 c 2.0 d 3.5 e 4.5 f 5.0 g NaN dtype: float64
  • 相关阅读:
    算法(第四版)2.1 初级排序算法
    数据类型:数值
    数据类型:null, undefined 和布尔值
    数据类型:概述
    9.6 http中间件
    9.5 处理http 请求
    9.4 简单httpserver
    9.3 多客户端TCP
    9.2 udp server
    资源竞争
  • 原文地址:https://www.cnblogs.com/Christina-Notebook/p/10220785.html
Copyright © 2020-2023  润新知