• Pandas的拼接操作


    pandas的拼接操作

    pandas的拼接分为两种:

    • 级联:pd.concat, pd.append
    • 合并:pd.merge, pd.join
    import pandas as pd
    import numpy as np
    from pandas import DataFrame,Series
    

    一. 使用pd.concat()级联

    pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

    objs
    axis=0
    keys
    join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
    ignore_index=False
    

    1)匹配级联

    行列索引均一致

    df1 = DataFrame(data=np.random.randint(0,100,size=(3,4)))
    df1
    
    0 1 2 3
    0 61 89 68 51
    1 46 79 1 55
    2 52 4 72 18
    df2 = DataFrame(data=np.random.randint(0,100,size=(3,4)))
    df2
    
    0 1 2 3
    0 15 62 20 78
    1 60 79 70 58
    2 71 87 20 95
    pd.concat((df1,df2),axis=0)  # axis=0表示Y轴级联
    
    0 1 2 3
    0 61 89 68 51
    1 46 79 1 55
    2 52 4 72 18
    0 15 62 20 78
    1 60 79 70 58
    2 71 87 20 95

    2) 不匹配级联

    不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致

    有2种连接方式:

    • 外连接:补NaN(默认模式)

    • 内连接:只连接匹配的项

    df1 = DataFrame(data=np.random.randint(0,100,size=(3,4)))
    df2 = DataFrame(data=np.random.randint(0,100,size=(3,3)))
    
    pd.concat((df1,df2),axis=0)
    
    0 1 2 3
    0 55 61 54 56.0
    1 10 14 6 62.0
    2 39 27 99 81.0
    0 31 49 80 NaN
    1 73 42 44 NaN
    2 67 68 97 NaN
    pd.concat((df1,df2),axis=0,join='inner')  # inner内连接,只级联匹配的项
    
    0 1 2
    0 55 61 54
    1 10 14 6
    2 39 27 99
    0 31 49 80
    1 73 42 44
    2 67 68 97

    二. 使用pd.merge()合并

    merge与concat的区别在于,merge需要依据某一共同的列来进行合并

    使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。

    注意每一列元素的顺序不要求一致

    参数:

    • how:outer取并集(外连接) inner取交集(内连接)

    • on:当有多列相同的时候,可以使用on来指定使用那一列进行合并,on的值为一个列表

    1) 一对一合并

    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                    'group':['Accounting','Engineering','Engineering'],
                    })
    df1
    
    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering
    df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                    'hire_date':[2004,2008,2012],
                    })
    df2
    
    employee hire_date
    0 Lisa 2004
    1 Bob 2008
    2 Jake 2012
    pd.merge(df1, df2)  # 按照employee进行了合并
    
    employee group hire_date
    0 Bob Accounting 2008
    1 Jake Engineering 2012
    2 Lisa Engineering 2004

    2) 多对一合并

    df3 = DataFrame({
        'employee':['Lisa','Jake'],
        'group':['Accounting','Engineering'],
        'hire_date':[2004,2016]})
    df3
    
    employee group hire_date
    0 Lisa Accounting 2004
    1 Jake Engineering 2016
    df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                           'supervisor':['Carly','Guido','Steve']
                    })
    df4
    
    group supervisor
    0 Accounting Carly
    1 Engineering Guido
    2 Engineering Steve
    pd.merge(df3, df4)
    
    employee group hire_date supervisor
    0 Lisa Accounting 2004 Carly
    1 Jake Engineering 2016 Guido
    2 Jake Engineering 2016 Steve

    3) 多对多合并

    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                     'group':['Accounting','Engineering','Engineering']})
    df1
    
    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering
    df2 = DataFrame({'group':['Engineering','Engineering','HR'],
                    'supervisor':['Carly','Guido','Steve']
                    })
    df2
    
    group supervisor
    0 Engineering Carly
    1 Engineering Guido
    2 HR Steve
    pd.merge(df1,df2,how='right')  # right表示右连接
    
    employee group supervisor
    0 Jake Engineering Carly
    1 Lisa Engineering Carly
    2 Jake Engineering Guido
    3 Lisa Engineering Guido
    4 NaN HR Steve

    4) key的规范化

    • 当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
    df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                     'group':['Accounting','Finance','Marketing']})
    df1
    
    employee group
    0 Jack Accounting
    1 Summer Finance
    2 Steve Marketing
    df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                     'hire_date':[2003,2009,2012],
                    'group':['Accounting','sell','ceo']})
    df2
    
    employee group hire_date
    0 Jack Accounting 2003
    1 Bob sell 2009
    2 Jake ceo 2012
    pd.merge(df1,df2,on='employee')  # 默认按照employee和group进行合并,可以指定列名
    
    employee group_x group_y hire_date
    0 Jack Accounting Accounting 2003
    • 当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列
    df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                    'group':['Accounting','Product','Marketing'],
                   'hire_date':[1998,2017,2018]})
    df1
    
    employee group hire_date
    0 Bobs Accounting 1998
    1 Linda Product 2017
    2 Bill Marketing 2018
    df2 = DataFrame({'name':['Lisa','Bobs','Bill'],
                    'hire_dates':[1998,2016,2007]})
    df2
    
    hire_dates name
    0 1998 Lisa
    1 2016 Bobs
    2 2007 Bill
    pd.merge(df1,df2,left_on='employee',right_on='name',how='outer')
    
    employee group hire_date hire_dates name
    0 Bobs Accounting 1998.0 2016.0 Bobs
    1 Linda Product 2017.0 NaN NaN
    2 Bill Marketing 2018.0 2007.0 Bill
    3 NaN NaN NaN 1998.0 Lisa

    5) 内合并与外合并:out取并集 inner取交集

    • 内合并:只保留两者都有的key(默认模式)
    df6 = DataFrame({'name':['Peter','Paul','Mary'],
                   'food':['fish','beans','bread']}
                   )
    df6
    
    food name
    0 fish Peter
    1 beans Paul
    2 bread Mary
    df7 = DataFrame({'name':['Mary','Joseph'],
                    'drink':['wine','beer']})
    df7
    
    drink name
    0 wine Mary
    1 beer Joseph
    pd.merge(df6, df7)
    
    food name drink
    0 bread Mary wine
    • 外合并 how='outer':补NaN
    pd.merge(df6, df7, how='outer')
    
    food name drink
    0 fish Peter NaN
    1 beans Paul NaN
    2 bread Mary wine
    3 NaN Joseph beer
  • 相关阅读:
    机器学习
    机器学习
    JavaWeb之tomcat安装、配置与使用(一)
    Tomcat安装、配置和部署笔记
    Java配置----JDK开发环境搭建及环境变量配置
    安装SQL2012
    SQLServer 数据库变成单个用户后无法访问问题的解决方法
    临时记录
    SQL Server 动态生成数据库所有表Insert语句
    SQL2000查看表的大小
  • 原文地址:https://www.cnblogs.com/zyyhxbs/p/11708522.html
Copyright © 2020-2023  润新知