• pandas的级联操作


    级联操作

    • pd.concat, pd.append
    import pandas as pd
    from pandas import DataFrame
    import numpy as np
    

    pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

    objs
    axis=0
    keys
    join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
    ignore_index=False
    
    • 匹配级联
    df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                    'group':['Accounting','Product','Marketing'],
                   'hire_date':[1998,2017,2018]})
    df1
    
    employee group hire_date
    0 Bobs Accounting 1998
    1 Linda Product 2017
    2 Bill Marketing 2018
    pd.concat((df1,df1),axis=0)
    
    employee group hire_date
    0 Bobs Accounting 1998
    1 Linda Product 2017
    2 Bill Marketing 2018
    0 Bobs Accounting 1998
    1 Linda Product 2017
    2 Bill Marketing 2018
    • 不匹配级联
      • 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
      • 有2种连接方式:
        • 外连接:补NaN(默认模式)
        • 内连接:只连接匹配的项
    df2 = df1.copy()
    df2.columns = ['employee','groupps','hire_date']
    df2
    
    employee groupps hire_date
    0 Bobs Accounting 1998
    1 Linda Product 2017
    2 Bill Marketing 2018
    pd.concat((df1,df2),axis=0)
    
    # 按列级联,发现不同列就加在表中,用NaN补全空的字段
    
    employee group groupps hire_date
    0 Bobs Accounting NaN 1998
    1 Linda Product NaN 2017
    2 Bill Marketing NaN 2018
    0 Bobs NaN Accounting 1998
    1 Linda NaN Product 2017
    2 Bill NaN Marketing 2018
    • join:
      • inner:只对可以匹配的项进行级联
      • outer:可以级联所有的项
    pd.concat((df1,df2),axis=0,join='inner')
    
    employee hire_date
    0 Bobs 1998
    1 Linda 2017
    2 Bill 2018
    0 Bobs 1998
    1 Linda 2017
    2 Bill 2018
    • append函数的使用: append只可以进行纵向的级联

      employee group groupps hire_date
      0 Bobs Accounting NaN 1998
      1 Linda Product NaN 2017
      2 Bill Marketing NaN 2018
      0 Bobs NaN Accounting 1998
      1 Linda NaN Product 2017
      2 Bill NaN Marketing 2018

    合并操作

    • merge与concat的区别在于,merge需要依据某一共同列来进行合并

    • 使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。

    • 注意每一列元素的顺序不要求一致

    一对一合并

    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                    'group':['Accounting','Engineering','Engineering'],
                    })
    df1
    
    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering
    df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                    'hire_date':[2004,2008,2012],
                    })
    df2
    
    employee hire_date
    0 Lisa 2004
    1 Bob 2008
    2 Jake 2012
    pd.merge(df1,df2,on='employee')
    
    employee group hire_date
    0 Bob Accounting 2008
    1 Jake Engineering 2012
    2 Lisa Engineering 2004

    一对多合并

    df3 = DataFrame({
        'employee':['Lisa','Jake'],
        'group':['Accounting','Engineering'],
        'hire_date':[2004,2016]})
    df
    
    employee group hire_date
    0 Lisa Accounting 2004
    1 Jake Engineering 2016
    df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                           'supervisor':['Carly','Guido','Steve']
                    })
    df4
    
    group supervisor
    0 Accounting Carly
    1 Engineering Guido
    2 Engineering Steve
    pd.merge(df3,df4)
    
    # 会默认以两个表的共同字段 group 进行关联,将两张表中的全部数据进行合并
    
    employee group hire_date supervisor
    0 Lisa Accounting 2004 Carly
    1 Jake Engineering 2016 Guido
    2 Jake Engineering 2016 Steve

    多对多合并

    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                     'group':['Accounting','Engineering','Engineering']})
    df1
    
    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering
    df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                    'supervisor':['Carly','Guido','Steve']
                    })
    df5
    
    group supervisor
    0 Engineering Carly
    1 Engineering Guido
    2 HR Steve
    pd.merge(df1,df5,how='outer')
    
    employee group supervisor
    0 Bob Accounting NaN
    1 Jake Engineering Carly
    2 Jake Engineering Guido
    3 Lisa Engineering Carly
    4 Lisa Engineering Guido
    5 NaN HR Steve

    key的规范化

    • 当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
    df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                     'group':['Accounting','Finance','Marketing']})
    df1
    
    employee group
    0 Jack Accounting
    1 Summer Finance
    2 Steve Marketing
    df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                     'hire_date':[2003,2009,2012],
                    'group':['Accounting','sell','ceo']})
    df2
    
    employee group hire_date
    0 Jack Accounting 2003
    1 Bob sell 2009
    2 Jake ceo 2012
    pd.merge(df1,df2,on='group')
    
    # 指定按照 group 这一列来进行分组,会将两个表中相同的字段名都加在合并后的表中,两个字段会改变
    # 也可以通过这个 suffixes=('_x', '_y') 参数进行一个修改
    
    employee_x group employee_y hire_date
    0 Jack Accounting Jack 2003
    • 当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列作为连接的列
    df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                    'group':['Accounting','Product','Marketing'],
                   'hire_date':[1998,2017,2018]})
    df1
    
    employee group hire_date
    0 Bobs Accounting 1998
    1 Linda Product 2017
    2 Bill Marketing 2018
    df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                    'hire_dates':[1998,2016,2007]})
    df5
    
    hire_dates name
    0 1998 Lisa
    1 2016 Bobs
    2 2007 Bill
    pd.merge(df1,df5,left_on='employee',right_on='name',how='outer')
    
    # 分别指定df1表中的employee字段和df5表中的name字段进行一个比对连接,
    
    employee group hire_date hire_dates name
    0 Bobs Accounting 1998.0 2016.0 Bobs
    1 Linda Product 2017.0 NaN NaN
    2 Bill Marketing 2018.0 2007.0 Bill
    3 NaN NaN NaN 1998.0 Lisa

    内合并与外合并:out取并集 inner取交集

  • 相关阅读:
    NSDate的处理:前一天、后一天等关于时区偏移的处理以及在数据库中的使用
    《powershell 的版本号所引起的载入 FSharp 编译器问题》基本解决
    hdu 2055 An easy problem (java)
    昨天登陆页面,无法进入后台,今天攻克了
    关于ubuntu下qt编译显示Cannot connect creator comm socket /tmp/qt_temp.xxx/stub-socket的解决的方法
    (转)Struts2的拦截器
    (转)Struts2的标签库
    (转)OGNL与值栈
    (转)Struts2访问Servlet的API及......
    (转)Struts2快速入门
  • 原文地址:https://www.cnblogs.com/zhufanyu/p/12031641.html
Copyright © 2020-2023  润新知