• 5 dataframe级联与合并操作


    import pandas as pd
    import numpy as np
    

    级联操作

    • pd.concat, pd.append

    pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

    objs
    axis=0
    keys
    join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
    ignore_index=False
    
    • 匹配级联
    df1 = pd.DataFrame(data=np.random.randint(0,100,size=(5,3)),columns=['A','B','C'])
    df2 = pd.DataFrame(data=np.random.randint(0,100,size=(5,3)),columns=['A','D','C'])
    
    pd.concat((df1,df1),axis=1) #行列索引都一致的级联叫做匹配级联
    
    A B C A B C
    0 26 63 95 26 63 95
    1 66 86 35 66 86 35
    2 74 3 4 74 3 4
    3 85 0 67 85 0 67
    4 59 28 65 59 28 65
    • 不匹配级联
      • 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
      • 有2种连接方式:
        • 外连接:补NaN(默认模式)
        • 内连接:只连接匹配的项
    pd.concat((df1,df2),axis=0)
    
    /Users/bobo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
    of pandas will change to not sort by default.
    
    To accept the future behavior, pass 'sort=False'.
    
    To retain the current behavior and silence the warning, pass 'sort=True'.
    
      """Entry point for launching an IPython kernel.
    
    A B C D
    0 26 63.0 95 NaN
    1 66 86.0 35 NaN
    2 74 3.0 4 NaN
    3 85 0.0 67 NaN
    4 59 28.0 65 NaN
    0 98 NaN 87 2.0
    1 24 NaN 3 52.0
    2 68 NaN 61 65.0
    3 57 NaN 29 29.0
    4 99 NaN 41 51.0
    pd.concat((df1,df2),axis=0,join='inner') #inner直把可以级联的级联不能级联不处理
    
    A C
    0 26 95
    1 66 35
    2 74 4
    3 85 67
    4 59 65
    0 98 87
    1 24 3
    2 68 61
    3 57 29
    4 99 41
    • 如果想要保留数据的完整性必须使用outer(外连接)

    • append函数的使用

    df1.append(df1)
    
    A B C
    0 26 63 95
    1 66 86 35
    2 74 3 4
    3 85 0 67
    4 59 28 65
    0 26 63 95
    1 66 86 35
    2 74 3 4
    3 85 0 67
    4 59 28 65

    合并操作

    • merge与concat的区别在于,merge需要依据某一共同列来进行合并

    • 使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。

    • 注意每一列元素的顺序不要求一致

    一对一合并

    from pandas import DataFrame
    
    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                    'group':['Accounting','Engineering','Engineering'],
                    })
    df1
    
    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering
    df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                    'hire_date':[2004,2008,2012],
                    })
    df2
    
    employee hire_date
    0 Lisa 2004
    1 Bob 2008
    2 Jake 2012
    pd.merge(df1,df2,on='employee')
    
    employee group hire_date
    0 Bob Accounting 2008
    1 Jake Engineering 2012
    2 Lisa Engineering 2004

    一对多合并

    df3 = DataFrame({
        'employee':['Lisa','Jake'],
        'group':['Accounting','Engineering'],
        'hire_date':[2004,2016]})
    df3
    
    employee group hire_date
    0 Lisa Accounting 2004
    1 Jake Engineering 2016
    df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                           'supervisor':['Carly','Guido','Steve']
                    })
    df4
    
    group supervisor
    0 Accounting Carly
    1 Engineering Guido
    2 Engineering Steve
    pd.merge(df3,df4)#on如果不写,默认情况下使用两表中公有的列作为合并条件
    
    employee group hire_date supervisor
    0 Lisa Accounting 2004 Carly
    1 Jake Engineering 2016 Guido
    2 Jake Engineering 2016 Steve

    多对多合并

    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                     'group':['Accounting','Engineering','Engineering']})
    df1
    
    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering
    df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                    'supervisor':['Carly','Guido','Steve']
                    })
    df5
    
    group supervisor
    0 Engineering Carly
    1 Engineering Guido
    2 HR Steve
    pd.merge(df1,df5,how='right')
    
    employee group supervisor
    0 Jake Engineering Carly
    1 Lisa Engineering Carly
    2 Jake Engineering Guido
    3 Lisa Engineering Guido
    4 NaN HR Steve
    pd.merge(df1,df5,how='right')
    
    employee group supervisor
    0 Jake Engineering Carly
    1 Lisa Engineering Carly
    2 Jake Engineering Guido
    3 Lisa Engineering Guido
    4 NaN HR Steve

    key的规范化

    • 当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列
    df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                    'group':['Accounting','Product','Marketing'],
                   'hire_date':[1998,2017,2018]})
    df1
    
    employee group hire_date
    0 Bobs Accounting 1998
    1 Linda Product 2017
    2 Bill Marketing 2018
    df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                    'hire_dates':[1998,2016,2007]})
    df5
    
    name hire_dates
    0 Lisa 1998
    1 Bobs 2016
    2 Bill 2007
    pd.merge(df1,df5,left_on='employee',right_on='name')
    
    employee group hire_date name hire_dates
    0 Bobs Accounting 1998 Bobs 2016
    1 Bill Marketing 2018 Bill 2007

    内合并与外合并:out取并集 inner取交集

    df6 = DataFrame({'name':['Peter','Paul','Mary'],
                   'food':['fish','beans','bread']}
                   )
    df7 = DataFrame({'name':['Mary','Joseph'],
                    'drink':['wine','beer']})
    
    
    
    
    df6 = DataFrame({'name':['Peter','Paul','Mary'],
                   'food':['fish','beans','bread']}
                   )
    df7 = DataFrame({'name':['Mary','Joseph'],
                    'drink':['wine','beer']})
    
    
    
    
    作者:华王 博客:https://www.cnblogs.com/huahuawang/
  • 相关阅读:
    Java SE 第十一讲(面向对象之封装)续二
    Java SE 第二十六讲 包与导入语句剖析
    Java SE 第三十一,二,三 Java数组剖析,Java数组内存地址解析
    Java SE 第三十四,五,六讲 Array类解析及数组疑难剖析,冒泡排序,交换排序以及快速排序
    Java SE 第三十八,九,四十,四十一,四十二,三四IDE详细介绍,ArrayList源代码深入剖析,L
    java高效的获取指定的精度的double数
    C++ 字符常量
    C++ endl
    C++基本数据类型
    vs2005的treeview简单使用之无限级别菜单建立
  • 原文地址:https://www.cnblogs.com/huahuawang/p/14889002.html
Copyright © 2020-2023  润新知