• Pandas Cookbook -- 09合并Pandas对象及使用数据库


    合并Pandas对象及数据库

    简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

    import pandas as pd
    import numpy as np
    

    1 DataFrame插入

    读取names数据集

    names = pd.read_csv('data/names.csv')
    names
    
    Name Age
    0 Cornelia 70
    1 Abbas 69
    2 Penelope 4
    3 Niko 2

    1.1 用loc直接赋值新的行

    new_data_list = ['Aria', 1]
    names.loc[4] = new_data_list
    names
    
    Name Age
    0 Cornelia 70
    1 Abbas 69
    2 Penelope 4
    3 Niko 2
    4 Aria 1

    也可以用字典赋值新行

    names.loc[len(names)] = {'Name':'Zayd', 'Age':2}
    

    字典可以打乱列名的顺序

    names.loc[len(names)] = pd.Series({'Age':32, 'Name':'Dean'})
    

    1.2 用append添加新行

    1.直接append一个字典

    names.append({'Name':'Aria2222', 'Age':12222},ignore_index=True)
    
    Name Age
    0 Cornelia 70
    1 Abbas 69
    2 Penelope 4
    3 Niko 2
    4 Aria 1
    5 Zayd 2
    6 Dean 32
    7 Aria2222 12222

    2.append方法可以将DataFrame和Series相连

    创建一个Series对象

    s = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
    
    names.append(s)
    
    Name Age
    0 Cornelia 70
    1 Abbas 69
    2 Penelope 4
    3 Niko 2
    4 Aria 1
    5 Zayd 2
    6 Dean 32
    7 Zach 3

    append方法可以同时连接多行,只要将对象放到列表中

    s1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
    s2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')
    names.append([s1, s2])
    
    Name Age
    0 Cornelia 70
    1 Abbas 69
    2 Penelope 4
    3 Niko 2
    4 Aria 1
    5 Zayd 2
    6 Dean 32
    7 Zach 3
    USA Zayd 2

    1.3 用insert添加新列

    DataFrame.insert(loc, column, value, allow_duplicates=False)

    Insert column into DataFrame at specified location.
    Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

    向dataframe指定的位置插入新列,如果该列与已有列重复则会抛出异常,除非allow_duplicates=True.

    • loc : int
      • Insertion index. Must verify 0 <= loc <= len(columns)
    • column : string, number, or hashable object
      • label of the inserted column
    • value : int, Series, or array-like
    • allow_duplicates : bool, optional
    names.insert(0,'id',np.random.randint(len(names)))
    names
    
    id Name Age
    0 3 Cornelia 70
    1 3 Abbas 69
    2 3 Penelope 4
    3 3 Niko 2
    4 3 Aria 1
    5 3 Zayd 2
    6 3 Dean 32

    2 DataFrame合并

    读取stocks_2016和stocks_2017两个数据集,用Symbol作为行索引名

    years = 2016, 2017, 2018
    stock_tables = [pd.read_csv('data/stocks_{}.csv'.format(year), index_col='Symbol') for year in years]
    stocks_2016, stocks_2017, stocks_2018 = stock_tables
    
    names = ['prices', 'transactions']
    food_tables = [pd.read_csv('data/food_{}.csv'.format(name)) for name in names]
    

    2.1 concat

    将两个DataFrame放到一个列表中,用pandas的concat方法将它们连接起来

    s_list = [stocks_2016, stocks_2017]
    pd.concat(s_list)
    
    Shares Low High
    Symbol
    AAPL 80 95 110
    TSLA 50 80 130
    WMT 40 55 70
    AAPL 50 120 140
    GE 100 30 40
    IBM 87 75 95
    SLB 20 55 85
    TXN 500 15 23
    TSLA 100 100 300

    concat是唯一一个可以将DataFrames垂直连接起来的函数

    keys参数可以给两个DataFrame命名,该标签会出现在行索引的最外层,会生成多层索引,names参数可以重命名每个索引层

    pd.concat(s_list, keys=['2016', '2017'], names=['Year', 'Symbol'])
    
    Shares Low High
    Year Symbol
    2016 AAPL 80 95 110
    TSLA 50 80 130
    WMT 40 55 70
    2017 AAPL 50 120 140
    GE 100 30 40
    IBM 87 75 95
    SLB 20 55 85
    TXN 500 15 23
    TSLA 100 100 300

    也可以横向连接。只要将axis参数设为columns或1

    pd.concat(s_list, keys=['2016', '2017'], axis='columns', names=['Year', None],sort=True)
    
    Year 2016 2017
    Shares Low High Shares Low High
    AAPL 80.0 95.0 110.0 50.0 120.0 140.0
    GE NaN NaN NaN 100.0 30.0 40.0
    IBM NaN NaN NaN 87.0 75.0 95.0
    SLB NaN NaN NaN 20.0 55.0 85.0
    TSLA 50.0 80.0 130.0 100.0 100.0 300.0
    TXN NaN NaN NaN 500.0 15.0 23.0
    WMT 40.0 55.0 70.0 NaN NaN NaN

    concat函数默认使用的是外连接,会保留每个DataFrame中的所有行。也可以通过设定join参数,使用内连接:

    pd.concat(s_list, join='inner', keys=['2016', '2017'], axis='columns', names=['Year', None])
    
    Year 2016 2017
    Shares Low High Shares Low High
    Symbol
    AAPL 80 95 110 50 120 140
    TSLA 50 80 130 100 100 300

    concat方法的TIPS

    给dataframe命名后,再连接

    pd.concat(dict(zip(years,stock_tables)), axis='columns',sort=True)
    
    2016 2017 2018
    Shares Low High Shares Low High Shares Low High
    AAPL 80.0 95.0 110.0 50.0 120.0 140.0 40.0 135.0 170.0
    AMZN NaN NaN NaN NaN NaN NaN 8.0 900.0 1125.0
    GE NaN NaN NaN 100.0 30.0 40.0 NaN NaN NaN
    IBM NaN NaN NaN 87.0 75.0 95.0 NaN NaN NaN
    SLB NaN NaN NaN 20.0 55.0 85.0 NaN NaN NaN
    TSLA 50.0 80.0 130.0 100.0 100.0 300.0 50.0 220.0 400.0
    TXN NaN NaN NaN 500.0 15.0 23.0 NaN NaN NaN
    WMT 40.0 55.0 70.0 NaN NaN NaN NaN NaN NaN

    append是concat方法的超简化版本,append内部其实就是调用concat。前本节的第二个例子,pd.concat也可以如下实现:

    stocks_2016.append(stocks_2017)
    
    Shares Low High
    Symbol
    AAPL 80 95 110
    TSLA 50 80 130
    WMT 40 55 70
    AAPL 50 120 140
    GE 100 30 40
    IBM 87 75 95
    SLB 20 55 85
    TXN 500 15 23
    TSLA 100 100 300

    2.2 join

    用join将DataFrame连起来;如果列名有相同的,需要设置lsuffix或rsuffix以进行区分

    stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer')
    
    Shares_2016 Low_2016 High_2016 Shares_2017 Low_2017 High_2017
    Symbol
    AAPL 80.0 95.0 110.0 50.0 120.0 140.0
    GE NaN NaN NaN 100.0 30.0 40.0
    IBM NaN NaN NaN 87.0 75.0 95.0
    SLB NaN NaN NaN 20.0 55.0 85.0
    TSLA 50.0 80.0 130.0 100.0 100.0 300.0
    TXN NaN NaN NaN 500.0 15.0 23.0
    WMT 40.0 55.0 70.0 NaN NaN NaN

    2.3 merge

    查看food_prices和food_transactions两个小数据集

    food_prices, food_transactions = food_tables
    
    food_prices
    
    item store price Date
    0 pear A 0.99 2017
    1 pear B 1.99 2017
    2 peach A 2.99 2017
    3 peach B 3.49 2017
    4 banana A 0.39 2017
    5 banana B 0.49 2017
    6 steak A 5.99 2017
    7 steak B 6.99 2017
    8 steak B 4.99 2015
    food_transactions
    
    custid item store quantity
    0 1 pear A 5
    1 1 banana A 10
    2 2 steak B 3
    3 2 pear B 1
    4 2 peach B 2
    5 2 steak B 1
    6 2 coconut B 4

    通过键item和store,将food_transactions和food_prices两个数据集融合

    food_transactions.merge(food_prices, on=['item', 'store'])
    
    custid item store quantity price Date
    0 1 pear A 5 0.99 2017
    1 1 banana A 10 0.39 2017
    2 2 steak B 3 6.99 2017
    3 2 steak B 3 4.99 2015
    4 2 steak B 1 6.99 2017
    5 2 steak B 1 4.99 2015
    6 2 pear B 1 1.99 2017
    7 2 peach B 2 3.49 2017

    因为steak在两张表中分别出现了两次,融合时产生了笛卡尔积,造成结果中出现了四行steak
    因为coconut没有对应的价格,造成结果中没有coconut

    下面只融合2017年的数据

    food_transactions.merge(food_prices.query('Date == 2017'), how='left')
    
    custid item store quantity price Date
    0 1 pear A 5 0.99 2017.0
    1 1 banana A 10 0.39 2017.0
    2 2 steak B 3 6.99 2017.0
    3 2 pear B 1 1.99 2017.0
    4 2 peach B 2 3.49 2017.0
    5 2 steak B 1 6.99 2017.0
    6 2 coconut B 4 NaN NaN

    2.4 concat/join/merge的区别

    concat:

    1. Pandas函数
    2. 可以垂直和水平地连接两个或多个pandas对象
    3. 只用索引对齐
    4. 索引出现重复值时会报错
    5. 默认是外连接(也可以设为内连接)
    6. concat是唯一一个可以将DataFrames垂直连接起来的函数

    join:

    1. DataFrame方法
    2. 只能水平连接两个或多个pandas对象
    3. 对齐是靠被调用的DataFrame的列索引或行索引和另一个对象的行索引(不能是列索引)
    4. 通过笛卡尔积处理重复的索引值
    5. 默认是左连接(也可以设为内连接、外连接和右连接)

    merge:

    1. DataFrame方法
    2. 只能水平连接两个DataFrame对象
    3. 对齐是靠被调用的DataFrame的列或行索引和另一个DataFrame的列或行索引
    4. 通过笛卡尔积处理重复的索引值
    5. 默认是内连接(也可以设为左连接、外连接、右连接)

    3 连接SQL数据库

    3.1 创建SQLAlchemy引擎

    在读取chinook数据库之前,需要创建SQLAlchemy引擎

    from sqlalchemy import create_engine
    engine = create_engine('sqlite:///data/chinook.db')
    

    3.2 read_sql_table函数

    read_sql_table函数可以读取一张表,第一个参数是表名,第二个参数是引擎,返回一个dataframe

    tracks = pd.read_sql_table('tracks', engine)
    tracks.iloc[:5,:5]
    
    TrackId Name AlbumId MediaTypeId GenreId
    0 1 For Those About To Rock (We Salute You) 1 1 1
    1 2 Balls to the Wall 2 2 1
    2 3 Fast As a Shark 3 2 1
    3 4 Restless and Wild 3 2 1
    4 5 Princess of the Dawn 3 2 1
    genres = pd.read_sql_table('genres', engine)
    genres.head()
    
    GenreId Name
    0 1 Rock
    1 2 Jazz
    2 3 Metal
    3 4 Alternative & Punk
    4 5 Rock And Roll

    找到每种类型歌曲的平均时长

    genre_track = genres.merge(
        tracks[['GenreId', 'Milliseconds']], 
        on='GenreId', 
        how='left').drop('GenreId', axis='columns')
    
    genre_track.head()
    
    Name Milliseconds
    0 Rock 343719
    1 Rock 342562
    2 Rock 230619
    3 Rock 252051
    4 Rock 375418

    将Milliseconds列转变为timedelta数据类型

    genre_time = genre_track.groupby('Name')['Milliseconds'].mean()
    
    pd.to_timedelta(genre_time, unit='ms').dt.floor('s').sort_values()[:10]
    
    Name
    Rock And Roll        00:02:14
    Opera                00:02:54
    Hip Hop/Rap          00:02:58
    Easy Listening       00:03:09
    Bossa Nova           00:03:39
    R&B/Soul             00:03:40
    World                00:03:44
    Pop                  00:03:49
    Latin                00:03:52
    Alternative & Punk   00:03:54
    Name: Milliseconds, dtype: timedelta64[ns]
    

    找到每名顾客花费的总时长

    cust = pd.read_sql_table('customers', engine, columns=['CustomerId', 'FirstName', 'LastName'])
    invoice = pd.read_sql_table('invoices', engine, columns=['InvoiceId','CustomerId'])
    ii = pd.read_sql_table('invoice_items', engine, columns=['InvoiceId', 'UnitPrice', 'Quantity'])
    
    cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')
    
    cust_inv.head()
    
    CustomerId FirstName LastName InvoiceId UnitPrice Quantity
    0 1 Luís Gonçalves 98 1.99 1
    1 1 Luís Gonçalves 98 1.99 1
    2 1 Luís Gonçalves 121 0.99 1
    3 1 Luís Gonçalves 121 0.99 1
    4 1 Luís Gonçalves 121 0.99 1

    现在可以用总量乘以单位价格,找到每名顾客的总消费

    total = cust_inv['Quantity'] * cust_inv['UnitPrice']
    cols = ['CustomerId', 'FirstName', 'LastName']
    cust_inv.assign(Total = total).groupby(cols)['Total'].sum().sort_values(ascending=False).head()
    
    CustomerId  FirstName  LastName  
    6           Helena     Holý          49.62
    26          Richard    Cunningham    47.62
    57          Luis       Rojas         46.62
    46          Hugh       O'Reilly      45.62
    45          Ladislav   Kovács        45.62
    Name: Total, dtype: float64
    

    3.3 read_sql_query函数

    pd.read_sql_query('select * from tracks limit 5', engine).iloc[:,:5]
    
    TrackId Name AlbumId MediaTypeId GenreId
    0 1 For Those About To Rock (We Salute You) 1 1 1
    1 2 Balls to the Wall 2 2 1
    2 3 Fast As a Shark 3 2 1
    3 4 Restless and Wild 3 2 1
    4 5 Princess of the Dawn 3 2 1

    可以将长字符串传给read_sql_query

    sql_string1 = '''
              select 
                  Name, 
                  time(avg(Milliseconds) / 1000, 'unixepoch') as avg_time
              from (
                      select 
                          g.Name, 
                          t.Milliseconds
                      from 
                          genres as g 
                      join
                          tracks as t
                          on 
                              g.genreid == t.genreid
                  )
              group by 
                  Name
              order by 
                  avg_time
              '''
    
    pd.read_sql_query(sql_string1, engine)[:10]
    
    Name avg_time
    0 Rock And Roll 00:02:14
    1 Opera 00:02:54
    2 Hip Hop/Rap 00:02:58
    3 Easy Listening 00:03:09
    4 Bossa Nova 00:03:39
    5 R&B/Soul 00:03:40
    6 World 00:03:44
    7 Pop 00:03:49
    8 Latin 00:03:52
    9 Alternative & Punk 00:03:54
    sql_string2 = '''
              select 
                    c.customerid, 
                    c.FirstName, 
                    c.LastName, 
                    sum(ii.quantity *  ii.unitprice) as Total
              from
                    customers as c
              join
                    invoices as i
                         on c.customerid = i.customerid
              join
                    invoice_items as ii
                         on i.invoiceid = ii.invoiceid
              group by
                    c.customerid, c.FirstName, c.LastName
              order by
                    Total desc
              '''
    
    pd.read_sql_query(sql_string2, engine)[:10]
    
    CustomerId FirstName LastName Total
    0 6 Helena Holý 49.62
    1 26 Richard Cunningham 47.62
    2 57 Luis Rojas 46.62
    3 45 Ladislav Kovács 45.62
    4 46 Hugh O'Reilly 45.62
    5 37 Fynn Zimmermann 43.62
    6 24 Frank Ralston 43.62
    7 28 Julia Barnett 43.62
    8 25 Victor Stevens 42.62
    9 7 Astrid Gruber 42.62
    天下风云出我辈,一入江湖岁月催
  • 相关阅读:
    005. gitlab安装
    004. github使用
    003. git标签
    TS标红和报错解决(优化项)
    catalog连接数据库与sde权限问题
    博客新生企划
    HDU 7105 Power Sum
    HDU 7131 Nun Heh Heh Aaaaaaaaaaa
    BZOJ 1691 挑剔的美食家
    洛谷 4254 Blue Mary 开公司
  • 原文地址:https://www.cnblogs.com/shiyushiyu/p/9818378.html
Copyright © 2020-2023  润新知