Pandas Cookbook -- 09合并Pandas对象及使用数据库

简书大神SeanCheney的译作，我作了些格式调整和文章目录结构的变化，更适合自己阅读，以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

1 DataFrame插入

读取names数据集

names = pd.read_csv('data/names.csv')
names

	Name	Age
0	Cornelia	70
1	Abbas	69
2	Penelope	4
3	Niko	2

1.1 用loc直接赋值新的行

new_data_list = ['Aria', 1]
names.loc[4] = new_data_list
names

	Name	Age
0	Cornelia	70
1	Abbas	69
2	Penelope	4
3	Niko	2
4	Aria	1

也可以用字典赋值新行

names.loc[len(names)] = {'Name':'Zayd', 'Age':2}

字典可以打乱列名的顺序

names.loc[len(names)] = pd.Series({'Age':32, 'Name':'Dean'})

1.2 用append添加新行

1.直接append一个字典

names.append({'Name':'Aria2222', 'Age':12222},ignore_index=True)

	Name	Age
0	Cornelia	70
1	Abbas	69
2	Penelope	4
3	Niko	2
4	Aria	1
5	Zayd	2
6	Dean	32
7	Aria2222	12222

2.append方法可以将DataFrame和Series相连

创建一个Series对象

s = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))

names.append(s)

	Name	Age
0	Cornelia	70
1	Abbas	69
2	Penelope	4
3	Niko	2
4	Aria	1
5	Zayd	2
6	Dean	32
7	Zach	3

append方法可以同时连接多行，只要将对象放到列表中

s1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
s2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')
names.append([s1, s2])

	Name	Age
0	Cornelia	70
1	Abbas	69
2	Penelope	4
3	Niko	2
4	Aria	1
5	Zayd	2
6	Dean	32
7	Zach	3
USA	Zayd	2

1.3 用insert添加新列

DataFrame.insert(loc, column, value, allow_duplicates=False)

Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

向dataframe指定的位置插入新列，如果该列与已有列重复则会抛出异常，除非allow_duplicates=True.

loc : int
- Insertion index. Must verify 0 <= loc <= len(columns)
column : string, number, or hashable object
- label of the inserted column
value : int, Series, or array-like
allow_duplicates : bool, optional

names.insert(0,'id',np.random.randint(len(names)))
names

	id	Name	Age
0	3	Cornelia	70
1	3	Abbas	69
2	3	Penelope	4
3	3	Niko	2
4	3	Aria	1
5	3	Zayd	2
6	3	Dean	32

2 DataFrame合并

读取stocks_2016和stocks_2017两个数据集，用Symbol作为行索引名

years = 2016, 2017, 2018
stock_tables = [pd.read_csv('data/stocks_{}.csv'.format(year), index_col='Symbol') for year in years]
stocks_2016, stocks_2017, stocks_2018 = stock_tables

names = ['prices', 'transactions']
food_tables = [pd.read_csv('data/food_{}.csv'.format(name)) for name in names]

2.1 concat

将两个DataFrame放到一个列表中，用pandas的concat方法将它们连接起来

s_list = [stocks_2016, stocks_2017]
pd.concat(s_list)

	Shares	Low	High
Symbol
AAPL	80	95	110
TSLA	50	80	130
WMT	40	55	70
AAPL	50	120	140
GE	100	30	40
IBM	87	75	95
SLB	20	55	85
TXN	500	15	23
TSLA	100	100	300

concat是唯一一个可以将DataFrames垂直连接起来的函数

keys参数可以给两个DataFrame命名，该标签会出现在行索引的最外层，会生成多层索引，names参数可以重命名每个索引层

pd.concat(s_list, keys=['2016', '2017'], names=['Year', 'Symbol'])

		Shares	Low	High
Year	Symbol
2016	AAPL	80	95	110
	TSLA	50	80	130
	WMT	40	55	70
2017	AAPL	50	120	140
	GE	100	30	40
	IBM	87	75	95
	SLB	20	55	85
	TXN	500	15	23
	TSLA	100	100	300

也可以横向连接。只要将axis参数设为columns或1

pd.concat(s_list, keys=['2016', '2017'], axis='columns', names=['Year', None],sort=True)

Year	2016			2017
	Shares	Low	High	Shares	Low	High
AAPL	80.0	95.0	110.0	50.0	120.0	140.0
GE	NaN	NaN	NaN	100.0	30.0	40.0
IBM	NaN	NaN	NaN	87.0	75.0	95.0
SLB	NaN	NaN	NaN	20.0	55.0	85.0
TSLA	50.0	80.0	130.0	100.0	100.0	300.0
TXN	NaN	NaN	NaN	500.0	15.0	23.0
WMT	40.0	55.0	70.0	NaN	NaN	NaN

concat函数默认使用的是外连接，会保留每个DataFrame中的所有行。也可以通过设定join参数，使用内连接：

pd.concat(s_list, join='inner', keys=['2016', '2017'], axis='columns', names=['Year', None])

Year	2016			2017
	Shares	Low	High	Shares	Low	High
Symbol
AAPL	80	95	110	50	120	140
TSLA	50	80	130	100	100	300

concat方法的TIPS

给dataframe命名后，再连接

pd.concat(dict(zip(years,stock_tables)), axis='columns',sort=True)

	2016			2017			2018
	Shares	Low	High	Shares	Low	High	Shares	Low	High
AAPL	80.0	95.0	110.0	50.0	120.0	140.0	40.0	135.0	170.0
AMZN	NaN	NaN	NaN	NaN	NaN	NaN	8.0	900.0	1125.0
GE	NaN	NaN	NaN	100.0	30.0	40.0	NaN	NaN	NaN
IBM	NaN	NaN	NaN	87.0	75.0	95.0	NaN	NaN	NaN
SLB	NaN	NaN	NaN	20.0	55.0	85.0	NaN	NaN	NaN
TSLA	50.0	80.0	130.0	100.0	100.0	300.0	50.0	220.0	400.0
TXN	NaN	NaN	NaN	500.0	15.0	23.0	NaN	NaN	NaN
WMT	40.0	55.0	70.0	NaN	NaN	NaN	NaN	NaN	NaN

append是concat方法的超简化版本，append内部其实就是调用concat。前本节的第二个例子，pd.concat也可以如下实现：

stocks_2016.append(stocks_2017)

	Shares	Low	High
Symbol
AAPL	80	95	110
TSLA	50	80	130
WMT	40	55	70
AAPL	50	120	140
GE	100	30	40
IBM	87	75	95
SLB	20	55	85
TXN	500	15	23
TSLA	100	100	300

2.2 join

用join将DataFrame连起来；如果列名有相同的，需要设置lsuffix或rsuffix以进行区分

stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer')

	Shares_2016	Low_2016	High_2016	Shares_2017	Low_2017	High_2017
Symbol
AAPL	80.0	95.0	110.0	50.0	120.0	140.0
GE	NaN	NaN	NaN	100.0	30.0	40.0
IBM	NaN	NaN	NaN	87.0	75.0	95.0
SLB	NaN	NaN	NaN	20.0	55.0	85.0
TSLA	50.0	80.0	130.0	100.0	100.0	300.0
TXN	NaN	NaN	NaN	500.0	15.0	23.0
WMT	40.0	55.0	70.0	NaN	NaN	NaN

2.3 merge

查看food_prices和food_transactions两个小数据集

food_prices, food_transactions = food_tables

food_prices

	item	store	price	Date
0	pear	A	0.99	2017
1	pear	B	1.99	2017
2	peach	A	2.99	2017
3	peach	B	3.49	2017
4	banana	A	0.39	2017
5	banana	B	0.49	2017
6	steak	A	5.99	2017
7	steak	B	6.99	2017
8	steak	B	4.99	2015

food_transactions

	custid	item	store	quantity
0	1	pear	A	5
1	1	banana	A	10
2	2	steak	B	3
3	2	pear	B	1
4	2	peach	B	2
5	2	steak	B	1
6	2	coconut	B	4

通过键item和store，将food_transactions和food_prices两个数据集融合

food_transactions.merge(food_prices, on=['item', 'store'])

	custid	item	store	quantity	price	Date
0	1	pear	A	5	0.99	2017
1	1	banana	A	10	0.39	2017
2	2	steak	B	3	6.99	2017
3	2	steak	B	3	4.99	2015
4	2	steak	B	1	6.99	2017
5	2	steak	B	1	4.99	2015
6	2	pear	B	1	1.99	2017
7	2	peach	B	2	3.49	2017

因为steak在两张表中分别出现了两次，融合时产生了笛卡尔积，造成结果中出现了四行steak
因为coconut没有对应的价格，造成结果中没有coconut

下面只融合2017年的数据

food_transactions.merge(food_prices.query('Date == 2017'), how='left')

	custid	item	store	quantity	price	Date
0	1	pear	A	5	0.99	2017.0
1	1	banana	A	10	0.39	2017.0
2	2	steak	B	3	6.99	2017.0
3	2	pear	B	1	1.99	2017.0
4	2	peach	B	2	3.49	2017.0
5	2	steak	B	1	6.99	2017.0
6	2	coconut	B	4	NaN	NaN

2.4 concat/join/merge的区别

concat:

Pandas函数
可以垂直和水平地连接两个或多个pandas对象
只用索引对齐
索引出现重复值时会报错
默认是外连接（也可以设为内连接）
concat是唯一一个可以将DataFrames垂直连接起来的函数

join:

DataFrame方法
只能水平连接两个或多个pandas对象
对齐是靠被调用的DataFrame的列索引或行索引和另一个对象的行索引（不能是列索引）
通过笛卡尔积处理重复的索引值
默认是左连接（也可以设为内连接、外连接和右连接）

merge:

DataFrame方法
只能水平连接两个DataFrame对象
对齐是靠被调用的DataFrame的列或行索引和另一个DataFrame的列或行索引
通过笛卡尔积处理重复的索引值
默认是内连接（也可以设为左连接、外连接、右连接）

3 连接SQL数据库

3.1 创建SQLAlchemy引擎

在读取chinook数据库之前，需要创建SQLAlchemy引擎

from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/chinook.db')

3.2 read_sql_table函数

read_sql_table函数可以读取一张表，第一个参数是表名，第二个参数是引擎，返回一个dataframe

tracks = pd.read_sql_table('tracks', engine)
tracks.iloc[:5,:5]

	TrackId	Name	AlbumId	MediaTypeId	GenreId
0	1	For Those About To Rock (We Salute You)	1	1	1
1	2	Balls to the Wall	2	2	1
2	3	Fast As a Shark	3	2	1
3	4	Restless and Wild	3	2	1
4	5	Princess of the Dawn	3	2	1

genres = pd.read_sql_table('genres', engine)
genres.head()

	GenreId	Name
0	1	Rock
1	2	Jazz
2	3	Metal
3	4	Alternative & Punk
4	5	Rock And Roll

找到每种类型歌曲的平均时长

genre_track = genres.merge(
    tracks[['GenreId', 'Milliseconds']], 
    on='GenreId', 
    how='left').drop('GenreId', axis='columns')

genre_track.head()

	Name	Milliseconds
0	Rock	343719
1	Rock	342562
2	Rock	230619
3	Rock	252051
4	Rock	375418

将Milliseconds列转变为timedelta数据类型

genre_time = genre_track.groupby('Name')['Milliseconds'].mean()

pd.to_timedelta(genre_time, unit='ms').dt.floor('s').sort_values()[:10]

Name
Rock And Roll        00:02:14
Opera                00:02:54
Hip Hop/Rap          00:02:58
Easy Listening       00:03:09
Bossa Nova           00:03:39
R&B/Soul             00:03:40
World                00:03:44
Pop                  00:03:49
Latin                00:03:52
Alternative & Punk   00:03:54
Name: Milliseconds, dtype: timedelta64[ns]

找到每名顾客花费的总时长

cust = pd.read_sql_table('customers', engine, columns=['CustomerId', 'FirstName', 'LastName'])
invoice = pd.read_sql_table('invoices', engine, columns=['InvoiceId','CustomerId'])
ii = pd.read_sql_table('invoice_items', engine, columns=['InvoiceId', 'UnitPrice', 'Quantity'])

cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')

cust_inv.head()

	CustomerId	FirstName	LastName	InvoiceId	UnitPrice	Quantity
0	1	Luís	Gonçalves	98	1.99	1
1	1	Luís	Gonçalves	98	1.99	1
2	1	Luís	Gonçalves	121	0.99	1
3	1	Luís	Gonçalves	121	0.99	1
4	1	Luís	Gonçalves	121	0.99	1

现在可以用总量乘以单位价格，找到每名顾客的总消费

total = cust_inv['Quantity'] * cust_inv['UnitPrice']
cols = ['CustomerId', 'FirstName', 'LastName']
cust_inv.assign(Total = total).groupby(cols)['Total'].sum().sort_values(ascending=False).head()

CustomerId  FirstName  LastName  
6           Helena     Holý          49.62
26          Richard    Cunningham    47.62
57          Luis       Rojas         46.62
46          Hugh       O'Reilly      45.62
45          Ladislav   Kovács        45.62
Name: Total, dtype: float64

3.3 read_sql_query函数

pd.read_sql_query('select * from tracks limit 5', engine).iloc[:,:5]

	TrackId	Name	AlbumId	MediaTypeId	GenreId
0	1	For Those About To Rock (We Salute You)	1	1	1
1	2	Balls to the Wall	2	2	1
2	3	Fast As a Shark	3	2	1
3	4	Restless and Wild	3	2	1
4	5	Princess of the Dawn	3	2	1

可以将长字符串传给read_sql_query

sql_string1 = '''
          select 
              Name, 
              time(avg(Milliseconds) / 1000, 'unixepoch') as avg_time
          from (
                  select 
                      g.Name, 
                      t.Milliseconds
                  from 
                      genres as g 
                  join
                      tracks as t
                      on 
                          g.genreid == t.genreid
              )
          group by 
              Name
          order by 
              avg_time
          '''

pd.read_sql_query(sql_string1, engine)[:10]

	Name	avg_time
0	Rock And Roll	00:02:14
1	Opera	00:02:54
2	Hip Hop/Rap	00:02:58
3	Easy Listening	00:03:09
4	Bossa Nova	00:03:39
5	R&B/Soul	00:03:40
6	World	00:03:44
7	Pop	00:03:49
8	Latin	00:03:52
9	Alternative & Punk	00:03:54

sql_string2 = '''
          select 
                c.customerid, 
                c.FirstName, 
                c.LastName, 
                sum(ii.quantity *  ii.unitprice) as Total
          from
                customers as c
          join
                invoices as i
                     on c.customerid = i.customerid
          join
                invoice_items as ii
                     on i.invoiceid = ii.invoiceid
          group by
                c.customerid, c.FirstName, c.LastName
          order by
                Total desc
          '''

pd.read_sql_query(sql_string2, engine)[:10]

	CustomerId	FirstName	LastName	Total
0	6	Helena	Holý	49.62
1	26	Richard	Cunningham	47.62
2	57	Luis	Rojas	46.62
3	45	Ladislav	Kovács	45.62
4	46	Hugh	O'Reilly	45.62
5	37	Fynn	Zimmermann	43.62
6	24	Frank	Ralston	43.62
7	28	Julia	Barnett	43.62
8	25	Victor	Stevens	42.62
9	7	Astrid	Gruber	42.62

天下风云出我辈，一入江湖岁月催

相关阅读:
005. gitlab安装
 004. github使用
 003. git标签
 TS标红和报错解决(优化项)
catalog连接数据库与sde权限问题
 博客新生企划
 HDU 7105 Power Sum
HDU 7131 Nun Heh Heh Aaaaaaaaaaa
BZOJ 1691 挑剔的美食家
 洛谷 4254 Blue Mary 开公司
原文地址：https://www.cnblogs.com/shiyushiyu/p/9818378.html