简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
1 DataFrame插入
读取names数据集
names = pd.read_csv('data/names.csv')
names
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
1.1 用loc直接赋值新的行
new_data_list = ['Aria', 1]
names.loc[4] = new_data_list
names
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
4 | Aria | 1 |
也可以用字典赋值新行
names.loc[len(names)] = {'Name':'Zayd', 'Age':2}
字典可以打乱列名的顺序
names.loc[len(names)] = pd.Series({'Age':32, 'Name':'Dean'})
1.2 用append添加新行
1.直接append一个字典
names.append({'Name':'Aria2222', 'Age':12222},ignore_index=True)
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
4 | Aria | 1 |
5 | Zayd | 2 |
6 | Dean | 32 |
7 | Aria2222 | 12222 |
2.append方法可以将DataFrame和Series相连
创建一个Series对象
s = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
names.append(s)
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
4 | Aria | 1 |
5 | Zayd | 2 |
6 | Dean | 32 |
7 | Zach | 3 |
append方法可以同时连接多行,只要将对象放到列表中
s1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
s2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')
names.append([s1, s2])
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
4 | Aria | 1 |
5 | Zayd | 2 |
6 | Dean | 32 |
7 | Zach | 3 |
USA | Zayd | 2 |
1.3 用insert添加新列
DataFrame.insert(loc, column, value, allow_duplicates=False)
Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.
向dataframe指定的位置插入新列,如果该列与已有列重复则会抛出异常,除非allow_duplicates=True.
- loc : int
- Insertion index. Must verify 0 <= loc <= len(columns)
- column : string, number, or hashable object
- label of the inserted column
- value : int, Series, or array-like
- allow_duplicates : bool, optional
names.insert(0,'id',np.random.randint(len(names)))
names
id | Name | Age | |
---|---|---|---|
0 | 3 | Cornelia | 70 |
1 | 3 | Abbas | 69 |
2 | 3 | Penelope | 4 |
3 | 3 | Niko | 2 |
4 | 3 | Aria | 1 |
5 | 3 | Zayd | 2 |
6 | 3 | Dean | 32 |
2 DataFrame合并
读取stocks_2016和stocks_2017两个数据集,用Symbol作为行索引名
years = 2016, 2017, 2018
stock_tables = [pd.read_csv('data/stocks_{}.csv'.format(year), index_col='Symbol') for year in years]
stocks_2016, stocks_2017, stocks_2018 = stock_tables
names = ['prices', 'transactions']
food_tables = [pd.read_csv('data/food_{}.csv'.format(name)) for name in names]
2.1 concat
将两个DataFrame放到一个列表中,用pandas的concat方法将它们连接起来
s_list = [stocks_2016, stocks_2017]
pd.concat(s_list)
Shares | Low | High | |
---|---|---|---|
Symbol | |||
AAPL | 80 | 95 | 110 |
TSLA | 50 | 80 | 130 |
WMT | 40 | 55 | 70 |
AAPL | 50 | 120 | 140 |
GE | 100 | 30 | 40 |
IBM | 87 | 75 | 95 |
SLB | 20 | 55 | 85 |
TXN | 500 | 15 | 23 |
TSLA | 100 | 100 | 300 |
concat是唯一一个可以将DataFrames垂直连接起来的函数
keys参数可以给两个DataFrame命名,该标签会出现在行索引的最外层,会生成多层索引,names参数可以重命名每个索引层
pd.concat(s_list, keys=['2016', '2017'], names=['Year', 'Symbol'])
Shares | Low | High | ||
---|---|---|---|---|
Year | Symbol | |||
2016 | AAPL | 80 | 95 | 110 |
TSLA | 50 | 80 | 130 | |
WMT | 40 | 55 | 70 | |
2017 | AAPL | 50 | 120 | 140 |
GE | 100 | 30 | 40 | |
IBM | 87 | 75 | 95 | |
SLB | 20 | 55 | 85 | |
TXN | 500 | 15 | 23 | |
TSLA | 100 | 100 | 300 |
也可以横向连接。只要将axis参数设为columns或1
pd.concat(s_list, keys=['2016', '2017'], axis='columns', names=['Year', None],sort=True)
Year | 2016 | 2017 | ||||
---|---|---|---|---|---|---|
Shares | Low | High | Shares | Low | High | |
AAPL | 80.0 | 95.0 | 110.0 | 50.0 | 120.0 | 140.0 |
GE | NaN | NaN | NaN | 100.0 | 30.0 | 40.0 |
IBM | NaN | NaN | NaN | 87.0 | 75.0 | 95.0 |
SLB | NaN | NaN | NaN | 20.0 | 55.0 | 85.0 |
TSLA | 50.0 | 80.0 | 130.0 | 100.0 | 100.0 | 300.0 |
TXN | NaN | NaN | NaN | 500.0 | 15.0 | 23.0 |
WMT | 40.0 | 55.0 | 70.0 | NaN | NaN | NaN |
concat函数默认使用的是外连接,会保留每个DataFrame中的所有行。也可以通过设定join参数,使用内连接:
pd.concat(s_list, join='inner', keys=['2016', '2017'], axis='columns', names=['Year', None])
Year | 2016 | 2017 | ||||
---|---|---|---|---|---|---|
Shares | Low | High | Shares | Low | High | |
Symbol | ||||||
AAPL | 80 | 95 | 110 | 50 | 120 | 140 |
TSLA | 50 | 80 | 130 | 100 | 100 | 300 |
concat方法的TIPS
给dataframe命名后,再连接
pd.concat(dict(zip(years,stock_tables)), axis='columns',sort=True)
2016 | 2017 | 2018 | |||||||
---|---|---|---|---|---|---|---|---|---|
Shares | Low | High | Shares | Low | High | Shares | Low | High | |
AAPL | 80.0 | 95.0 | 110.0 | 50.0 | 120.0 | 140.0 | 40.0 | 135.0 | 170.0 |
AMZN | NaN | NaN | NaN | NaN | NaN | NaN | 8.0 | 900.0 | 1125.0 |
GE | NaN | NaN | NaN | 100.0 | 30.0 | 40.0 | NaN | NaN | NaN |
IBM | NaN | NaN | NaN | 87.0 | 75.0 | 95.0 | NaN | NaN | NaN |
SLB | NaN | NaN | NaN | 20.0 | 55.0 | 85.0 | NaN | NaN | NaN |
TSLA | 50.0 | 80.0 | 130.0 | 100.0 | 100.0 | 300.0 | 50.0 | 220.0 | 400.0 |
TXN | NaN | NaN | NaN | 500.0 | 15.0 | 23.0 | NaN | NaN | NaN |
WMT | 40.0 | 55.0 | 70.0 | NaN | NaN | NaN | NaN | NaN | NaN |
append是concat方法的超简化版本,append内部其实就是调用concat。前本节的第二个例子,pd.concat也可以如下实现:
stocks_2016.append(stocks_2017)
Shares | Low | High | |
---|---|---|---|
Symbol | |||
AAPL | 80 | 95 | 110 |
TSLA | 50 | 80 | 130 |
WMT | 40 | 55 | 70 |
AAPL | 50 | 120 | 140 |
GE | 100 | 30 | 40 |
IBM | 87 | 75 | 95 |
SLB | 20 | 55 | 85 |
TXN | 500 | 15 | 23 |
TSLA | 100 | 100 | 300 |
2.2 join
用join将DataFrame连起来;如果列名有相同的,需要设置lsuffix或rsuffix以进行区分
stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer')
Shares_2016 | Low_2016 | High_2016 | Shares_2017 | Low_2017 | High_2017 | |
---|---|---|---|---|---|---|
Symbol | ||||||
AAPL | 80.0 | 95.0 | 110.0 | 50.0 | 120.0 | 140.0 |
GE | NaN | NaN | NaN | 100.0 | 30.0 | 40.0 |
IBM | NaN | NaN | NaN | 87.0 | 75.0 | 95.0 |
SLB | NaN | NaN | NaN | 20.0 | 55.0 | 85.0 |
TSLA | 50.0 | 80.0 | 130.0 | 100.0 | 100.0 | 300.0 |
TXN | NaN | NaN | NaN | 500.0 | 15.0 | 23.0 |
WMT | 40.0 | 55.0 | 70.0 | NaN | NaN | NaN |
2.3 merge
查看food_prices和food_transactions两个小数据集
food_prices, food_transactions = food_tables
food_prices
item | store | price | Date | |
---|---|---|---|---|
0 | pear | A | 0.99 | 2017 |
1 | pear | B | 1.99 | 2017 |
2 | peach | A | 2.99 | 2017 |
3 | peach | B | 3.49 | 2017 |
4 | banana | A | 0.39 | 2017 |
5 | banana | B | 0.49 | 2017 |
6 | steak | A | 5.99 | 2017 |
7 | steak | B | 6.99 | 2017 |
8 | steak | B | 4.99 | 2015 |
food_transactions
custid | item | store | quantity | |
---|---|---|---|---|
0 | 1 | pear | A | 5 |
1 | 1 | banana | A | 10 |
2 | 2 | steak | B | 3 |
3 | 2 | pear | B | 1 |
4 | 2 | peach | B | 2 |
5 | 2 | steak | B | 1 |
6 | 2 | coconut | B | 4 |
通过键item和store,将food_transactions和food_prices两个数据集融合
food_transactions.merge(food_prices, on=['item', 'store'])
custid | item | store | quantity | price | Date | |
---|---|---|---|---|---|---|
0 | 1 | pear | A | 5 | 0.99 | 2017 |
1 | 1 | banana | A | 10 | 0.39 | 2017 |
2 | 2 | steak | B | 3 | 6.99 | 2017 |
3 | 2 | steak | B | 3 | 4.99 | 2015 |
4 | 2 | steak | B | 1 | 6.99 | 2017 |
5 | 2 | steak | B | 1 | 4.99 | 2015 |
6 | 2 | pear | B | 1 | 1.99 | 2017 |
7 | 2 | peach | B | 2 | 3.49 | 2017 |
因为steak在两张表中分别出现了两次,融合时产生了笛卡尔积,造成结果中出现了四行steak
因为coconut没有对应的价格,造成结果中没有coconut
下面只融合2017年的数据
food_transactions.merge(food_prices.query('Date == 2017'), how='left')
custid | item | store | quantity | price | Date | |
---|---|---|---|---|---|---|
0 | 1 | pear | A | 5 | 0.99 | 2017.0 |
1 | 1 | banana | A | 10 | 0.39 | 2017.0 |
2 | 2 | steak | B | 3 | 6.99 | 2017.0 |
3 | 2 | pear | B | 1 | 1.99 | 2017.0 |
4 | 2 | peach | B | 2 | 3.49 | 2017.0 |
5 | 2 | steak | B | 1 | 6.99 | 2017.0 |
6 | 2 | coconut | B | 4 | NaN | NaN |
2.4 concat/join/merge的区别
concat:
- Pandas函数
- 可以垂直和水平地连接两个或多个pandas对象
- 只用索引对齐
- 索引出现重复值时会报错
- 默认是外连接(也可以设为内连接)
- concat是唯一一个可以将DataFrames垂直连接起来的函数
join:
- DataFrame方法
- 只能水平连接两个或多个pandas对象
- 对齐是靠被调用的DataFrame的列索引或行索引和另一个对象的行索引(不能是列索引)
- 通过笛卡尔积处理重复的索引值
- 默认是左连接(也可以设为内连接、外连接和右连接)
merge:
- DataFrame方法
- 只能水平连接两个DataFrame对象
- 对齐是靠被调用的DataFrame的列或行索引和另一个DataFrame的列或行索引
- 通过笛卡尔积处理重复的索引值
- 默认是内连接(也可以设为左连接、外连接、右连接)
3 连接SQL数据库
3.1 创建SQLAlchemy引擎
在读取chinook数据库之前,需要创建SQLAlchemy引擎
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/chinook.db')
3.2 read_sql_table函数
read_sql_table函数可以读取一张表,第一个参数是表名,第二个参数是引擎,返回一个dataframe
tracks = pd.read_sql_table('tracks', engine)
tracks.iloc[:5,:5]
TrackId | Name | AlbumId | MediaTypeId | GenreId | |
---|---|---|---|---|---|
0 | 1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 |
1 | 2 | Balls to the Wall | 2 | 2 | 1 |
2 | 3 | Fast As a Shark | 3 | 2 | 1 |
3 | 4 | Restless and Wild | 3 | 2 | 1 |
4 | 5 | Princess of the Dawn | 3 | 2 | 1 |
genres = pd.read_sql_table('genres', engine)
genres.head()
GenreId | Name | |
---|---|---|
0 | 1 | Rock |
1 | 2 | Jazz |
2 | 3 | Metal |
3 | 4 | Alternative & Punk |
4 | 5 | Rock And Roll |
找到每种类型歌曲的平均时长
genre_track = genres.merge(
tracks[['GenreId', 'Milliseconds']],
on='GenreId',
how='left').drop('GenreId', axis='columns')
genre_track.head()
Name | Milliseconds | |
---|---|---|
0 | Rock | 343719 |
1 | Rock | 342562 |
2 | Rock | 230619 |
3 | Rock | 252051 |
4 | Rock | 375418 |
将Milliseconds列转变为timedelta数据类型
genre_time = genre_track.groupby('Name')['Milliseconds'].mean()
pd.to_timedelta(genre_time, unit='ms').dt.floor('s').sort_values()[:10]
Name
Rock And Roll 00:02:14
Opera 00:02:54
Hip Hop/Rap 00:02:58
Easy Listening 00:03:09
Bossa Nova 00:03:39
R&B/Soul 00:03:40
World 00:03:44
Pop 00:03:49
Latin 00:03:52
Alternative & Punk 00:03:54
Name: Milliseconds, dtype: timedelta64[ns]
找到每名顾客花费的总时长
cust = pd.read_sql_table('customers', engine, columns=['CustomerId', 'FirstName', 'LastName'])
invoice = pd.read_sql_table('invoices', engine, columns=['InvoiceId','CustomerId'])
ii = pd.read_sql_table('invoice_items', engine, columns=['InvoiceId', 'UnitPrice', 'Quantity'])
cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')
cust_inv.head()
CustomerId | FirstName | LastName | InvoiceId | UnitPrice | Quantity | |
---|---|---|---|---|---|---|
0 | 1 | Luís | Gonçalves | 98 | 1.99 | 1 |
1 | 1 | Luís | Gonçalves | 98 | 1.99 | 1 |
2 | 1 | Luís | Gonçalves | 121 | 0.99 | 1 |
3 | 1 | Luís | Gonçalves | 121 | 0.99 | 1 |
4 | 1 | Luís | Gonçalves | 121 | 0.99 | 1 |
现在可以用总量乘以单位价格,找到每名顾客的总消费
total = cust_inv['Quantity'] * cust_inv['UnitPrice']
cols = ['CustomerId', 'FirstName', 'LastName']
cust_inv.assign(Total = total).groupby(cols)['Total'].sum().sort_values(ascending=False).head()
CustomerId FirstName LastName
6 Helena Holý 49.62
26 Richard Cunningham 47.62
57 Luis Rojas 46.62
46 Hugh O'Reilly 45.62
45 Ladislav Kovács 45.62
Name: Total, dtype: float64
3.3 read_sql_query函数
pd.read_sql_query('select * from tracks limit 5', engine).iloc[:,:5]
TrackId | Name | AlbumId | MediaTypeId | GenreId | |
---|---|---|---|---|---|
0 | 1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 |
1 | 2 | Balls to the Wall | 2 | 2 | 1 |
2 | 3 | Fast As a Shark | 3 | 2 | 1 |
3 | 4 | Restless and Wild | 3 | 2 | 1 |
4 | 5 | Princess of the Dawn | 3 | 2 | 1 |
可以将长字符串传给read_sql_query
sql_string1 = '''
select
Name,
time(avg(Milliseconds) / 1000, 'unixepoch') as avg_time
from (
select
g.Name,
t.Milliseconds
from
genres as g
join
tracks as t
on
g.genreid == t.genreid
)
group by
Name
order by
avg_time
'''
pd.read_sql_query(sql_string1, engine)[:10]
Name | avg_time | |
---|---|---|
0 | Rock And Roll | 00:02:14 |
1 | Opera | 00:02:54 |
2 | Hip Hop/Rap | 00:02:58 |
3 | Easy Listening | 00:03:09 |
4 | Bossa Nova | 00:03:39 |
5 | R&B/Soul | 00:03:40 |
6 | World | 00:03:44 |
7 | Pop | 00:03:49 |
8 | Latin | 00:03:52 |
9 | Alternative & Punk | 00:03:54 |
sql_string2 = '''
select
c.customerid,
c.FirstName,
c.LastName,
sum(ii.quantity * ii.unitprice) as Total
from
customers as c
join
invoices as i
on c.customerid = i.customerid
join
invoice_items as ii
on i.invoiceid = ii.invoiceid
group by
c.customerid, c.FirstName, c.LastName
order by
Total desc
'''
pd.read_sql_query(sql_string2, engine)[:10]
CustomerId | FirstName | LastName | Total | |
---|---|---|---|---|
0 | 6 | Helena | Holý | 49.62 |
1 | 26 | Richard | Cunningham | 47.62 |
2 | 57 | Luis | Rojas | 46.62 |
3 | 45 | Ladislav | Kovács | 45.62 |
4 | 46 | Hugh | O'Reilly | 45.62 |
5 | 37 | Fynn | Zimmermann | 43.62 |
6 | 24 | Frank | Ralston | 43.62 |
7 | 28 | Julia | Barnett | 43.62 |
8 | 25 | Victor | Stevens | 42.62 |
9 | 7 | Astrid | Gruber | 42.62 |