一、Series与DataFrame
from pandas import Series,DataFrame import pandas as pd
仅由一组简单的数据就可产生最简单的Series,数据+相关的标签:
obj = Series([4,7,-5,3])
obj
0 4 1 7 2 -5 3 3 dtype: int64
Series的values(值)和index(索引)属性:
obj.values
array([ 4, 7, -5, 3], dtype=int64)
obj.index
RangeIndex(start=0, stop=4, step=1)
另一种索引,可以给你的索引起名字:
obj2 = Series([4,7,-5,3],index=['d','b','a','c']) obj2
d 4 b 7 a -5 c 3 dtype: int64
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')
通过索引来查值:
obj2['a']
-5
obj2[['c','a','d']] #索引用列表嵌套
c 3 a -5 d 4 dtype: int64
NumPy数组运算都会保留索引和值之间的链接:
obj2
d 4 b 7 a -5 c 3 dtype: int64
#布尔型数组过滤
obj2 > 0
d True b True a False c True dtype: bool
obj2[obj2 > 0]
d 4 b 7 c 3 dtype: int64
#标量乘法
obj2 * 2
d 8 b 14 a -10 c 6 dtype: int64
#数学函数
import numpy as np
np.exp(obj2)
d 54.598150 b 1096.633158 a 0.006738 c 20.085537 dtype: float64
可以将Series看成一个定长的有序字典,毕竟都是索引值到数据值的一个映射。
'b' in obj2
True
'e' in obj2
False
数据被存放在字典当中,也可以通过这个字典创建Series:
#字典类型变Series类型,字典中的键就是Series中的索引
sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000} obj3 = Series(sdata) obj3
Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64
更改索引index:
states = ['California','Ohio','Oregon','Texas'] obj4 = Series(sdata,index=states) obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
sdata中跟states索引相匹配的3个值会被找出来放在相应的位置上,California所对应的sdata值找不到,结果就为NaN,表示缺失数据。
pandas中isnull和notnull函数用于检测缺失数据:
pd.isnull(obj4)
California True Ohio False Oregon False Texas False dtype: bool
pd.notnull(obj4)
California False Ohio True Oregon True Texas True dtype: bool
obj4.isnull()
California True Ohio False Oregon False Texas False dtype: bool
Series最重要的一个功能:在算术运算中会自动对齐不同索引的数据。
大家只取各自共有的、重叠的部分,你有我没有的就会变为NaN:
obj3
Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64
obj4
California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64
obj3 + obj4
California NaN Ohio 70000.0 Oregon 32000.0 Texas 142000.0 Utah NaN dtype: float64
Series对象自己和它的索引都有一个name属性:
obj4.name = 'population' obj4.index.name = 'state' obj4
state California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 Name: population, dtype: float64
索引可以通过赋值的方式修改:
obj.index = ['Bob','Steve','Jeff','Ryan']
DataFrame:
创建DataFrame:
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9]} frame = DataFrame(data) frame
字典就变成了DataFrame,键就成了列名,又自动添加了索引:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
可指定列序列:
DataFrame(data,columns=['year','state','pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
如果数据找不到,就会是NA值:
frame2 = DataFrame(data,columns=['year','state','pop','debt'], index=['one','two','three','four','five']) frame2
year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
类似字典标记的方式将DataFrame的列获取为一个Series:
frame2['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
属性的方式将DataFrame的列获取为一个Series:
frame2.year
one 2000 two 2001 three 2002 four 2001 five 2002 Name: year, dtype: int64
获取一行:
frame2.loc['three']
year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object
赋值的方式修改列值:
frame2['debt'] = 16.5 frame2
year state pop debt one 2000 Ohio 1.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.9 16.5
frame2['debt'] = np.arange(5.) frame2
year state pop debt one 2000 Ohio 1.5 0.0 two 2001 Ohio 1.7 1.0 three 2002 Ohio 3.6 2.0 four 2001 Nevada 2.4 3.0 five 2002 Nevada 2.9 4.0
将列表或数组赋值给DataFrame某一个列时,长度必须要匹配。如果赋值一个Series,相应的索引会自己匹配:
val = Series([-1.2,-1.5,-1.7],index=['two','four','five']) frame2['debt'] = val frame2
year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7
为不存在的列赋值,会自动创建出一个新列的:
frame2['eastern'] = frame2.state == 'Ohio' frame2
year state pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 -1.2 True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 -1.5 False five 2002 Nevada 2.9 -1.7 False
删除列:del:
del frame2['eastern'] frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
嵌套字典:字典的字典
pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame3 = DataFrame(pop) frame3
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
外层字典的键作为列,内层字典的键作为行索引。
转置:
frame3.T
显示指定索引:
DataFrame(pop,index=[2001,2002,2003])
由Series组成的字典:
pdata = {'Ohio':frame3['Ohio'][:-1], 'Nevada':frame3['Nevada'][:2]} DataFrame(pdata)
Ohio Nevada 2000 1.5 NaN 2001 1.7 2.4
其中:
frame3['Ohio']
2000 1.5
2001 1.7
2002 3.6
Name: Ohio, dtype: float64
设置DataFrame的index和columns的name属性,并能显示出来:
frame3.index.name = 'year' frame3.columns.name = 'state' frame3
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
values属性:
frame3.values
array([[nan, 1.5], [2.4, 1.7], [2.9, 3.6]])
2019.10.30 14:38:04