一、字符串操作
Python内置的字符串操作和re正则模块可以解决很多场景下的字符串操作需求。但是在数据分析过程中,它们有时候比较尴尬,比如:
In [143]: dic= {'one':'feixue', 'two':np.nan, 'three':'tom', 'five':'jerry@film'}
In [144]: s = pd.Series(dic)
In [145]: s
Out[145]:
one feixue
two NaN
three tom
five jerry@film
dtype: object
现在想将s中的字母都大写,通过Python内置字符串方法,可能会这么设计:
In [159]: s.map(lambda x : x.upper())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
但是,弹出了异常,原因是数据有一个缺失值NaN,这个值不是字符串,没有upper方法。
那怎么办呢?Pandas为这一类整体性的操作,提供了专门的字符串函数,跳过缺失值等异常情况,对能够进行操作的每个元素进行处理:
In [160]: s.str.upper()
Out[160]:
one FEIXUE
two NaN
three TOM
five JERRY@FILM
dtype: object
这就是Series的str属性,在它的基础上甚至可以使用正则表达式的函数。
下面是部分可用的Series.str的字符串操作方法,名字基本和Python字符串内置方法相同:
- cat :粘合字符串
- contains:是否包含的判断
- count:计数
- extract:返回匹配的字符串组
- endswith:以xx结尾判断
- startswith:以xx开始判断
- findall:查找
- get:获取
- isalnum:类型判断
- isalpha:类型判断
- isdecimal:类型判断
- isdigit:类型判断
- islower:是否小写
- isnumeric:类型判断
- isupper:是否大写
- join:连接
- len:长度
- lower:小写
- upper:大写
- match:匹配
- pad:将空白加到字符串的左边、右边或者两边
- center:居中
- repeat:重复
- replace:替换
- slice:切片
- split:分割
- strip:脱除
- lstrip:左脱除
- rstrip:右脱除
二、合并连接
可以通过多种方式将Pandas对象联合到一起:
- pandas.merge: 根据一个或多个键进行连接。类似SQL的连接操作
- pandas.concat:使对象在轴向上进行粘合或者‘堆叠’
- combine_first:将重叠的数据拼接在一起,使用一个对象中的值填充另一个对象中的缺失值
1.merge连接
merge方法将两个pandas对象连接在一起,类似SQL的连接操作。默认情况下,它执行的是内连接,也就是两个对象的交集。通过参数how,还可以指定外连接、左连接和右连接。参数on指定在哪个键上连接,参数left_on
和right_on
分别指定左右对象的连接键。
- 外连接:并集
- 内连接:交集
- 左连接:左边对象全部保留
- 右连接:右边对象全部保留
In [23]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
...: 'data1': range(7)})
...:
In [24]: df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
...: 'data2': range(3)})
...:
In [25]: df1
Out[25]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
In [26]: df2
Out[26]:
key data2
0 a 0
1 b 1
2 d 2
In [27]: pd.merge(df1,df2) # 默认内链接,并智能地查找连接的键
Out[27]:
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
In [28]: pd.merge(df1,df2,on='key') # 最好是显式地指定连接的键
Out[28]:
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
In [30]: pd.merge(df1, df2, how='outer') # 外连接
Out[30]:
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0
In [31]: pd.merge(df1, df2, how='left') # 左连接
Out[31]:
key data1 data2
0 b 0 1.0
1 b 1 1.0
2 a 2 0.0
3 c 3 NaN
4 a 4 0.0
5 a 5 0.0
6 b 6 1.0
In [32]: pd.merge(df1, df2, how='right') #右连接
Out[32]:
key data1 data2
0 b 0.0 1
1 b 1.0 1
2 b 6.0 1
3 a 2.0 0
4 a 4.0 0
5 a 5.0 0
6 d NaN 2
In [33]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
...: 'data1': range(7)})
...:
In [34]: df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
...: 'data2': range(3)})
...:
In [35]: pd.merge(df3, df4, left_on='lkey', right_on='rkey') # 指定两边的键
Out[35]:
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
多对多的merge连接是行的笛卡儿积。比如左边如果有3个‘b’行,右边有2个‘b’行,那么结果是3x2,6个‘b’行。
也可以同时指定多个键进行连接:
In [36]: left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
...: 'key2': ['one', 'two', 'one'],
...: 'lval': [1, 2, 3]})
In [37]: right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
...: 'key2': ['one', 'one', 'one', 'two'],
...: 'rval': [4, 5, 6, 7]})
In [38]: pd.merge(left, right, on=['key1', 'key2'], how='outer')
Out[38]:
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
merge操作中还有一个重叠列名的问题,比如上面的left和right两个数据,为此,我们可以使用suffixes参数,手动指定为重复的列名添加后缀:
In [41]: pd.merge(left, right, on='key1')
Out[41]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
In [42]: pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
Out[42]:
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
有时候,用于merge合并的键可能是某个对象的行索引:
In [43]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
...: 'value': range(6)})
...:
In [44]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
In [45]: left1
Out[45]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
In [46]: right1
Out[46]:
group_val
a 3.5
b 7.0
In [47]: pd.merge(left1, right1, left_on='key', right_index=True)
Out[47]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
In [48]: pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
Out[48]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
使用right_index=True
参数显式地指出,右边的对象right1使用它的行索引作为连接的键。
事实上Pandas有一个join方法,可以帮助我们直接用行索引进行连接:
In [49]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
...: index=['a', 'c', 'e'],
...: columns=['Ohio', 'Nevada'])
...:
In [50]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
...: index=['b', 'c', 'd', 'e'],
...: columns=['Missouri', 'Alabama'])
...:
In [51]: left2
Out[51]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
In [52]: right2
Out[52]:
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
In [53]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Out[53]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
In [54]: left2.join(right2, how='outer') # 与上面的操作效果一样
Out[54]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
2.concat轴向连接
concat方法可以实现对象在轴向的的粘合或者堆叠。
In [55]: s1 = pd.Series([0, 1], index=['a', 'b'])
In [56]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
In [57]: s3 = pd.Series([5, 6], index=['f', 'g'])
In [58]: pd.concat([s1, s2, s3]) # 要以列表的方式提供参数
Out[58]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
In [59]: pd.concat([s1, s2, s3], axis=1) # 横向堆叠,但出现警告信息
C:ProgramDataAnaconda3Scriptsipython:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
......
Out[59]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
In [60]: pd.concat([s1, s2, s3], axis=1,sort=True) # 按人家的要求做
Out[60]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
对于DataFrame,默认情况下都是按行往下合并的,当然也可以设置axis参数:
In [66]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
...: columns=['one', 'two'])
...:
In [67]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
...: columns=['three', 'four'])
...:
In [68]: df1
Out[68]:
one two
a 0 1
b 2 3
c 4 5
In [69]: df2
Out[69]:
three four
a 5 6
c 7 8
In [71]: pd.concat([df1, df2], sort=True)
Out[71]:
four one three two
a NaN 0.0 NaN 1.0
b NaN 2.0 NaN 3.0
c NaN 4.0 NaN 5.0
a 6.0 NaN 5.0 NaN
c 8.0 NaN 7.0 NaN
In [72]: pd.concat([df1, df2], axis=1, sort=True)
Out[72]:
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
3.combine_first联合叠加
有这么种场景,某个对象里缺失的值,拿另外一个对象的相应位置的值来填补。在Numpy层面,可以这么做:
In [74]: a = pd.Series([np.nan, 2.5, 0, 3.5, 4.5, np.nan],
...: index=['f', 'e', 'd', 'c', 'b', 'a'])
In [75]: b = pd.Series([0, np.nan, 2.1, np.nan, np.nan, 5], index=list('abcdef'))
In [76]: a
Out[76]:
f NaN
e 2.5
d 0.0
c 3.5
b 4.5
a NaN
dtype: float64
In [77]: b
Out[77]:
a 0.0
b NaN
c 2.1
d NaN
e NaN
f 5.0
dtype: float64
In [78]: np.where(pd.isnull(a), b, a)
Out[78]: array([0. , 2.5, 0. , 3.5, 4.5, 5. ])
np.where(pd.isnull(a), b, a)
,这一句里,首先去pd.isnull(a)
种判断元素,如果是True,从b里拿数据,否则从a里拿,得到最终结果。
实际上,Pandas为这种场景提供了一个专门的combine_first
方法:
In [80]: b.combine_first(a)
Out[80]:
a 0.0
b 4.5
c 2.1
d 0.0
e 2.5
f 5.0
dtype: float64
对于DataFrame对象,combine_first
逐列做相同的操作,因此你可以认为它是根据你传入的对象来‘修补’调用对象的缺失值。
In [81]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
...: 'b': [np.nan, 2., np.nan, 6.],
...: 'c': range(2, 18, 4)})
...:
In [82]: df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
...: 'b': [np.nan, 3., 4., 6., 8.]})
...:
In [83]: df1
Out[83]:
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
In [84]: df2
Out[84]:
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
In [85]: df1.combine_first(df2)
Out[85]:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
三、重塑
对表格型数据进行重新排列的操作,被称作重塑。
使用多层索引进行重塑主要有stack和unstack操作,前面有介绍过。
In [93]: df = pd.DataFrame(np.arange(6).reshape(2,3),
...: index=pd.Index(['河南','山西'], name='省份'),
...: columns=pd.Index(['one','two','three'],name='number'))
In [94]: df
Out[94]:
number one two three
省份
河南 0 1 2
山西 3 4 5
In [95]: result = df.stack()
In [96]: result
Out[96]:
省份 number
河南 one 0
two 1
three 2
山西 one 3
two 4
three 5
dtype: int32
In [97]: result.unstack()
Out[97]:
number one two three
省份
河南 0 1 2
山西 3 4 5
stack操作使得df的所有列都变成了分层行索引,产生了一个新的Series。
unstack默认情况下拆分最内层索引,然后将数据放入一个DataFrame中。可以传入一个层级序号或名称来拆分不同的层级。
In [98]: result.unstack(0)
Out[98]:
省份 河南 山西
number
one 0 3
two 1 4
three 2 5
In [99]: result.unstack('省份')
Out[99]:
省份 河南 山西
number
one 0 3
two 1 4
three 2 5
如果层级中的所有值并未有包含于每个子分组中,拆分可能会导致缺失值的产生:
In [100]: s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
In [101]: s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
In [102]: data2 = pd.concat([s1, s2], keys=['one', 'two'])
In [103]: data2 # 注意concat的结果是一个分层索引
Out[103]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
In [104]: data2.unstack()
Out[104]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
In [105]: data2.unstack().stack() # 结果是可逆的
Out[105]:
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
In [106]: data2.unstack().stack(dropna=False) # 保留缺失值
Out[106]:
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
而在DataFrame对象拆堆时,被拆的层级会变成结果中最低的层级:
In [107]: df = pd.DataFrame({'left': result, 'right': result + 5},
...: columns=pd.Index(['left', 'right'], name='side'))
...:
In [108]: df
Out[108]:
side left right
省份 number
河南 one 0 5
two 1 6
three 2 7
山西 one 3 8
two 4 9
three 5 10
In [109]: df.unstack('省份') # 因为作死引入了中文,所以版式不太对齐
Out[109]:
side left right
省份 河南 山西 河南 山西
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10