import numpy as np
import pandas as pd
This section will walk you(引导你) through the fundamental(基本的) mechanics(方法) of interacting(交互) with the data contained in a Series or DataFrame. -> (引导你去了解基本的数据交互, 通过Series, DataFrame).
In the chapters to come, we will delve(钻研) more deeply into data analysis and manipulation topics using pandas. This book is not inteded to serve as exhausitive(详尽的) documentation for the pandas library; instead, we'll focus on the most important features, leaving the less common(i.e more esoteric(深奥,晦涩的)) things for you to explore on you own.
Reindexing
An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
Calling reindex on this Series rearranges(重排列) the data according to the new index, introducing missing values if any index values were not already present:
-> 更新索引, 如没有对应到值, 则为缺失NaN
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
For ordered data like time series(时间序列), it may be desirable to do some interprolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills(向前填充值) the values.
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3
0 blue
2 purple
4 yellow
dtype: object
"ffill 向前填充 - forward-fill"
obj3.reindex(range(6), method='ffill')
'ffill 向前填充 - forward-fill'
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
With DataFrame, reindex can alter either the(row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California']
)
frame
Ohio | Texas | California | |
---|---|---|---|
a | 0 | 1 | 2 |
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
"重新index, 不能匹配的则NaN"
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
'重新index, 不能匹配的则NaN'
Ohio | Texas | California | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | NaN | NaN | NaN |
c | 3.0 | 4.0 | 5.0 |
d | 6.0 | 7.0 | 8.0 |
The columns can be reindex with the columns keyword:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
Texas | Utah | California | |
---|---|---|---|
a | 1 | NaN | 2 |
c | 4 | NaN | 5 |
d | 7 | NaN | 8 |
See Table 5-3 for more about the arguments to reindex.
As we'll explore in more detail, you can reindex more succinctly by label-indexing with loc, and many users prefer to use it exclusively(情有独钟对loc):
"loc[[行索引], [列索引]]"
frame.loc[['a', 'b', 'c', 'd'], states]
'loc[[行索引], [列索引]]'
c:pythonpython36libsite-packagesipykernel_launcher.py:2: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Texas | Utah | California | |
---|---|---|---|
a | 1.0 | NaN | 2.0 |
b | NaN | NaN | NaN |
c | 4.0 | NaN | 5.0 |
d | 7.0 | NaN | 8.0 |
reindex() 参数:
- index
- method 'ffill'向前填充, 'bfill' 向后填充
- fill_value 填充值
- limit
- livel Match simple index on level of MultiIndex; otherwise select subset of.
- copy
删除行,列数据根据Axis
Dropping one or more entries from an axis is easy if you already hava an index array or list without those entries. As that can requier a bit of munging(操作) and set logic. The drop method will return a new object with the indecated value or values deleted from an axis:
obj = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
obj
a 0
b 1
c 2
d 3
e 4
dtype: int32
"drop('行索引') 直接删除, 连带这行数据"
new_obj = obj.drop('c')
new_obj
"drop('行索引') 直接删除, 连带这行数据"
a 0
b 1
d 3
e 4
dtype: int32
"删除多个索引, 用列表组织起来 - 非原地"
obj.drop(['d', 'c'])
'删除多个索引, 用列表组织起来 - 非原地'
a 0
b 1
e 4
dtype: int32
obj
a 0
b 1
c 2
d 3
e 4
dtype: int32
With DataFrame, index values can be deleted from either axis. To illustrate(阐明) this, we first create an example DataFrame:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four']
)
data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
Calling drop with a sequence of labels will drop values from either axis. To illustrate this, we first create an example DataFrame: ->(删除某个行标签, 将会对应删掉该行数据)
'drop([row_name1, row_name2]), 删除行, 非原地'
data.drop(['Colorado', 'Ohio'])
'drop([row_name1, row_name2]), 删除行, 非原地'
one | two | three | four | |
---|---|---|---|---|
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
".drop()方式是非原地的, del方式是原地的"
data
'.drop()方式是非原地的, del方式是原地的'
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
You can drop values from the columns by passing axis=1(列方向->) or axis='columns'.
"删除列, 需要指明 axis=1 or axis='columns'"
data.drop(['two', 'four'], axis='columns')
"删除列, 需要指明 axis=1 or axis='columns'"
one | three | |
---|---|---|
Ohio | 0 | 2 |
Colorado | 4 | 6 |
Utah | 8 | 10 |
New York | 12 | 14 |
"drop()不论删除行还是列, 默认都是非原地的,可以指定"
data
'drop()不论删除行还是列, 默认都是非原地的,可以指定'
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object: ->(可以用inplace=True来指定原定修改, 很多方法都可试一波的)
"原地删除第2,3列"
data.drop(['two', 'three'], axis='columns', inplace=True)
"发现原数据已经被修改了"
data
'原地删除第2,3列'
'发现原数据已经被修改了'
one | four | |
---|---|---|
Ohio | 0 | 3 |
Colorado | 4 | 7 |
Utah | 8 | 11 |
New York | 12 | 15 |
Indexing, Selection, and Filtering
Series indexing (obj[...]) works analogously(类似的) to NumPy array indexing, except you can use the Series's index values instead of only integers(字符, 数字等序列也可以作为索引). Here are some examples of this:
obj = pd.Series(np.arange(4), index=['a', 'b', 'c','d'])
obj
a 0
b 1
c 2
d 3
dtype: int32
"obj[row_index_name]"
obj['b']
"根据多个行索引名称来选取"
obj[['b', 'd']]
'obj[row_index_name]'
1
'根据多个行索引名称来选取'
b 1
d 3
dtype: int32
"根据索引值来选取, 下标从0哦"
"obj[1]"
"取第二行元素"
obj[1]
"连续: 取1到3行元素, 前闭后开的, 跟列表一样"
obj[0:3]
"离散: 取索第1行, 第4行"
obj[[0, 3]]
'根据索引值来选取, 下标从0哦'
'obj[1]'
'取第二行元素'
1
'连续: 取1到3行元素, 前闭后开的, 跟列表一样'
a 0
b 1
c 2
dtype: int32
'离散: 取索第1行, 第4行'
a 0
d 3
dtype: int32
"bool 索引, 取值小于2的"
obj[ obj < 2]
'bool 索引, 取值小于2的'
a 0
b 1
dtype: int32
Slicing with labels behaves differently than normal Python slicing in that the end-point is inclusive.-> 后面是闭合的区间
"是闭区间"
obj['b':'c']
'是闭区间'
b 1
c 2
dtype: int32
Setting using these methods modifies the corresponding section of the Series:
"直接赋值修改是原地的"
obj['b':'c'] = 5
obj
'直接赋值修改是原地的'
a 0
b 5
c 5
d 3
dtype: int32
Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence: 通过一个或多个(列表值)来进项列的选取
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
"df[列名] 选取单个列"
data['two']
"df[[col1, col2]] 选取多个列"
data[['three', 'four']]
'df[列名] 选取单个列'
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
'df[[col1, col2]] 选取多个列'
three | four | |
---|---|---|
Ohio | 2 | 3 |
Colorado | 6 | 7 |
Utah | 10 | 11 |
New York | 14 | 15 |
Indexing like this has a few special cases. First, slicing or selecting data with a boolean array.
"df[数字], 表示行索引, 从0开始, 这里表示 取2行"
data[:2]
"bool 行索引, 条件是three列中, 值大于5的行"
data[data['three'] > 5]
'df[数字], 表示行索引, 从0开始, 这里表示 取2行'
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
'bool 行索引, 条件是three列中, 值大于5的行'
one | two | three | four | |
---|---|---|---|---|
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
The row selection syntax data[:2] is provided as convenience. Passing a single element or a list to the [] operator selects columns.
-> df[]传入的是字段名, 则按字段选取; 传入的是切片索引, 这安行选取. 我是挺容易混淆的. 没有通用的方法来整
Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison(跟标量进行比较, 产生bool值进行过滤):
data < 5
one | two | three | four | |
---|---|---|---|---|
Ohio | True | True | True | True |
Colorado | True | False | False | False |
Utah | False | False | False | False |
New York | False | False | False | False |
"检索整个DF的值, 将小于5的值 原地替换为 0"
data[data < 5] = 0
data
'检索整个DF的值, 将小于5的值 原地替换为 0'
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 0 | 0 | 0 |
Colorado | 0 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
This makes DataFrame syntactically(在语法上) more like two-dimensional NumPy array in the particular(特别的) case.
Selection with loc and iloc
For DataFrame label-indexing on the rows(行列同时索引的神器), I introduce the the special indexing operators loc and iloc. The enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notaion using either axis lables(loc) or integers(iloc)
As a preliminary(初步的) example, let's select a single row and multiple columns by label:
data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 0 | 0 | 0 |
Colorado | 0 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
"df.loc[[行索引], [列索引]]"
data.loc[['Colorado'], ['two', 'three']]
'df.loc[[行索引], [列索引]]'
two | three | |
---|---|---|
Colorado | 5 | 6 |
We'll the perform some similar selections with integers using using iloc:
"选取第3行, 第4,1,2列, 即 11, 8, 9"
data.iloc[2, [3,0,1]]
"一维是Series, 二维是DataFrame"
data.iloc[[2,3], [3,0,1]]
'选取第3行, 第4,1,2列, 即 11, 8, 9'
four 11
one 8
two 9
Name: Utah, dtype: int32
'一维是Series, 二维是DataFrame'
four | one | two | |
---|---|---|---|
Utah | 11 | 8 | 9 |
New York | 15 | 12 | 13 |
Both indexing functions work with slices in addtion to single labels or lists of labels:
"都能切片"
data.loc[:'Utah', 'two']
'都能切片'
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
"取所有行, 1-4列, three列值大于5的行"
data.iloc[:, :3][data.three > 5]
'取所有行, 1-4列, three列值大于5的行'
one | two | three | |
---|---|---|---|
Colorado | 0 | 5 | 6 |
Utah | 8 | 9 | 10 |
New York | 12 | 13 | 14 |
So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, Table 5-4 provides a short summary of many of them. As you'll see later, there are a number of additional options for working with hierarchical(分层的) indexes.
When originally(最初) designing pandas(作者自己设计pandas的时候), I felt that having to type frame[:, col] to select a column was too verbose(冗余的) (and error-prone), since column selection is one of the most common operations. I made the design trade-off(权衡) to push all of the fancy indexing behavior(both labels and integers 标签索引和数值索引都支持) into the ix operator.(推送到ix). In practice,(然而从实践上来看) this led to many edge(鸡肋的, 边缘的) cases in data with integer axis labels, so the pandas team decided to creat the loc and iloc operators to deal with strictly label-based and integer-based indexing, respectively(各自地).
The ix indexing operator still exists, but it is deprecated(废弃了). I do not recommend(推荐) using it.
Indexing options with DataFrame
-
df[col_name] / df[[col, col]] 选取单列或多列
-
df.loc[row_name] 选取单或多行
- df.loc[:, val] 选取单列或多列
- df.loc[row, col] 选取行列的交集
- df.iloc[where] 下标区间的行(integer)
- df.iloc[:, where] 下标区间的列(integer)
- df.iloc[where_i, where_j] indtege行列索引
- df.at[label_i, label_j] 通过行列的label来取值
- df.iat[i, j] 行列位置来选取
- reindex method Select either rows or columns by labels
- get_value, setvalue methods Select single value by row and column label
Integer Indexes
Working with pandas objects indexed by integers is somthing that often trips up(小坑) new users due to some differences with indexing semantics(语义上的) on buit-in Python data structures like list and tuples. For example, you might not expect the following code to generate an error:
ser = pd.Series(np.arange(3))
ser
0 0
1 1
2 2
dtype: int32
"习惯以为是Python"
ser[-1]
'习惯以为是Python'
In this case, pandas could 'fall back' on integer indexing, but it is difficult to do this in general without introducing subtle(微妙的)bugs. Here we have an index containing 0, 1, 2 but inferring what the user wants (label-based indexing or position-based 用数子做为索引, 则会很难去分辨到底是 标签索引or位置) is difficult:
ser
0 0
1 1
2 2
dtype: int32
On the other hand, with a non-integer index, there is no potential(可能性) for ambiguity(歧义的):
用label等作为索引, 就不会引起歧义了.
ser2 = pd.Series(np.arange(3), index=['a', 'b', 'c'])
ser2
a 0
b 1
c 2
dtype: int32
"这样就不会引起歧义了, 数值就是行呀"
ser2[-1]
'这样就不会引起歧义了, 数值就是行呀'
2
To keep things consistent, (保持一致) if you have an axis index containing integer, data selection will always be label-oriented. For more precise handling use lic(for labels) or iloc(for integers):
-> 保持良好的使用风格, 如df[label] 用选取列, loc[row_lable] 用于选取行标签, iloc[] 用于数值索引, 不要混淆使用.
'选取第1行, 后开区间'
ser[:1]
'选取1,2行, 后闭区间'
ser.loc[:1]
ser.iloc[:1]
'选取第1行, 后开区间'
0 0
dtype: int32
'选取1,2行, 后闭区间'
0 0
1 1
dtype: int32
0 0
dtype: int32
Arithmetic and Data Alignment
基本算数运算和数据对齐 An important pandas feature for some application is the behave of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index paris(索引的并集). For users with database experience, this is similar to an automatic outer join(像表的外连接) on the index labels. Let's look at an example.
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
index=['a', 'c', 'e', 'f', 'g'])
s1
s2
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
Adding these together yields:
s1 + s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
The internal data alignment(数据自动对齐) introduces missing values in the label locations that don't overlap(重叠的). Missing values will then propagate(传播) in further arithmetic computations.
In this case of DataFrame, alignment is performed on both the rows and the columns:
df1 = pd.DataFrame(np.arange(9).reshape((3, 3)),
columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12).reshape((4,3)),
columns=list('bde'),
index=['Utath', 'Ohio', 'Texas', 'Oregon'])
df1
df2
b | c | d | |
---|---|---|---|
Ohio | 0 | 1 | 2 |
Texas | 3 | 4 | 5 |
Colorado | 6 | 7 | 8 |
b | d | e | |
---|---|---|---|
Utath | 0 | 1 | 2 |
Ohio | 3 | 4 | 5 |
Texas | 6 | 7 | 8 |
Oregon | 9 | 10 | 11 |
Adding these together returns a DataFrame whose index and columns are the unions(并集) of the ones in the each DataFrame.
"自动对齐"
df1 + df2
'自动对齐'
b | c | d | e | |
---|---|---|---|---|
Colorado | NaN | NaN | NaN | NaN |
Ohio | 3.0 | NaN | 6.0 | NaN |
Oregon | NaN | NaN | NaN | NaN |
Texas | 9.0 | NaN | 12.0 | NaN |
Utath | NaN | NaN | NaN | NaN |
Since the 'c' and 'e' columns are not found in both DataFrame objects, they apper as all missing in the result.(没有共同的列, 则值是NA) The same holds(把握) for the rows whose labels are not common to both objects.(对于没有共同标签的行也是如此)
if you add DataFrame objects with no columns or row labels in common(共同的行or列标签), the result will contain all nulls:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df1
df2
A | |
---|---|
0 | 1 |
1 | 2 |
B | |
---|---|
0 | 3 |
1 | 4 |
"没有相同的行列标签, 节后是标签保留了, 值都为NaN"
df1 + df2
'没有相同的行列标签, 节后是标签保留了, 值都为NaN'
A | B | |
---|---|---|
0 | NaN | NaN |
1 | NaN | NaN |
Arithmetic methods with fill values
In arithmetic operations between different indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
"第2行, b列的元素, 修改为NaN"
df2.loc[1, 'b'] = np.nan
df1
df2
'第2行, b列的元素, 修改为NaN'
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | 5.0 | 6.0 | 7.0 |
2 | 8.0 | 9.0 | 10.0 | 11.0 |
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 |
1 | 5.0 | NaN | 7.0 | 8.0 | 9.0 |
2 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
Adding these together results in NA values in the locations that don't overlap(重叠的)
"相同轴标签, 同位置的值运算, 不重叠的则NaN"
df1 + df2
'相同轴标签, 同位置的值运算, 不重叠的则NaN'
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | NaN |
1 | 9.0 | NaN | 13.0 | 15.0 | NaN |
2 | 18.0 | 20.0 | 22.0 | 24.0 | NaN |
3 | NaN | NaN | NaN | NaN | NaN |
Using the add method on df1, I pass df2 and an argument to fill_value. ->(将位置对不上的值, 用0填充, 然后再进行运算)
df1.add(df2, fill_value=0)
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
1 | 9.0 | 5.0 | 13.0 | 15.0 | 9.0 |
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
See Table 5-5 for a listing of Series and DataFrame methods for arithmetic. Each of them has a countepart(副本), starting with the letter r that has arguments flpped(带r的字母). So these two statements are equivalent:
1/ df1
"直接写运算符, 和调用方法是一样的效果, 副本而已啦"
df1.rdiv(1)
a | b | c | d | |
---|---|---|---|---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
1 | 0.250000 | 0.200000 | 0.166667 | 0.142857 |
2 | 0.125000 | 0.111111 | 0.100000 | 0.090909 |
'直接写运算符, 和调用方法是一样的效果, 副本而已啦'
a | b | c | d | |
---|---|---|---|---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
1 | 0.250000 | 0.200000 | 0.166667 | 0.142857 |
2 | 0.125000 | 0.111111 | 0.100000 | 0.090909 |
Relatedly(有密切相关的是), when reindexing a Series or DataFrame, you can also specify a different fill value
'重设索引名reindex 时, 可对没有对应上索引, 进行值填充, 如本列中的e'
df1.reindex(columns=df2.columns, fill_value=0)
'重设索引名reindex 时, 可对没有对应上索引, 进行值填充, 如本列中的e'
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 | 0 |
1 | 4.0 | 5.0 | 6.0 | 7.0 | 0 |
2 | 8.0 | 9.0 | 10.0 | 11.0 | 0 |
- add, radd
- sub, rsub
- div, rdiv
- floordiv, rfloordiv
- mul, rmul
- pow, rpow
Operations between DataFrame and Series
As(针对) with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined.(针对不同维度的数组, 定义了运算法则) First, as a motivating(激励的) example, consider the difference between a two-dimensional array and one of its rows:
arr = np.arange(12).reshape((3, 4))
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
'取第一行'
arr[0]
'广播机制, 每一行都会减掉值, 因为匹配了'
arr - arr[0]
'取第一行'
array([0, 1, 2, 3])
'广播机制, 每一行都会减掉值, 因为匹配了'
array([[0, 0, 0, 0],
[4, 4, 4, 4],
[8, 8, 8, 8]])
When we subtract arr[0] from arr, the subtraction is performed once for each row(每行都被广播了). This is referred to as broadcasting and is explained in more detail as it relates to general NumPy arrays in Appendix A. Operations between a DataFrame and a Series are similar:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
"选取第一行"
series = frame.iloc[0]
series
'选取第一行'
b 0
d 1
e 2
Name: Utah, dtype: int32
By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame's columns, broadcasting down the rows: ->(默认是运算匹配列索引(右边), 向下广播(行索引))
"匹配从左到右, 广播从上到下"
frame - series
'匹配从左到右, 广播从上到下'
b | d | e | |
---|---|---|---|
Utah | 0 | 0 | 0 |
Ohio | 3 | 3 | 3 |
Texas | 6 | 6 | 6 |
Oregon | 9 | 9 | 9 |
If an index value is not found in either the DataFrame's columns or the Series's index, the objects will be reindexed to from the union
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
"没有匹配到, 就自然NA了"
frame + series2
'没有匹配到, 就自然NA了'
b | d | e | f | |
---|---|---|---|---|
Utah | 0.0 | NaN | 3.0 | NaN |
Ohio | 3.0 | NaN | 6.0 | NaN |
Texas | 6.0 | NaN | 9.0 | NaN |
Oregon | 9.0 | NaN | 12.0 | NaN |
If you want to instead broadcast over the columns, matching on the rows, you have to use one of teh arithmetic methods. For example:
series3 = frame['d']
frame
series3
b | d | e | |
---|---|---|---|
Utah | 0 | 1 | 2 |
Ohio | 3 | 4 | 5 |
Texas | 6 | 7 | 8 |
Oregon | 9 | 10 | 11 |
Utah 1
Ohio 4
Texas 7
Oregon 10
Name: d, dtype: int32
"指定广播的轴0, 每一了行都被广播"
frame.sub(series3, axis='index')
'指定广播的轴0, 每一了行都被广播'
b | d | e | |
---|---|---|---|
Utah | -1 | 0 | 1 |
Ohio | -1 | 0 | 1 |
Texas | -1 | 0 | 1 |
Oregon | -1 | 0 | 1 |
The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame's row index(aix='index' or axis=0) and broadcast across.
Function Application and Mapping
NumPy ufuncs(element-wise array methods) also work with pandas objects: => 元素级函数对DF同样适用
# randn 是标准正态分布
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
b | d | e | |
---|---|---|---|
Utah | 0.872481 | -0.026409 | 1.130246 |
Ohio | -0.793998 | -1.394605 | -0.224205 |
Texas | -0.120480 | 0.243161 | 1.627977 |
Oregon | -2.734813 | -2.009582 | 0.905505 |
np.abs(frame)
b | d | e | |
---|---|---|---|
Utah | 0.872481 | 0.026409 | 1.130246 |
Ohio | 0.793998 | 1.394605 | 0.224205 |
Texas | 0.120480 | 0.243161 | 1.627977 |
Oregon | 2.734813 | 2.009582 | 0.905505 |
Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame's apply method does exactly this:
"自定义一个求极差的函数, 作为参数传给DF"
f = lambda x: x.max() - x.min()
"默认方向映射每列"
frame.apply(f)
'自定义一个求极差的函数, 作为参数传给DF'
'默认方向映射每列'
b 3.607294
d 2.252743
e 1.852182
dtype: float64
Here the function f, which computes the difference between maximum and minimum of a Series, is invoked(被调用) once each column(输入是frame的每列哦). The result is a Series having the columns of frame as its index.
If you pass axis='columns'(列方向, 右边, 每行) to apply, the function will be invoked once per row instead:
"默认行方向, 即上-下, 每列一个结果; axis='column'列方向, 每行一个结果"
"axis=column, 是方向为列, 不是按列哦, 实际是按行映射"
frame.apply(f, axis=1)
"默认行方向, 即上-下, 每列一个结果; axis='column'列方向, 每行一个结果"
'axis=column, 是方向为列, 不是按列哦, 实际是按行映射'
Utah 1.156654
Ohio 1.170400
Texas 1.748457
Oregon 3.640318
dtype: float64
Many of the most common array statistic(like sum and mean) are DataFrame methods, so using apply is not necessay.
The function passed to apply need not return a scalar value; It can alse return a Series with multiple values. -> appy(), 能返回多值
def f(x):
"返回DF的最大值, 最小值的df"
return pd.Series([x.min(), x.max()], index=['max', 'min'])
"将函数名作为参数传给apply"
frame.apply(f)
'将函数名作为参数传给apply'
b | d | e | |
---|---|---|---|
max | -2.734813 | -2.009582 | -0.224205 |
min | 0.872481 | 0.243161 | 1.627977 |
Element-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating-point value in fram. You can do this with applymap: -apply, 假如要格式化DF里的所有字符数据(保留两位)...
format = lambda x: '%.2f' % x
"applymap() 映射所有元素, 而apply是有轴方向的"
frame.applymap(format)
'applymap() 映射所有元素, 而apply是有轴方向的'
b | d | e | |
---|---|---|---|
Utah | 0.87 | -0.03 | 1.13 |
Ohio | -0.79 | -1.39 | -0.22 |
Texas | -0.12 | 0.24 | 1.63 |
Oregon | -2.73 | -2.01 | 0.91 |
The reason for the name applmap is that Series has a map method for applying an element-wise function:
"applymap()的原因是, Series有一个map的方法"
frame['e'].map(format)
'applymap()的原因是, Series有一个map的方法'
Utah 1.13
Ohio -0.22
Texas 1.63
Oregon 0.91
Name: e, dtype: object
Sorting and Ranking
Sorting a dataset by some criterion(标准) is another important built-in operation. To sort lexicographically(词典编撰的) by row or column index, use sort_index method, which returns a new, sorted object:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
"sort_index() 按索引排序"
obj.sort_index()
'sort_index() 按索引排序'
a 1
b 2
c 3
d 0
dtype: int64
With a DataFrame, you can sort by index on either axis:
frame = pd.DataFrame(np.arange(8).reshape((2,4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
frame
"按索引排序, 默认axis=0, 下方向"
frame.sort_index()
"按列索引排序, axis=1, 右方向"
frame.sort_index(axis=1)
d | a | b | c | |
---|---|---|---|---|
three | 0 | 1 | 2 | 3 |
one | 4 | 5 | 6 | 7 |
'按索引排序, 默认axis=0, 下方向'
d | a | b | c | |
---|---|---|---|---|
one | 4 | 5 | 6 | 7 |
three | 0 | 1 | 2 | 3 |
'按列索引排序, axis=1, 右方向'
a | b | c | d | |
---|---|---|---|---|
three | 1 | 2 | 3 | 0 |
one | 5 | 6 | 7 | 4 |
The data is sorted in ascending order by default, but can be sorted in descending order, too:
frame.sort_index(axis=1, ascending=False)
d | c | b | a | |
---|---|---|---|---|
three | 0 | 3 | 2 | 1 |
one | 4 | 7 | 6 | 5 |
To sort a Series by its values, use its sort_values method.
obj = pd.Series([4, 7, -3, 2])
"sort_values()按值排序"
obj.sort_values()
'sort_values()按值排序'
2 -3
3 2
0 4
1 7
dtype: int64
Any missing values are sorted to the end of the Series by default. -> 缺失值默认排在最后
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
"缺失值排在最后"
obj.sort_values()
'缺失值排在最后'
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
When sorthing a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more columns names to the by option of sort_values:
frame = pd.DataFrame({
'b':[4,7,3,-2],
'a':[0,1,0, 1]
})
frame
b | a | |
---|---|---|
0 | 4 | 0 |
1 | 7 | 1 |
2 | 3 | 0 |
3 | -2 | 1 |
"by='column_name' 按某个字段进行排序"
frame.sort_values(by='b')
"by='column_name' 按某个字段进行排序"
b | a | |
---|---|---|
3 | -2 | 1 |
2 | 3 | 0 |
0 | 4 | 0 |
1 | 7 | 1 |
"To sort by multiple columns, pass a list of names"
frame.sort_values(by=['a', 'b'])
'To sort by multiple columns, pass a list of names'
b | a | |
---|---|---|
2 | 3 | 0 |
0 | 4 | 0 |
3 | -2 | 1 |
1 | 7 | 1 |
Ranking assigns ranks from one through the number of valid points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank: -. 没看明白呢
obj = pd.Series([7, -5, 7, 4, 2, 8, 4])
obj.rank()
"Ranks can also be assigned according to the order in which they're observed in the data"
obj.rank(method='first')
0 5.5
1 1.0
2 5.5
3 3.5
4 2.0
5 7.0
6 3.5
dtype: float64
"Ranks can also be assigned according to the order in which they're observed in the data"
0 5.0
1 1.0
2 6.0
3 3.0
4 2.0
5 7.0
6 4.0
dtype: float64
Here, instead of using the average rank 6.5 for the entries 0 and 2, they insead have been set to 6 and 7 because label 0 precedes label 2 inthe data.
You can rank in descending order, too:
# Assign values the maximum rank in the group
obj.rank(ascending=False, method='max')
0 3.0
1 7.0
2 3.0
3 5.0
4 6.0
5 1.0
6 5.0
dtype: float64
See Table 5-6 for a list of tie-breaking methods avaible.
DataFrame can compute ranks over the rows or the columns:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
frame
b | a | c | |
---|---|---|---|
0 | 4.3 | 0 | -2.0 |
1 | 7.0 | 1 | 5.0 |
2 | -3.0 | 0 | 8.0 |
3 | 2.0 | 1 | -2.5 |
frame.rank(axis=1)
b | a | c | |
---|---|---|---|
0 | 3.0 | 2.0 | 1.0 |
1 | 3.0 | 1.0 | 2.0 |
2 | 1.0 | 2.0 | 3.0 |
3 | 3.0 | 2.0 | 1.0 |
rank 参数
- average Default: assign the average rank to each entry in the equal group
- max Use the maximum rank for the whole group
- min ---
- first
- dense
Axis Indexes with Duplicate Lables
Up until now all of the examples we've looked at have had unique axis labels(index values). When many pandas functions(like reindex) require that the labels be unique, it's not mandatory(强制的). Let's consider a small Series with duplicate indices:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
a 0
a 1
b 2
b 3
c 4
dtype: int64
The index's is_unique property can tell you whether its label are unique or not:
"is_unique 属性来判断是否有重复索引"
obj.index.is_unique
'is_unique 属性来判断是否有重复索引'
False
Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value. -> 存在重复索引的时候, 返回值是一个Series, 不重复则返回一个标量值
df = pd.DataFrame(np.random.randn(4,3), index=['a', 'a', 'b', 'b'])
df
0 | 1 | 2 | |
---|---|---|---|
a | -1.160530 | -0.226480 | 0.608358 |
a | -1.052758 | -0.783890 | 0.920109 |
b | -0.520996 | -0.706842 | 0.459379 |
b | 0.813595 | 1.052030 | 0.263111 |
"选取行索引为 b 的行, 重复, 则返回df"
df.loc['b']
'选取行索引为 b 的行, 重复, 则返回df'
0 | 1 | 2 | |
---|---|---|---|
b | -0.520996 | -0.706842 | 0.459379 |
b | 0.813595 | 1.052030 | 0.263111 |