import numpy as np
import pandas as pd
Pandas will be a major tool of interest throughout(贯穿) much of the rest of the book. It contains data structures and manipulation tools designed to make data cleaning(数据清洗) and analysis fast and easy in Python. pandas is often used in tandem(串联) with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization(可视化) libraries like matplotlib. pandas adopts(采用) sinificant(显著的,大量的) parts of NumPy's idiomatic(惯用的) style of array based computing, especially array-based functions and preference for data processing without for loops.(面向数组编程)
While pandas adopts many coding idioms(惯用的) from NumPy, the biggest difference is that pandas is disgined for working with tabular(表格型) or heterogeneous(多样型) data. NumPy, by contrast(对比), is best suite for working with homogeneous numerical array data. -> pandas 是表格型数据处理的一种最佳方案(作者很能吹的哦)
Since become an open source project in 2010, pandas has matured(成熟的) into a quite large library that is applicable(适用于) in a broad set of real-world use cases. -> 被广泛使用 The developer community has grown to over 800 distinct(活跃的) contributors, who have been helping build the project as they have used
it to solve their day-to-day data problems. -> 解决日常生活中的大量数据处理问题
Throughout the rest of the book, I use the following import convention for pandas:
import pandas as pd
# from pandas import Serieser, DataFrame
Thus, whever you see pd in code, it is refering to pandas. You may also find it easier to import Series and Dataframe into the local namespace since they are frequently used:
"from pandas import Series DataFrame"
To get start with pandas, you will need to comfortable(充分了解) with its two workhorse data structures: Series and DataFrame. While(尽管) they are not a universal solution for every problem, they provide a solid(稳定的), easy-to-use basis for most applications.
Series
A series is a one-dimensional array-like object containing a sequence of values(of similar types to NumPy types) and an associated array of data labels, called it's index. The simplest(简明来说) Series is formed from only an array of data. -> Series像是一个有索引的一维NumPy数组.
obj = pd.Series([4, 7, -5, 3])
obj
0 4
1 7
2 -5
3 3
dtype: int64
The string representation(代表) of a Series displaye interactively(交互地) show the index on the left and the value on the right.(索引显示在左边, 值在右边) Since we did not specify(指定) an index for the data, a default one consisting of the integer 0 throught N-1(where N is the lenght of the data)(索引从0开始的) is created. You can get the array representation and index object of the Series via(通过) its values and index attributes, respectively: -> 通过其values, index属性进行访问和设置.
obj.values
array([ 4, 7, -5, 3], dtype=int64)
obj.index # like range(4)
RangeIndex(start=0, stop=4, step=1)
Often it will be describe to create a Series with an index identifying each data point with a lable:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
"打印索引"
obj2.index
d 4
b 7
a -5
c 3
dtype: int64
'打印索引'
Index(['d', 'b', 'a', 'c'], dtype='object')
Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values.-> 通过index来选取单个或多个元素
"选取单个元素[index]"
obj2['a']
"修改元素-直接赋值-修改是-inplace"
obj2['d'] = 'cj'
"选取多个元素[[index]], 注意, 没有值则会NaN, 比较健壮的"
obj2[['c', 'a', 'd', 'xx']]
'选取单个元素[index]'
-5
'修改元素-直接赋值-修改是-inplace'
'选取多个元素[[index]], 注意, 没有值则会NaN, 比较健壮的'
c:pythonpython36libsite-packagespandascoreseries.py:851: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self.loc[key]
c 3
a -5
d cj
xx NaN
dtype: object
"对元素赋值修改, 默认是原地修改的"
obj2
'对元素赋值修改, 默认是原地修改的'
d cj
b 7
a -5
c 3
dtype: object
Here ['c', 'a', 'd'] is interpreted(被要求为) as a list of indices, even though it contains strings instead of integers.-> 多个索引的键, 先用一个列表存起来, 再作为一个参数给索引.
Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication(标量乘), or appplying math functions)函数映射, will preserve the index-value link: -> 像操作NumPy数组一样操作, 如bool数组, 标量乘, 数学函数等..
"过滤出Series中大于0的元素及对应索引"
"先还原数据, 字符不能和数字比较哦"
obj2['d'] = 4
obj2[obj2 > 0]
"标量计算"
obj2 * 2
"调用NumPy函数"
"需要用values过滤掉索引, cj 觉得, 不然会报错"
np.exp(obj.values)
'过滤出Series中大于0的元素及对应索引'
'先还原数据, 字符不能和数字比较哦'
d 4
b 7
c 3
dtype: object
'标量计算'
d 8
b 14
a -10
c 6
dtype: object
'调用NumPy函数'
'需要用values过滤掉索引, cj 觉得, 不然会报错'
array([5.45981500e+01, 1.09663316e+03, 6.73794700e-03, 2.00855369e+01])
"cj test"
obj2 > 0
np.exp(obj2)
'cj test'
d True
b True
a False
c True
dtype: bool
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-86002a981278> in <module>
2 obj2 > 0
3
----> 4 np.exp(obj2)
AttributeError: 'int' object has no attribute 'exp'
Another way to think about a Series is as fixed-lenght, ordered dict, as it's a mapping of index values to data values. -> (Series可以看做是一个有序字典映射, key是index, value.) It can be used in many contexts(情景) where you might use a dict:
"跟字典操作一样, 遍历, 选取, 默认都是对key进行操作"
'b' in obj2
'xxx' in obj2
'跟字典操作一样, 遍历, 选取, 默认都是对key进行操作'
True
False
Should you have data contained in a Python dict, you can create a Series from it by pass the dict: -> 可直接将Python字典对象转为Series, index就是key.
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
"直接可将字典转为Series"
obj3 = pd.Series(sdata)
obj3
'直接可将字典转为Series'
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
# cj test
"多层字典嵌套也是可以的, 但只会显示顶层结构"
cj_data = {'Ohio':{'sex':1, 'age':18}, 'Texas':{'cj':123}}
pd.Series(cj_data)
'多层字典嵌套也是可以的, 但只会显示顶层结构'
Ohio {'sex': 1, 'age': 18}
Texas {'cj': 123}
dtype: object
When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in order you want them to appear in the resulting Series: -> 传入字典对象, 默认的index是key, 可以通过重写index来达到任何我们期望的结果:
"重写, 覆盖掉原来的index"
states = ['California', 'Ohio', 'Oregon', 'Texas']
"相同的字段直接 替换, 没有的字段, 则显示为NA"
obj4 = pd.Series(sdata, index=states)
obj4
'重写, 覆盖掉原来的index'
'相同的字段直接 替换, 没有的字段, 则显示为NA'
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
Here, three values found in sdata were palced in the appropriate(适当的) location, (替换, 字段相同), but since no value for 'Carlifornia' was found, it appears as NaN(not a number), which is considered in pandas to mark(标记) missing or NA values. Since 'Utah' was not include in states, it is excluded from the resulting object.
I will use the terms(短语) 'missing' or 'NA' interchangeably(交替地) to refer to(涉及) missing data. The isnull and notnull functions in pandas should be used to detect(检测) missing data:
"pd.isnull(), pd.notnull() 用来检测缺失值情况"
pd.isnull(obj4)
"正向逻辑"
pd.notnull(obj4)
"Series also has these as instance methods:"
obj4.notnull()
'pd.isnull(), pd.notnull() 用来检测缺失值情况'
California True
Ohio False
Oregon False
Texas False
dtype: bool
'正向逻辑'
California False
Ohio True
Oregon True
Texas True
dtype: bool
'Series also has these as instance methods:'
California False
Ohio True
Oregon True
Texas True
dtype: bool
I discuss working with missing data in more detail in Chapter 7.
A usefull Series feature for many applications is that it automatically(自动地) aligns(对齐) index label in arithmetic operations. -> Series 在算数运算中, 会自动地对齐索引,即相同索引, 会被认为一个索引 这点很关键.
obj3
obj4
"obj3 + obj4, index相同, 直接数值相加, 不想同则NaN"
obj3 + obj4
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
'obj3 + obj4, index相同, 直接数值相加, 不想同则NaN'
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
Data alignment features(数据对齐的功能) will be in addressed in more detail later. If you have experience with databases, you can think about this as being simalar to a join operation. ->(数据对齐, 就跟数据的的连接是相似的, 内连接, 左连接, 右连接)
Both the Series object itself and its index hava a name attribute, which integrates(一体化) with other keys areas of pandas functionality: -> (name属性, 是将一些键区域联系在一起的)
"设置键的名字 obj4.name='xxx'"
obj4.name = 'population'
"设置索引的名字 obj4.index.name = 'xxx'"
obj4.index.name = 'state'
obj4
"设置键的名字 obj4.name='xxx'"
"设置索引的名字 obj4.index.name = 'xxx'"
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
A Series's index can be altered(改变) in-place by assignment. -> index 可通过赋值的方式, 原地改变
obj
"通过obj.index = 'xxx'实现原地修改索引, 数量不匹配则会报错哦"
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
"通过obj.index = 'xxx'实现原地修改索引, 数量不匹配则会报错哦"
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
DataFrame
A DataFrame represents a rectangular table of data(矩形数据表) and contains an ordered collecton of columns, each of which can be different value type(numeric, string, boolean, etc..)-> (每一列可以包含不同的数据类型) The DataFrame has both a row and column index;(包含有行索引index, 和列索引columns)
It can be thought of as a dict fo Series all sharing the same index.(共享相同索引的Series) Under the hood(从底层来看) the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection fo one-dimensional arrays.(数据被存储为多个二维数组块而非list, dict, 或其他一维数组) The exact(详细的) details of DataFrame's internals(底层原理) are outside the scope of this book.
While a DataFrame is physically(原本用来表示) two-dimensional, you can use it to represent higher dimensional data in a tabular format using hierarchical(分层的) indexing, a subject we wil discuss in Chapter8 and an ingredient(成分) in some of the more advanced data-handling features in pandas. -> 分层索引处理多维数据, 和更多处理高维数据的先进功能在pandas中都能学习到.
There are many ways to construct(构造) a DataFrame, though one of the most common is from a dict of equal-length lists of or NumPy array. ->(构造一个DataFrame最常见的方式是传入一个等长字典, or 多维数组)
data = {
'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:
frame
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
If you are using the Jupyter notebook, pandas DataFrame objects will be displayed as a more browser-friendly HTML table.
For large DataFrames, the head method selects only the first five rows: -> df.head() 默认查看前5行
frame.head()
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
If you specify a sequence of columns, The DataFrame's columns will be arranged in that order: -> 指定列的顺序
"按指定列的顺序排列"
pd.DataFrame(data, columns=['year', 'state', 'pop'])
'按指定列的顺序排列'
year | state | pop | |
---|---|---|---|
0 | 2000 | Ohio | 1.5 |
1 | 2001 | Ohio | 1.7 |
2 | 2002 | Ohio | 3.6 |
3 | 2001 | Nevada | 2.4 |
4 | 2002 | Nevada | 2.9 |
5 | 2003 | Nevada | 3.2 |
If you pass a column that isn't contained in the dict, it will appear with missing values the result:
frame2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five', 'six'])
"对于没有的 columns, 则会新建, 值为NaN"
frame2
"index没有, 则会报错哦, frame.columns 可查看列索引"
frame2.columns
'对于没有的 columns, 则会新建, 值为NaN'
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN |
two | 2001 | Ohio | 1.7 | NaN |
three | 2002 | Ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
six | 2003 | Nevada | 3.2 | NaN |
'index没有, 则会报错哦, frame.columns 可查看列索引'
Index(['year', 'state', 'pop', 'debt'], dtype='object')
A column in a DataFrame can be retrieve(被检索) as a Series either by dict-like notation or by attribute:
->(列表作为索引, 或者df.列名)
"中括号索引[字段名]"
frame2['state']
"通过属方式 df.字段名"
frame2.state
'中括号索引[字段名]'
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
'通过属方式 df.字段名'
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
Attribute-like access(eg, frame2.year) and tab completion(完成) of column names in Ipython is provided as a convenience. -> 通过属性的方式来选取列名是挺方便的.
Frame2[column] works for any column name, but frame2.column only works when the column name is valid Python variable name.
Note that the returned Series have the same index as the DataFrame,(返回的Series具有相同的索引) and their name attribute has been appropriately(适当地) set.
Rows can also be retrieve by position or name with the special loc attribute(much more than this later) -> loc属性用来选取行...
"选取index为three的行 loc[index]"
frame2.loc['three']
"选取第二行和第三行, frame.loc[1:2]"
frame.loc[1:2]
'选取index为three的行 loc[index]'
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
'选取第二行和第三行, frame.loc[1:2]'
state | year | pop | |
---|---|---|---|
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
Columns can be modified by assignment. For example, the enpty 'debt' column could be assigned a scalar value or an array of values: -> 原地修改值
frame2['debet'] = 16.5
"原地修改了整列的值了"
frame2
'原地修改了整列的值了'
year | state | pop | debt | debet | |
---|---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN | 16.5 |
two | 2001 | Ohio | 1.7 | NaN | 16.5 |
three | 2002 | Ohio | 3.6 | NaN | 16.5 |
four | 2001 | Nevada | 2.4 | NaN | 16.5 |
five | 2002 | Nevada | 2.9 | NaN | 16.5 |
six | 2003 | Nevada | 3.2 | NaN | 16.5 |
"原地修改, 自动对齐"
frame2['debet'] = np.arange(6)
"删除掉debt列, axis=1, 列, inplace=True原地删除"
frame2.drop(labels='debt', axis=1, inplace=True)
frame2
'原地修改, 自动对齐'
'删除掉debt列, axis=1, 列, inplace=True原地删除'
year | state | pop | debet | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | 0 |
two | 2001 | Ohio | 1.7 | 1 |
three | 2002 | Ohio | 3.6 | 2 |
four | 2001 | Nevada | 2.4 | 3 |
five | 2002 | Nevada | 2.9 | 4 |
six | 2003 | Nevada | 3.2 | 5 |
frame2.columns
Index(['year', 'state', 'pop', 'debet'], dtype='object')
frame2.drop()
frame2['debt']
one 0
two 1
three 2
four 3
five 4
six 5
Name: debt, dtype: int32
When you are assigning list or arrays to a column, the value's lenght must match the lenght of the DataFrame.(插入数据的长度必须能对齐, 不然后缺失值了) If you assign a Series, it's labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
"自动对齐, 根据index"
frame2['debet'] = val
frame2
'自动对齐, 根据index'
year | state | pop | debet | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN |
two | 2001 | Ohio | 1.7 | -1.2 |
three | 2002 | Ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | -1.5 |
five | 2002 | Nevada | 2.9 | -1.7 |
six | 2003 | Nevada | 3.2 | NaN |
Assigning a column that doesn't exist will create a new colum. The del keyword will delete columns as with a dict. -> del 来删除列
As an example of del, I first add a new column of boolean values where the state columns equals 'Ohio':
frame2['eastern'] = frame2.state == 'Ohio'
"先新增一列 eastern"
frame2
"然后用 del 关键子去删除该列"
del frame2['eastern']
"显示字段名, 发现 eastern列被干掉了, 当然, drop()方法也可以"
frame2.columns
'先新增一列 eastern'
year | state | pop | debet | eastern | |
---|---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN | True |
two | 2001 | Ohio | 1.7 | -1.2 | True |
three | 2002 | Ohio | 3.6 | NaN | True |
four | 2001 | Nevada | 2.4 | -1.5 | False |
five | 2002 | Nevada | 2.9 | -1.7 | False |
six | 2003 | Nevada | 3.2 | NaN | False |
'然后用 del 关键子去删除该列'
'显示字段名, 发现 eastern列被干掉了, 当然, drop()方法也可以'
Index(['year', 'state', 'pop', 'debet'], dtype='object')
The column returned from indexing a DataFrame is a view on teh underlying data, not a copy.(视图哦, in-place的) Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Serie's copy method. -> 可以显示指定列进行拷贝, 不然操作的是视图.
Another common form of data is a nested dict of dicts:
pop = {
'Nevada': {2001:2.4, 2002:2.9},
'Ohio': {2000:1.5, 2001:1.7, 2002:3.6}
}
If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices: ->(字典一层嵌套, pandas 会将最外层key作为columns, 内层key作为index)
frame3 = pd.DataFrame(pop)
"外层字典的键作为column, 值的键作为index"
frame3
'外层字典的键作为column, 值的键作为index'
Nevada | Ohio | |
---|---|---|
2000 | NaN | 1.5 |
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
You can transpose the DataFrame(swap rows and columns) with similar syntax to a NumPy array:
"转置"
frame3.T
'转置'
2000 | 2001 | 2002 | |
---|---|---|---|
Nevada | NaN | 2.4 | 2.9 |
Ohio | 1.5 | 1.7 | 3.6 |
The keys in the inner dicts(内部键, index) are combined and sorted to form the index in the result. This isn't true if an explicit index is specified:
# pd.DataFrame(pop, index=('a', 'b','c'))
Dicts of Series are treated in much the same way.
pdata = {
'Ohio': frame3['Ohio'][:-1],
'Nevada': frame3['Nevada'][:2]
}
pd.DataFrame(pdata)
Ohio | Nevada | |
---|---|---|
2000 | 1.5 | NaN |
2001 | 1.7 | 2.4 |
For a complete list of things you can pass the DataFrame constructor(构造), see Table5-1.
If a DataFrame's index and columns have their name attributes, these will also be displayed: -> 设置行列索引的名字属性
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
state | Nevada | Ohio |
---|---|---|
year | ||
2000 | NaN | 1.5 |
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray: -> values属性返回的是二维的
frame3.values
array([[nan, 1.5],
[2.4, 1.7],
[2.9, 3.6]])
If the DataFrame's columns are different dtypes, the dtype of the values array will be chosen to accommodate(容纳) all of the columns.
"会自动选择dtype去容纳各种类型的数据"
frame2.values
'会自动选择dtype去容纳各种类型的数据'
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, nan],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, nan],
[2002, 'Nevada', 2.9, nan],
[2003, 'Nevada', 3.2, nan]], dtype=object)
Table 5-1 Possible data inputs to DataFrame constructor
- 2D ndarray A matrix of data, passing optional and columns labels
- .......用到再说吧
Index Objects
pandas's Index objects are responsible(保存) for holding the axis labels and other metadata(like the axis name or names). Any array or other sequence of lables you use when constructing(构造) a Series or DataFrame is internally(内部地) converted to an Index(转为索引):
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
index[1:]
obj
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')
a 0
b 1
c 2
dtype: int64
Index objects are immutable(不可变的) and thus can't be modified by the user:
index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-a452e55ce13b> in <module>
----> 1 index[1] = 'd'
c:pythonpython36libsite-packagespandascoreindexesase.py in __setitem__(self, key, value)
2063
2064 def __setitem__(self, key, value):
-> 2065 raise TypeError("Index does not support mutable operations")
2066
2067 def __getitem__(self, key):
TypeError: Index does not support mutable operations
"index 不可变哦"
index
'index 不可变哦'
Index(['a', 'b', 'c'], dtype='object')
labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
0 1.5
1 -2.5
2 0.0
dtype: float64
obj2.index is labels
True
Unlike Python sets, a pandas Index can con
Selections with dumplicate labels will select all occurrences(发生) of that label.
Each Index has a number of methods and properties for set logic which answer other common questions about the data it contains. Some useful ones are summarized in Table 5-2
- append Concatenate with additional Index objects, producing a new index
- difference Compute set difference as Index
- intersection Compute set intersection
- union Compute set union
- isin -> 是否在里面
- delete Compute new index with element at index i deleted
- drop Compute new index by deleting passed values
- insert Compute new index by inserting element at index i
- is_unique Return True if the index has no duplicate values
- unique Compute the array of unique values in the index.