一.简介
1.Python Data Analysis Library 或 pandas 是基于NumPy 的一种工具,主要用于数据处理(数据整理,操作,存储,读取等)和数据分析
2.http://pandas.pydata.org/和https://pandas.pydata.org/docs/pandas.pdf
3.pandas有很多数据结构(类),主要用到:Series(一维图表),DataFrame(二维表格),panel(三维数组)
二.Series
1.具有标签(index)的一维数组,能够保存任何数据类型(int,str,float,python对象等),轴标签(索引)从0开始(表格的列的列表)
2.创建:pd.Series(data, index=index),data可以是字典、ndarray、标量
(1)字典
1 #dict, 2 #1.当未传递Series索引时,键表示索引,值表示值 3 d = {'b' : 1, 'a' : 0, 'c' : 2} 4 s=pd.Series(d) 5 print(s) 6 #2.如果传递索引,则将拉出与索引中的标签对应的数据中的值,NaN(不是数字)是pandas中使用的标准缺失数据标记 7 s2=pd.Series(d, index=['b', 'c', 'd', 'a']) 8 print(s2) 9 ---------------------------------------------------------- 10 a 0 11 b 1 12 c 2 13 dtype: int64 14 b 1.0 15 c 2.0 16 d NaN 17 a 0.0 18 dtype: float64
(2)ndarray
1 #ndarry,若设置索引,则索引的长度必须与数据的长度相同, 2 s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) 3 s2=pd.Series(np.random.randn(5))#如果没有传递索引,将创建一个具有值的索引。[0, ..., len(data) - 1] 4 print(s) 5 print(s2) 6 ------------------------------------------- 7 a -0.019921 8 b -2.324644 9 c -0.429393 10 d 1.436731 11 e 2.564406 12 dtype: float64 13 0 -0.925714 14 1 0.319075 15 2 0.528071 16 3 -0.385841 17 4 0.963207 18 dtype: float64
(3)标量
1 #标量,如果data是标量值,则必须提供索引。将重复该值以匹配索引的长度 2 s=pd.Series(5., index=['a', 'b', 'c', 'd', 'e']) 3 print(s) 4 5 ------------------------------------------------------- 6 a 5.0 7 b 5.0 8 c 5.0 9 d 5.0 10 e 5.0 11 dtype: float64
三.DataFrame
1.二维的表格型数据结构,具有索引(行标签)和 列(列标签)参数,若没指定标签,则从0开始自动标记到length-1,
2.创建:pd.DataFrame(data, index=index,columns=columns) ,data可以是ndarray,列表,字典,series,dataframe,也就是说可以把Series很容易地转为DataFrame
(1)字典{‘columns’:series},推荐
1 #字典,#字典的键为columns,值为每一个series,#通过字典创建会产生列的顺序会是随机的 2 d = { 'one': pd.Series([1., 2., 3.],index=['a', 'b', 'c']), 3 'two': pd.Series([1., 2., 3., 4.],index=['a', 'b', 'c', 'd'])} 4 df = pd.DataFrame(d) 5 print(df) 6 print(pd.DataFrame(d, index=['d', 'b', 'a']))#如果没有传递列,则列将是dict键的有序列表。 7 print(pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])) 8 ----------------------------------------------------------- 9 one two 10 a 1.0 1.0 11 b 2.0 2.0 12 c 3.0 3.0 13 d NaN 4.0 14 one two 15 d NaN 4.0 16 b 2.0 2.0 17 a 1.0 1.0 18 two three 19 d 4.0 NaN 20 b 2.0 NaN 21 a 1.0 NaN
(2)字典{‘columns’:ndarrays / lists}
1 #字典,ndarrays必须都是相同的长度, 2 d = {'one' : [1., 2., 3., 4.], 3 'two' : [4., 3., 2., 1.]} 4 df = pd.DataFrame(d) 5 print(df)#如果没有传递索引,结果将是range(n), 6 print(pd.DataFrame(d, index=['a', 'b', 'c', 'd']))#如果传递索引,则它必须明显与数组的长度相同。 7 ---------------------------------------- 8 one two 9 0 1.0 4.0 10 1 2.0 3.0 11 2 3.0 2.0 12 3 4.0 1.0 13 one two 14 a 1.0 4.0 15 b 2.0 3.0 16 c 3.0 2.0 17 d 4.0 1.0
(3)列表【{‘columns’:},{}】
1 #列表【字典】,每个字典的值代表的是每条记录(一行),而且顺序确定,字典的键表示columns 2 data2 = [{'a': 1, 'b': 2}, 3 {'a': 5, 'b': 10, 'c': 20}] 4 df=pd.DataFrame(data2) 5 print(df) 6 print(pd.DataFrame(data2, index=['first', 'second'])) 7 print(pd.DataFrame(data2, columns=['a', 'b'])) 8 ----------------------------------------------------- 9 a b c 10 0 1 2 NaN 11 1 5 10 20.0 12 a b c 13 first 1 2 NaN 14 second 5 10 20.0 15 a b 16 0 1 2 17 1 5 10
3.常用属性
1 df = pd.DataFrame(np.random.randn(20, 3)) 2 3 print(df) 4 print(df.dtypes) # 类型 5 print(df.index) # 行标签(索引),第一维度的标签 6 print(df.columns) # 列标签,第二维度的标签 7 print(df.values) # 返回numpy的ndarray形式 8 print(df.axes) # 返回行标签和列标签的列表,也就是轴(维度)的列表 9 print(df.ndim) # 返回一个表示(维度数)轴数/数组维数的整数。2 10 print(df.size) # 整个表格的元素个数,行数*列数 11 print(df.shape) # 返回表格的形状,维度数的元组 12 print(df.empty) # 判断是否为空 13 --------------------------------------------- 14 0 1 2 15 0 1.463760 0.889478 2.719325 16 1 1.151600 -1.543238 -1.092630 17 2 -0.915165 0.182080 0.015987 18 3 0.997108 -1.062458 -0.290915 19 4 0.506596 0.521730 1.204838 20 5 -1.205070 0.593703 -0.237471 21 6 -0.941276 0.375154 2.109682 22 7 0.490349 -0.333887 -2.234917 23 8 0.927428 1.178269 -0.252521 24 9 -0.734907 -1.701942 0.008140 25 10 0.066335 -0.279483 -1.536980 26 11 0.381364 0.527889 -0.735369 27 12 -1.759830 0.837367 -0.311767 28 13 -0.331585 -0.081331 -1.250890 29 14 0.010716 0.100442 0.030236 30 15 -0.718699 1.051054 0.990649 31 16 -0.295118 0.463517 -0.011839 32 17 0.216745 1.397626 -0.242623 33 18 0.667472 1.096342 -1.638717 34 19 -0.972141 -0.502762 -0.484464 35 0 float64 36 1 float64 37 2 float64 38 dtype: object 39 RangeIndex(start=0, stop=20, step=1) 40 RangeIndex(start=0, stop=3, step=1) 41 [[ 1.46376002 0.88947846 2.71932544] 42 [ 1.15159974 -1.54323848 -1.09262975] 43 [-0.91516515 0.18208043 0.01598746] 44 [ 0.99710758 -1.06245845 -0.29091497] 45 [ 0.506596 0.52173002 1.20483791] 46 [-1.20507018 0.59370326 -0.23747127] 47 [-0.94127614 0.37515381 2.10968156] 48 [ 0.49034868 -0.33388734 -2.23491685] 49 [ 0.9274285 1.17826853 -0.25252075] 50 [-0.73490667 -1.70194233 0.00814038] 51 [ 0.06633457 -0.27948286 -1.53698016] 52 [ 0.3813643 0.52788901 -0.73536879] 53 [-1.7598303 0.83736726 -0.31176709] 54 [-0.33158538 -0.08133123 -1.25089025] 55 [ 0.01071576 0.10044171 0.03023593] 56 [-0.71869935 1.05105444 0.99064853] 57 [-0.29511769 0.46351668 -0.01183942] 58 [ 0.216745 1.3976258 -0.24262259] 59 [ 0.66747248 1.0963415 -1.63871673] 60 [-0.97214094 -0.50276224 -0.48446367]] 61 [RangeIndex(start=0, stop=20, step=1), RangeIndex(start=0, stop=3, step=1)] 62 2 63 60 64 (20, 3) 65 False
4.常用方法
(1)关于查看数据的方法:索引:直接索引、iloc位置索引、loc标签索引
1 df = pd.DataFrame(pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], 2 index=['one', 'two', 'three'], columns=['A', 'B', 'C'])) 3 print(df) 4 # dataframe:从左到右,从上到下,最左边一列是索引列表,每一条索引表示一条记录 5 print(df.head()) # 从表格顶部开始显示表格,查看前几行 6 print(df.tail()) # 从表格底部开始显示表格,查看后几行 7 print('#'*30) 8 9 # 索引 10 # 1.根据标签索引 11 print(df.at['two','B']) # 设定行列的标签进行索引, 12 # loc按标签进行索引,还可以利用元组进行三维数据的所有 13 print(df.loc['one']) # 按行标签进行索引,相当于df.loc['one',:] 14 print(df.loc['two', 'B']) # 按行标签,列标签索引某个值,相当于df.loc['two'].at['B']、 15 print(df.loc[['one','two'],'A':'B']) # 根据行标签和列标签的设定范围来索引 16 print(df.loc[df['C'] > 6, df.loc['three']>10]) # 条件索引 17 18 print('#'*30) 19 # 2.根据位置进行索引 20 print(df.iat[1, 2]) # 相当于df.iloc[1].iat[2] 21 # iloc,和loc类似,只是标签变为位置序号,逗号分隔行列索引,冒号表示到,列表表示索引列表 22 print(df.iloc[2]) # 选择index列表中标签位置为2+1的数据, 23 print(df.iloc[1, 1]) # 按位置进行索引 24 print(df.iloc[0:1, [1, 2]]) 25 26 # 3.直接索引 27 print(df['A']) # 选择某一列,而且只能标签索引,相当于df.A 28 print(df.transpose()['one']) # 可以通过转置来选择某一行,c 29 print(df[1:2]) # 按行序号进行索引,选择行 30 print(df['A']['one']) # 先选择列标签,再选择行标签 31 print(df[1:2]['A']) # 这样返回的是一维Series 32 ------------------------------------------------------- 33 A B C 34 one 0 2 3 35 two 0 4 1 36 three 10 20 30 37 A B C 38 one 0 2 3 39 two 0 4 1 40 three 10 20 30 41 A B C 42 one 0 2 3 43 two 0 4 1 44 three 10 20 30 45 ############################## 46 4 47 A 0 48 B 2 49 C 3 50 Name: one, dtype: int64 51 4 52 A B 53 one 0 2 54 two 0 4 55 B C 56 three 20 30 57 ############################## 58 1 59 A 10 60 B 20 61 C 30 62 Name: three, dtype: int64 63 4 64 B C 65 one 2 3 66 one 0 67 two 0 68 three 10 69 Name: A, dtype: int64 70 A 0 71 B 2 72 C 3 73 Name: one, dtype: int64 74 A B C 75 two 0 4 1 76 0 77 two 0 78 Name: A, dtype: int64
(2)增、删、改、查
【1】原表格上新增一个表格
1 # 增:要尽量符合格式和标签,若无数据,会为np.nan,若无标签,会从上一个序号开始 2 # (1)append向下添加,添加一行或多行,但是原表格不变,可以设定是否排序,可以设定是否忽略原来的index 3 print(df.append(df1, sort=False,ignore_index=False)) 4 print(df.append([df1,df1],sort=False,ignore_index=True)) # 把两个表添加到另一个表中,向下添加, 5 6 # (2)join横向添加表格,原表格不变,一般用于对于大表格添加小表格,索引是按照被添加表格的索引来的,若两个表格有相同列标签,则可以设定成不同的列标签 7 print(df.join(df1, lsuffix='_left', rsuffix='_right')) 8 9 # (3)assign可以利用简单函数添加列,原表格不变, 10 print(df.assign(new_column = lambda x: x['A'] + 4)) 11 12 # (4)直接赋值添加一列,原表格改变 13 df['D'] = np.NaN 14 15 # (5)upadata更新原表格,原表格根据新表格的值而改变,默认覆写,注意index还是原表格的index,而列标签会改变 16 print(df.update(df1)) 17 print(df) 18 ----------------------------------------------- 19 df: A B C 20 one 0.0 NaN 3 21 two 0.0 1 22 three 1.0 True False 23 df1: B C D 24 two 1 2 3 25 three 4 5 6 26 four 7 8 9 27 A B C D 28 one 0.0 NaN 3 NaN 29 two 0.0 1 NaN 30 three 1.0 True False NaN 31 two NaN 1 2 3.0 32 three NaN 4 5 6.0 33 four NaN 7 8 9.0 34 A B C D 35 0 0.0 NaN 3 NaN 36 1 0.0 1 NaN 37 2 1.0 True False NaN 38 3 NaN 1 2 3.0 39 4 NaN 4 5 6.0 40 5 NaN 7 8 9.0 41 6 NaN 1 2 3.0 42 7 NaN 4 5 6.0 43 8 NaN 7 8 9.0 44 A B_left C_left B_right C_right D 45 one 0.0 NaN 3 NaN NaN NaN 46 two 0.0 1 1.0 2.0 3.0 47 three 1.0 True False 4.0 5.0 6.0 48 A B C new_column 49 one 0.0 NaN 3 4.0 50 two 0.0 1 4.0 51 three 1.0 True False 5.0 52 None 53 A B C D 54 one 0.0 NaN 3 NaN 55 two 0.0 1 2 3.0 56 three 1.0 4 5 6.0
【2】删除
1 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, '', 1.0], [1.0, True, False],[1.0, True, False]], 2 index=['one', 'two', 'three','four'], columns=['A', 'B', 'C'])) 3 # 删 4 # drop删除行和列,返回删除后的表格,原表格不变 5 print(df.drop(columns=['B', 'C'])) 6 print(df.drop(index=['one'])) 7 print(df.drop(columns=['B'],index=['one'])) # 并不是删除某个值,而是删除了行和列,和索引不一样 8 9 # drop_duplicates去重,去除行重复, 10 # subset根据哪几列去重,默认None考虑所有列,keep保留哪次的重复行,有'first'(默认)、"last"和False。inplace代表是否在原表上直接去除 11 print(df.duplicated()) # 查看是否有重复行 12 print(df.drop_duplicates(subset=['A'],keep='last',inplace=True)) 13 print(df) 14 ------------------------------------ 15 A 16 one 0.0 17 two 0.0 18 three 1.0 19 four 1.0 20 A B C 21 two 0.0 1 22 three 1.0 True False 23 four 1.0 True False 24 A C 25 two 0.0 1 26 three 1.0 False 27 four 1.0 False 28 one False 29 two False 30 three False 31 four True 32 dtype: bool 33 None 34 A B C 35 two 0.0 1 36 four 1.0 True False
【3】改
1 # 改 2 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, '', 1.0], [1.0, True, False],[1.0, True, False]], 3 index=['one', 'two', 'three','four'], columns=['A', 'B', 'C'])) 4 5 # (1)利用索引找到要改的值,直接赋值改,原表格变 6 df['A'] = np.NaN 7 print(df) 8 9 # (2)修改列标签和行标签的名称,原表格名称不变,只是生成副本 10 print(df.rename(columns={"A": "a", "B": "c"},index={'one': 0, 'two': 1})) 11 12 # (3)修改标签,inplace表示是生成副本(False)还是在原来表格上改动(True) 13 print(df.reset_index(drop=False)) # 将index的标签变为以0开始的序号,生成一个index列,drop决定是否生成index列 14 print(df.set_axis(['I', 'II','IIII','IV'], axis='index',inplace=True)) # 重新设置标签 15 16 # (4)替换值,默认生成副本 17 print(df.replace([np.NaN,''],[0,1])) # 前一个表示要改变的列表,后一个表示对应的替换值 18 print(df.replace({'A': np.NaN, 'B': True}, 100)) # 可以进行更精确的定位和替换 19 print(df) 20 ------------------------------------- 21 A B C 22 one NaN NaN 3 23 two NaN 1 24 three NaN True False 25 four NaN True False 26 a c C 27 0 NaN NaN 3 28 1 NaN 1 29 three NaN True False 30 four NaN True False 31 index A B C 32 0 one NaN NaN 3 33 1 two NaN 1 34 2 three NaN True False 35 3 four NaN True False 36 None 37 A B C 38 I 0.0 0 3 39 II 0.0 1 1 40 IIII 0.0 True False 41 IV 0.0 True False 42 A B C 43 I 100.0 NaN 3 44 II 100.0 1 45 IIII 100.0 100 False 46 IV 100.0 100 False 47 A B C 48 I NaN NaN 3 49 II NaN 1 50 IIII NaN True False 51 IV NaN True False
【4】查
1 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, 6, 1.0], [1.0, True, False],[1.0, True, False]], 2 index=['one', 'two', 'three','four'], columns=['A', 'B', 'C'])) 3 # 查 4 print(df) 5 print(df.info()) # 查询信息 6 print(df.describe()) # 查询基本表格信息 7 print(df.duplicated()) # 查询重复行 8 print(df.select_dtypes(include='float')) # 选择某个类型的列 9 print(df.select_dtypes(exclude=['float'])) # 去除某个类型的列 10 print(df.isna()) # 检验缺失值 11 print(df.notna()) # 检验非缺失值 12 print(df.items()) # df中是以列名为key的,所以df['A']表示列 13 print(df.iterrows()) # 以行名为key进行的迭代 14 print(df.isin([0, 2])) # 判断每个值是否在列表中 15 print(df.isin({'B': [0, 3]})) # 判断某列每个值是否在列表中,其他为False 16 print(df.to_numpy()) # 返回数组类型 17 # print(df.idxmax(axis=1)) # 查询最大值的id 18 # df.equals(df1) # 查询是否有相同的元素 19 ------------------------------------------------------------- 20 A B C 21 one 0.0 NaN 3 22 two 0.0 6 1 23 three 1.0 True False 24 four 1.0 True False 25 <class 'pandas.core.frame.DataFrame'> 26 Index: 4 entries, one to four 27 Data columns (total 3 columns): 28 A 4 non-null float64 29 B 3 non-null object 30 C 4 non-null object 31 dtypes: float64(1), object(2) 32 memory usage: 128.0+ bytes 33 None 34 A 35 count 4.00000 36 mean 0.50000 37 std 0.57735 38 min 0.00000 39 25% 0.00000 40 50% 0.50000 41 75% 1.00000 42 max 1.00000 43 one False 44 two False 45 three False 46 four True 47 dtype: bool 48 A 49 one 0.0 50 two 0.0 51 three 1.0 52 four 1.0 53 B C 54 one NaN 3 55 two 6 1 56 three True False 57 four True False 58 A B C 59 one False True False 60 two False False False 61 three False False False 62 four False False False 63 A B C 64 one True False True 65 two True True True 66 three True True True 67 four True True True 68 <generator object DataFrame.iteritems at 0x00000234BF340750> 69 <generator object DataFrame.iterrows at 0x00000234BF340750> 70 A B C 71 one True False False 72 two True False False 73 three False False True 74 four False False True 75 A B C 76 one False False False 77 two False False False 78 three False False False 79 four False False False 80 [[0.0 nan 3] 81 [0.0 6 1.0] 82 [1.0 True False] 83 [1.0 True False]]
(3)对于缺失数据的处理
1 import numpy as np 2 import pandas as pd 3 dates = pd.date_range('20180507', periods=3) 4 df = pd.DataFrame(np.arange(12).reshape(3,4), index=dates, columns=list('ABCD')) 5 #np.nan表示丢失的数据,默认不包含计算中 6 df.ix[1,'C']=np.nan 7 print(df) 8 #删除对应数据 9 print(df.dropna(axis=0,how='any'))#删除行:行中只要有一个丢失数据就删除 10 print(df.dropna(axis=1,how='all'))#删除列:列中所有数据都是丢失数据就删除 11 #填充对应数据 12 print(df.fillna(value=0))#在丢失数据上把nan变为0 13 #检查是否确实数据 14 print(df.isnull())#print(df.isna()) 15 ----------------------------------------------------------- 16 A B C D 17 2018-05-07 0 1 2.0 3 18 2018-05-08 4 5 NaN 7 19 2018-05-09 8 9 10.0 11 20 A B C D 21 2018-05-07 0 1 2.0 3 22 2018-05-09 8 9 10.0 11 23 A B C D 24 2018-05-07 0 1 2.0 3 25 2018-05-08 4 5 NaN 7 26 2018-05-09 8 9 10.0 11 27 A B C D 28 2018-05-07 0 1 2.0 3 29 2018-05-08 4 5 0.0 7 30 2018-05-09 8 9 10.0 11 31 A B C D 32 2018-05-07 False False False False 33 2018-05-08 False False True False 34 2018-05-09 False False False False
(4)关于计算和统计方法
1 df = pd.DataFrame(pd.DataFrame([[1, -5, 3], [4, 5, 6], [7, 8, 9]], 2 index=['one', 'two', 'three'], columns=['A', 'B', 'C'])) 3 print(df) 4 5 # 统计和计算 6 print(df.abs()) # 每个元素取绝对值 7 print(df.clip(-4, 6)) # 修改最大最小值 8 print(df.corr(method='pearson')) # 计算列之间的相关性,可选方法:pearson、kendall、spearman 9 print(df.count()) # 计算每列的非NA(None,NaN,NaT,inf)元素个数 10 print(df.cov()) # 计算列的协方差矩阵 11 print(df.cummin()) # 计算列依次的最小的累计,同样有cummax 12 print(df.cumsum()) # 计算列的累加值,同样有cumprob累积 13 print(df.diff()) # 计算列中其他元素与第一个元素的差值(离散距离),periods=1表示第一行 14 print(df.eval('A+B')) # 计算字符串表达式 15 print(df.max()) # 计算列的最大值,同样有min最小值 16 print(df.idxmax()) # 计算最大值的索引 17 print(df.std()) # 标准差 18 print(df.var()) # 方差 19 print(df.mean()) # 均值 20 print(df.median()) # 中位数 21 print(df.describe()) # 描述信息,25%表示25%分位数 22 print(df['A'].value_counts()) # 计算频率 23 print(df.round(2)) # 四舍五入小数 24 # print(df.all()) # 检查是否所有元素都为True 25 ------------------------------------------------- 26 A B C 27 one 1 -5 3 28 two 4 5 6 29 three 7 8 9 30 A B C 31 one 1 5 3 32 two 4 5 6 33 three 7 8 9 34 A B C 35 one 1 -4 3 36 two 4 5 6 37 three 6 6 6 38 A B C 39 A 1.000000 0.954919 1.000000 40 B 0.954919 1.000000 0.954919 41 C 1.000000 0.954919 1.000000 42 A 3 43 B 3 44 C 3 45 dtype: int64 46 A B C 47 A 9.0 19.500000 9.0 48 B 19.5 46.333333 19.5 49 C 9.0 19.500000 9.0 50 A B C 51 one 1 -5 3 52 two 1 -5 3 53 three 1 -5 3 54 A B C 55 one 1 -5 3 56 two 5 0 9 57 three 12 8 18 58 A B C 59 one NaN NaN NaN 60 two 3.0 10.0 3.0 61 three 3.0 3.0 3.0 62 one -4 63 two 9 64 three 15 65 dtype: int64 66 A 7 67 B 8 68 C 9 69 dtype: int64 70 A three 71 B three 72 C three 73 dtype: object 74 A 3.000000 75 B 6.806859 76 C 3.000000 77 dtype: float64 78 A 9.000000 79 B 46.333333 80 C 9.000000 81 dtype: float64 82 A 4.000000 83 B 2.666667 84 C 6.000000 85 dtype: float64 86 A 4.0 87 B 5.0 88 C 6.0 89 dtype: float64 90 A B C 91 count 3.0 3.000000 3.0 92 mean 4.0 2.666667 6.0 93 std 3.0 6.806859 3.0 94 min 1.0 -5.000000 3.0 95 25% 2.5 0.000000 4.5 96 50% 4.0 5.000000 6.0 97 75% 5.5 6.500000 7.5 98 max 7.0 8.000000 9.0 99 7 1 100 1 1 101 4 1 102 Name: A, dtype: int64 103 A B C 104 one 1 -5 3 105 two 4 5 6 106 three 7 8 9
(5)读取和存储
1 # 读取read_文件类型 保存to_文件类型 2 # 文件类型可以csv/excel/hdf/sql/json/html/stata/sas/pickle/records/markdown/dict/latex等 3 # 解决乱码:https://blog.csdn.net/leonzhouwei/article/details/8447643 4 df = pd.DataFrame(pd.DataFrame([[1, -5, 3], [4, 5, 6], [7, 8, 9]], 5 index=['one', 'two', 'three'], columns=['A', 'B', 'C'])) 6 df.to_csv('foo.csv', index=False) 7 data = pd.read_csv('foo.csv') 8 print(data) 9 ---------------------------------- 10 A B C 11 0 1 -5 3 12 1 4 5 6 13 2 7 8 9
(6)可视化
1 # 可视化用matplotlib 2 df = pd.DataFrame({ 3 'sales': [3, 2, 3, 9, 10, 6], 4 'signups': [5, 5, 6, 12, 14, 13], 5 'visits': [20, 42, 28, 62, 81, 50], 6 }, index=pd.date_range(start='2018/01/01', end='2018/07/01', 7 freq='M')) 8 import matplotlib.pyplot as plt 9 df.plot() # 折线图 10 df.plot.area() # 面积图 11 df.plot.bar() # 垂直条形图 12 df.plot.barh() # 水平条形图 13 df.plot.hist() # 列的直方图 14 df.plot.line() # 线图 15 df.plot.scatter(x='sales', y='signups') # 散点图 16 plt.show()#显示图
(7)其他
1 df = pd.DataFrame( 2 {'A':[2,3,2], 3 'B':[7,8,6], 4 'C':[7,11,9]} 5 ) 6 print(df) 7 8 print(df.insert(loc=1,column='D',value=[3,7,1])) # 在指定位置插入 9 print(df.astype({'A': 'float16'})) # 转换数据类型 10 print(df.copy()) # 复制 11 print(df.applymap(lambda x:x+1 )) # 应用函数 12 df2 = df.groupby(by=['A']) # 依据某个标签里的种类进行分组 13 print(df2.get_group(2)) # 获取某个类别的分组df对象 14 print(df.aggregate(np.median)) # 使用指定轴上的一项或多项操作进行汇总 15 16 print('#'*30) 17 print(df) 18 19 print(df.sort_values(by='C',ascending=False)) # 指定某一属性,按值从小到大把整个列表排序,倒叙 20 print(df.to_dense()) # 将稀疏矩阵变为稠密矩阵 21 ------------------------------------------------- 22 A B C 23 0 2 7 7 24 1 3 8 11 25 2 2 6 9 26 None 27 A D B C 28 0 2.0 3 7 7 29 1 3.0 7 8 11 30 2 2.0 1 6 9 31 A D B C 32 0 2 3 7 7 33 1 3 7 8 11 34 2 2 1 6 9 35 A D B C 36 0 3 4 8 8 37 1 4 8 9 12 38 2 3 2 7 10 39 A D B C 40 0 2 3 7 7 41 2 2 1 6 9 42 A 2.0 43 D 3.0 44 B 7.0 45 C 9.0 46 dtype: float64 47 ############################## 48 A D B C 49 0 2 3 7 7 50 1 3 7 8 11 51 2 2 1 6 9 52 A D B C 53 1 3 7 8 11 54 2 2 1 6 9 55 0 2 3 7 7 56 A D B C 57 0 2 3 7 7 58 1 3 7 8 11 59 2 2 1 6 9
5.两个表格的操作
(1)合并
1 df1 = pd.DataFrame([[0.0, np.NaN, 3], [0.0, 3, 1.0], [1.0, True, False]], 2 index=['one', 'two', 'three'], columns=['A', 'B', 'C']) 3 df2 = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], 4 index =['two', 'three', 'four'], columns=['B', 'C', 'D']) 5 6 # (1)concat合并,axis=0表示上下合并,1表示左右合并;ignore_index=True表示忽略以前的index,重置index; 7 # join = outer表示并集,inner表示交集;sort表示index是否排序,index不忽略时用,True表示排序,str按首字母进行排序 8 print(pd.concat([df1,df2],axis=1,ignore_index=False,sort=True,join='outer')) 9 print(pd.concat([df1,df2],axis=0,ignore_index=False,sort=False,join='inner')) 10 11 # (2)merge合并,基于关键字 12 # how=['left','right','outer','inner']合并的方式:基于左边的表进行填充,右边的表进行填充,并集,交集 13 # left_index和right_index:是否考虑左边的index和右边的index,值有True或False 14 # suffixes:合并时,给一样的columns,不一样的数据,添加标记进行区分 15 print(df1) 16 print(df2) 17 print(pd.merge(df1,df2,on=['B'],suffixes=['_left','_right'],how='outer')) # 基于相同columns=‘key’进行合并 18 ---------------------------------------------------------- 19 A B C B C D 20 four NaN NaN NaN 7.0 8.0 9.0 21 one 0.0 NaN 3 NaN NaN NaN 22 three 1.0 True False 4.0 5.0 6.0 23 two 0.0 3 1 1.0 2.0 3.0 24 B C 25 one NaN 3 26 two 3 1 27 three True False 28 two 1 2 29 three 4 5 30 four 7 8 31 A B C 32 one 0.0 NaN 3 33 two 0.0 3 1 34 three 1.0 True False 35 B C D 36 two 1 2 3 37 three 4 5 6 38 four 7 8 9 39 A B C_left C_right D 40 0 0.0 NaN 3 NaN NaN 41 1 0.0 3 1 NaN NaN 42 2 1.0 True False 2.0 3.0 43 3 NaN 4 NaN 5.0 6.0 44 4 NaN 7 NaN 8.0 9.0
(2)计算,和numpy差不多
四.通用
1.所有对列的操作和对行的操作都可以通过转置进行等价操作
2.pandas中index表示行(列表),对应axis=0行操作;columns表示列(列表),对应axis=1列操作
3.通过字典创建dataframe,创建的图表:最左边是index列表,从上到下;最上面是columns列表,从左到右;中间是字典数据,每一个数据对应相应的index和columns