• pandas-1.0.3


    一.简介

    1.Python Data Analysis Library 或 pandas 是基于NumPy 的一种工具,主要用于数据处理(数据整理,操作,存储,读取等)和数据分析

    2.http://pandas.pydata.org/和https://pandas.pydata.org/docs/pandas.pdf

    3.pandas有很多数据结构(类),主要用到:Series(一维图表),DataFrame(二维表格),panel(三维数组)

    二.Series

    1.具有标签(index)的一维数组,能够保存任何数据类型(int,str,float,python对象等),轴标签(索引)从0开始(表格的列的列表)

    2.创建:pd.Series(data, index=index),data可以是字典、ndarray、标量

    (1)字典

    1 #dict,
     2 #1.当未传递Series索引时,键表示索引,值表示值
     3 d = {'b' : 1, 'a' : 0, 'c' : 2}
     4 s=pd.Series(d)
     5 print(s)
     6 #2.如果传递索引,则将拉出与索引中的标签对应的数据中的值,NaN(不是数字)是pandas中使用的标准缺失数据标记
     7 s2=pd.Series(d, index=['b', 'c', 'd', 'a'])
     8 print(s2)
     9 ----------------------------------------------------------
    10 a    0
    11 b    1
    12 c    2
    13 dtype: int64
    14 b    1.0
    15 c    2.0
    16 d    NaN
    17 a    0.0
    18 dtype: float64
    字典

    (2)ndarray

    1 #ndarry,若设置索引,则索引的长度必须与数据的长度相同,
     2 s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
     3 s2=pd.Series(np.random.randn(5))#如果没有传递索引,将创建一个具有值的索引。[0, ..., len(data) - 1]
     4 print(s)
     5 print(s2)
     6 -------------------------------------------
     7 a   -0.019921
     8 b   -2.324644
     9 c   -0.429393
    10 d    1.436731
    11 e    2.564406
    12 dtype: float64
    13 0   -0.925714
    14 1    0.319075
    15 2    0.528071
    16 3   -0.385841
    17 4    0.963207
    18 dtype: float64
    数组

    (3)标量

    1 #标量,如果data是标量值,则必须提供索引。将重复该值以匹配索引的长度
     2 s=pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
     3 print(s)
     4 
     5 -------------------------------------------------------
     6 a    5.0
     7 b    5.0
     8 c    5.0
     9 d    5.0
    10 e    5.0
    11 dtype: float64
    标量

    三.DataFrame

    1.二维的表格型数据结构,具有索引(行标签)和 列(列标签)参数,若没指定标签,则从0开始自动标记到length-1,

    2.创建:pd.DataFrame(data, index=index,columns=columns) ,data可以是ndarray,列表,字典,series,dataframe,也就是说可以把Series很容易地转为DataFrame

    (1)字典{‘columns’:series},推荐

    1 #字典,#字典的键为columns,值为每一个series,#通过字典创建会产生列的顺序会是随机的
     2 d = { 'one': pd.Series([1., 2., 3.],index=['a', 'b', 'c']),
     3       'two': pd.Series([1., 2., 3., 4.],index=['a', 'b', 'c', 'd'])}
     4 df = pd.DataFrame(d)
     5 print(df)
     6 print(pd.DataFrame(d, index=['d', 'b', 'a']))#如果没有传递列,则列将是dict键的有序列表。
     7 print(pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']))
     8 -----------------------------------------------------------
     9    one  two
    10 a  1.0  1.0
    11 b  2.0  2.0
    12 c  3.0  3.0
    13 d  NaN  4.0
    14    one  two
    15 d  NaN  4.0
    16 b  2.0  2.0
    17 a  1.0  1.0
    18    two three
    19 d  4.0   NaN
    20 b  2.0   NaN
    21 a  1.0   NaN
    字典

    (2)字典{‘columns’:ndarrays / lists}

    1 #字典,ndarrays必须都是相同的长度,
     2 d = {'one' : [1., 2., 3., 4.],
     3      'two' : [4., 3., 2., 1.]}
     4 df = pd.DataFrame(d)
     5 print(df)#如果没有传递索引,结果将是range(n),
     6 print(pd.DataFrame(d, index=['a', 'b', 'c', 'd']))#如果传递索引,则它必须明显与数组的长度相同。
     7 ----------------------------------------
     8    one  two
     9 0  1.0  4.0
    10 1  2.0  3.0
    11 2  3.0  2.0
    12 3  4.0  1.0
    13    one  two
    14 a  1.0  4.0
    15 b  2.0  3.0
    16 c  3.0  2.0
    17 d  4.0  1.0
    字典2

    (3)列表【{‘columns’:},{}】

    1 #列表【字典】,每个字典的值代表的是每条记录(一行),而且顺序确定,字典的键表示columns
     2 data2 = [{'a': 1, 'b': 2},
     3          {'a': 5, 'b': 10, 'c': 20}]
     4 df=pd.DataFrame(data2)
     5 print(df)
     6 print(pd.DataFrame(data2, index=['first', 'second']))
     7 print(pd.DataFrame(data2, columns=['a', 'b']))
     8 -----------------------------------------------------
     9    a   b     c
    10 0  1   2   NaN
    11 1  5  10  20.0
    12         a   b     c
    13 first   1   2   NaN
    14 second  5  10  20.0
    15    a   b
    16 0  1   2
    17 1  5  10
    列表

    3.常用属性

     1 df = pd.DataFrame(np.random.randn(20, 3))
     2 
     3 print(df)
     4 print(df.dtypes)       # 类型
     5 print(df.index)        # 行标签(索引),第一维度的标签
     6 print(df.columns)      # 列标签,第二维度的标签
     7 print(df.values)       # 返回numpy的ndarray形式
     8 print(df.axes)         # 返回行标签和列标签的列表,也就是轴(维度)的列表
     9 print(df.ndim)         # 返回一个表示(维度数)轴数/数组维数的整数。2
    10 print(df.size)         # 整个表格的元素个数,行数*列数
    11 print(df.shape)        # 返回表格的形状,维度数的元组
    12 print(df.empty)        # 判断是否为空
    13 ---------------------------------------------
    14            0         1         2
    15 0   1.463760  0.889478  2.719325
    16 1   1.151600 -1.543238 -1.092630
    17 2  -0.915165  0.182080  0.015987
    18 3   0.997108 -1.062458 -0.290915
    19 4   0.506596  0.521730  1.204838
    20 5  -1.205070  0.593703 -0.237471
    21 6  -0.941276  0.375154  2.109682
    22 7   0.490349 -0.333887 -2.234917
    23 8   0.927428  1.178269 -0.252521
    24 9  -0.734907 -1.701942  0.008140
    25 10  0.066335 -0.279483 -1.536980
    26 11  0.381364  0.527889 -0.735369
    27 12 -1.759830  0.837367 -0.311767
    28 13 -0.331585 -0.081331 -1.250890
    29 14  0.010716  0.100442  0.030236
    30 15 -0.718699  1.051054  0.990649
    31 16 -0.295118  0.463517 -0.011839
    32 17  0.216745  1.397626 -0.242623
    33 18  0.667472  1.096342 -1.638717
    34 19 -0.972141 -0.502762 -0.484464
    35 0    float64
    36 1    float64
    37 2    float64
    38 dtype: object
    39 RangeIndex(start=0, stop=20, step=1)
    40 RangeIndex(start=0, stop=3, step=1)
    41 [[ 1.46376002  0.88947846  2.71932544]
    42  [ 1.15159974 -1.54323848 -1.09262975]
    43  [-0.91516515  0.18208043  0.01598746]
    44  [ 0.99710758 -1.06245845 -0.29091497]
    45  [ 0.506596    0.52173002  1.20483791]
    46  [-1.20507018  0.59370326 -0.23747127]
    47  [-0.94127614  0.37515381  2.10968156]
    48  [ 0.49034868 -0.33388734 -2.23491685]
    49  [ 0.9274285   1.17826853 -0.25252075]
    50  [-0.73490667 -1.70194233  0.00814038]
    51  [ 0.06633457 -0.27948286 -1.53698016]
    52  [ 0.3813643   0.52788901 -0.73536879]
    53  [-1.7598303   0.83736726 -0.31176709]
    54  [-0.33158538 -0.08133123 -1.25089025]
    55  [ 0.01071576  0.10044171  0.03023593]
    56  [-0.71869935  1.05105444  0.99064853]
    57  [-0.29511769  0.46351668 -0.01183942]
    58  [ 0.216745    1.3976258  -0.24262259]
    59  [ 0.66747248  1.0963415  -1.63871673]
    60  [-0.97214094 -0.50276224 -0.48446367]]
    61 [RangeIndex(start=0, stop=20, step=1), RangeIndex(start=0, stop=3, step=1)]
    62 2
    63 60
    64 (20, 3)
    65 False
    常用属性

    4.常用方法

    (1)关于查看数据的方法:索引:直接索引、iloc位置索引、loc标签索引

     1 df = pd.DataFrame(pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
     2                   index=['one', 'two', 'three'], columns=['A', 'B', 'C']))
     3 print(df)
     4 # dataframe:从左到右,从上到下,最左边一列是索引列表,每一条索引表示一条记录
     5 print(df.head())         # 从表格顶部开始显示表格,查看前几行
     6 print(df.tail())         # 从表格底部开始显示表格,查看后几行
     7 print('#'*30)
     8 
     9 # 索引
    10 # 1.根据标签索引
    11 print(df.at['two','B'])                # 设定行列的标签进行索引,
    12 # loc按标签进行索引,还可以利用元组进行三维数据的所有
    13 print(df.loc['one'])                   # 按行标签进行索引,相当于df.loc['one',:]
    14 print(df.loc['two', 'B'])              # 按行标签,列标签索引某个值,相当于df.loc['two'].at['B']、
    15 print(df.loc[['one','two'],'A':'B'])   # 根据行标签和列标签的设定范围来索引
    16 print(df.loc[df['C'] > 6, df.loc['three']>10])  # 条件索引
    17 
    18 print('#'*30)
    19 # 2.根据位置进行索引
    20 print(df.iat[1, 2])                  # 相当于df.iloc[1].iat[2]
    21 # iloc,和loc类似,只是标签变为位置序号,逗号分隔行列索引,冒号表示到,列表表示索引列表
    22 print(df.iloc[2])                    # 选择index列表中标签位置为2+1的数据,
    23 print(df.iloc[1, 1])                 # 按位置进行索引
    24 print(df.iloc[0:1, [1, 2]])
    25 
    26 # 3.直接索引
    27 print(df['A'])                      # 选择某一列,而且只能标签索引,相当于df.A
    28 print(df.transpose()['one'])        # 可以通过转置来选择某一行,c
    29 print(df[1:2])                      # 按行序号进行索引,选择行
    30 print(df['A']['one'])               # 先选择列标签,再选择行标签
    31 print(df[1:2]['A'])                 # 这样返回的是一维Series
    32 -------------------------------------------------------
    33         A   B   C
    34 one     0   2   3
    35 two     0   4   1
    36 three  10  20  30
    37         A   B   C
    38 one     0   2   3
    39 two     0   4   1
    40 three  10  20  30
    41         A   B   C
    42 one     0   2   3
    43 two     0   4   1
    44 three  10  20  30
    45 ##############################
    46 4
    47 A    0
    48 B    2
    49 C    3
    50 Name: one, dtype: int64
    51 4
    52      A  B
    53 one  0  2
    54 two  0  4
    55         B   C
    56 three  20  30
    57 ##############################
    58 1
    59 A    10
    60 B    20
    61 C    30
    62 Name: three, dtype: int64
    63 4
    64      B  C
    65 one  2  3
    66 one       0
    67 two       0
    68 three    10
    69 Name: A, dtype: int64
    70 A    0
    71 B    2
    72 C    3
    73 Name: one, dtype: int64
    74      A  B  C
    75 two  0  4  1
    76 0
    77 two    0
    78 Name: A, dtype: int64
    索引

     (2)增、删、改、查

     【1】原表格上新增一个表格

     1 # 增:要尽量符合格式和标签,若无数据,会为np.nan,若无标签,会从上一个序号开始
     2 # (1)append向下添加,添加一行或多行,但是原表格不变,可以设定是否排序,可以设定是否忽略原来的index
     3 print(df.append(df1, sort=False,ignore_index=False))
     4 print(df.append([df1,df1],sort=False,ignore_index=True))  # 把两个表添加到另一个表中,向下添加,
     5 
     6 # (2)join横向添加表格,原表格不变,一般用于对于大表格添加小表格,索引是按照被添加表格的索引来的,若两个表格有相同列标签,则可以设定成不同的列标签
     7 print(df.join(df1, lsuffix='_left', rsuffix='_right'))
     8 
     9 # (3)assign可以利用简单函数添加列,原表格不变,
    10 print(df.assign(new_column = lambda x: x['A'] + 4))
    11 
    12 # (4)直接赋值添加一列,原表格改变
    13 df['D'] = np.NaN
    14 
    15 # (5)upadata更新原表格,原表格根据新表格的值而改变,默认覆写,注意index还是原表格的index,而列标签会改变
    16 print(df.update(df1))
    17 print(df)
    18 -----------------------------------------------
    19 df:          A     B      C
    20 one    0.0   NaN      3
    21 two    0.0            1
    22 three  1.0  True  False
    23 df1:        B  C  D
    24 two    1  2  3
    25 three  4  5  6
    26 four   7  8  9
    27          A     B      C    D
    28 one    0.0   NaN      3  NaN
    29 two    0.0            1  NaN
    30 three  1.0  True  False  NaN
    31 two    NaN     1      2  3.0
    32 three  NaN     4      5  6.0
    33 four   NaN     7      8  9.0
    34      A     B      C    D
    35 0  0.0   NaN      3  NaN
    36 1  0.0            1  NaN
    37 2  1.0  True  False  NaN
    38 3  NaN     1      2  3.0
    39 4  NaN     4      5  6.0
    40 5  NaN     7      8  9.0
    41 6  NaN     1      2  3.0
    42 7  NaN     4      5  6.0
    43 8  NaN     7      8  9.0
    44          A B_left C_left  B_right  C_right    D
    45 one    0.0    NaN      3      NaN      NaN  NaN
    46 two    0.0             1      1.0      2.0  3.0
    47 three  1.0   True  False      4.0      5.0  6.0
    48          A     B      C  new_column
    49 one    0.0   NaN      3         4.0
    50 two    0.0            1         4.0
    51 three  1.0  True  False         5.0
    52 None
    53          A    B  C    D
    54 one    0.0  NaN  3  NaN
    55 two    0.0    1  2  3.0
    56 three  1.0    4  5  6.0

     【2】删除

     1 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, '', 1.0], [1.0, True, False],[1.0, True, False]],
     2                   index=['one', 'two', 'three','four'], columns=['A', 'B', 'C']))
     3 #
     4 # drop删除行和列,返回删除后的表格,原表格不变
     5 print(df.drop(columns=['B', 'C']))
     6 print(df.drop(index=['one']))
     7 print(df.drop(columns=['B'],index=['one']))  # 并不是删除某个值,而是删除了行和列,和索引不一样
     8 
     9 # drop_duplicates去重,去除行重复,
    10 # subset根据哪几列去重,默认None考虑所有列,keep保留哪次的重复行,有'first'(默认)、"last"和False。inplace代表是否在原表上直接去除
    11 print(df.duplicated())                # 查看是否有重复行
    12 print(df.drop_duplicates(subset=['A'],keep='last',inplace=True))
    13 print(df)
    14 ------------------------------------
    15          A
    16 one    0.0
    17 two    0.0
    18 three  1.0
    19 four   1.0
    20          A     B      C
    21 two    0.0            1
    22 three  1.0  True  False
    23 four   1.0  True  False
    24          A      C
    25 two    0.0      1
    26 three  1.0  False
    27 four   1.0  False
    28 one      False
    29 two      False
    30 three    False
    31 four      True
    32 dtype: bool
    33 None
    34         A     B      C
    35 two   0.0            1
    36 four  1.0  True  False

     【3】改

     1 #
     2 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, '', 1.0], [1.0, True, False],[1.0, True, False]],
     3                   index=['one', 'two', 'three','four'], columns=['A', 'B', 'C']))
     4 
     5 # (1)利用索引找到要改的值,直接赋值改,原表格变
     6 df['A'] = np.NaN
     7 print(df)
     8 
     9 # (2)修改列标签和行标签的名称,原表格名称不变,只是生成副本
    10 print(df.rename(columns={"A": "a", "B": "c"},index={'one': 0, 'two': 1}))
    11 
    12 # (3)修改标签,inplace表示是生成副本(False)还是在原来表格上改动(True)
    13 print(df.reset_index(drop=False))  # 将index的标签变为以0开始的序号,生成一个index列,drop决定是否生成index列
    14 print(df.set_axis(['I', 'II','IIII','IV'], axis='index',inplace=True))  # 重新设置标签
    15 
    16 # (4)替换值,默认生成副本
    17 print(df.replace([np.NaN,''],[0,1]))    # 前一个表示要改变的列表,后一个表示对应的替换值
    18 print(df.replace({'A': np.NaN, 'B': True}, 100))  # 可以进行更精确的定位和替换
    19 print(df)
    20 -------------------------------------
    21         A     B      C
    22 one   NaN   NaN      3
    23 two   NaN            1
    24 three NaN  True  False
    25 four  NaN  True  False
    26         a     c      C
    27 0     NaN   NaN      3
    28 1     NaN            1
    29 three NaN  True  False
    30 four  NaN  True  False
    31    index   A     B      C
    32 0    one NaN   NaN      3
    33 1    two NaN            1
    34 2  three NaN  True  False
    35 3   four NaN  True  False
    36 None
    37         A     B      C
    38 I     0.0     0      3
    39 II    0.0     1      1
    40 IIII  0.0  True  False
    41 IV    0.0  True  False
    42           A    B      C
    43 I     100.0  NaN      3
    44 II    100.0           1
    45 IIII  100.0  100  False
    46 IV    100.0  100  False
    47        A     B      C
    48 I    NaN   NaN      3
    49 II   NaN            1
    50 IIII NaN  True  False
    51 IV   NaN  True  False

     【4】查

     1 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, 6, 1.0], [1.0, True, False],[1.0, True, False]],
     2                   index=['one', 'two', 'three','four'], columns=['A', 'B', 'C']))
     3 #
     4 print(df)
     5 print(df.info())                              # 查询信息
     6 print(df.describe())                          # 查询基本表格信息
     7 print(df.duplicated())                        # 查询重复行
     8 print(df.select_dtypes(include='float'))      # 选择某个类型的列
     9 print(df.select_dtypes(exclude=['float']))    # 去除某个类型的列
    10 print(df.isna())                              # 检验缺失值
    11 print(df.notna())                             # 检验非缺失值
    12 print(df.items())                             # df中是以列名为key的,所以df['A']表示列
    13 print(df.iterrows())                          # 以行名为key进行的迭代
    14 print(df.isin([0, 2]))                        # 判断每个值是否在列表中
    15 print(df.isin({'B': [0, 3]}))                 # 判断某列每个值是否在列表中,其他为False
    16 print(df.to_numpy())                          # 返回数组类型
    17 # print(df.idxmax(axis=1))                    # 查询最大值的id
    18 # df.equals(df1)                              # 查询是否有相同的元素
    19 -------------------------------------------------------------
    20          A     B      C
    21 one    0.0   NaN      3
    22 two    0.0     6      1
    23 three  1.0  True  False
    24 four   1.0  True  False
    25 <class 'pandas.core.frame.DataFrame'>
    26 Index: 4 entries, one to four
    27 Data columns (total 3 columns):
    28 A    4 non-null float64
    29 B    3 non-null object
    30 C    4 non-null object
    31 dtypes: float64(1), object(2)
    32 memory usage: 128.0+ bytes
    33 None
    34              A
    35 count  4.00000
    36 mean   0.50000
    37 std    0.57735
    38 min    0.00000
    39 25%    0.00000
    40 50%    0.50000
    41 75%    1.00000
    42 max    1.00000
    43 one      False
    44 two      False
    45 three    False
    46 four      True
    47 dtype: bool
    48          A
    49 one    0.0
    50 two    0.0
    51 three  1.0
    52 four   1.0
    53           B      C
    54 one     NaN      3
    55 two       6      1
    56 three  True  False
    57 four   True  False
    58            A      B      C
    59 one    False   True  False
    60 two    False  False  False
    61 three  False  False  False
    62 four   False  False  False
    63           A      B     C
    64 one    True  False  True
    65 two    True   True  True
    66 three  True   True  True
    67 four   True   True  True
    68 <generator object DataFrame.iteritems at 0x00000234BF340750>
    69 <generator object DataFrame.iterrows at 0x00000234BF340750>
    70            A      B      C
    71 one     True  False  False
    72 two     True  False  False
    73 three  False  False   True
    74 four   False  False   True
    75            A      B      C
    76 one    False  False  False
    77 two    False  False  False
    78 three  False  False  False
    79 four   False  False  False
    80 [[0.0 nan 3]
    81  [0.0 6 1.0]
    82  [1.0 True False]
    83  [1.0 True False]]
    查询

    (3)对于缺失数据的处理

    1 import numpy as np
     2 import pandas as pd
     3 dates = pd.date_range('20180507', periods=3)
     4 df = pd.DataFrame(np.arange(12).reshape(3,4), index=dates, columns=list('ABCD'))
     5 #np.nan表示丢失的数据,默认不包含计算中
     6 df.ix[1,'C']=np.nan
     7 print(df)
     8 #删除对应数据
     9 print(df.dropna(axis=0,how='any'))#删除行:行中只要有一个丢失数据就删除
    10 print(df.dropna(axis=1,how='all'))#删除列:列中所有数据都是丢失数据就删除
    11 #填充对应数据
    12 print(df.fillna(value=0))#在丢失数据上把nan变为0
    13 #检查是否确实数据
    14 print(df.isnull())#print(df.isna())
    15 -----------------------------------------------------------
    16             A  B     C   D
    17 2018-05-07  0  1   2.0   3
    18 2018-05-08  4  5   NaN   7
    19 2018-05-09  8  9  10.0  11
    20             A  B     C   D
    21 2018-05-07  0  1   2.0   3
    22 2018-05-09  8  9  10.0  11
    23             A  B     C   D
    24 2018-05-07  0  1   2.0   3
    25 2018-05-08  4  5   NaN   7
    26 2018-05-09  8  9  10.0  11
    27             A  B     C   D
    28 2018-05-07  0  1   2.0   3
    29 2018-05-08  4  5   0.0   7
    30 2018-05-09  8  9  10.0  11
    31                 A      B      C      D
    32 2018-05-07  False  False  False  False
    33 2018-05-08  False  False   True  False
    34 2018-05-09  False  False  False  False
    缺失数据的处理

    (4)关于计算和统计方法

      1 df = pd.DataFrame(pd.DataFrame([[1, -5, 3], [4, 5, 6], [7, 8, 9]],
      2                   index=['one', 'two', 'three'], columns=['A', 'B', 'C']))
      3 print(df)
      4 
      5 #  统计和计算
      6 print(df.abs())                   # 每个元素取绝对值
      7 print(df.clip(-4, 6))             # 修改最大最小值
      8 print(df.corr(method='pearson'))  # 计算列之间的相关性,可选方法:pearson、kendall、spearman
      9 print(df.count())                 # 计算每列的非NA(None,NaN,NaT,inf)元素个数
     10 print(df.cov())                   # 计算列的协方差矩阵
     11 print(df.cummin())                # 计算列依次的最小的累计,同样有cummax
     12 print(df.cumsum())                # 计算列的累加值,同样有cumprob累积
     13 print(df.diff())                  # 计算列中其他元素与第一个元素的差值(离散距离),periods=1表示第一行
     14 print(df.eval('A+B'))             # 计算字符串表达式
     15 print(df.max())                   # 计算列的最大值,同样有min最小值
     16 print(df.idxmax())                # 计算最大值的索引
     17 print(df.std())                   # 标准差
     18 print(df.var())                   # 方差
     19 print(df.mean())                  # 均值
     20 print(df.median())                # 中位数
     21 print(df.describe())              # 描述信息,25%表示25%分位数
     22 print(df['A'].value_counts())     # 计算频率
     23 print(df.round(2))                # 四舍五入小数
     24 # print(df.all())                 # 检查是否所有元素都为True
     25 -------------------------------------------------
     26        A  B  C
     27 one    1 -5  3
     28 two    4  5  6
     29 three  7  8  9
     30        A  B  C
     31 one    1  5  3
     32 two    4  5  6
     33 three  7  8  9
     34        A  B  C
     35 one    1 -4  3
     36 two    4  5  6
     37 three  6  6  6
     38           A         B         C
     39 A  1.000000  0.954919  1.000000
     40 B  0.954919  1.000000  0.954919
     41 C  1.000000  0.954919  1.000000
     42 A    3
     43 B    3
     44 C    3
     45 dtype: int64
     46       A          B     C
     47 A   9.0  19.500000   9.0
     48 B  19.5  46.333333  19.5
     49 C   9.0  19.500000   9.0
     50        A  B  C
     51 one    1 -5  3
     52 two    1 -5  3
     53 three  1 -5  3
     54         A  B   C
     55 one     1 -5   3
     56 two     5  0   9
     57 three  12  8  18
     58          A     B    C
     59 one    NaN   NaN  NaN
     60 two    3.0  10.0  3.0
     61 three  3.0   3.0  3.0
     62 one      -4
     63 two       9
     64 three    15
     65 dtype: int64
     66 A    7
     67 B    8
     68 C    9
     69 dtype: int64
     70 A    three
     71 B    three
     72 C    three
     73 dtype: object
     74 A    3.000000
     75 B    6.806859
     76 C    3.000000
     77 dtype: float64
     78 A     9.000000
     79 B    46.333333
     80 C     9.000000
     81 dtype: float64
     82 A    4.000000
     83 B    2.666667
     84 C    6.000000
     85 dtype: float64
     86 A    4.0
     87 B    5.0
     88 C    6.0
     89 dtype: float64
     90          A         B    C
     91 count  3.0  3.000000  3.0
     92 mean   4.0  2.666667  6.0
     93 std    3.0  6.806859  3.0
     94 min    1.0 -5.000000  3.0
     95 25%    2.5  0.000000  4.5
     96 50%    4.0  5.000000  6.0
     97 75%    5.5  6.500000  7.5
     98 max    7.0  8.000000  9.0
     99 7    1
    100 1    1
    101 4    1
    102 Name: A, dtype: int64
    103        A  B  C
    104 one    1 -5  3
    105 two    4  5  6
    106 three  7  8  9
    统计信息

    (5)读取和存储

     1 # 读取read_文件类型   保存to_文件类型
     2 # 文件类型可以csv/excel/hdf/sql/json/html/stata/sas/pickle/records/markdown/dict/latex等
     3 # 解决乱码:https://blog.csdn.net/leonzhouwei/article/details/8447643
     4 df = pd.DataFrame(pd.DataFrame([[1, -5, 3], [4, 5, 6], [7, 8, 9]],
     5                   index=['one', 'two', 'three'], columns=['A', 'B', 'C']))
     6 df.to_csv('foo.csv', index=False)
     7 data = pd.read_csv('foo.csv')
     8 print(data)
     9 ----------------------------------
    10    A  B  C
    11 0  1 -5  3
    12 1  4  5  6
    13 2  7  8  9
    保存和读取

    (6)可视化

     1 # 可视化用matplotlib
     2 df = pd.DataFrame({
     3     'sales': [3, 2, 3, 9, 10, 6],
     4     'signups': [5, 5, 6, 12, 14, 13],
     5     'visits': [20, 42, 28, 62, 81, 50],
     6 }, index=pd.date_range(start='2018/01/01', end='2018/07/01',
     7                        freq='M'))
     8 import matplotlib.pyplot as plt
     9 df.plot()         # 折线图
    10 df.plot.area()    # 面积图
    11 df.plot.bar()     # 垂直条形图
    12 df.plot.barh()    # 水平条形图
    13 df.plot.hist()    # 列的直方图
    14 df.plot.line()    # 线图
    15 df.plot.scatter(x='sales', y='signups') # 散点图
    16 plt.show()#显示图
    可视化

    (7)其他

     1 df = pd.DataFrame(
     2     {'A':[2,3,2],
     3      'B':[7,8,6],
     4      'C':[7,11,9]}
     5 )
     6 print(df)
     7 
     8 print(df.insert(loc=1,column='D',value=[3,7,1]))  # 在指定位置插入
     9 print(df.astype({'A': 'float16'}))                # 转换数据类型
    10 print(df.copy())                                  # 复制
    11 print(df.applymap(lambda x:x+1 ))                 # 应用函数
    12 df2 = df.groupby(by=['A'])                        # 依据某个标签里的种类进行分组
    13 print(df2.get_group(2))                           # 获取某个类别的分组df对象
    14 print(df.aggregate(np.median))                    # 使用指定轴上的一项或多项操作进行汇总
    15 
    16 print('#'*30)
    17 print(df)
    18 
    19 print(df.sort_values(by='C',ascending=False))  # 指定某一属性,按值从小到大把整个列表排序,倒叙
    20 print(df.to_dense())                           # 将稀疏矩阵变为稠密矩阵
    21 -------------------------------------------------
    22    A  B   C
    23 0  2  7   7
    24 1  3  8  11
    25 2  2  6   9
    26 None
    27      A  D  B   C
    28 0  2.0  3  7   7
    29 1  3.0  7  8  11
    30 2  2.0  1  6   9
    31    A  D  B   C
    32 0  2  3  7   7
    33 1  3  7  8  11
    34 2  2  1  6   9
    35    A  D  B   C
    36 0  3  4  8   8
    37 1  4  8  9  12
    38 2  3  2  7  10
    39    A  D  B  C
    40 0  2  3  7  7
    41 2  2  1  6  9
    42 A    2.0
    43 D    3.0
    44 B    7.0
    45 C    9.0
    46 dtype: float64
    47 ##############################
    48    A  D  B   C
    49 0  2  3  7   7
    50 1  3  7  8  11
    51 2  2  1  6   9
    52    A  D  B   C
    53 1  3  7  8  11
    54 2  2  1  6   9
    55 0  2  3  7   7
    56    A  D  B   C
    57 0  2  3  7   7
    58 1  3  7  8  11
    59 2  2  1  6   9
    其他

    5.两个表格的操作

    (1)合并

     1 df1 = pd.DataFrame([[0.0, np.NaN, 3], [0.0, 3, 1.0], [1.0, True, False]],
     2                   index=['one', 'two', 'three'], columns=['A', 'B', 'C'])
     3 df2 = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]],
     4                   index =['two', 'three', 'four'], columns=['B', 'C', 'D'])
     5 
     6 # (1)concat合并,axis=0表示上下合并,1表示左右合并;ignore_index=True表示忽略以前的index,重置index;
     7 # join = outer表示并集,inner表示交集;sort表示index是否排序,index不忽略时用,True表示排序,str按首字母进行排序
     8 print(pd.concat([df1,df2],axis=1,ignore_index=False,sort=True,join='outer'))
     9 print(pd.concat([df1,df2],axis=0,ignore_index=False,sort=False,join='inner'))
    10 
    11 # (2)merge合并,基于关键字
    12 # how=['left','right','outer','inner']合并的方式:基于左边的表进行填充,右边的表进行填充,并集,交集
    13 # left_index和right_index:是否考虑左边的index和右边的index,值有True或False
    14 # suffixes:合并时,给一样的columns,不一样的数据,添加标记进行区分
    15 print(df1)
    16 print(df2)
    17 print(pd.merge(df1,df2,on=['B'],suffixes=['_left','_right'],how='outer'))  # 基于相同columns=‘key’进行合并
    18 ----------------------------------------------------------
    19          A     B      C    B    C    D
    20 four   NaN   NaN    NaN  7.0  8.0  9.0
    21 one    0.0   NaN      3  NaN  NaN  NaN
    22 three  1.0  True  False  4.0  5.0  6.0
    23 two    0.0     3      1  1.0  2.0  3.0
    24           B      C
    25 one     NaN      3
    26 two       3      1
    27 three  True  False
    28 two       1      2
    29 three     4      5
    30 four      7      8
    31          A     B      C
    32 one    0.0   NaN      3
    33 two    0.0     3      1
    34 three  1.0  True  False
    35        B  C  D
    36 two    1  2  3
    37 three  4  5  6
    38 four   7  8  9
    39      A     B C_left  C_right    D
    40 0  0.0   NaN      3      NaN  NaN
    41 1  0.0     3      1      NaN  NaN
    42 2  1.0  True  False      2.0  3.0
    43 3  NaN     4    NaN      5.0  6.0
    44 4  NaN     7    NaN      8.0  9.0
    合并

    (2)计算,和numpy差不多

    四.通用

    1.所有对列的操作和对行的操作都可以通过转置进行等价操作

    2.pandas中index表示行(列表),对应axis=0行操作;columns表示列(列表),对应axis=1列操作

    3.通过字典创建dataframe,创建的图表:最左边是index列表,从上到下;最上面是columns列表,从左到右;中间是字典数据,每一个数据对应相应的index和columns

  • 相关阅读:
    【深度学习】吴恩达网易公开课练习(class1 week2)
    【深度学习】吴恩达网易公开课练习(class1 week3)
    【python】内存调试
    【python】threadpool的内存占用问题
    Druid: A Real-time Analytical Data Store
    Mesa: GeoReplicated, Near RealTime, Scalable Data Warehousing
    Presto: SQL on Everything
    The Snowflake Elastic Data Warehouse
    Guava 库
    Java Annotation
  • 原文地址:https://www.cnblogs.com/yu-liang/p/12982819.html
Copyright © 2020-2023  润新知