• Pandas基础命令速查清单


    本文翻译整理自Pandas Cheat Sheet - Python for Data Science,结合K-Lab的工具属性,添加了具体的内容将速查清单里面的代码实践了一遍。

    速查表内容概要

    点击右上角的Fork按钮上手实践,即可点击标题实现内容跳转

    • [缩写解释 & 库的导入]
    • [数据的导入]
    • [数据的导出]
    • [创建测试对象]
    • [数据的查看与检查]
    • [数据的选取]
    • [数据的清洗]
    • [数据的过滤(filter),排序(sort)和分组(groupby)]
    • [数据的连接(join)与组合(combine)]
    • [数据的统计]
     
    缩写解释 & 库的导入
     

    df --- 任意的pandas DataFrame(数据框)对象
    s --- 任意的pandas Series(数组)对象
    pandasnumpy是用Python做数据分析最基础且最核心的库

    In [2]:
    import pandas as pd # 导入pandas库并简写为pd
    import numpy as np # 导入numpy库并简写为np
    
    In [1]:
    import pandas as pd
    import numpy as np
    
     
    数据的导入
     
    pd.read_csv(filename) # 导入csv格式文件中的数据
    pd.read_table(filename) # 导入有分隔符的文本 (如TSV) 中的数据
    pd.read_excel(filename) # 导入Excel格式文件中的数据
    pd.read_sql(query, connection_object) # 导入SQL数据表/数据库中的数据
    pd.read_json(json_string) # 导入JSON格式的字符,URL地址或者文件中的数据
    pd.read_html(url) # 导入经过解析的URL地址中包含的数据框 (DataFrame) 数据
    pd.read_clipboard() # 导入系统粘贴板里面的数据
    pd.DataFrame(dict)  # 导入Python字典 (dict) 里面的数据,其中key是数据框的表头,value是数据框的内容。
    
    In [4]:
    pd.read_csv(filename)
    pd.read_table(filename)
    pd.read_excel(filename)
    pd.read_sql(query, connection_object)
    pd.read_json(json_string)
    pd.read_html(url)
    pd.read_clipboard()
    pd.DataFrame(dict)
    
     
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-4-085e950bd1b9> in <module>()
    ----> 1 pd.read_csv(filename)
          2 pd.read_table(filename)
          3 pd.read_excel(filename)
          4 pd.read_sql(query, connection_object)
          5 pd.read_json(json_string)
    
    NameError: name 'filename' is not defined
     
    数据的导出
     
    df.to_csv(filename) # 将数据框 (DataFrame)中的数据导入csv格式的文件中
    df.to_excel(filename) # 将数据框 (DataFrame)中的数据导入Excel格式的文件中
    df.to_sql(table_name,connection_object) # 将数据框 (DataFrame)中的数据导入SQL数据表/数据库中
    df.to_json(filename) # 将数据框 (DataFrame)中的数据导入JSON格式的文件中
    
    In [5]:
    df.to_csv(filename)
    df.to_excel(filename)
    df.to_sql(table_name, connection_object)
    df.to_json(filename)
    
     
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-5-79e6637c6866> in <module>()
    ----> 1 df.to_csv(filename)
          2 df.to_excel(filename)
          3 df.to_sql(table_name, connection_object)
          4 df.to_json(filename)
    
    NameError: name 'df' is not defined
     
    创建测试对象
     
    pd.DataFrame(np.random.rand(10,5)) # 创建一个5列10行的由随机浮点数组成的数据框 DataFrame
    
    In [6]:
    pd.DataFrame(np.random.rand(10,5))
    
    Out[6]:
     01234
    0 0.178801 0.846355 0.705159 0.196188 0.874350
    1 0.362044 0.390863 0.760347 0.555912 0.689457
    2 0.201675 0.673297 0.180532 0.648759 0.483332
    3 0.645076 0.932788 0.182940 0.722370 0.542127
    4 0.578884 0.839314 0.734570 0.691949 0.538795
    5 0.999395 0.383014 0.192030 0.315428 0.940216
    6 0.980939 0.475735 0.674909 0.112695 0.961567
    7 0.389256 0.855763 0.026823 0.876811 0.274633
    8 0.108523 0.267471 0.988235 0.991163 0.271738
    9 0.403084 0.935190 0.628058 0.296839 0.386862
    In [2]:
    pd.DataFrame(np.random.rand(10,5))
    
    Out[2]:
     01234
    0 0.647736 0.372628 0.255864 0.853542 0.613267
    1 0.064364 0.156340 0.575021 0.561911 0.479901
    2 0.036473 0.876819 0.255325 0.393240 0.543039
    3 0.357489 0.006578 0.093966 0.531294 0.029009
    4 0.550582 0.504600 0.273546 0.011693 0.052523
    5 0.721563 0.170689 0.702163 0.447883 0.905983
    6 0.839726 0.935997 0.343133 0.356957 0.377116
    7 0.931894 0.026684 0.719148 0.911425 0.676187
    8 0.115619 0.114894 0.130696 0.321598 0.170082
    9 0.194649 0.526141 0.965442 0.275433 0.880765
     
    pd.Series(my_list) # 从一个可迭代的对象 my_list 中创建一个数据组
    
    In [7]:
    my_list = ['huang', 100, 'xiaolei',4,56]
    pd.Series(my_list)
    
    Out[7]:
    0      huang
    1        100
    2    xiaolei
    3          4
    4         56
    dtype: object
    In [3]:
    my_list = ['Kesci',100,'欢迎来到科赛网']
    pd.Series(my_list)
    
    Out[3]:
    0      Kesci
    1        100
    2    欢迎来到科赛网
    dtype: object
     
    df.index = pd.date_range('2017/1/1', periods=df.shape[0]) # 添加一个日期索引 index
    
    In [4]:
    df = pd.DataFrame(np.random.rand(10,5))
    df.index = pd.date_range('2017/1/1', periods=df.shape[0])
    df
    
    Out[4]:
     01234
    2017-01-01 0.248515 0.647889 0.111346 0.540434 0.159914
    2017-01-02 0.445073 0.329843 0.823678 0.737438 0.707598
    2017-01-03 0.526543 0.876826 0.717986 0.271920 0.719657
    2017-01-04 0.471256 0.657647 0.973484 0.598997 0.249301
    2017-01-05 0.958465 0.474331 0.004078 0.842343 0.819295
    2017-01-06 0.271308 0.271988 0.434776 0.449652 0.369188
    2017-01-07 0.989573 0.928428 0.452436 0.058590 0.732283
    2017-01-08 0.435328 0.730214 0.909400 0.683413 0.186820
    2017-01-09 0.897414 0.687525 0.122937 0.018102 0.440427
    2017-01-10 0.743821 0.134602 0.210326 0.877157 0.815462
     
    数据的查看与检查
     
    df.head(n)  # 查看数据框的前n行
    
    In [9]:
    df = pd.DataFrame(np.random.rand(10, 5))
    df.head(5)
    
    Out[9]:
     01234
    0 0.857171 0.900692 0.500228 0.636632 0.395819
    1 0.332900 0.856592 0.645121 0.311064 0.836480
    2 0.815698 0.667021 0.328536 0.924848 0.400043
    3 0.693114 0.551914 0.696962 0.703079 0.645103
    4 0.842381 0.466469 0.279249 0.740606 0.941279
    In [5]:
    df = pd.DataFrame(np.random.rand(10,5))
    df.head(3)
    
    Out[5]:
     01234
    0 0.705884 0.845813 0.770585 0.481049 0.381055
    1 0.733309 0.542363 0.264334 0.254283 0.859442
    2 0.497977 0.474898 0.806073 0.384412 0.242989
     
    df.tail(n) # 查看数据框的最后n行
    
    In [10]:
    df = pd.DataFrame(np.random.rand(15,8))
    df.tail(4)
    
    Out[10]:
     01234567
    11 0.785491 0.243000 0.991953 0.367337 0.512946 0.740280 0.897460 0.799860
    12 0.602312 0.440157 0.985066 0.992641 0.550723 0.387046 0.047515 0.566604
    13 0.726211 0.132540 0.302954 0.542220 0.029554 0.963806 0.436351 0.462788
    14 0.516992 0.624268 0.423005 0.476461 0.627335 0.635427 0.173666 0.034728
    In [6]:
    df = pd.DataFrame(np.random.rand(10,5))
    df.tail(3)
    
    Out[6]:
     01234
    7 0.617289 0.009801 0.220155 0.992743 0.944472
    8 0.261141 0.940925 0.063394 0.052104 0.517853
    9 0.634541 0.897483 0.748453 0.805861 0.344938
     
    df.shape # 查看数据框的行数与列数
    
    In [11]:
    df = pd.DataFrame(np.random.rand(14, 5))
    df.shape
    
    Out[11]:
    (14, 5)
    In [7]:
    df = pd.DataFrame(np.random.rand(10,5))
    df.shape
    
    Out[7]:
    (10, 5)
     
    df.info() # 查看数据框 (DataFrame) 的索引、数据类型及内存信息
    
    In [13]:
    df = pd.DataFrame(np.random.rand(10, 4))
    df.info()
    
     
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10 entries, 0 to 9
    Data columns (total 4 columns):
    0    10 non-null float64
    1    10 non-null float64
    2    10 non-null float64
    3    10 non-null float64
    dtypes: float64(4)
    memory usage: 400.0 bytes
    
    In [8]:
    df = pd.DataFrame(np.random.rand(10,5))
    df.info()
    
     
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10 entries, 0 to 9
    Data columns (total 5 columns):
    0    10 non-null float64
    1    10 non-null float64
    2    10 non-null float64
    3    10 non-null float64
    4    10 non-null float64
    dtypes: float64(5)
    memory usage: 480.0 bytes
    
     
    df.describe() # 对于数据类型为数值型的列,查询其描述性统计的内容
    
    In [14]:
    df.describe()
    
    Out[14]:
     0123
    count 10.000000 10.000000 10.000000 10.000000
    mean 0.459510 0.467315 0.616311 0.546682
    std 0.401191 0.319752 0.304275 0.205285
    min 0.017633 0.150638 0.068416 0.160698
    25% 0.108201 0.183076 0.535336 0.419520
    50% 0.409686 0.381424 0.729697 0.610982
    75% 0.846220 0.751856 0.831845 0.688182
    max 0.970186 0.959066 0.905394 0.779920
    In [9]:
    df.describe()
    
    Out[9]:
     01234
    count 10.000000 10.000000 10.000000 10.000000 10.000000
    mean 0.410631 0.497585 0.506200 0.322960 0.603119
    std 0.280330 0.322573 0.254780 0.260299 0.256370
    min 0.043731 0.031742 0.070668 0.044822 0.143786
    25% 0.240661 0.211625 0.416827 0.145298 0.422969
    50% 0.346297 0.544697 0.479648 0.217359 0.635974
    75% 0.493105 0.669044 0.557353 0.468119 0.782573
    max 0.937583 0.945573 0.987328 0.883157 0.992891
     
    s.value_counts(dropna=False) # 查询每个独特数据值出现次数统计
    
    In [16]:
    s = pd.Series([1,2,5,6,6,6,6,5,5,'huang'])
    s.value_counts(dropna=False)
    
    Out[16]:
    6        4
    5        3
    huang    1
    2        1
    1        1
    dtype: int64
    In [10]:
    s = pd.Series([1,2,3,3,4,np.nan,5,5,5,6,7])
    s.value_counts(dropna=False)
    
    Out[10]:
     5.0    3
     3.0    2
     7.0    1
     6.0    1
    NaN     1
     4.0    1
     2.0    1
     1.0    1
    dtype: int64
     
    df.apply(pd.Series.value_counts) # 查询数据框 (Data Frame) 中每个列的独特数据值出现次数统计
    
    In [19]:
    pd.DataFrame(np.random.rand(3, 3))
    print(df)
    df.apply(pd.Series.value_counts)
    
     
              a         b         c         d         e
    0  0.743688  0.081938  0.693243  0.647515  0.835997
    1  0.162604  0.421371  0.422371  0.930136  0.732234
    2  0.842065  0.139927  0.675018  0.543914  0.017094
    3  0.535794  0.078217  0.964779  0.607462  0.432429
    4  0.560279  0.544811  0.304371  0.797165  0.505008
    5  0.695691  0.696121  0.741812  0.502741  0.484697
    6  0.775342  0.410536  0.275251  0.810911  0.081818
    7  0.584267  0.917728  0.379231  0.097702  0.622885
    8  0.754810  0.809628  0.102337  0.283509  0.615719
    9  0.003056  0.536268  0.187236  0.181844  0.255499
    
    Out[19]:
     abcde
    0.003056 1.0 NaN NaN NaN NaN
    0.017094 NaN NaN NaN NaN 1.0
    0.078217 NaN 1.0 NaN NaN NaN
    0.081818 NaN NaN NaN NaN 1.0
    0.081938 NaN 1.0 NaN NaN NaN
    0.097702 NaN NaN NaN 1.0 NaN
    0.102337 NaN NaN 1.0 NaN NaN
    0.139927 NaN 1.0 NaN NaN NaN
    0.162604 1.0 NaN NaN NaN NaN
    0.181844 NaN NaN NaN 1.0 NaN
    0.187236 NaN NaN 1.0 NaN NaN
    0.255499 NaN NaN NaN NaN 1.0
    0.275251 NaN NaN 1.0 NaN NaN
    0.283509 NaN NaN NaN 1.0 NaN
    0.304371 NaN NaN 1.0 NaN NaN
    0.379231 NaN NaN 1.0 NaN NaN
    0.410536 NaN 1.0 NaN NaN NaN
    0.421371 NaN 1.0 NaN NaN NaN
    0.422371 NaN NaN 1.0 NaN NaN
    0.432429 NaN NaN NaN NaN 1.0
    0.484697 NaN NaN NaN NaN 1.0
    0.502741 NaN NaN NaN 1.0 NaN
    0.505008 NaN NaN NaN NaN 1.0
    0.535794 1.0 NaN NaN NaN NaN
    0.536268 NaN 1.0 NaN NaN NaN
    0.543914 NaN NaN NaN 1.0 NaN
    0.544811 NaN 1.0 NaN NaN NaN
    0.560279 1.0 NaN NaN NaN NaN
    0.584267 1.0 NaN NaN NaN NaN
    0.607462 NaN NaN NaN 1.0 NaN
    0.615719 NaN NaN NaN NaN 1.0
    0.622885 NaN NaN NaN NaN 1.0
    0.647515 NaN NaN NaN 1.0 NaN
    0.675018 NaN NaN 1.0 NaN NaN
    0.693243 NaN NaN 1.0 NaN NaN
    0.695691 1.0 NaN NaN NaN NaN
    0.696121 NaN 1.0 NaN NaN NaN
    0.732234 NaN NaN NaN NaN 1.0
    0.741812 NaN NaN 1.0 NaN NaN
    0.743688 1.0 NaN NaN NaN NaN
    0.754810 1.0 NaN NaN NaN NaN
    0.775342 1.0 NaN NaN NaN NaN
    0.797165 NaN NaN NaN 1.0 NaN
    0.809628 NaN 1.0 NaN NaN NaN
    0.810911 NaN NaN NaN 1.0 NaN
    0.835997 NaN NaN NaN NaN 1.0
    0.842065 1.0 NaN NaN NaN NaN
    0.917728 NaN 1.0 NaN NaN NaN
    0.930136 NaN NaN NaN 1.0 NaN
    0.964779 NaN NaN 1.0 NaN NaN
     
    数据的选取
     
    df[col] # 以数组 Series 的形式返回选取的列
    
    In [23]:
    df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
    df['c']
    
    Out[23]:
    0    0.238355
    1    0.641129
    2    0.716013
    3    0.549903
    4    0.997134
    Name: c, dtype: float64
    In [11]:
    df = pd.DataFrame(np.random.rand(5,5),columns=list('ABCDE'))
    df['C']
    
    Out[11]:
    0    0.720965
    1    0.360155
    2    0.474067
    3    0.116206
    4    0.774503
    Name: C, dtype: float64
     
    df[[col1, col2]] # 以新的数据框(DataFrame)的形式返回选取的列
    
    In [25]:
    df = pd.DataFrame(np.random.rand(5, 4), columns=list('abcd'))
    df[['a','d']]
    
    Out[25]:
     ad
    0 0.689811 0.446470
    1 0.022796 0.101198
    2 0.724498 0.555124
    3 0.923610 0.952664
    4 0.990061 0.891120
    In [12]:
    df = pd.DataFrame(np.random.rand(5,5),columns=list('ABCDE'))
    df[['B','E']]
    
    Out[12]:
     BE
    0 0.205912 0.333909
    1 0.475620 0.540206
    2 0.144041 0.065117
    3 0.636970 0.406317
    4 0.451541 0.944245
     
    s.iloc[0] # 按照位置选取
    
    In [11]:
    s = pd.Series(np.array(['huang','xiao','lei']))
    print(s)
    s.iloc[1]
    
     
    0    huang
    1     xiao
    2      lei
    dtype: object
    
    Out[11]:
    'xiao'
    In [13]:
    s = pd.Series(np.array(['I','Love','Data']))
    s.iloc[0]
    
    Out[13]:
    'I'
     
    s.loc['index_one'] # 按照索引选取
    
    In [10]:
    s = pd.Series(np.array(['df','s','df']))
    print(s)
    s.loc[1]
    
     
    0    df
    1     s
    2    df
    dtype: object
    
    Out[10]:
    's'
    In [14]:
    s = pd.Series(np.array(['I','Love','Data']))
    s.loc[1]
    
    Out[14]:
    'Love'
     
    df.iloc[0,:] # 选取第一行
    
    In [24]:
    df = pd.DataFrame(np.random.rand(5, 5),columns= list('abcde'))
    print(df)
    #df.iloc[1, :]
    df.loc[1:3]
    
     
              a         b         c         d         e
    0  0.293829  0.636855  0.383047  0.182288  0.991080
    1  0.098706  0.984684  0.362848  0.865179  0.191418
    2  0.238197  0.027557  0.847372  0.478444  0.286712
    3  0.816694  0.886405  0.637459  0.917760  0.218578
    4  0.962678  0.322024  0.489059  0.675897  0.024523
    
    Out[24]:
     abcde
    1 0.098706 0.984684 0.362848 0.865179 0.191418
    2 0.238197 0.027557 0.847372 0.478444 0.286712
    3 0.816694 0.886405 0.637459 0.917760 0.218578
    In [15]:
    df = pd.DataFrame(np.random.rand(5,5),columns=list('ABCDE'))
    df.iloc[0,:]
    
    Out[15]:
    A    0.234156
    B    0.513754
    C    0.593067
    D    0.856575
    E    0.291528
    Name: 0, dtype: float64
     
    df.iloc[0,0] # 选取第一行的第一个元素
    
    In [26]:
    df = pd.DataFrame(np.random.rand(10, 5), columns=list('asdfg'))
    print(df)
    df.iloc[1,3]
    
     
              a         s         d         f         g
    0  0.819962  0.011747  0.969565  0.467551  0.281303
    1  0.741277  0.645715  0.113062  0.495135  0.169768
    2  0.862192  0.433940  0.726602  0.692266  0.796443
    3  0.701999  0.222973  0.553875  0.253598  0.090833
    4  0.354669  0.779308  0.282878  0.729156  0.972402
    5  0.310698  0.253160  0.435239  0.465066  0.393626
    6  0.449286  0.079748  0.778311  0.651505  0.659701
    7  0.621606  0.883868  0.059535  0.015870  0.056286
    8  0.762552  0.159625  0.716243  0.179370  0.161484
    9  0.695830  0.388746  0.759827  0.325159  0.379626
    
    Out[26]:
    0.49513455869985046
    In [16]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.iloc[0,0]
    
    Out[16]:
    0.91525996455410763
     
    数据的清洗
     
    df.columns = ['a','b'] # 重命名数据框的列名称
    
    In [36]:
    df = pd.DataFrame({'a':np.array([1,2,5,8,4,3]), 'b':np.array([9,3,7,5,3,4]), 'c':'htl'})
    df.columns = ['q','e','r']
    df
    
    Out[36]:
     qer
    0 1 9 htl
    1 2 3 htl
    2 5 7 htl
    3 8 5 htl
    4 4 3 htl
    5 3 4 htl
    In [30]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    df.columns = ['a','b','c']
    df
    
    Out[30]:
     abc
    0 1.0 NaN foo
    1 NaN 4.0 foo
    2 2.0 NaN foo
    3 3.0 5.0 foo
    4 6.0 9.0 foo
    5 NaN NaN foo
     
    pd.isnull() # 检查数据中空值出现的情况,并返回一个由布尔值(True,Fale)组成的列
    
    In [37]:
    df = pd.DataFrame({'a':np.array([1,np.nan,2,3,6,np.nan]),
                      'b':np.array([np.nan,4,np.nan,5,9,np.nan]),
                       'c':'sdf'})
    pd.isnull(df)
    
    Out[37]:
     abc
    0 False True False
    1 True False False
    2 False True False
    3 False False False
    4 False False False
    5 True True False
    In [18]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    pd.isnull(df)
    
    Out[18]:
     ABC
    0 False True False
    1 True False False
    2 False True False
    3 False False False
    4 False False False
    5 True True False
     
    pd.notnull() # 检查数据中非空值出现的情况,并返回一个由布尔值(True,False)组成的列
    
    In [39]:
    df = pd.DataFrame({
                    'a':np.array([1,np.nan,2,3,4,np.nan]),
                    'b':np.array([np.nan,4,np.nan,5,9,np.nan]),
                    'c':'foo'
                        })
    pd.notnull(df)
    
    Out[39]:
     abc
    0 True False True
    1 False True True
    2 True False True
    3 True True True
    4 True True True
    5 False False True
    In [40]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    pd.notnull(df)
    df.dropna()
    
    Out[40]:
     ABC
    3 3.0 5.0 foo
    4 6.0 9.0 foo
     
    df.dropna() # 移除数据框 DataFrame 中包含空值的行
    
    In [20]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    df.dropna()
    
    Out[20]:
     ABC
    3 3.0 5.0 foo
    4 6.0 9.0 foo
     
    df.dropna(axis=1) # 移除数据框 DataFrame 中包含空值的列
    
    In [45]:
    df = pd.DataFrame({
            'a':np.array([1,np.nan,2,3,4,np.nan]),
            'b':np.array([np.nan,4,np.nan,5,9,np.nan]),
            'c':'foo'
                        })
    print(df)
    df.dropna(axis=1)
    
     
         a    b    c
    0  1.0  NaN  foo
    1  NaN  4.0  foo
    2  2.0  NaN  foo
    3  3.0  5.0  foo
    4  4.0  9.0  foo
    5  NaN  NaN  foo
    
    Out[45]:
     c
    0 foo
    1 foo
    2 foo
    3 foo
    4 foo
    5 foo
    In [21]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    df.dropna(axis=1)
    
    Out[21]:
     C
    0 foo
    1 foo
    2 foo
    3 foo
    4 foo
    5 foo
     
    df.dropna(axis=1,thresh=n) # 移除数据框df中空值个数不超过n的行
    
    In [73]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    print(df)
    df.dropna(axis=1,thresh=3)
    
     
         A    B    C
    0  1.0  NaN  foo
    1  NaN  4.0  foo
    2  2.0  NaN  foo
    3  3.0  5.0  foo
    4  6.0  9.0  foo
    5  NaN  NaN  foo
    
    Out[73]:
     ABC
    0 1.0 NaN foo
    1 NaN 4.0 foo
    2 2.0 NaN foo
    3 3.0 5.0 foo
    4 6.0 9.0 foo
    5 NaN NaN foo
    In [22]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    test = df.dropna(axis=1,thresh=1)
    test
    
    Out[22]:
     ABC
    0 1.0 NaN foo
    1 NaN 4.0 foo
    2 2.0 NaN foo
    3 3.0 5.0 foo
    4 6.0 9.0 foo
    5 NaN NaN foo
     
    df.fillna(x) # 将数据框 DataFrame 中的所有空值替换为 x
    
    In [76]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    print(df)
    df.fillna('huang')
    
     
         A    B    C
    0  1.0  NaN  foo
    1  NaN  4.0  foo
    2  2.0  NaN  foo
    3  3.0  5.0  foo
    4  6.0  9.0  foo
    5  NaN  NaN  foo
    
    Out[76]:
     ABC
    0 1 huang foo
    1 huang 4 foo
    2 2 huang foo
    3 3 5 foo
    4 6 9 foo
    5 huang huang foo
    In [23]:
    df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                     'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                      'C':'foo'})
    df.fillna('Test')
    
    Out[23]:
     ABC
    0 1 Test foo
    1 Test 4 foo
    2 2 Test foo
    3 3 5 foo
    4 6 9 foo
    5 Test Test foo
     

    s.fillna(s.mean()) -> 将所有空值替换为平均值

    In [82]:
    s = pd.Series([1,3,4,np.nan,7,8,9])
    a = s.fillna(s.mean())
    print(a)
    
     
    0    1.000000
    1    3.000000
    2    4.000000
    3    5.333333
    4    7.000000
    5    8.000000
    6    9.000000
    dtype: float64
    
    In [24]:
    s = pd.Series([1,3,5,np.nan,7,9,9])
    s.fillna(s.mean())
    
    Out[24]:
    0    1.000000
    1    3.000000
    2    5.000000
    3    5.666667
    4    7.000000
    5    9.000000
    6    9.000000
    dtype: float64
     
    s.astype(float) # 将数组(Series)的格式转化为浮点数
    
    In [85]:
    s = pd.Series([1,2,4,np.nan,5,6,6])
    a = s.fillna(s.mean())
    a.astype(int)
    
    Out[85]:
    0    1
    1    2
    2    4
    3    4
    4    5
    5    6
    6    6
    dtype: int64
    In [25]:
    s = pd.Series([1,3,5,np.nan,7,9,9])
    s.astype(float)
    
    Out[25]:
    0    1.0
    1    3.0
    2    5.0
    3    NaN
    4    7.0
    5    9.0
    6    9.0
    dtype: float64
     
    s.replace(1,'one') # 将数组(Series)中的所有1替换为'one'
    
    In [86]:
    s = pd.Series([1,2,4,np.nan,5,6,7])
    s.replace(1,'yi')
    
    Out[86]:
    0     yi
    1      2
    2      4
    3    NaN
    4      5
    5      6
    6      7
    dtype: object
    In [26]:
    s = pd.Series([1,3,5,np.nan,7,9,9])
    s.replace(1,'one')
    
    Out[26]:
    0    one
    1      3
    2      5
    3    NaN
    4      7
    5      9
    6      9
    dtype: object
     
    s.replace([1,3],['one','three']) # 将数组(Series)中所有的1替换为'one', 所有的3替换为'three'
    
    In [87]:
    s = pd.Series([1,3,4,np.nan,7,3,5])
    s.replace([1,4],['sd', 'dsf'])
    
    Out[87]:
    0     sd
    1      3
    2    dsf
    3    NaN
    4      7
    5      3
    6      5
    dtype: object
    In [27]:
    s = pd.Series([1,3,5,np.nan,7,9,9])
    s.replace([1,3],['one','three'])
    
    Out[27]:
    0      one
    1    three
    2        5
    3      NaN
    4        7
    5        9
    6        9
    dtype: object
     
    df.rename(columns=lambda x: x + 2) # 将全体列重命名
    
    In [20]:
    df = pd.DataFrame(np.random.rand(4, 4))
    
    df.rename(columns=lambda x:x+2 )
    
    Out[20]:
     2345
    0 0.081634 0.064494 0.171152 0.568444
    1 0.355771 0.934762 0.634321 0.505097
    2 0.544467 0.824562 0.742992 0.937263
    3 0.524025 0.620101 0.764900 0.211475
    In [28]:
    df = pd.DataFrame(np.random.rand(4,4))
    df.rename(columns=lambda x: x+ 2)
    
    Out[28]:
     2345
    0 0.753588 0.137984 0.022013 0.900072
    1 0.947073 0.815182 0.769708 0.729688
    2 0.334815 0.204315 0.707794 0.437704
    3 0.467212 0.738360 0.853463 0.529946
     
    df.rename(columns={'old_name': 'new_ name'}) # 将选择的列重命名
    
    In [24]:
    df = pd.DataFrame(np.random.rand(10, 5), columns=list('asdfp'))
    df.rename(columns={'a':'huang', 'd':'xiao'})
    
    Out[24]:
     huangsxiaofp
    0 0.883222 0.073876 0.740827 0.035460 0.929947
    1 0.161005 0.276637 0.095228 0.490336 0.433798
    2 0.245889 0.763647 0.472240 0.718072 0.260942
    3 0.933051 0.400177 0.494481 0.173994 0.800894
    4 0.762221 0.170352 0.507960 0.383658 0.533412
    5 0.665419 0.515597 0.538217 0.305045 0.072796
    6 0.723260 0.661109 0.793995 0.391161 0.724623
    7 0.829130 0.896624 0.732372 0.317762 0.745941
    8 0.302628 0.320006 0.420980 0.400016 0.556747
    9 0.574811 0.952172 0.573045 0.343735 0.930765
    In [29]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.rename(columns={'A':'newA','C':'newC'})
    
    Out[29]:
     newABnewCDE
    0 0.169072 0.694563 0.069313 0.637560 0.475181
    1 0.910271 0.800067 0.676448 0.934767 0.025608
    2 0.825186 0.451545 0.135421 0.635303 0.419758
    3 0.401979 0.510304 0.014901 0.209211 0.121889
    4 0.579282 0.001947 0.036519 0.750415 0.453078
    5 0.896213 0.557514 0.028147 0.527471 0.575772
    6 0.443222 0.095459 0.319582 0.912069 0.781455
    7 0.067923 0.590470 0.602999 0.507358 0.703022
    8 0.301491 0.682629 0.283103 0.565754 0.089268
    9 0.399671 0.925416 0.020578 0.278000 0.591522
     
    df.set_index('column_one') # 改变索引
    
    In [27]:
    df = pd.DataFrame(np.random.rand(10, 5), columns=list('asdfg'))
    print(df)
    df.set_index('a')
    
     
              a         s         d         f         g
    0  0.483397  0.944772  0.678662  0.439009  0.588450
    1  0.984601  0.110966  0.331303  0.578410  0.467633
    2  0.001784  0.431582  0.593597  0.238572  0.429771
    3  0.644358  0.102394  0.935862  0.863739  0.118716
    4  0.514392  0.928633  0.750763  0.026851  0.049935
    5  0.749309  0.961028  0.383087  0.052621  0.598980
    6  0.963810  0.087193  0.569974  0.440941  0.384748
    7  0.000576  0.538573  0.171773  0.802815  0.556191
    8  0.731837  0.934994  0.998125  0.485058  0.745950
    9  0.599032  0.462614  0.234398  0.833158  0.521382
    
    Out[27]:
     sdfg
    a    
    0.483397 0.944772 0.678662 0.439009 0.588450
    0.984601 0.110966 0.331303 0.578410 0.467633
    0.001784 0.431582 0.593597 0.238572 0.429771
    0.644358 0.102394 0.935862 0.863739 0.118716
    0.514392 0.928633 0.750763 0.026851 0.049935
    0.749309 0.961028 0.383087 0.052621 0.598980
    0.963810 0.087193 0.569974 0.440941 0.384748
    0.000576 0.538573 0.171773 0.802815 0.556191
    0.731837 0.934994 0.998125 0.485058 0.745950
    0.599032 0.462614 0.234398 0.833158 0.521382
    In [30]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.set_index('B')
    
    Out[30]:
     ACDE
    B    
    0.311742 0.972069 0.557977 0.114267 0.795128
    0.931644 0.725425 0.082130 0.993764 0.136923
    0.206382 0.980647 0.947041 0.038841 0.879139
    0.157801 0.402233 0.249151 0.724130 0.108238
    0.314238 0.341221 0.512180 0.218882 0.046379
    0.029040 0.470619 0.666784 0.036655 0.823498
    0.843928 0.779437 0.926912 0.189213 0.624111
    0.282773 0.993681 0.048483 0.135934 0.576662
    0.759600 0.235513 0.359139 0.488255 0.669043
    0.088552 0.893269 0.277296 0.889523 0.398392
     
    df.rename(index = lambda x: x+ 1) # 改变全体索引
    
    In [29]:
    df = pd.DataFrame(np.random.rand(10, 5))
    df.rename(index = lambda x: x+1)
    
    Out[29]:
     01234
    1 0.932421 0.478929 0.051820 0.721526 0.016739
    2 0.359403 0.327488 0.503009 0.352523 0.169186
    3 0.894238 0.268052 0.906756 0.726393 0.973686
    4 0.188892 0.056018 0.156585 0.643488 0.321641
    5 0.661594 0.043409 0.392303 0.469758 0.157635
    6 0.582072 0.992046 0.060181 0.202060 0.119541
    7 0.073971 0.157798 0.616039 0.516502 0.472920
    8 0.885208 0.158675 0.211644 0.763249 0.762270
    9 0.907770 0.455217 0.430548 0.473017 0.240695
    10 0.043648 0.259251 0.365041 0.518889 0.765609
    In [31]:
    df = pd.DataFrame(np.random.rand(10,5))
    df.rename(index = lambda x: x+ 1)
    
    Out[31]:
     01234
    1 0.386542 0.031932 0.963200 0.790339 0.602533
    2 0.053492 0.652174 0.889465 0.465296 0.843528
    3 0.411836 0.460788 0.110352 0.083247 0.389855
    4 0.336156 0.830522 0.560991 0.667896 0.233841
    5 0.307933 0.995207 0.506680 0.957895 0.636461
    6 0.724975 0.842118 0.123139 0.244357 0.803936
    7 0.059176 0.117784 0.330192 0.418764 0.464144
    8 0.104323 0.222367 0.930414 0.659232 0.562155
    9 0.484089 0.024045 0.879834 0.492231 0.949636
    10 0.201583 0.280658 0.356804 0.890706 0.236174
     
    数据的过滤(```filter```),排序(```sort```)和分组(```groupby```)
     
    df[df[col] > 0.5] # 选取数据框df中对应行的数值大于0.5的全部列
    
    In [33]:
    df = pd.DataFrame(np.random.rand(10, 5), columns=list('asdfg'))
    print(df)
    df[df['a']>0.5]
    
     
              a         s         d         f         g
    0  0.191880  0.437651  0.780847  0.836473  0.086490
    1  0.997351  0.671057  0.212071  0.946415  0.768535
    2  0.506504  0.800164  0.968510  0.513060  0.258659
    3  0.791777  0.632927  0.624002  0.799357  0.270455
    4  0.207246  0.152955  0.007859  0.257787  0.208638
    5  0.620649  0.557626  0.393774  0.331476  0.855253
    6  0.220170  0.358326  0.811410  0.667446  0.085703
    7  0.554684  0.994837  0.054684  0.854683  0.749515
    8  0.759856  0.771095  0.571663  0.189677  0.177212
    9  0.887868  0.617078  0.487259  0.462189  0.673066
    
    Out[33]:
     asdfg
    1 0.997351 0.671057 0.212071 0.946415 0.768535
    2 0.506504 0.800164 0.968510 0.513060 0.258659
    3 0.791777 0.632927 0.624002 0.799357 0.270455
    5 0.620649 0.557626 0.393774 0.331476 0.855253
    7 0.554684 0.994837 0.054684 0.854683 0.749515
    8 0.759856 0.771095 0.571663 0.189677 0.177212
    9 0.887868 0.617078 0.487259 0.462189 0.673066
    In [32]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df[df['A'] > 0.5]
    
    Out[32]:
     ABCDE
    0 0.534886 0.863546 0.236718 0.326766 0.415460
    2 0.953931 0.070198 0.483749 0.922528 0.295505
    8 0.880175 0.056811 0.520499 0.533152 0.548145
     
    df[(df[col] > 0.5) & (df[col] < 0.7)] # 选取数据框df中对应行的数值大于0.5,并且小于0.7的全部列
    
    In [34]:
    df = pd.DataFrame(np.random.rand(10,6),columns= list('qwerty'))
    df[(df['e'] > 0.5) &(df['t'] < 0.7) ]
    
    Out[34]:
     qwerty
    2 0.176275 0.358433 0.895002 0.739299 0.050452 0.114546
    3 0.726330 0.591592 0.909450 0.120671 0.677124 0.837148
    4 0.318870 0.805787 0.600435 0.629595 0.045091 0.891886
    5 0.270306 0.143335 0.519607 0.118409 0.079835 0.071877
    In [33]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df[(df['C'] > 0.5) & (df['D'] < 0.7)]
    
    Out[33]:
     ABCDE
    2 0.953112 0.174517 0.645300 0.308216 0.171177
    6 0.853087 0.863079 0.701823 0.354019 0.311754
     
    df.sort_values(col1) # 按照数据框的列col1升序(ascending)的方式对数据框df做排序
    
    In [35]:
    df = pd.DataFrame(np.random.rand(10,6),columns=list('adsfgh'))
    df.sort_values('a')
    
    Out[35]:
     adsfgh
    8 0.012038 0.240554 0.900154 0.630489 0.971382 0.889947
    3 0.174606 0.704540 0.284934 0.412725 0.261158 0.807697
    9 0.324203 0.834741 0.624353 0.676012 0.580034 0.436738
    1 0.386444 0.256227 0.924961 0.000652 0.589956 0.476489
    5 0.479683 0.080173 0.333917 0.741830 0.219858 0.550681
    6 0.546706 0.358566 0.875383 0.921672 0.004955 0.631361
    4 0.581234 0.001990 0.737987 0.203702 0.231551 0.235576
    7 0.762742 0.800615 0.945827 0.434820 0.755877 0.312649
    2 0.888132 0.019374 0.555217 0.618628 0.396756 0.924784
    0 0.904388 0.758854 0.450406 0.487383 0.666163 0.430539
    In [34]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.sort_values('E')
    
    Out[34]:
     ABCDE
    3 0.024096 0.623842 0.775949 0.828343 0.317729
    6 0.220055 0.381614 0.463676 0.762644 0.391758
    4 0.589411 0.727439 0.064528 0.319521 0.413518
    1 0.878490 0.229301 0.699506 0.726879 0.464106
    8 0.438101 0.970649 0.050256 0.697440 0.499057
    9 0.566100 0.558798 0.723253 0.254244 0.524486
    7 0.613603 0.933109 0.677036 0.808160 0.544953
    5 0.079326 0.711673 0.266434 0.910628 0.816783
    2 0.132114 0.145395 0.908436 0.521271 0.889645
    0 0.432677 0.216837 0.203532 0.093214 0.977671
     
    df.sort_values(col2,ascending=False) # 按照数据框的列col2降序(descending)的方式对数据框df做排序
    
    In [36]:
    df = pd.DataFrame(np.random.rand(10, 8),columns=list('qwertyui'))
    df.sort_values('e', ascending=False)
    
    Out[36]:
     qwertyui
    8 0.541191 0.443107 0.804432 0.475763 0.332738 0.169072 0.350597 0.234079
    9 0.278131 0.672111 0.766488 0.555026 0.271935 0.453826 0.491817 0.986139
    1 0.758781 0.041056 0.732308 0.974348 0.219851 0.211953 0.524819 0.300156
    2 0.065457 0.556341 0.655507 0.205678 0.606155 0.945356 0.915438 0.642333
    4 0.916662 0.179418 0.620904 0.689385 0.477483 0.262302 0.868513 0.002603
    6 0.934955 0.970812 0.331655 0.507056 0.012076 0.643469 0.579360 0.416791
    3 0.372486 0.775326 0.250734 0.021345 0.267355 0.059874 0.253597 0.244643
    7 0.598279 0.031159 0.205364 0.715331 0.340993 0.918638 0.918882 0.971622
    5 0.062437 0.923440 0.119125 0.755429 0.744593 0.421468 0.366993 0.103529
    0 0.965093 0.630529 0.034310 0.500022 0.736686 0.484777 0.595759 0.281686
    In [35]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.sort_values('A',ascending=False)
    
    Out[35]:
     ABCDE
    9 0.977172 0.930607 0.889285 0.475032 0.031715
    0 0.864511 0.229990 0.678612 0.042491 0.148123
    2 0.694747 0.580891 0.817524 0.392417 0.055003
    6 0.684327 0.802028 0.862043 0.241838 0.800401
    7 0.612324 0.099445 0.714120 0.215054 0.280343
    8 0.441434 0.315553 0.564762 0.800143 0.330030
    1 0.438734 0.161109 0.610750 0.647330 0.792404
    4 0.365880 0.710768 0.344320 0.998757 0.979497
    3 0.202511 0.769728 0.575057 0.511384 0.696753
    5 0.029527 0.560114 0.224787 0.086291 0.318322
     
    df.sort_values([col1,col2],ascending=[True,False]) # 按照数据框的列col1升序,col2降序的方式对数据框df做排序
    
    In [37]:
    df = pd.DataFrame(np.random.rand(5,6),columns=list('qwerty'))
    df.sort_values(['q', 'w'],ascending=[True, False])
    
    Out[37]:
     qwerty
    3 0.039156 0.902539 0.544040 0.715766 0.476489 0.968014
    4 0.369672 0.760559 0.339207 0.773287 0.112713 0.465799
    2 0.446962 0.675626 0.805690 0.869418 0.553809 0.310547
    0 0.898922 0.210659 0.024452 0.310047 0.492718 0.530260
    1 0.981514 0.476470 0.435834 0.613164 0.071609 0.771960
    In [36]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.sort_values(['A','E'],ascending=[True,False])
    
    Out[36]:
     ABCDE
    6 0.075863 0.696980 0.648945 0.336977 0.113122
    2 0.199316 0.632063 0.787358 0.133175 0.060568
    5 0.242081 0.818550 0.618439 0.215761 0.924459
    7 0.261237 0.400725 0.659224 0.555746 0.132572
    0 0.390540 0.358432 0.754028 0.194403 0.889624
    8 0.410481 0.463811 0.343021 0.736340 0.291121
    4 0.578705 0.544711 0.881707 0.396593 0.414465
    3 0.600541 0.459247 0.591303 0.027464 0.496864
    9 0.720029 0.419921 0.740225 0.904391 0.226958
    1 0.777955 0.992290 0.144495 0.600207 0.647018
     
    df.groupby(col) # 按照某列对数据框df做分组
    
    In [3]:
    df = pd.DataFrame({
                    'a':np.array(['huang','huang','huang','xiao','xiao','xiao']),
                    'b':np.array(['lei','lei','lei','xiao','xiao','lei']),
                    'c':np.array(['small','medium','large','small','large','medium']),
                    'd':np.array([1,2,3,4,5,6])
                        })
    df.groupby('a').count()
    
    Out[3]:
     bcd
    a   
    huang 3 3 3
    xiao 3 3 3
    In [38]:
    df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
          'B':np.array(['one','one','two','two','three','three']),
         'C':np.array(['small','medium','large','large','small','small']),
         'D':np.array([1,2,2,3,3,5])})
    print(df)
    
    df.groupby('A').count()
    
     
         A      B       C  D
    0  foo    one   small  1
    1  foo    one  medium  2
    2  foo    two   large  2
    3  foo    two   large  3
    4  bar  three   small  3
    5  bar  three   small  5
    
    Out[38]:
     BCD
    A   
    bar 2 2 2
    foo 4 4 4
     
    df.groupby([col1,col2]) # 按照列col1和col2对数据框df做分组
    
    In [4]:
    df = pd.DataFrame({
                        'a':np.array(['s','s','s','e','e','e']),
                        'b':np.array(['q','w','e','e','e','w']),
                        'c':np.array(['t','t','t','hu','hi','jk'])
                        })
    print(df)
    df.groupby(['a','b']).count()
    
     
       a  b   c
    0  s  q   t
    1  s  w   t
    2  s  e   t
    3  e  e  hu
    4  e  e  hi
    5  e  w  jk
    
    Out[4]:
      c
    ab 
    ee 2
    w 1
    se 1
    q 1
    w 1
    In [39]:
    df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
          'B':np.array(['one','one','two','two','three','three']),
         'C':np.array(['small','medium','large','large','small','small']),
         'D':np.array([1,2,2,3,3,5])})
    print(df)
    df.groupby(['B','C']).sum()
    
     
         A      B       C  D
    0  foo    one   small  1
    1  foo    one  medium  2
    2  foo    two   large  2
    3  foo    two   large  3
    4  bar  three   small  3
    5  bar  three   small  5
    
    Out[39]:
      D
    BC 
    onemedium 2
    small 1
    threesmall 8
    twolarge 5
     
    df.groupby(col1)[col2].mean() # 按照列col1对数据框df做分组处理后,返回对应的col2的平均值
    
    In [10]:
    df = pd.DataFrame({
            'a':np.array(['ho','ho','ho','e','e','e']),
            'b':np.array(['huang','huang','lei','lei','xiao','xiao']),
            'c':np.array([1,2,3,4,5,6])
        })
    df.groupby('a')['c'].mean()
    
    Out[10]:
    a
    e     5
    ho    2
    Name: c, dtype: int64
    In [39]:
    df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
          'B':np.array(['one','one','two','two','three','three']),
         'C':np.array(['small','medium','large','large','small','small']),
         'D':np.array([1,2,2,3,3,5])})
    df.groupby('B')['D'].mean()
    
    Out[39]:
    B
    one      1.5
    three    4.0
    two      2.5
    Name: D, dtype: float64
     
    pythyon
    df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) # 做透视表,索引为col1,针对的数值列为col2和col3,分组函数为平均值
    In [11]:
    df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
          'B':np.array(['one','one','two','two','three','three']),
         'C':np.array(['small','medium','large','large','small','small']),
         'D':np.array([1,2,2,3,3,5])})
    print(df)
    df.pivot_table(df,index=['A','B'],
                   columns=['C'],aggfunc=np.sum)
    
     
         A      B       C  D
    0  foo    one   small  1
    1  foo    one  medium  2
    2  foo    two   large  2
    3  foo    two   large  3
    4  bar  three   small  3
    5  bar  three   small  5
    
    Out[11]:
      D
     Clargemediumsmall
    AB   
    barthree NaN NaN 8.0
    fooone NaN 2.0 1.0
    two 5.0 NaN NaN
     
    df.groupby(col1).agg(np.mean)
    
    In [12]:
    df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
          'B':np.array(['one','one','two','two','three','three']),
         'C':np.array(['small','medium','large','large','small','small']),
         'D':np.array([1,2,2,3,3,5])})
    print(df)
    df.groupby('A').agg(np.mean)
    
     
         A      B       C  D
    0  foo    one   small  1
    1  foo    one  medium  2
    2  foo    two   large  2
    3  foo    two   large  3
    4  bar  three   small  3
    5  bar  three   small  5
    
    Out[12]:
     D
    A 
    bar 4
    foo 2
     
    df.apply(np.mean) # 对数据框df的每一列求平均值
    
    In [13]:
    df = pd.DataFrame(np.random.rand(10, 5),columns=list('adsfg'))
    df.apply(np.mean)
    
    Out[13]:
    a    0.539334
    d    0.500330
    s    0.508882
    f    0.580603
    g    0.523317
    dtype: float64
    In [42]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.apply(np.mean)
    
    Out[42]:
    A    0.388075
    B    0.539564
    C    0.607983
    D    0.518634
    E    0.482960
    dtype: float64
     
    df.apply(np.max,axis=1) # 对数据框df的每一行求最大值
    
    In [14]:
    df = pd.DataFrame(np.random.rand(10, 6),columns=list('asdfrg'))
    df.apply(np.max, axis=1)
    
    Out[14]:
    0    0.845378
    1    0.998686
    2    0.968602
    3    0.843231
    4    0.940353
    5    0.908892
    6    0.949700
    7    0.663064
    8    0.876051
    9    0.975562
    dtype: float64
    In [43]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.apply(np.max,axis=1)
    
    Out[43]:
    0    0.904163
    1    0.804519
    2    0.924102
    3    0.761781
    4    0.952084
    5    0.923679
    6    0.796320
    7    0.582907
    8    0.761310
    9    0.893564
    dtype: float64
     
    数据的连接(```join```)与组合(```combine```)
     
    df1.append(df2) # 在数据框df2的末尾添加数据框df1,其中df1和df2的列数应该相等
    
    In [44]:
    df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                       index=[0, 1, 2, 3])
    df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                       index=[4, 5, 6, 7])
    
    df1.append(df2)
    
    Out[44]:
     ABCD
    0 A0 B0 C0 D0
    1 A1 B1 C1 D1
    2 A2 B2 C2 D2
    3 A3 B3 C3 D3
    4 A4 B4 C4 D4
    5 A5 B5 C5 D5
    6 A6 B6 C6 D6
    7 A7 B7 C7 D7
     
    pd.concat([df1, df2],axis=1) # 在数据框df1的列最后添加数据框df2,其中df1和df2的行数应该相等
    
    In [45]:
    df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                       index=[0, 1, 2, 3])
    df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                       index=[4, 5, 6, 7])
    pd.concat([df1,df2],axis=1)
    
    Out[45]:
     ABCDABCD
    0 A0 B0 C0 D0 NaN NaN NaN NaN
    1 A1 B1 C1 D1 NaN NaN NaN NaN
    2 A2 B2 C2 D2 NaN NaN NaN NaN
    3 A3 B3 C3 D3 NaN NaN NaN NaN
    4 NaN NaN NaN NaN A4 B4 C4 D4
    5 NaN NaN NaN NaN A5 B5 C5 D5
    6 NaN NaN NaN NaN A6 B6 C6 D6
    7 NaN NaN NaN NaN A7 B7 C7 D7
     
    df1.join(df2,on=col1,how='inner') # 对数据框df1和df2做内连接,其中连接的列为col1
    
    In [46]:
    df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],           
                         'B': ['B0', 'B1', 'B2', 'B3'],
                         'key': ['K0', 'K1', 'K0', 'K1']})
       
    
    df2 = pd.DataFrame({'C': ['C0', 'C1'],
                          'D': ['D0', 'D1']},
                         index=['K0', 'K1'])
       
    
    df1.join(df2, on='key')
    
    Out[46]:
     ABkeyCD
    0 A0 B0 K0 C0 D0
    1 A1 B1 K1 C1 D1
    2 A2 B2 K0 C0 D0
    3 A3 B3 K1 C1 D1
     

    <div id = 'p10'>数据的统计</div>

     
    df.describe() # 得到数据框df每一列的描述性统计
    
    In [4]:
    df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde'))
    df.describe()
    
    Out[4]:
     abcde
    count 10.000000 10.000000 10.000000 10.000000 10.000000
    mean 0.401144 0.359406 0.603465 0.627617 0.408927
    std 0.314415 0.276410 0.225576 0.338007 0.277260
    min 0.052844 0.015361 0.255718 0.121600 0.082777
    25% 0.148306 0.141934 0.498205 0.320862 0.198211
    50% 0.328256 0.301379 0.575852 0.661513 0.332168
    75% 0.603549 0.584706 0.665217 0.922541 0.581780
    max 0.899552 0.838164 0.973688 0.986095 0.933372
    In [47]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.describe()
    
    Out[47]:
     ABCDE
    count 10.000000 10.000000 10.000000 10.000000 10.000000
    mean 0.398648 0.451699 0.443472 0.739478 0.412954
    std 0.330605 0.221586 0.303084 0.308798 0.262148
    min 0.004457 0.188689 0.079697 0.113562 0.052935
    25% 0.088177 0.270355 0.205663 0.715005 0.205685
    50% 0.315533 0.457229 0.332148 0.885872 0.400232
    75% 0.749716 0.497208 0.737900 0.948651 0.634670
    max 0.782956 0.825671 0.851065 0.962922 0.815447
     
    df.mean() # 得到数据框df中每一列的平均值
    
    In [6]:
    df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde'))
    df.mean()
    
    Out[6]:
    a    0.501247
    b    0.596623
    c    0.525627
    d    0.503693
    e    0.420740
    dtype: float64
    In [5]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.mean()
    
    Out[5]:
    A    0.554337
    B    0.574231
    C    0.438493
    D    0.514337
    E    0.532763
    dtype: float64
     
    df.corr() # 得到数据框df中每一列与其他列的相关系数
    
    In [7]:
    df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde'))
    df.corr()
    
    Out[7]:
     abcde
    a 1.000000 -0.314863 0.145670 0.569909 -0.089665
    b -0.314863 1.000000 0.241693 -0.105917 0.510971
    c 0.145670 0.241693 1.000000 0.073844 -0.070198
    d 0.569909 -0.105917 0.073844 1.000000 -0.425560
    e -0.089665 0.510971 -0.070198 -0.425560 1.000000
    In [49]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.corr()
    
    Out[49]:
     ABCDE
    A 1.000000 -0.634931 -0.354824 -0.354131 0.170957
    B -0.634931 1.000000 0.225222 -0.338124 -0.043300
    C -0.354824 0.225222 1.000000 0.098285 0.297133
    D -0.354131 -0.338124 0.098285 1.000000 -0.324209
    E 0.170957 -0.043300 0.297133 -0.324209 1.000000
     
    df.count() # 得到数据框df中每一列的非空值个数
    
    In [8]:
    df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde'))
    df.count()
    
    Out[8]:
    a    10
    b    10
    c    10
    d    10
    e    10
    dtype: int64
    In [50]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.count()
    
    Out[50]:
    A    10
    B    10
    C    10
    D    10
    E    10
    dtype: int64
     
    df.max() # 得到数据框df中每一列的最大值
    
    In [12]:
    df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde'))
    print(df)
    print(df.max())
    df.count()
    
     
              a         b         c         d         e
    0  0.743688  0.081938  0.693243  0.647515  0.835997
    1  0.162604  0.421371  0.422371  0.930136  0.732234
    2  0.842065  0.139927  0.675018  0.543914  0.017094
    3  0.535794  0.078217  0.964779  0.607462  0.432429
    4  0.560279  0.544811  0.304371  0.797165  0.505008
    5  0.695691  0.696121  0.741812  0.502741  0.484697
    6  0.775342  0.410536  0.275251  0.810911  0.081818
    7  0.584267  0.917728  0.379231  0.097702  0.622885
    8  0.754810  0.809628  0.102337  0.283509  0.615719
    9  0.003056  0.536268  0.187236  0.181844  0.255499
    a    0.842065
    b    0.917728
    c    0.964779
    d    0.930136
    e    0.835997
    dtype: float64
    
    Out[12]:
    a    10
    b    10
    c    10
    d    10
    e    10
    dtype: int64
    In [51]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.max()
    
    Out[51]:
    A    0.933848
    B    0.730197
    C    0.921751
    D    0.715280
    E    0.940010
    dtype: float64
     
    df.min() # 得到数据框df中每一列的最小值
    
    In [52]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.min()
    
    Out[52]:
    A    0.107516
    B    0.001635
    C    0.024502
    D    0.092810
    E    0.019898
    dtype: float64
     
    df.median() # 得到数据框df中每一列的中位数
    
    In [53]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.median()
    
    Out[53]:
    A    0.497591
    B    0.359854
    C    0.661607
    D    0.342418
    E    0.588468
    dtype: float64
     
    df.std() # 得到数据框df中每一列的标准差
    
    In [54]:
    df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
    df.std()
    
    Out[54]:
    A    0.231075
    B    0.286691
    C    0.276511
    D    0.304167
    E    0.272570
    dtype: float64
  • 相关阅读:
    各种数据类型的取值范围(总结全)
    Help Johnny-(类似杭电acm3568题)
    ExtJs 设置GridPanel表格文本垂直居中
    批处理通过字符串截取得到文件名
    sql优化-提防错误关联
    Unix Domain Socket 域套接字实现
    solr源码分析之数据导入DataImporter追溯。
    spark初识
    Spark:一个高效的分布式计算系统--转
    Importing/Indexing database (MySQL or SQL Server) in Solr using Data Import Handler--转载
  • 原文地址:https://www.cnblogs.com/heitaoq/p/7965964.html
Copyright © 2020-2023  润新知