• pandas学习笔记——阅读官方文档


    1. 初始化

    (1)生成简单序列pd.Series

    >>>s = pd.Series([1,3,5,np.nan,6,8])
    >>>s
    0    1.0
    1    3.0
    2    5.0
    3    NaN   #注意空
    4    6.0
    5    8.0
    dtype: float64

    (2)生成日期序列pd.date_range

    >>>dates = pd.date_range('20130101', periods=6)
    >>> dates
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05', '2013-01-06'],
                  dtype='datetime64[ns]', freq='D')

    (3)结构

    >>>df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    # index 表示序号,columns表示列名称
    
    >>> df
                       A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988
    >>>: df2 = pd.DataFrame({     'A' : 1.,
       ....:                      'B' : pd.Timestamp('20130102'),
       ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
       ....:                      'D' : np.array([3] * 4,dtype='int32'),
       ....:                      'E' : pd.Categorical(["test","train","test","train"]),
       ....:                      'F' : 'foo' })
       ....: 
    
    >>>: df2
         A        B    C    D     E    F
    0  1.0 2013-01-02  1.0  3   test  foo
    1  1.0 2013-01-02  1.0  3  train  foo
    2  1.0 2013-01-02  1.0  3   test  foo
    3  1.0 2013-01-02  1.0  3  train  foo

    2. 观察数据

    (1)前n个(head),后n个(tail)

    >>> df.head(2)
                       A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    
    
    >>> df.tail(3)
                       A         B         C         D
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988

    (2)展示序号(index)、列号(columns)、值(values)

    >>>df.index
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05', '2013-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    >>> df.columns
    Index(['A', 'B', 'C', 'D'], dtype='object')
    
    >>> df.values
    array([[ 0.4691, -0.2829, -1.5091, -1.1356],
           [ 1.2121, -0.1732,  0.1192, -1.0442],
           [-0.8618, -2.1046, -0.4949,  1.0718],
           [ 0.7216, -0.7068, -1.0396,  0.2719],
           [-0.425 ,  0.567 ,  0.2762, -1.0874],
           [-0.6737,  0.1136, -1.4784,  0.525 ]])

    (3)快速数据统计describe

    >>>df.describe()
                  A         B         C         D
    count  6.000000  6.000000  6.000000  6.000000
    mean   0.073711 -0.431125 -0.687758 -0.233103
    std    0.843157  0.922818  0.779887  0.973118
    min   -0.861849 -2.104569 -1.509059 -1.135632
    25%   -0.611510 -0.600794 -1.368714 -1.076610
    50%    0.022070 -0.228039 -0.767252 -0.386188
    75%    0.658444  0.041933 -0.034326  0.461706
    max    1.212112  0.567020  0.276232  1.071804

    (4)转置df.T

    (5)按轴排序

    降序:ascending=False
    升序:ascending=True
    横轴: df.sort_index(axis=1, ascending=False)
    纵轴: df.sort_index(axis=0, ascending=False)
    >>>df.sort_index(axis=1, ascending=False)
                       D         C         B         A
    2013-01-01 -1.135632 -1.509059 -0.282863  0.469112
    2013-01-02 -1.044236  0.119209 -0.173215  1.212112
    2013-01-03  1.071804 -0.494929 -2.104569 -0.861849
    2013-01-04  0.271860 -1.039575 -0.706771  0.721555
    2013-01-05 -1.087401  0.276232  0.567020 -0.424972
    2013-01-06  0.524988 -1.478427  0.113648 -0.673690

    (6)按值排序

    >>> df.sort_values(by='B')
                       A         B         C         D
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401

    3. 选择, 与matlab类似

    选择某列( df.A == df['A']

    选择某个区间(df[0:3])

    按标签选择(df.loc[dates[0]])

    4. 数据缺失

    用nan表示

    舍去丢失数据的行 df.dropna(how='any')

    补全丢失的数据 df.fillna(value=5)

    判断是否缺失数据 pd.isna(df1)

    5. 统计

    求平均值  df.mean()

    6. 使用函数

    >>>df.apply(lambda x: x.max() - x.min())
     
    A    2.073961
    B    2.671590
    C    1.785291
    D    0.000000
    F    4.000000
    dtype: float64
  • 相关阅读:
    Android studio界面相关设置
    HIVE和HBASE区别
    《梦断代码》经典语录--持续更新
    幽灵漏洞(Ghost gethost)
    服务化实战之 dubbo、dubbox、motan、thrift、grpc等RPC框架比较及选型
    thrift使用总结
    thrift学习总结
    IntelliJ IDEA配置Tomcat(完整版教程)
    sudo执行脚本找不到环境变量和命令
    W-TinyLFU——设计一个现代的缓存
  • 原文地址:https://www.cnblogs.com/syyy/p/7908075.html
Copyright © 2020-2023  润新知