• Pandas学习1


    pandas有两种自己独有的基本数据结构Series和DataFrame

    Series

    数据结构
    data    100 300 500
    index    0   1   2
    或者
    index data
    0      100
    1      300
    2      500
    
    
    创建series对象
    
    In [1]: import numpy as np
    In [2]: from pandas import Series,DataFrame
    In [3]: import pandas as pd
    
    传递list创建对象,默认创建整数索引
    In [4]: s1 = Series([1,3,6,-1,2,8]) 
    In [5]: s1
    Out[5]: 
    0    1
    1    3
    2    6
    3   -1
    4    2
    5    8
    dtype: int64
    
    
    传入列表自定义索引创建对象
    In [9]: s2 = Series([1,3,6,-1,2,8],index = ["a","c","d","e","b","g"])
    In [10]: s2
    Out[10]: 
    a    1
    c    3
    d    6
    e   -1
    b    2
    g    8
    dtype: int64
    
    传入字典创建对象
    In [11]: SD = {"python":100,"java":101,"scala":102}
    In [12]: s3 = Series(SD)
    In [14]: s3
    Out[14]: 
    java      101
    python    100
    scala     102
    dtype: int64
    
    
    
    //显示数据值【values】和索引【index】
    In [6]: s1.values
    Out[6]: array([ 1,  3,  6, -1,  2,  8])
    In [7]: s1.index
    Out[7]: RangeIndex(start=0, stop=6, step=1)
    
    
    
    In [17]: s1
    Out[17]: 
    0    1
    1    3
    2    6
    3   -1
    4    2
    5    8
    dtype: int64
    
    自定义索引名字
    In [18]: s1.index = ["p1","p2","p3","p4","p5","p6"]
    In [19]: s1
    Out[19]: 
    p1    1
    p2    3
    p3    6
    p4   -1
    p5    2
    p6    8
    dtype: int64
    
    
    
    根据索引查看值和修改值
    In [20]: s1['p1']
    Out[20]: 1
    In [21]: s1['p1']=100
    In [22]: s1
    Out[22]: 
    p1    100
    p2      3
    p3      6
    p4     -1
    p5      2
    p6      8
    dtype: int64
    
    判断值是否为空
    In [29]: pd.isnull(s1)
    Out[29]: 
    p1    False
    p2    False
    p3    False
    p4    False
    p5    False
    p6    False
    dtype: bool
    
    In [30]: pd.notnull(s1)
    Out[30]: 
    p1    True
    p2    True
    p3    True
    p4    True
    p5    True
    p6    True
    dtype: bool
    
    
    运算
    In [31]: s2
    Out[31]: 
    a    1
    c    3
    d    6
    e   -1
    b    2
    g    8
    dtype: int64
    In [32]: s2[s2>5]
    Out[32]: 
    d    6
    g    8
    dtype: int64
    
    
    In [33]: s2*10
    Out[33]: 
    a    10
    c    30
    d    60
    e   -10
    b    20
    g    80
    dtype: int64
    
    

    DataFrame

    DataFrame几要素:index、columns、values等

    通过传递一个list对象来创建一个Series,pandas会默认创建整形索引
    In [34]: s = pd.Series([1,3,5,np.nan,6,8])
    In [35]: s
    Out[35]: 
    0    1.0
    1    3.0
    2    5.0
    3    NaN
    4    6.0
    5    8.0
    dtype: float64
    
    通过传递一个numpy array,时间索引以及列标签创建DataFrame
    In [48]: dates = pd.date_range("20170101",periods = 6)
    
    In [49]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
    
    In [50]: df
    Out[50]: 
                       A         B         C         D
    2017-01-01  0.198724  1.455237 -1.165803 -0.474382
    2017-01-02  0.622154 -0.280253 -0.492515  0.002470
    2017-01-03  1.764839 -1.734531 -0.195002  0.128216
    2017-01-04 -0.520130  1.372930 -2.240510  0.362139
    2017-01-05  1.530835  0.406480 -1.714226 -0.289591
    2017-01-06  0.675166  0.210024 -0.773319 -1.410746
    In [51]: dates
    Out[51]: 
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
                   '2017-01-05', '2017-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    d1=DataFrame(np.arange(12).reshape((3,4)),index=['a','b','c'],columns=['a1','a2','a3','a4'])    
    
    
    比较常用的有导入等长列表、字典、numpy数组、数据文件等
    In [61]: data = {'name':['zxx','lxx','gxx','hxx'],'age':[12,13,14,15],'addr':['JX','JS','BJ','SH']}
    字典数据转换为DataFrame,并指定索引
    In [62]: d2 = DataFrame(data)
    In [63]: d2
    Out[63]: 
      addr  age name
    0   JX   12  zxx
    1   JS   13  lxx
    2   BJ   14  gxx
    3   SH   15  hxx
    
    In [64]: d3 = DataFrame(data,columns=['name','age','addr'],index=['a','b','c','d']) 
    In [65]: d3
    Out[65]: 
      name  age addr
    a  zxx   12   JX
    b  lxx   13   JS
    c  gxx   14   BJ
    d  hxx   15   SH
    
    df.dtypes 查看不同列的数据类型
    df.Tab键  自动识别所有属性及自定义列
    df.head(2) 查看前两行
    df.tail(2) 查看尾部两行
    df.index 查看索引值
    df.columns 查看列名
    df.values 查看底层numpy数据
    
    
    In [56]: df.head(2)
    Out[56]: 
                       A         B         C         D
    2017-01-01  0.198724  1.455237 -1.165803 -0.474382
    2017-01-02  0.622154 -0.280253 -0.492515  0.002470
    
    In [57]: df.tail(2)
    Out[57]: 
                       A         B         C         D
    2017-01-05  1.530835  0.406480 -1.714226 -0.289591
    2017-01-06  0.675166  0.210024 -0.773319 -1.410746
    
    In [58]: df.index
    Out[58]: 
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
                   '2017-01-05', '2017-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    In [59]: df.columns
    Out[59]: Index([u'A', u'B', u'C', u'D'], dtype='object')
    
    In [60]: df.values
    Out[60]: 
    array([[ 0.19872446,  1.45523672, -1.16580285, -0.47438238],
           [ 0.62215406, -0.28025262, -0.49251531,  0.00247041],
           [ 1.76483913, -1.73453082, -0.19500168,  0.12821624],
           [-0.52013049,  1.37292972, -2.24051045,  0.36213914],
           [ 1.53083459,  0.40647992, -1.71422601, -0.28959076],
           [ 0.67516588,  0.2100239 , -0.77331882, -1.41074624]])
    
    

    获取数据

    In [71]: d3=DataFrame(data,columns=['name','age','addr'],index=['a','b','c','d'])
    
    In [72]: d3
    Out[72]: 
           name  age     addr
    a  zhanghua   40   jianxi
    b   liuting   45   pudong
    c    gaofei   50  beijing
    d    hedong   46     xian
    
    选择列
    In [78]: d3[['name','age']]
    Out[78]: 
           name  age
    a  zhanghua   40
    b   liuting   45
    c    gaofei   50
    d    hedong   46
    
    选择行
    In [84]: d3['a':'c']
    Out[84]: 
           name  age     addr
    a  zhanghua   40   jianxi
    b   liuting   45   pudong
    c    gaofei   50  beijing
    
    选择行(利用位置索引)
    In [87]: d3[1:3]
    Out[87]: 
          name  age     addr
    b  liuting   45   pudong
    c   gaofei   50  beijing
    
    使用过滤条件
    In [90]: d3[d3['age']>40]
    Out[90]: 
          name  age     addr
    b  liuting   45   pudong
    c   gaofei   50  beijing
    d   hedong   46     xian
    
    
    obj.ix[indexs,[columns]]可以根据列或索引同时进行过滤
    In [91]: d3.ix[['a','c'],['name','age']]
    Out[91]: 
           name  age
    a  zhanghua   40
    c    gaofei   50
    
    In [93]: d3.ix['a':'c',['name','age']]
    Out[93]: 
           name  age
    a  zhanghua   40
    b   liuting   45
    c    gaofei   50
    
    In [94]: d3.ix[0:3,['name','age']]
    Out[94]: 
           name  age
    a  zhanghua   40
    b   liuting   45
    c    gaofei   50
    

    修改数据

    In [95]: data={'name':['zhanghua','liuting','gaofei','hedong'],'age':[40,45,50,46],'addr':['jianxi','pudong','beijing','xian']}
    
    In [96]: d3=DataFrame(data,columns=['name','age','addr'],index=['a','b','c','d'])
    
    删除行
    In [97]: d3.drop('d',axis=0)
    Out[97]: 
           name  age     addr
    a  zhanghua   40   jianxi
    b   liuting   45   pudong
    c    gaofei   50  beijing
    
    删除列
    In [99]: d3.drop('age',axis=1)
    Out[99]: 
           name     addr
    a  zhanghua   jianxi
    b   liuting   pudong
    c    gaofei  beijing
    d    hedong     xian
    
    添加一行,注意需要ignore_index=True
    In [103]: d4 = d3.append({'name':'wxx','age':38,'addr':'HN'},ignore_index=True)
    
    In [104]: d4
    Out[104]: 
           name  age     addr
    0  zhanghua   40   jianxi
    1   liuting   45   pudong
    2    gaofei   50  beijing
    3    hedong   46     xian
    4       wxx   38       HN
    
    In [105]: d4.ix['4','age']=100
    
    In [106]: d4
    Out[106]: 
           name    age     addr
    0  zhanghua   40.0   jianxi
    1   liuting   45.0   pudong
    2    gaofei   50.0  beijing
    3    hedong   46.0     xian
    4       wxx   38.0       HN
    4       NaN  100.0      NaN
    
    修改索引
    In [111]: d3.index=['a','b','c','d'] 
    
    In [112]: d3
    Out[112]: 
           name  age     addr
    a  zhanghua   40   jianxi
    b   liuting   45   pudong
    c    gaofei   50  beijing
    d    hedong   46     xian
    

    汇总统计

    常用统计方法
    count	统计非NA的数量
    describe	统计列的汇总信息
    min、max	计算最小值和最大值
    sum	求总和
    mean	求平均数
    var	样本的方差
    std	样本的标准差
    
    
    
    导入和保存数据
    读取csv文件/或者逗号分隔的txt文件
    pd.read_csv('wu.csv')
    读取HDFS数据
    pd.read_hdf('wu.h5',df)
    写入为csv文件
    df.to_csv('wu.csv')
    写入HDF5存储
    df.to_hdf('wu.h5','df')
    
    
    
    In [15]: inputfile = '/home/hadoop/wujiadong/wu1_stud_score.txt'
    In [16]: data = pd.read_csv(inputfile)
    In [40]: df = DataFrame(data)
    In [41]: df.head(3)
    Out[41]: 
        stud_code  sub_code sub_name  sub_tech  sub_score   stat_date
    0  2015101000     10101     数学分析       NaN         90  0000-00-00
    1  2015101000     10102     高等代数       NaN         88  0000-00-00
    2  2015101000     10103     大学物理       NaN         67  0000-00-00
    
    
    In [42]: df.count()
    Out[42]: 
    stud_code    121
    sub_code     121
    sub_name     121
    sub_tech       0
    sub_score    121
    stat_date    121
    dtype: int64
    
    
    In [43]: df['sub_score'].describe()
    Out[43]: 
    count    121.000000
    mean      78.561983
    std       12.338215
    min       48.000000
    25%       69.000000
    50%       80.000000
    75%       89.000000
    max       98.000000
    Name: sub_score, dtype: float64
    
    求学生成绩标准差
    In [44]: df['sub_score'].std()
    Out[44]: 12.338214729032906
    
    
    
    

    应用函数和映射

    In [45]: d1 = DataFrame(np.arange(12).reshape((3,4)),index=['a','b','c'],columns=['a1','a2','a3','a4'])
    
    In [46]: d1
    Out[46]: 
       a1  a2  a3  a4
    a   0   1   2   3
    b   4   5   6   7
    c   8   9  10  11
    
    处理每个元素
    In [47]: d1.applymap(lambda x:x+2)
    Out[47]: 
       a1  a2  a3  a4
    a   2   3   4   5
    b   6   7   8   9
    c  10  11  12  13
    
    处理行数据
    In [54]: d1.ix[1].map(lambda x:x+2)
    Out[54]: 
    a1    6
    a2    7
    a3    8
    a4    9
    Name: b, dtype: int64
    
    列级处理
    In [62]: d1.apply(lambda x:x.max()-x.min(),axis=0)
    Out[62]: 
    a1    8
    a2    8
    a3    8
    a4    8
    dtype: int64
    
    

    参考资料
    1)10 Minutes to pandas:
    http://pandas.pydata.org/pandas-docs/stable/10min.html
    2)十分钟搞定pandas:
    http://www.cnblogs.com/chaosimple/p/4153083.html
    3)Pandas使用:
    https://github.com/qiwsir/StarterLearningPython/blob/master/311.md
    4)python cookbook:
    http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook

  • 相关阅读:
    数据绑定表达式语法(Eval,Bind区别)
    使用博客园的第一件事 自定义主题
    sql2000 跨服务器复制表数据
    使用UpdatePanel 局部刷新出现中文乱码的解决方法!!
    MMC不能打开文件MSC文件
    sql 日期 、时间相关
    loaded AS2 swf call function in AS3 holder
    Rewrite the master page form action attribute in asp.net 2.0
    100万个不重复的8位的随机数
    flash 中实现斜切变型
  • 原文地址:https://www.cnblogs.com/wujiadong2014/p/6545229.html
Copyright © 2020-2023  润新知