• Pandas学习1


    pandas有两种自己独有的基本数据结构Series和DataFrame

    Series

    数据结构
    data    100 300 500
    index    0   1   2
    或者
    index data
    0      100
    1      300
    2      500
    
    
    创建series对象
    
    In [1]: import numpy as np
    In [2]: from pandas import Series,DataFrame
    In [3]: import pandas as pd
    
    传递list创建对象,默认创建整数索引
    In [4]: s1 = Series([1,3,6,-1,2,8]) 
    In [5]: s1
    Out[5]: 
    0    1
    1    3
    2    6
    3   -1
    4    2
    5    8
    dtype: int64
    
    
    传入列表自定义索引创建对象
    In [9]: s2 = Series([1,3,6,-1,2,8],index = ["a","c","d","e","b","g"])
    In [10]: s2
    Out[10]: 
    a    1
    c    3
    d    6
    e   -1
    b    2
    g    8
    dtype: int64
    
    传入字典创建对象
    In [11]: SD = {"python":100,"java":101,"scala":102}
    In [12]: s3 = Series(SD)
    In [14]: s3
    Out[14]: 
    java      101
    python    100
    scala     102
    dtype: int64
    
    
    
    //显示数据值【values】和索引【index】
    In [6]: s1.values
    Out[6]: array([ 1,  3,  6, -1,  2,  8])
    In [7]: s1.index
    Out[7]: RangeIndex(start=0, stop=6, step=1)
    
    
    
    In [17]: s1
    Out[17]: 
    0    1
    1    3
    2    6
    3   -1
    4    2
    5    8
    dtype: int64
    
    自定义索引名字
    In [18]: s1.index = ["p1","p2","p3","p4","p5","p6"]
    In [19]: s1
    Out[19]: 
    p1    1
    p2    3
    p3    6
    p4   -1
    p5    2
    p6    8
    dtype: int64
    
    
    
    根据索引查看值和修改值
    In [20]: s1['p1']
    Out[20]: 1
    In [21]: s1['p1']=100
    In [22]: s1
    Out[22]: 
    p1    100
    p2      3
    p3      6
    p4     -1
    p5      2
    p6      8
    dtype: int64
    
    判断值是否为空
    In [29]: pd.isnull(s1)
    Out[29]: 
    p1    False
    p2    False
    p3    False
    p4    False
    p5    False
    p6    False
    dtype: bool
    
    In [30]: pd.notnull(s1)
    Out[30]: 
    p1    True
    p2    True
    p3    True
    p4    True
    p5    True
    p6    True
    dtype: bool
    
    
    运算
    In [31]: s2
    Out[31]: 
    a    1
    c    3
    d    6
    e   -1
    b    2
    g    8
    dtype: int64
    In [32]: s2[s2>5]
    Out[32]: 
    d    6
    g    8
    dtype: int64
    
    
    In [33]: s2*10
    Out[33]: 
    a    10
    c    30
    d    60
    e   -10
    b    20
    g    80
    dtype: int64
    
    

    DataFrame

    DataFrame几要素:index、columns、values等

    通过传递一个list对象来创建一个Series,pandas会默认创建整形索引
    In [34]: s = pd.Series([1,3,5,np.nan,6,8])
    In [35]: s
    Out[35]: 
    0    1.0
    1    3.0
    2    5.0
    3    NaN
    4    6.0
    5    8.0
    dtype: float64
    
    通过传递一个numpy array,时间索引以及列标签创建DataFrame
    In [48]: dates = pd.date_range("20170101",periods = 6)
    
    In [49]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
    
    In [50]: df
    Out[50]: 
                       A         B         C         D
    2017-01-01  0.198724  1.455237 -1.165803 -0.474382
    2017-01-02  0.622154 -0.280253 -0.492515  0.002470
    2017-01-03  1.764839 -1.734531 -0.195002  0.128216
    2017-01-04 -0.520130  1.372930 -2.240510  0.362139
    2017-01-05  1.530835  0.406480 -1.714226 -0.289591
    2017-01-06  0.675166  0.210024 -0.773319 -1.410746
    In [51]: dates
    Out[51]: 
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
                   '2017-01-05', '2017-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    d1=DataFrame(np.arange(12).reshape((3,4)),index=['a','b','c'],columns=['a1','a2','a3','a4'])    
    
    
    比较常用的有导入等长列表、字典、numpy数组、数据文件等
    In [61]: data = {'name':['zxx','lxx','gxx','hxx'],'age':[12,13,14,15],'addr':['JX','JS','BJ','SH']}
    字典数据转换为DataFrame,并指定索引
    In [62]: d2 = DataFrame(data)
    In [63]: d2
    Out[63]: 
      addr  age name
    0   JX   12  zxx
    1   JS   13  lxx
    2   BJ   14  gxx
    3   SH   15  hxx
    
    In [64]: d3 = DataFrame(data,columns=['name','age','addr'],index=['a','b','c','d']) 
    In [65]: d3
    Out[65]: 
      name  age addr
    a  zxx   12   JX
    b  lxx   13   JS
    c  gxx   14   BJ
    d  hxx   15   SH
    
    df.dtypes 查看不同列的数据类型
    df.Tab键  自动识别所有属性及自定义列
    df.head(2) 查看前两行
    df.tail(2) 查看尾部两行
    df.index 查看索引值
    df.columns 查看列名
    df.values 查看底层numpy数据
    
    
    In [56]: df.head(2)
    Out[56]: 
                       A         B         C         D
    2017-01-01  0.198724  1.455237 -1.165803 -0.474382
    2017-01-02  0.622154 -0.280253 -0.492515  0.002470
    
    In [57]: df.tail(2)
    Out[57]: 
                       A         B         C         D
    2017-01-05  1.530835  0.406480 -1.714226 -0.289591
    2017-01-06  0.675166  0.210024 -0.773319 -1.410746
    
    In [58]: df.index
    Out[58]: 
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
                   '2017-01-05', '2017-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    In [59]: df.columns
    Out[59]: Index([u'A', u'B', u'C', u'D'], dtype='object')
    
    In [60]: df.values
    Out[60]: 
    array([[ 0.19872446,  1.45523672, -1.16580285, -0.47438238],
           [ 0.62215406, -0.28025262, -0.49251531,  0.00247041],
           [ 1.76483913, -1.73453082, -0.19500168,  0.12821624],
           [-0.52013049,  1.37292972, -2.24051045,  0.36213914],
           [ 1.53083459,  0.40647992, -1.71422601, -0.28959076],
           [ 0.67516588,  0.2100239 , -0.77331882, -1.41074624]])
    
    

    获取数据

    In [71]: d3=DataFrame(data,columns=['name','age','addr'],index=['a','b','c','d'])
    
    In [72]: d3
    Out[72]: 
           name  age     addr
    a  zhanghua   40   jianxi
    b   liuting   45   pudong
    c    gaofei   50  beijing
    d    hedong   46     xian
    
    选择列
    In [78]: d3[['name','age']]
    Out[78]: 
           name  age
    a  zhanghua   40
    b   liuting   45
    c    gaofei   50
    d    hedong   46
    
    选择行
    In [84]: d3['a':'c']
    Out[84]: 
           name  age     addr
    a  zhanghua   40   jianxi
    b   liuting   45   pudong
    c    gaofei   50  beijing
    
    选择行(利用位置索引)
    In [87]: d3[1:3]
    Out[87]: 
          name  age     addr
    b  liuting   45   pudong
    c   gaofei   50  beijing
    
    使用过滤条件
    In [90]: d3[d3['age']>40]
    Out[90]: 
          name  age     addr
    b  liuting   45   pudong
    c   gaofei   50  beijing
    d   hedong   46     xian
    
    
    obj.ix[indexs,[columns]]可以根据列或索引同时进行过滤
    In [91]: d3.ix[['a','c'],['name','age']]
    Out[91]: 
           name  age
    a  zhanghua   40
    c    gaofei   50
    
    In [93]: d3.ix['a':'c',['name','age']]
    Out[93]: 
           name  age
    a  zhanghua   40
    b   liuting   45
    c    gaofei   50
    
    In [94]: d3.ix[0:3,['name','age']]
    Out[94]: 
           name  age
    a  zhanghua   40
    b   liuting   45
    c    gaofei   50
    

    修改数据

    In [95]: data={'name':['zhanghua','liuting','gaofei','hedong'],'age':[40,45,50,46],'addr':['jianxi','pudong','beijing','xian']}
    
    In [96]: d3=DataFrame(data,columns=['name','age','addr'],index=['a','b','c','d'])
    
    删除行
    In [97]: d3.drop('d',axis=0)
    Out[97]: 
           name  age     addr
    a  zhanghua   40   jianxi
    b   liuting   45   pudong
    c    gaofei   50  beijing
    
    删除列
    In [99]: d3.drop('age',axis=1)
    Out[99]: 
           name     addr
    a  zhanghua   jianxi
    b   liuting   pudong
    c    gaofei  beijing
    d    hedong     xian
    
    添加一行,注意需要ignore_index=True
    In [103]: d4 = d3.append({'name':'wxx','age':38,'addr':'HN'},ignore_index=True)
    
    In [104]: d4
    Out[104]: 
           name  age     addr
    0  zhanghua   40   jianxi
    1   liuting   45   pudong
    2    gaofei   50  beijing
    3    hedong   46     xian
    4       wxx   38       HN
    
    In [105]: d4.ix['4','age']=100
    
    In [106]: d4
    Out[106]: 
           name    age     addr
    0  zhanghua   40.0   jianxi
    1   liuting   45.0   pudong
    2    gaofei   50.0  beijing
    3    hedong   46.0     xian
    4       wxx   38.0       HN
    4       NaN  100.0      NaN
    
    修改索引
    In [111]: d3.index=['a','b','c','d'] 
    
    In [112]: d3
    Out[112]: 
           name  age     addr
    a  zhanghua   40   jianxi
    b   liuting   45   pudong
    c    gaofei   50  beijing
    d    hedong   46     xian
    

    汇总统计

    常用统计方法
    count	统计非NA的数量
    describe	统计列的汇总信息
    min、max	计算最小值和最大值
    sum	求总和
    mean	求平均数
    var	样本的方差
    std	样本的标准差
    
    
    
    导入和保存数据
    读取csv文件/或者逗号分隔的txt文件
    pd.read_csv('wu.csv')
    读取HDFS数据
    pd.read_hdf('wu.h5',df)
    写入为csv文件
    df.to_csv('wu.csv')
    写入HDF5存储
    df.to_hdf('wu.h5','df')
    
    
    
    In [15]: inputfile = '/home/hadoop/wujiadong/wu1_stud_score.txt'
    In [16]: data = pd.read_csv(inputfile)
    In [40]: df = DataFrame(data)
    In [41]: df.head(3)
    Out[41]: 
        stud_code  sub_code sub_name  sub_tech  sub_score   stat_date
    0  2015101000     10101     数学分析       NaN         90  0000-00-00
    1  2015101000     10102     高等代数       NaN         88  0000-00-00
    2  2015101000     10103     大学物理       NaN         67  0000-00-00
    
    
    In [42]: df.count()
    Out[42]: 
    stud_code    121
    sub_code     121
    sub_name     121
    sub_tech       0
    sub_score    121
    stat_date    121
    dtype: int64
    
    
    In [43]: df['sub_score'].describe()
    Out[43]: 
    count    121.000000
    mean      78.561983
    std       12.338215
    min       48.000000
    25%       69.000000
    50%       80.000000
    75%       89.000000
    max       98.000000
    Name: sub_score, dtype: float64
    
    求学生成绩标准差
    In [44]: df['sub_score'].std()
    Out[44]: 12.338214729032906
    
    
    
    

    应用函数和映射

    In [45]: d1 = DataFrame(np.arange(12).reshape((3,4)),index=['a','b','c'],columns=['a1','a2','a3','a4'])
    
    In [46]: d1
    Out[46]: 
       a1  a2  a3  a4
    a   0   1   2   3
    b   4   5   6   7
    c   8   9  10  11
    
    处理每个元素
    In [47]: d1.applymap(lambda x:x+2)
    Out[47]: 
       a1  a2  a3  a4
    a   2   3   4   5
    b   6   7   8   9
    c  10  11  12  13
    
    处理行数据
    In [54]: d1.ix[1].map(lambda x:x+2)
    Out[54]: 
    a1    6
    a2    7
    a3    8
    a4    9
    Name: b, dtype: int64
    
    列级处理
    In [62]: d1.apply(lambda x:x.max()-x.min(),axis=0)
    Out[62]: 
    a1    8
    a2    8
    a3    8
    a4    8
    dtype: int64
    
    

    参考资料
    1)10 Minutes to pandas:
    http://pandas.pydata.org/pandas-docs/stable/10min.html
    2)十分钟搞定pandas:
    http://www.cnblogs.com/chaosimple/p/4153083.html
    3)Pandas使用:
    https://github.com/qiwsir/StarterLearningPython/blob/master/311.md
    4)python cookbook:
    http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook

  • 相关阅读:
    基本二叉搜索树的第K小元素
    sklearn常见分类器(二分类模板)
    python图论包networks(最短路,最小生成树带包)
    PAT 甲级 1030 Travel Plan (30 分)(dijstra,较简单,但要注意是从0到n-1)
    PAT 甲级 1029 Median (25 分)(思维题,找两个队列的中位数,没想到)*
    Oracle 10g ORA-12154: TNS: could not resolve the connect identifier specified 问题解决! 我同事遇到的问题。 username/
    JavaScritpt的DOM初探之Node(一)
    怎样实现动态加入布局文件(避免 The specified child already has a parent的问题)
    Ubuntu 14.04下单节点Ceph安装(by quqi99)
    卡片游戏
  • 原文地址:https://www.cnblogs.com/wujiadong2014/p/6545229.html
Copyright © 2020-2023  润新知