• Pandas入门学习笔记2


    2 基本功能

    只是一些基本功能,更深奥的内容用到再摸索。

    2.1 重新索引

    reindex是pandas的重要方法,举个例子:

    In [101]: obj = Series([4,7,-5,3.4],index=['c','a','b','d'])
    
    In [102]: obj
    Out[102]:
    c    4.0
    a    7.0
    b   -5.0
    d    3.4
    dtype: float64
    
    In [103]: obj2 = obj.reindex(['a','b','c','d','e'])
    
    In [104]: obj2
    Out[104]:
    a    7.0
    b   -5.0
    c    4.0
    d    3.4
    e    NaN
    dtype: float64
    
    # 缺失值可以自定义
    
    In [105]: obj.reindex(['a','b','c','d','e'],fill_value=0)
    Out[105]:
    a    7.0
    b   -5.0
    c    4.0
    d    3.4
    e    0.0  #缺失值填充
    dtype: float64
    
    
    

    reindex的插值method选项:

    参数 说明
    ffill或pad 前向填充值
    bfill或backfill 后向填充值
    In [106]: obj3 = Series(['blue','purple','yellow'],index=[0,2,4])
    
    # 前向填充
    In [107]: obj3.reindex(range(6),method='ffill')
    Out[107]:
    0      blue
    1      blue
    2    purple
    3    purple
    4    yellow
    5    yellow
    dtype: object
    
    # 后向填充
    In [109]: obj3.reindex(range(6),method='bfill')
    Out[109]:
    0      blue
    1    purple
    2    purple
    3    yellow
    4    yellow
    5       NaN
    dtype: object
    

    针对DataFrame,可以修改行、列或两个都进行重新索引。

    In [111]: frame = DataFrame(np.arange(9).reshape(3,3), index=['a','b','c'],colmns=['Ohio','Texas','California'])
    
    In [112]: frame
    Out[112]:
       Ohio  Texas  California
    a     0      1           2
    b     3      4           5
    c     6      7           8
    
    In [113]: frame2 = frame.reindex(['a','b','c','d'])  # 默认行索引
    
    In [115]: frame2
    Out[115]:
       Ohio  Texas  California
    a   0.0    1.0         2.0
    b   3.0    4.0         5.0
    c   6.0    7.0         8.0
    d   NaN    NaN         NaN
    
    In [116]: states = ['Texas','Utah','California']
    
    In [117]: frame.reindex(columns=states)  #指定列索引
    Out[117]:
       Texas  Utah  California
    a      1   NaN           2
    b      4   NaN           5
    c      7   NaN           8
    
    # 对行、列都进行重新索引,
    # 并且进行插值,但是只能在0轴进行,即按行应用。
    In [118]: frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
    Out[118]:
       Texas  Utah  California
    a      1   NaN           2
    b      4   NaN           5
    c      7   NaN           8
    d      7   NaN           8
    
    # 用ix更简洁。
    In [119]: frame.ix[['a','b','c','d'],states]
    Out[119]:
       Texas  Utah  California
    a    1.0   NaN         2.0
    b    4.0   NaN         5.0
    c    7.0   NaN         8.0
    d    NaN   NaN         NaN
    

    reindex函数的参数

    2.2 丢弃指定轴上的项

    丢弃项,只要一个索引或列表即可。drop方法会返回一个删除了指定值的新对象。

    In [120]: obj = Series(np.arange(5.),index=['a','b','c','d','e'])
    
    In [121]: new_obj = obj.drop('c')
    
    In [122]: new_obj
    Out[122]:
    a    0.0
    b    1.0
    d    3.0
    e    4.0
    dtype: float64
    
    In [124]: obj.drop(['d','c'])
    Out[124]:
    a    0.0
    b    1.0
    e    4.0
    dtype: float64
    
    

    针对DataFrame,可以删除任意轴上的索引值。

    In [125]: data = DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado',
         ...: 'Utah','New York'],columns=['one','two','three','four'])
    
    In [126]: data
    Out[126]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    In [127]: data.drop(['Colorado','Ohio'])
    Out[127]:
              one  two  three  four
    Utah        8    9     10    11
    New York   12   13     14    15
    
    In [128]: data.drop(['two',],axis=1)
    Out[128]:
              one  three  four
    Ohio        0      2     3
    Colorado    4      6     7
    Utah        8     10    11
    New York   12     14    15
    

    2.3 索引、选取和过滤

    In [129]: obj = Series(np.arange(4.),index=['a','b','c','d'])
    
    In [130]: obj['a']  #使用index索引
    Out[130]: 0.0
    
    In [131]: obj[0]    #使用序号来索引
    Out[131]: 0.0
    
    In [132]: obj[1]
    Out[132]: 1.0
    
    In [133]: obj
    Out[133]:
    a    0.0
    b    1.0
    c    2.0
    d    3.0
    dtype: float64
    
    In [134]: obj[1:2]  # 使用序号切片
    Out[134]:
    b    1.0
    dtype: float64
    
    In [135]: obj[1:3]
    Out[135]:
    b    1.0
    c    2.0
    dtype: float64
    
    In [136]: obj[obj<2]  # 使用值判断
    Out[136]:
    a    0.0
    b    1.0
    dtype: float64
    
    In [137]: obj['b':'c']  # 使用索引切片,注意是两端包含的。
    Out[137]:
    b    1.0
    c    2.0
    dtype: float64
    
    In [138]: obj['b':'c'] = 100  # 赋值
    
    In [139]: obj
    Out[139]:
    a      0.0
    b    100.0
    c    100.0
    d      3.0
    dtype: float64
    
    

    针对DataFrame,索引就是获取一个或多个列。
    使用列名:获取列
    使用序号或bool值:获取行

    In [140]: data
    Out[140]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    In [141]:
    
    In [141]: data['two']  # 获取第2列
    Out[141]:
    Ohio         1
    Colorado     5
    Utah         9
    New York    13
    Name: two, dtype: int32
    
    In [142]: data[['two','one']]  # 按要求获取列
    Out[142]:
              two  one
    Ohio        1    0
    Colorado    5    4
    Utah        9    8
    New York   13   12
    
    In [143]: data[:2]  # 获取前面两行,使用数字序号获取的是行
    Out[143]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    
    In [144]: data[data['three']>5]  # 获取第三列大于5的行
    Out[144]:
              one  two  three  four
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    

    DataFrame在语法上与ndarray是比较相似的。

    In [146]: data < 5
    Out[146]:
                one    two  three   four
    Ohio       True   True   True   True
    Colorado   True  False  False  False
    Utah      False  False  False  False
    New York  False  False  False  False
    
    In [147]: data[data<5] = 0
    
    In [148]: data
    Out[148]:
              one  two  three  four
    Ohio        0    0      0     0
    Colorado    0    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    

    索引字段ix:
    可以通过Numpy的标记法以及轴标签从DataFrame中选取行和列的子集。
    此外,ix得表述方式很简单

    In [150]: data.ix['Colorado',['two','three']]
    Out[150]:
    two      5
    three    6
    Name: Colorado, dtype: int32
    
    In [151]: data.ix[['Colorado','Utah'],[3,0,1]]
    Out[151]:
              four  one  two
    Colorado     7    0    5
    Utah        11    8    9
    
    In [152]: data.ix[2]
    Out[152]:
    one       8
    two       9
    three    10
    four     11
    Name: Utah, dtype: int32
    
    In [153]: data.ix[:'Utah','two']
    Out[153]:
    Ohio        0
    Colorado    5
    Utah        9
    Name: two, dtype: int32
    

    DataFrame的索引选项

    2.4 算术运算和数据对齐

    算术运算结果就是不同索引之间的并集,不存在的值之间运算结果用NaN表示。

    In [4]: s1 = Series([-2,-3,5,-1],index=list('abcd'))
    
    In [5]: s2 = Series([9,2,5,1,5],index=list('badef'))
    
    In [6]: s1 + s2
    Out[6]:
    a    0.0
    b    6.0
    c    NaN
    d    4.0
    e    NaN
    f    NaN
    dtype: float64
    
    

    DataFrame也是一样,会同时发生在行和列上。

    在算术方法中填充值

    In [7]: df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
    
    In [8]: df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))
    
    In [9]: df1
    Out[9]:
         a    b     c     d
    0  0.0  1.0   2.0   3.0
    1  4.0  5.0   6.0   7.0
    2  8.0  9.0  10.0  11.0
    
    In [10]: df2
    Out[10]:
          a     b     c     d     e
    0   0.0   1.0   2.0   3.0   4.0
    1   5.0   6.0   7.0   8.0   9.0
    2  10.0  11.0  12.0  13.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    In [11]: df1 + df2  # 不填充值
    Out[11]:
          a     b     c     d   e
    0   0.0   2.0   4.0   6.0 NaN
    1   9.0  11.0  13.0  15.0 NaN
    2  18.0  20.0  22.0  24.0 NaN
    3   NaN   NaN   NaN   NaN NaN
    
    In [12]: df1.add(df2, fill_value=0)  # 填充0
    Out[12]:
          a     b     c     d     e
    0   0.0   2.0   4.0   6.0   4.0
    1   9.0  11.0  13.0  15.0   9.0
    2  18.0  20.0  22.0  24.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    In [13]: df1.reindex(columns=df2.columns, method='ffill')
    Out[13]:
         a    b     c     d   e
    0  0.0  1.0   2.0   3.0 NaN
    1  4.0  5.0   6.0   7.0 NaN
    2  8.0  9.0  10.0  11.0 NaN
    
    In [14]: df1.reindex(columns=df2.columns, fill_value=0)  # 重新索引的时候也可以填充。
    Out[14]:
         a    b     c     d  e
    0  0.0  1.0   2.0   3.0  0
    1  4.0  5.0   6.0   7.0  0
    2  8.0  9.0  10.0  11.0  0
    
    

    可用的算术算法有:

    • add:加法,
    • sub:减法,
    • div:除法
    • mul:乘法

    DataFrame和Series之间的运算

    采用广播的方式,就是会按照一定的规律作用到整个DataFrame之中。

    In [15]: frame = DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index
        ...: =['Utah','Ohio','Texas','Oregon'])
    
    In [16]: series = frame.ix[0]  # 获取第一行
    
    In [17]: frame
    Out[17]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [18]: series
    Out[18]:
    b    0.0
    d    1.0
    e    2.0
    Name: Utah, dtype: float64
    
    In [19]: frame - series   # 自动广播到其他行
    Out[19]:
              b    d    e
    Utah    0.0  0.0  0.0
    Ohio    3.0  3.0  3.0
    Texas   6.0  6.0  6.0
    Oregon  9.0  9.0  9.0
    
    In [20]: series2 = Series(np.arange(3),index=list('bef'))
    
    In [21]: series2
    Out[21]:
    b    0
    e    1
    f    2
    dtype: int64
    
    In [22]: frame + series2  # 没有的列使用NaN
    Out[22]:
              b   d     e   f
    Utah    0.0 NaN   3.0 NaN
    Ohio    3.0 NaN   6.0 NaN
    Texas   6.0 NaN   9.0 NaN
    Oregon  9.0 NaN  12.0 NaN
    
    In [23]: series3 = frame['d']   # 获取列
    
    In [24]: frame.sub(series3, axis=0) #列相减,指定axis
    Out[24]:
              b    d    e
    Utah   -1.0  0.0  1.0
    Ohio   -1.0  0.0  1.0
    Texas  -1.0  0.0  1.0
    Oregon -1.0  0.0  1.0
    
    

    2.5 函数应用和映射

    Numpy中的通用函数(ufunc)也可以作用于pandas的Series和DataFrame对象。

    In [31]: np.abs(frame)
    Out[31]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [32]: np.max(frame)
    Out[32]:
    b     9.0
    d    10.0
    e    11.0
    dtype: float64
    
    

    DataFrame有一个apply方法,可以接受自定义函数。

    In [33]: f = lambda x: np.max(x) - np.min(x)
    
    In [34]: frame.apply(f)
    Out[34]:
    b    9.0
    d    9.0
    e    9.0
    dtype: float64
    
    In [35]: frame
    Out[35]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [36]: f = lambda x : np.exp2(x)
    
    In [37]: frame.apply(f)
    Out[37]:
                b       d       e
    Utah      1.0     2.0     4.0
    Ohio      8.0    16.0    32.0
    Texas    64.0   128.0   256.0
    Oregon  512.0  1024.0  2048.0
    
    

    许多常用的方法,DataFrame已经实现,不需要使用apply方法自定义。

    In [38]: f = lambda x: Series([np.max(x),np.min(x)],index=['max','min'])
    
    In [39]: frame.apply(f)
    Out[39]:
           b     d     e
    max  9.0  10.0  11.0
    min  0.0   1.0   2.0
    
    # 如果f函数是一个元素级别的函数,就使用applymap
    In [40]: f = lambda x : '%.2f' % x
    
    In [41]: frame.applymap(f)
    Out[41]:
               b      d      e
    Utah    0.00   1.00   2.00
    Ohio    3.00   4.00   5.00
    Texas   6.00   7.00   8.00
    Oregon  9.00  10.00  11.00
    
    # 同样对于Series就使用map,与DataFrame的applymap是对应的。
    In [43]: series
    Out[43]:
    b    0.0
    d    1.0
    e    2.0
    Name: Utah, dtype: float64
    
    In [44]: series.map(f)
    Out[44]:
    b    0.00
    d    1.00
    e    2.00
    Name: Utah, dtype: object
    
    
    

    2.6 排序与排名

    排序

    排序可以使用:

    • sort_index方法:按索引排序,
    • sort_value方法(order方法):按值排序,使用by参数
    In [45]: obj = Series(range(4),index=list('dbca'))
    
    In [46]: obj
    Out[46]:
    d    0
    b    1
    c    2
    a    3
    dtype: int64
    
    In [47]: obj.sort_index()
    Out[47]:
    a    3
    b    1
    c    2
    d    0
    dtype: int64
    
    In [50]: frame
    Out[50]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [51]: frame.sort_index()
    Out[51]:
              b     d     e
    Ohio    3.0   4.0   5.0
    Oregon  9.0  10.0  11.0
    Texas   6.0   7.0   8.0
    Utah    0.0   1.0   2.0
    
    In [52]: frame.sort_index(axis=0)
    Out[52]:
              b     d     e
    Ohio    3.0   4.0   5.0
    Oregon  9.0  10.0  11.0
    Texas   6.0   7.0   8.0
    Utah    0.0   1.0   2.0
    
    In [53]: frame.sort_index(axis=1)
    Out[53]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [54]: frame.sort_index(axis=1, ascending=False) # 倒序
    Out[54]:
               e     d    b
    Utah     2.0   1.0  0.0
    Ohio     5.0   4.0  3.0
    Texas    8.0   7.0  6.0
    Oregon  11.0  10.0  9.0
    
    

    按值排序:

    In [55]: s1 = Series([3,-2,-7,4])
    
    In [56]: s1.order()
    /Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: order is deprecated, use sort_values(...)
      #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
    Out[56]:
    2   -7
    1   -2
    0    3
    3    4
    dtype: int64
    
    
    In [58]: frame.sort_index(by='b')
    /Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
      #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
    Out[58]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [59]: frame.sort_values(by='b')
    Out[59]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    
    

    排名

    rank方法,默认情况下为“相同的值分配一个平均排名”:

    In [60]: s1 = Series([7,-5,7,4,2,0,4])
    
    In [61]: s1.rank()  # 可见0和2索引对应的值都是7,排名分别为6,7;因此取平均值6.5
    Out[61]:
    0    6.5
    1    1.0
    2    6.5
    3    4.5
    4    3.0
    5    2.0
    6    4.5
    dtype: float64
    
    

    当然,有很多方法可以“打破”这种平级关系。

    In [62]: s1.rank(method='first')  # 按原始数据出现顺序排序
    Out[62]:
    0    6.0
    1    1.0
    2    7.0
    3    4.0
    4    3.0
    5    2.0
    6    5.0
    dtype: float64
    
    In [63]: s1.rank(ascending=False, method='max')  # 倒序,平级处理使用最大排名
    Out[63]:
    0    2.0
    1    7.0
    2    2.0
    3    4.0
    4    5.0
    5    6.0
    6    4.0
    dtype: float64
    
    

    DataFrame排名可以使用axis按行或按列进行排名。

    2.7 带有重复值的轴索引

    目前所有的例子中索引都是唯一的,而且如pandas中的许多函数(reindex)就要求索引唯一。
    但是也不是强制的。

    In [64]: obj  = Series(range(5),index=list('aabbc'))
    
    In [65]: obj
    Out[65]:
    a    0
    a    1
    b    2
    b    3
    c    4
    dtype: int64
    
    In [67]: obj.index.is_unique
    Out[67]: False
    
    In [68]: obj['a']
    Out[68]:
    a    0
    a    1
    dtype: int64
    
    In [69]: obj['c']
    Out[69]: 4
    

    对于DataFrame,也是如此。

    In [70]: df =DataFrame(np.random.randn(4,3),index=list('aabb'))
    
    In [79]: df.ix['a']
    Out[79]:
              0         1         2
    a  1.099692 -0.491098  0.625690
    a -0.816857  1.025018  0.558494
    
    In [80]: df.reindex(['b','a'])  # 不能重新索引有重复索引的DataFrame
    ...
    ValueError: cannot reindex from a duplicate axis
    

    待续。。。

  • 相关阅读:
    Python3.x与Python2.x的区别
    Python3.x:打包为exe执行文件(window系统)
    Python3.x:常用基础语法
    关于yaha中文分词(将中文分词后,结合TfidfVectorizer变成向量)
    关于:cross_validation.scores
    list array解析(总算清楚一点了)
    pipeline(管道的连续应用)
    关于RandomizedSearchCV 和GridSearchCV(区别:参数个数的选择方式)
    VotingClassifier
    Python的zip函数
  • 原文地址:https://www.cnblogs.com/felo/p/6359895.html
Copyright © 2020-2023  润新知