• 第5章:pandas入门【1】Series与DataFrame


    一、Series与DataFrame

    from pandas import Series,DataFrame
    import pandas as pd
    

    仅由一组简单的数据就可产生最简单的Series,数据+相关的标签:

    obj = Series([4,7,-5,3])
    obj
    0    4
    1    7
    2   -5
    3    3
    dtype: int64

    Series的values(值)和index(索引)属性:

    obj.values
    array([ 4,  7, -5,  3], dtype=int64)

    obj.index
    RangeIndex(start=0, stop=4, step=1)

    另一种索引,可以给你的索引起名字:

    obj2 = Series([4,7,-5,3],index=['d','b','a','c'])
    obj2
    d    4
    b    7
    a   -5
    c    3
    dtype: int64

    obj2.index
    Index(['d', 'b', 'a', 'c'], dtype='object')

    通过索引来查值:

    obj2['a']
    -5

    obj2[['c','a','d']] #索引用列表嵌套
    c    3
    a   -5
    d    4
    dtype: int64

    NumPy数组运算都会保留索引和值之间的链接:

    obj2
    d    4
    b    7
    a   -5
    c    3
    dtype: int64
    #布尔型数组过滤
    obj2 > 0
    d     True
    b     True
    a    False
    c     True
    dtype: bool

    obj2[obj2 > 0]
    d    4
    b    7
    c    3
    dtype: int64

    #标量乘法
    obj2 * 2
    d     8
    b    14
    a   -10
    c     6
    dtype: int64

    #数学函数
    import numpy as np
    np.exp(obj2)
    d      54.598150
    b    1096.633158
    a       0.006738
    c      20.085537
    dtype: float64

    可以将Series看成一个定长的有序字典,毕竟都是索引值到数据值的一个映射。

    'b' in obj2
    True

    'e' in obj2
    False

    数据被存放在字典当中,也可以通过这个字典创建Series:

    #字典类型变Series类型,字典中的键就是Series中的索引
    sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000} obj3 = Series(sdata) obj3
    Ohio      35000
    Texas     71000
    Oregon    16000
    Utah       5000
    dtype: int64

    更改索引index:

    states = ['California','Ohio','Oregon','Texas']
    obj4 = Series(sdata,index=states)
    obj4
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    dtype: float64

    sdata中跟states索引相匹配的3个值会被找出来放在相应的位置上,California所对应的sdata值找不到,结果就为NaN,表示缺失数据。

    pandas中isnull和notnull函数用于检测缺失数据:

    pd.isnull(obj4)
    California     True
    Ohio          False
    Oregon        False
    Texas         False
    dtype: bool
    pd.notnull(obj4)
    California False Ohio True Oregon True Texas True dtype: bool

    obj4.isnull()
    California     True
    Ohio          False
    Oregon        False
    Texas         False
    dtype: bool

    Series最重要的一个功能:在算术运算中会自动对齐不同索引的数据。

    大家只取各自共有的、重叠的部分,你有我没有的就会变为NaN:

    obj3
    Ohio      35000
    Texas     71000
    Oregon    16000
    Utah       5000
    dtype: int64

    obj4
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    dtype: float64

    obj3 + obj4
    California         NaN
    Ohio           70000.0
    Oregon         32000.0
    Texas         142000.0
    Utah               NaN
    dtype: float64

    Series对象自己和它的索引都有一个name属性:

    obj4.name = 'population'
    obj4.index.name = 'state'
    obj4
    state
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    Name: population, dtype: float64

    索引可以通过赋值的方式修改:

    obj.index = ['Bob','Steve','Jeff','Ryan']
    

    DataFrame:

    创建DataFrame:

    data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
           'year':[2000,2001,2002,2001,2002],
           'pop':[1.5,1.7,3.6,2.4,2.9]}
    frame = DataFrame(data)
    frame

    字典就变成了DataFrame,键就成了列名,又自动添加了索引:

        state    year    pop
    0    Ohio    2000    1.5
    1    Ohio    2001    1.7
    2    Ohio    2002    3.6
    3    Nevada    2001    2.4
    4    Nevada    2002    2.9

     可指定列序列:

    DataFrame(data,columns=['year','state','pop'])
         year    state    pop
    0    2000      Ohio    1.5
    1    2001      Ohio    1.7
    2    2002      Ohio    3.6
    3    2001    Nevada    2.4
    4    2002    Nevada    2.9

    如果数据找不到,就会是NA值:

    frame2 = DataFrame(data,columns=['year','state','pop','debt'],
                      index=['one','two','three','four','five'])
    frame2
           year    state    pop    debt
    one    2000    Ohio      1.5    NaN
    two    2001    Ohio      1.7    NaN
    three  2002    Ohio      3.6    NaN
    four   2001    Nevada    2.4    NaN
    five   2002    Nevada    2.9    NaN
    frame2.columns
    Index(['year', 'state', 'pop', 'debt'], dtype='object')

    类似字典标记的方式将DataFrame的列获取为一个Series:

    frame2['state']
    one        Ohio
    two        Ohio
    three      Ohio
    four     Nevada
    five     Nevada
    Name: state, dtype: object

    属性的方式将DataFrame的列获取为一个Series:

    frame2.year
    one      2000
    two      2001
    three    2002
    four     2001
    five     2002
    Name: year, dtype: int64

    获取一行:

    frame2.loc['three']
    year     2002
    state    Ohio
    pop       3.6
    debt      NaN
    Name: three, dtype: object

    赋值的方式修改列值:

    frame2['debt'] = 16.5
    frame2
            year    state     pop    debt
    one     2000    Ohio      1.5    16.5
    two     2001    Ohio      1.7    16.5
    three   2002    Ohio      3.6    16.5
    four    2001    Nevada    2.4    16.5
    five    2002    Nevada    2.9    16.5
    frame2['debt'] = np.arange(5.)
    frame2
            year    state   pop    debt
    one     2000    Ohio    1.5    0.0
    two     2001    Ohio    1.7    1.0
    three   2002    Ohio    3.6    2.0
    four    2001    Nevada  2.4    3.0
    five    2002    Nevada  2.9    4.0

    将列表或数组赋值给DataFrame某一个列时,长度必须要匹配。如果赋值一个Series,相应的索引会自己匹配:

    val = Series([-1.2,-1.5,-1.7],index=['two','four','five'])
    frame2['debt'] = val
    frame2
           year    state    pop    debt
    one     2000    Ohio    1.5    NaN
    two     2001    Ohio    1.7    -1.2
    three   2002    Ohio    3.6    NaN
    four    2001    Nevada  2.4    -1.5
    five    2002    Nevada  2.9    -1.7

    为不存在的列赋值,会自动创建出一个新列的:

    frame2['eastern'] = frame2.state == 'Ohio'
    frame2
            year    state   pop    debt   eastern
    one     2000    Ohio    1.5    NaN     True
    two     2001    Ohio    1.7    -1.2    True
    three   2002    Ohio    3.6    NaN     True
    four    2001    Nevada  2.4    -1.5    False
    five    2002    Nevada  2.9    -1.7    False

    删除列:del:

    del frame2['eastern']
    frame2.columns
    Index(['year', 'state', 'pop', 'debt'], dtype='object')

    嵌套字典:字典的字典

    pop = {'Nevada':{2001:2.4,2002:2.9},
          'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
    frame3 = DataFrame(pop)
    frame3
         Nevada    Ohio
    2000    NaN    1.5
    2001    2.4    1.7
    2002    2.9    3.6

    外层字典的键作为列,内层字典的键作为行索引。

    转置:

    frame3.T

    显示指定索引:

    DataFrame(pop,index=[2001,2002,2003])

    由Series组成的字典:

    pdata = {'Ohio':frame3['Ohio'][:-1],
            'Nevada':frame3['Nevada'][:2]}
    DataFrame(pdata)
        Ohio    Nevada
    2000    1.5    NaN
    2001    1.7    2.4

    其中:

    frame3['Ohio']
    2000    1.5
    2001    1.7
    2002    3.6
    Name: Ohio, dtype: float64

     设置DataFrame的index和columns的name属性,并能显示出来:

    frame3.index.name = 'year'
    frame3.columns.name = 'state'
    frame3
    state    Nevada    Ohio
    year        
    2000       NaN    1.5
    2001       2.4    1.7
    2002       2.9    3.6

    values属性:

    frame3.values
    array([[nan, 1.5],
           [2.4, 1.7],
           [2.9, 3.6]])

    2019.10.30 14:38:04

  • 相关阅读:
    Dynamics CRM
    Cordova
    《linux就该这么学》第二节课,安装红帽7,基础命令至2.3小节的笔记
    《linux就该这么学》开课,linux之路新开始
    《linux就该这么学》找到一本不错的Linux电子书,《Linux就该这么学》。
    mongodb3.4.5用http访问28017端口
    mongodb安装步骤
    linux安装docker
    centos 遇到/dev/mapper/cl-root 100% 解决方法
    本机ping不通虚拟机,但虚拟机可以ping通本机
  • 原文地址:https://www.cnblogs.com/direwolf22/p/11763030.html
Copyright © 2020-2023  润新知