• Pandas Cookbook -- 06索引对齐


    索引对齐

    简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

    import pandas as pd
    import numpy as np
    

    1 索引方法

    college = pd.read_csv('data/college.csv')
    college.iloc[:5,:5]
    
    INSTNM CITY STABBR HBCU MENONLY
    0 Alabama A & M University Normal AL 1.0 0.0
    1 University of Alabama at Birmingham Birmingham AL 0.0 0.0
    2 Amridge University Montgomery AL 0.0 0.0
    3 University of Alabama in Huntsville Huntsville AL 0.0 0.0
    4 Alabama State University Montgomery AL 1.0 0.0

    1.1 基础方法

    提取所有的列索引

    columns = college.columns
    
    columns
    
    Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
           'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
           'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
           'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
           'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
          dtype='object')
    

    用values属性,访问底层的NumPy数组

    columns.values
    
    array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
           'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
           'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
           'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
           'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
           'GRAD_DEBT_MDN_SUPP'], dtype=object)
    

    取出该数组的第6个值

    columns[5]
    
    'WOMENONLY'
    

    取出该数组的第2911

    columns[[1,8,10]]
    
    Index(['CITY', 'SATMTMID', 'UGDS'], dtype='object')
    

    逆序切片选取

    columns[-7:-4]
    
    Index(['PPTUG_EF', 'CURROPER', 'PCTPELL'], dtype='object')
    

    索引有许多和Series和DataFrame相同的方法

    columns.min(), columns.max(), columns.isnull().sum()
    
    ('CITY', 'WOMENONLY', 0)
    

    索引对象可以直接通过字符串方法修改,返回一个copy,索引对象本身是不可变类型,修改其本身会导致报错

    columns + '_A'
    
    Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A', 'WOMENONLY_A',
           'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A',
           'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A',
           'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A',
           'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A',
           'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'],
          dtype='object')
    

    索引对象也可以通过比较运算符,得到布尔索引

    columns > 'G'
    
    array([ True, False,  True,  True,  True,  True,  True,  True,  True,
           False,  True,  True,  True,  True,  True,  True,  True,  True,
            True,  True,  True, False,  True,  True,  True,  True,  True])
    

    1.2 集合方法

    索引对象支持集合运算:联合、交叉、求差、对称差

    c1 = columns[:4]
    c2 = columns[2:5]
    

    1.2.1 联合

    c1.union(c2)
    
    Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR'], dtype='object')
    
    c1 | c2
    
    Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR'], dtype='object')
    

    1.2.2 对称差

    c1.symmetric_difference(c2)
    
    Index(['CITY', 'INSTNM', 'MENONLY'], dtype='object')
    
    c1 ^ c2
    
    Index(['CITY', 'INSTNM', 'MENONLY'], dtype='object')
    

    2 笛卡尔积与索引爆炸

    创建两个有不同索引、但包含一些相同值的Series

    s1 = pd.Series(index=list('aaab'), data=np.arange(4))
    s2 = pd.Series(index=list('cababb'), data=np.arange(6))
    

    2.1 产生笛卡尔积

    除非当两组索引元素完全相同、顺序也相同时,不会生成笛卡尔积,其它情况都会产生笛卡尔积

    s1 + s2
    
    a    1.0
    a    3.0
    a    2.0
    a    4.0
    a    3.0
    a    5.0
    b    5.0
    b    7.0
    b    8.0
    c    NaN
    dtype: float64
    

    2.2 避免笛卡尔积

    索引一致是,数据会按照它们的索引对齐。下面的例子,两个Series完全相同,结果也是整数

    s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
    s2 = pd.Series(index=list('aaabb'), data=np.arange(5))
    s1 + s2
    
    a    0
    a    2
    a    4
    b    6
    b    8
    dtype: int64
    

    如果索引元素相同,但顺序不同,是能产生笛卡尔积的

     s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
     s2 = pd.Series(index=list('bbaaa'), data=np.arange(5))
     s1 + s2
    
    a    2
    a    3
    a    4
    a    3
    a    4
    a    5
    a    4
    a    5
    a    6
    b    3
    b    4
    b    4
    b    5
    dtype: int64
    

    2.2.1 索引爆炸

    读取employee数据集,设定行索引是RACE

    employee = pd.read_csv('data/employee.csv', index_col='RACE')
    

    2.2.1.1 数据准备

    copy方法 -- 复制数据,而非创造引用

    salary1 = employee['BASE_SALARY']
    salary2 = employee['BASE_SALARY']
    salary1 is salary2
    
    True
    

    结果是True,表明二者指向的同一个对象。这意味着,如果修改一个,另一个也会去改变。为了收到一个全新的数据,使用copy方法

    salary1 = employee['BASE_SALARY'].copy()
    salary2 = employee['BASE_SALARY'].copy()
    salary1 is salary2
    
    False
    

    2.2.1.2 索引顺序不一致

    salary1 = salary1.sort_index()
    
    salary_add1 = salary1 + salary2
    
    salary_add1.shape
    
    (1175424,)
    

    2.2.1.3 索引完全一致

    salary_add2 = salary1 + salary1
    

    2.2.1.4 比较结果

    len(salary1), len(salary2), len(salary_add1), len(salary_add2)
    
    (2000, 2000, 1175424, 2000)
    

    查看几个所得结果的长度,可以看到长度从2000到达了117万

    2.2.1.5 验证结果

    因为笛卡尔积是作用在相同索引元素上的,可以对其平方值求和

    index_vc = salary2.index.value_counts(dropna=False)
    
    index_vc
    
    Black or African American            700
    White                                665
    Hispanic/Latino                      480
    Asian/Pacific Islander               107
    NaN                                   35
    American Indian or Alaskan Native     11
    Others                                 2
    Name: RACE, dtype: int64
    
    index_vc.pow(2).sum()
    
    1175424
    

    3 不等索引填充数值

    3.1 不同索引的series

    读取三个baseball数据集,行索引设为playerID

    baseball_14 = pd.read_csv('data/baseball14.csv', index_col='playerID')
    baseball_15 = pd.read_csv('data/baseball15.csv', index_col='playerID')
    baseball_16 = pd.read_csv('data/baseball16.csv', index_col='playerID')
    baseball_14.iloc[:5,:]
    
    yearID stint teamID lgID G AB R H 2B 3B ... RBI SB CS BB SO IBB HBP SH SF GIDP
    playerID
    altuvjo01 2014 1 HOU AL 158 660 85 225 47 3 ... 59.0 56.0 9.0 36 53.0 7.0 5.0 1.0 5.0 20.0
    cartech02 2014 1 HOU AL 145 507 68 115 21 1 ... 88.0 5.0 2.0 56 182.0 6.0 5.0 0.0 4.0 12.0
    castrja01 2014 1 HOU AL 126 465 43 103 21 2 ... 56.0 1.0 0.0 34 151.0 1.0 9.0 1.0 3.0 11.0
    corpoca01 2014 1 HOU AL 55 170 22 40 6 0 ... 19.0 0.0 0.0 14 37.0 0.0 3.0 1.0 2.0 3.0
    dominma01 2014 1 HOU AL 157 564 51 121 17 0 ... 57.0 0.0 1.0 29 125.0 2.0 5.0 2.0 7.0 23.0

    5 rows × 21 columns

    3.1.1 索引分析

    用索引方法difference,找到哪些索引标签在baseball_14中,却不在baseball_15、baseball_16中

    baseball_14.index.difference(baseball_15.index)
    
    Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
           'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
          dtype='object', name='playerID')
    
    baseball_14.index.difference(baseball_16.index)
    
    Index(['cartech02', 'corpoca01', 'dominma01', 'fowlede01', 'grossro01',
           'guzmaje01', 'hoeslj01', 'krausma01', 'preslal01', 'singljo02',
           'villajo01'],
          dtype='object', name='playerID')
    

    3.1.2 数据分析

    找到每名球员在过去三个赛季的击球数,H列包含了这个数据

    hits_14 = baseball_14['H']
    hits_15 = baseball_15['H']
    hits_16 = baseball_16['H']
    

    将hits_14和hits_15两列相加

    (hits_14 + hits_15).head()
    
    playerID
    altuvjo01    425.0
    cartech02    193.0
    castrja01    174.0
    congeha01      NaN
    corpoca01      NaN
    Name: H, dtype: float64
    

    3.1.3 缺失值

    3.1.3.1 可修补的缺失值

    congeha01 和 corpoca01 在2015年是有记录的,但是结果缺失了。使用add方法和参数fill_value,避免产生缺失值

    hits_14.add(hits_15, fill_value=0).head()
    
    playerID
    altuvjo01    425.0
    cartech02    193.0
    castrja01    174.0
    congeha01     46.0
    corpoca01     40.0
    Name: H, dtype: float64
    

    再将2016的数据也加上

    hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)
    
    hits_total.head()
    
    playerID
    altuvjo01    641.0
    bregmal01     53.0
    cartech02    193.0
    castrja01    243.0
    congeha01     46.0
    Name: H, dtype: float64
    

    3.1.3.2 无法修补的缺失值

    如果一个元素在两个Series都是缺失值,即便使用了fill_value,相加的结果也仍是缺失值

    s = pd.Series(index=['a', 'b', 'c', 'd'], data=[np.nan, 3, np.nan, 1])
    
    s1 = pd.Series(index=['a', 'b', 'c'], data=[np.nan, 6, 10])
    
    s.add(s1, fill_value=5)
    
    a     NaN
    b     9.0
    c    15.0
    d     6.0
    dtype: float64
    

    3.2 不同索引的dataframe

    从baseball_14中选取一些列

    df_14 = baseball_14[['G','AB', 'R', 'H']]
    df_14.head()
    
    G AB R H
    playerID
    altuvjo01 158 660 85 225
    cartech02 145 507 68 115
    castrja01 126 465 43 103
    corpoca01 55 170 22 40
    dominma01 157 564 51 121

    再从baseball_15中选取一些列,有相同的、也有不同的

    df_15 = baseball_15[['AB', 'R', 'H', 'HR']]
    df_15.head()
    
    AB R H HR
    playerID
    altuvjo01 638 86 200 15
    cartech02 391 50 78 24
    castrja01 337 38 71 11
    congeha01 201 25 46 11
    correca01 387 52 108 22

    将二者相加的话,只要行或列不能对齐,就会产生缺失值。style属性的highlight_null方法可以高亮缺失值

    (df_14 + df_15).head(10).style.highlight_null('yellow')
    
    AB G H HR R
    playerID
    altuvjo01 1298 nan 425 nan 171
    cartech02 898 nan 193 nan 118
    castrja01 802 nan 174 nan 81
    congeha01 nan nan nan nan nan
    corpoca01 nan nan nan nan nan
    correca01 nan nan nan nan nan
    dominma01 nan nan nan nan nan
    fowlede01 nan nan nan nan nan
    gattiev01 nan nan nan nan nan
    gomezca01 nan nan nan nan nan

    即便使用了fill_value=0,有些值也会是缺失值,这是因为一些行和列的组合根本不存在输入的数据中

    df_14.add(df_15, fill_value=0).head(10).style.highlight_null('yellow')
    
    AB G H HR R
    playerID
    altuvjo01 1298 158 425 15 171
    cartech02 898 145 193 24 118
    castrja01 802 126 174 11 81
    congeha01 201 nan 46 11 25
    corpoca01 170 55 40 nan 22
    correca01 387 nan 108 22 52
    dominma01 564 157 121 nan 51
    fowlede01 434 116 120 nan 61
    gattiev01 566 nan 139 27 66
    gomezca01 149 nan 36 4 19

    4 从不同的DataFrame追加列

    读取employee数据,选取'DEPARTMENT', 'BASE_SALARY'这两列

    employee = pd.read_csv('data/employee.csv')
    dept_sal = employee[['DEPARTMENT', 'BASE_SALARY']]
    

    4.1 加入无重复索引的series

    在每个部门内,对BASE_SALARY进行排序

    dept_sal = dept_sal.sort_values(['DEPARTMENT', 'BASE_SALARY'],ascending=[True, False])
    

    用drop_duplicates方法保留每个部门的第一行

    max_dept_sal = dept_sal.drop_duplicates(subset='DEPARTMENT')
    max_dept_sal.head()
    
    DEPARTMENT BASE_SALARY
    1494 Admn. & Regulatory Affairs 140416.0
    149 City Controller's Office 64251.0
    236 City Council 100000.0
    647 Convention and Entertainment 38397.0
    1500 Dept of Neighborhoods (DON) 89221.0

    使用DEPARTMENT作为行索引

    max_dept_sal = max_dept_sal.set_index('DEPARTMENT')
    employee = employee.set_index('DEPARTMENT')
    

    现在行索引包含匹配值了,可以向employee的DataFrame新增一列

    employee['MAX_DEPT_SALARY'] = max_dept_sal['BASE_SALARY']
    
    employee.head()
    
    UNIQUE_ID POSITION_TITLE BASE_SALARY RACE EMPLOYMENT_TYPE GENDER EMPLOYMENT_STATUS HIRE_DATE JOB_DATE MAX_DEPT_SALARY
    DEPARTMENT
    Municipal Courts Department 0 ASSISTANT DIRECTOR (EX LVL) 121862.0 Hispanic/Latino Full Time Female Active 2006-06-12 2012-10-13 121862.0
    Library 1 LIBRARY ASSISTANT 26125.0 Hispanic/Latino Full Time Female Active 2000-07-19 2010-09-18 107763.0
    Houston Police Department-HPD 2 POLICE OFFICER 45279.0 White Full Time Male Active 2015-02-03 2015-02-03 199596.0
    Houston Fire Department (HFD) 3 ENGINEER/OPERATOR 63166.0 White Full Time Male Active 1982-02-08 1991-05-25 210588.0
    General Services Department 4 ELECTRICIAN 56347.0 White Full Time Male Active 1989-06-19 1994-10-22 89194.0

    现在可以用query查看是否有BASE_SALARY大于MAX_DEPT_SALARY的

    employee.query('BASE_SALARY > MAX_DEPT_SALARY')
    
    UNIQUE_ID POSITION_TITLE BASE_SALARY RACE EMPLOYMENT_TYPE GENDER EMPLOYMENT_STATUS HIRE_DATE JOB_DATE MAX_DEPT_SALARY
    DEPARTMENT

    4.2 加入有重复索引的series

    用random从dept_sal随机取10行,不做替换

    random_salary = dept_sal.sample(n=10).set_index('DEPARTMENT')
    
    random_salary
    
    BASE_SALARY
    DEPARTMENT
    Houston Police Department-HPD 86534.0
    Fleet Management Department 49088.0
    Houston Airport System (HAS) 76097.0
    Houston Police Department-HPD 66614.0
    Library 59748.0
    Houston Airport System (HAS) 29286.0
    Houston Police Department-HPD 61643.0
    Houston Fire Department (HFD) 52644.0
    Solid Waste Management 36712.0
    Houston Police Department-HPD NaN

    random_salary中是有重复索引的,employee DataFrame的标签要对应random_salary中的多个标签

    新增RANDOM_SALARY列,会引起报错

    # employee['RANDOM_SALARY'] = random_salary['BASE_SALARY']
    

    4.3 加入部分索引的series

    选取max_dept_sal['BASE_SALARY']的前三行,赋值给employee['MAX_SALARY2']

    employee['MAX_SALARY2'] = max_dept_sal['BASE_SALARY'].head(3)
    
    employee.MAX_SALARY2.value_counts()
    
    140416.0    29
    100000.0    11
    64251.0      5
    Name: MAX_SALARY2, dtype: int64
    

    因为只填充了三个部门的值,所有其它部门在结果中都是缺失值

    employee.MAX_SALARY2.isnull().mean()
    
    0.9775
    天下风云出我辈,一入江湖岁月催
  • 相关阅读:
    jmeter的基本功能使用详解
    服务器资源监控插件(jmeter)
    前端技术之--CSS
    前端技术之--HTML
    TCP/IP基础知识
    TCP/IP、Http的区别
    关于性能调优
    如何修改Docker已运行实例的端口映射
    Mysql 主从同步配置
    Presto的基本概念
  • 原文地址:https://www.cnblogs.com/shiyushiyu/p/9745648.html
Copyright © 2020-2023  润新知