• Pandas DataFrame 数据选取和过滤


    This would allow chaining operations like:

    pd.read_csv('imdb.txt')
      .sort(columns='year')
      .filter(lambda x: x['year']>1990)   # <---this is missing in Pandas
      .to_csv('filtered.csv')

    For current alternatives see:

    http://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining

    可以这样:

    df = pd.read_csv('imdb.txt').sort(columns='year')
    df[df['year']>1990].to_csv('filtered.csv')
    

      

    # however, could potentially do something like this:
    
    pd.read_csv('imdb.txt')
      .sort(columns='year')
      .[lambda x: x['year']>1990]
      .to_csv('filtered.csv')
    or
    
    pd.read_csv('imdb.txt')
      .sort(columns='year')
      .loc[lambda x: x['year']>1990]
      .to_csv('filtered.csv')

      

    from:https://yangjin795.github.io/pandas_df_selection.html

    Pandas 是 Python Data Analysis Library, 是基于 numpy 库的一个为了数据分析而设计的一个 Python 库。它提供了很多工具和方法,使得使用 python 操作大量的数据变得高效而方便。

    本文专门介绍 Pandas 中对 DataFrame 的一些对数据进行过滤、选取的方法和工具。 首先,本文所用的原始数据如下:

    df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    
        Out[9]: 
                         A         B         C         D
        2017-04-01  0.522241  0.495106 -0.268194 -0.035003
        2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
        2017-04-03  0.480507  1.215048  1.313314 -0.072320
        2017-04-04  1.700309  0.287588 -0.012103  0.525291
        2017-04-05  0.526615 -0.417645  0.405853 -0.835213
        2017-04-06  1.143858 -0.326720  1.425379  0.531037
    

    选取

    通过 [] 来选取

    选取一列或者几列:

    df['A']
    Out:
        2017-04-01    0.522241
        2017-04-02    2.104572
        2017-04-03    0.480507
        2017-04-04    1.700309
        2017-04-05    0.526615
        2017-04-06    1.143858
    
    df[['A','B']]
    Out:
                           A         B
        2017-04-01  0.522241  0.495106
        2017-04-02  2.104572 -0.977768
        2017-04-03  0.480507  1.215048
        2017-04-04  1.700309  0.287588
        2017-04-05  0.526615 -0.417645
        2017-04-06  1.143858 -0.326720
    

    选取某一行或者几行:

    df['2017-04-01':'2017-04-01']
    Out:
                           A         B         C         D
        2017-04-01  0.522241  0.495106 -0.268194 -0.03500   
    
    df['2017-04-01':'2017-04-03']
                           A         B         C         D
        2017-04-01  0.522241  0.495106 -0.268194 -0.035003
        2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
        2017-04-03  0.480507  1.215048  1.313314 -0.072320
    

    loc, 通过行标签选取数据

    df.loc['2017-04-01','A']
    
    df.loc['2017-04-01']
    Out:
        A    0.522241
        B    0.495106
        C   -0.268194
        D   -0.035003
    
    df.loc['2017-04-01':'2017-04-03']
    Out:
                           A         B         C         D
        2017-04-01  0.522241  0.495106 -0.268194 -0.035003
        2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
        2017-04-03  0.480507  1.215048  1.313314 -0.072320
    
    df.loc['2017-04-01':'2017-04-04',['A','B']]
    Out:
                           A         B
        2017-04-01  0.522241  0.495106
        2017-04-02  2.104572 -0.977768
        2017-04-03  0.480507  1.215048
        2017-04-04  1.700309  0.287588
    
    df.loc[:,['A','B']]
    Out:
                           A         B
        2017-04-01  0.522241  0.495106
        2017-04-02  2.104572 -0.977768
        2017-04-03  0.480507  1.215048
        2017-04-04  1.700309  0.287588
        2017-04-05  0.526615 -0.417645
        2017-04-06  1.143858 -0.326720
    

    iloc, 通过行号获取数据

    df.iloc[2]
    Out:
        A    0.480507
        B    1.215048
        C    1.313314
        D   -0.072320
    
    df.iloc[1:3]
    Out:
                           A         B         C         D
        2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
        2017-04-03  0.480507  1.215048  1.313314 -0.072320
    
    df.iloc[1,1]
    
    df.iloc[1:3,1]
    
    df.iloc[1:3,1:2]
    
    df.iloc[[1,3],[2,3]]
    Out:
                           C         D
        2017-04-02 -0.139632 -0.735926
        2017-04-04 -0.012103  0.525291
    
    df.iloc[[1,3],:]
    
    df.iloc[:,[2,3]]
    

    iat, 获取某一个 cell 的值

    df.iat[1,2]
    Out:
        -0.13963224781812655
    

    过滤

    使用 [] 过滤

    []中是一个boolean 表达式,凡是计算为 True 的就会被选取。

    df[df.A>1]
    Out:
                           A         B         C         D
        2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
        2017-04-04  1.700309  0.287588 -0.012103  0.525291
        2017-04-06  1.143858 -0.326720  1.425379  0.531037
    
    df[df>1]
    Out:
                           A         B         C   D
        2017-04-01       NaN       NaN       NaN NaN
        2017-04-02  2.104572       NaN       NaN NaN
        2017-04-03       NaN  1.215048  1.313314 NaN
        2017-04-04  1.700309       NaN       NaN NaN
        2017-04-05       NaN       NaN       NaN NaN
        2017-04-06  1.143858       NaN  1.425379 NaN
    
    df[df.A+df.B>1.5]
    Out:
                           A         B         C         D      
        2017-04-03  0.480507  1.215048  1.313314 -0.072320  
        2017-04-04  1.700309  0.287588 -0.012103  0.525291  
    

    下面是一个更加复杂的例子,选取的是 index 在 '2017-04-01'中'2017-04-04'的,一行的数据的和大于1的行:

    df.loc['2017-04-01':'2017-04-04',df.sum()>1]
    

    还可以通过和 apply 方法结合,构造更加复杂的过滤,实现将某个返回值为 boolean 的方法作为过滤条件:

    df[df.apply(lambda x: x['b'] > x['c'], axis=1)]
    

    使用 isin

    df['E']=['one', 'one','two','three','four','three']
                           A         B         C         D      E
        2017-04-01  0.522241  0.495106 -0.268194 -0.035003    one
        2017-04-02  2.104572 -0.977768 -0.139632 -0.735926    one
        2017-04-03  0.480507  1.215048  1.313314 -0.072320    two
        2017-04-04  1.700309  0.287588 -0.012103  0.525291  three
        2017-04-05  0.526615 -0.417645  0.405853 -0.835213   four
        2017-04-06  1.143858 -0.326720  1.425379  0.531037  three
    
    df[df.E.isin(['one'])]
        Out:
                           A         B         C         D    E
        2017-04-01  0.522241  0.495106 -0.268194 -0.035003  one
        2017-04-02  2.104572 -0.977768 -0.139632 -0.735926  one
  • 相关阅读:
    pip 笔记
    Codeforces Round #739
    leetcode周赛 248
    AcWing周赛 6
    AcWing周赛 5
    算法提高--最长上升子序列一
    算法提高--数字三角形模型
    数据结构--线段树
    leetcode周赛 243
    AcWing周赛 1
  • 原文地址:https://www.cnblogs.com/bonelee/p/9882287.html
Copyright © 2020-2023  润新知