• Python机器学习(九十五)Pandas 有条件访问


    有时需要根据条件,访问DataFrame中的数据。例如,找出电影数据集中,某个导演的电影:

    # 加载数据
    movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
    movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                         'rating', 'votes', 'revenue_millions', 'metascore']
    
    # 有条件访问
    condition = (movies_df['director'] == "Ridley Scott")
    
    condition.head()

    输出

    Title
    Guardians of the Galaxy    False
    Prometheus                  True
    Split                      False
    Sing                       False
    Suicide Squad              False
    Name: director, dtype: bool

    isnull()类似,它返回一系列真值和假值: 如果是该导演的电影为真,否则为假。

    上面的方法,没有过滤掉不属于该导演的电影。下面的例子,选出某个导演的所有电影,过滤掉不属于他的电影:

    movies_df[movies_df['director'] == "Ridley Scott"]

    输出

                            rank                     genre  ... revenue_millions metascore
    Title                                                   ...
    Prometheus                 2  Adventure,Mystery,Sci-Fi  ...           126.46      65.0
    The Martian              103    Adventure,Drama,Sci-Fi  ...           228.43      80.0
    Robin Hood               388    Action,Adventure,Drama  ...           105.22      53.0
    American Gangster        471     Biography,Crime,Drama  ...           130.13      76.0
    Exodus: Gods and Kings   517    Action,Adventure,Drama  ...            65.01      52.0
    The Counselor            522      Crime,Drama,Thriller  ...            16.97      48.0
    A Good Year              531      Comedy,Drama,Romance  ...             7.46      47.0
    Body of Lies             738      Action,Drama,Romance  ...            39.38      57.0
    
    [8 rows x 11 columns]

    此代码可以理解为下面的sql语句:

    Select * from movies_df where director = 'Ridley Scott'

    让我们看看使用数值的条件选择,通过评分过滤DataFrame:

    movies_df[movies_df['rating'] >= 8.6].head(3)

    输出

                     rank                    genre  ... revenue_millions metascore
    Title                                           ...
    Interstellar       37   Adventure,Drama,Sci-Fi  ...           187.99      74.0
    The Dark Knight    55       Action,Crime,Drama  ...           533.32      82.0
    Inception          81  Action,Adventure,Sci-Fi  ...           292.57      74.0
    
    [3 rows x 11 columns]

    可以使用逻辑运算符|(或),&(与)来生成更复杂的条件语句。

    例如,只显示克里斯托弗·诺兰(Christopher Nolan)或雷德利·斯科特(Ridley Scott)的电影:

    movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()

    输出

                     rank                     genre  ... revenue_millions metascore
    Title                                            ...
    Prometheus          2  Adventure,Mystery,Sci-Fi  ...           126.46      65.0
    Interstellar       37    Adventure,Drama,Sci-Fi  ...           187.99      74.0
    The Dark Knight    55        Action,Crime,Drama  ...           533.32      82.0
    The Prestige       65      Drama,Mystery,Sci-Fi  ...            53.08      66.0
    Inception          81   Action,Adventure,Sci-Fi  ...           292.57      74.0
    
    [5 rows x 11 columns]

    可以使用isin()方法,代码更简洁:

    movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()

    输出

                     rank                     genre  ... revenue_millions metascore
    Title                                            ...
    Prometheus          2  Adventure,Mystery,Sci-Fi  ...           126.46      65.0
    Interstellar       37    Adventure,Drama,Sci-Fi  ...           187.99      74.0
    The Dark Knight    55        Action,Crime,Drama  ...           533.32      82.0
    The Prestige       65      Drama,Mystery,Sci-Fi  ...            53.08      66.0
    Inception          81   Action,Adventure,Sci-Fi  ...           292.57      74.0
    
    [5 rows x 11 columns]

    下面的例子,选取所有在2005年到2010年间上映的电影,其评级都在8.0以上,但票房收入却低于25%。

    movies_df[
        ((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
        & (movies_df['rating'] > 8.0)
        & (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
    ]

    输出

                         rank               genre  ... revenue_millions metascore
    Title                                          ...
    3 Idiots              431        Comedy,Drama  ...             6.52      67.0
    The Lives of Others   477      Drama,Thriller  ...            11.28      89.0
    Incendies             714   Drama,Mystery,War  ...             6.86      80.0
    Taare Zameen Par      992  Drama,Family,Music  ...             1.20      42.0
    
    [4 rows x 11 columns]

    这里只有4部电影符合这个标准。

    回想一下,前面章节,使用.describe()时,收入的第25个百分位数大约是17.4,所以可以使用方法quantile(0.25)直接访问这个值。

  • 相关阅读:
    Android开源框架——Volley
    Android中的事件传递机制
    @ViewDebug.ExportedProperty的使用
    字符间距——扩展
    读取assets文件夹下图片(ods_interview)
    Android消息推送——JPush极光推送
    深入模块
    正则表达式和re模块
    初识模块
    迭代器和生成器
  • 原文地址:https://www.cnblogs.com/huanghanyu/p/13174175.html
Copyright © 2020-2023  润新知