• Pandas CookBook -- 03数据分析入门


    数据分析入门

    简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

    import pandas as pd
    import numpy as np
    

    设定最大列数和最大行数

    pd.set_option('max_columns',5 , 'max_rows', 5)
    

    1 基础分析

    读取查看数据

    college = pd.read_csv('data/college.csv')
    

    1.1 基础信息

    college.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 7535 entries, 0 to 7534
    Data columns (total 27 columns):
    INSTNM                7535 non-null object
    CITY                  7535 non-null object
    STABBR                7535 non-null object
    HBCU                  7164 non-null float64
    MENONLY               7164 non-null float64
    WOMENONLY             7164 non-null float64
    RELAFFIL              7535 non-null int64
    SATVRMID              1185 non-null float64
    SATMTMID              1196 non-null float64
    DISTANCEONLY          7164 non-null float64
    UGDS                  6874 non-null float64
    UGDS_WHITE            6874 non-null float64
    UGDS_BLACK            6874 non-null float64
    UGDS_HISP             6874 non-null float64
    UGDS_ASIAN            6874 non-null float64
    UGDS_AIAN             6874 non-null float64
    UGDS_NHPI             6874 non-null float64
    UGDS_2MOR             6874 non-null float64
    UGDS_NRA              6874 non-null float64
    UGDS_UNKN             6874 non-null float64
    PPTUG_EF              6853 non-null float64
    CURROPER              7535 non-null int64
    PCTPELL               6849 non-null float64
    PCTFLOAN              6849 non-null float64
    UG25ABV               6718 non-null float64
    MD_EARN_WNE_P10       6413 non-null object
    GRAD_DEBT_MDN_SUPP    7503 non-null object
    dtypes: float64(20), int64(2), object(5)
    memory usage: 1.6+ MB
    

    1.2 统计数值列

    college.describe(include=[np.number]).T
    
    count mean ... 75% max
    HBCU 7164.0 0.014238 ... 0.000000 1.0
    MENONLY 7164.0 0.009213 ... 0.000000 1.0
    ... ... ... ... ... ...
    PCTFLOAN 6849.0 0.522211 ... 0.745000 1.0
    UG25ABV 6718.0 0.410021 ... 0.572275 1.0

    22 rows × 8 columns

    1.3 统计对象和类型列

    college.describe(include=[np.object, pd.Categorical]).T
    
    count unique top freq
    INSTNM 7535 7535 Austin Graduate School of Theology 1
    CITY 7535 2514 New York 87
    STABBR 7535 59 CA 773
    MD_EARN_WNE_P10 6413 598 PrivacySuppressed 822
    GRAD_DEBT_MDN_SUPP 7503 2038 PrivacySuppressed 1510

    2 数据类型的转换

    DataFrame.astype(dtype, copy=True, errors='raise', **kwargs)

    • dtype : data type, or dict of column name -> data type

      • Use a numpy.dtype or Python type to cast entire pandas object to the same type.
      • Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
      • 使用numpy的数据类型,或者直接用Python内置的类型。
      • 如果只传递一个dtype的话整个dataframe都会被转成该类型,可以传递一个{列名:类型}的字典,对dataframe中的列分别转换。
    • copy : bool, default True.

    • Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

    • 如果设置了copy = True 的话,不改变原有的dataframe

    2.1 选取5列

    different_cols = ['RELAFFIL', 'SATMTMID', 'CURROPER', 'INSTNM', 'STABBR']
    
    col2 = college.loc[:, different_cols]
    
    col2.dtypes
    
    RELAFFIL      int64
    SATMTMID    float64
    CURROPER      int64
    INSTNM       object
    STABBR       object
    dtype: object
    

    用memory_usage方法查看每列的内存消耗

    original_mem = col2.memory_usage(deep=True)
    original_mem
    
    Index           80
    RELAFFIL     60280
                 ...  
    INSTNM      660240
    STABBR      444565
    Length: 6, dtype: int64
    

    2.2 检查数值列的独立个数

    col2['RELAFFIL'].unique()
    
    array([0, 1])
    

    RELAFFIL这列只包含0或1,因此没必要用64位,使用astype方法将其变为8位(1字节)整数

    2.3 检查两个对象列的独立值的个数

    col2.select_dtypes(include=['object']).nunique()
    
    INSTNM    7535
    STABBR      59
    dtype: int64
    

    STABBR列可以转变为“类型”(Categorical),独立值的个数小于总数的1%

    2.4 类型转化

    col3 = col2.astype({'STABBR':'category','RELAFFIL':np.int8})
    
    col3.dtypes
    
    RELAFFIL        int8
    SATMTMID     float64
    CURROPER       int64
    INSTNM        object
    STABBR      category
    dtype: object
    
    new_mem = col3.memory_usage(deep = True)
    new_mem
    
    Index           80
    RELAFFIL      7535
                 ...  
    INSTNM      660699
    STABBR       13576
    Length: 6, dtype: int64
    

    2.5 效果对比

    new_mem / original_mem
    
    Index       1.000000
    RELAFFIL    0.125000
                  ...   
    INSTNM      1.000695
    STABBR      0.030538
    Length: 6, dtype: float64
    

    通过和原始数据比较,RELAFFIL列变为了原来的八分之一,STABBR列只有原始大小的3%

    3 排序与去重

    3.1 排序

    DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

    • by : str or list of str
      • Name or list of names to sort by.
    • ascending : bool or list of bool, default True
      • Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
    • inplace : bool, default False
      • if True, perform operation in-place
    • na_position : {‘first’, ‘last’}, default ‘last’
      • first puts NaNs at the beginning, last puts NaNs at the end

    3.1.1 单列排序

    选取出三列。按照title_year降序排列

    movie = pd.read_csv('data/movie.csv')
    
    movie2 = movie[['movie_title', 'title_year', 'imdb_score']]
    
    movie2.sort_values('title_year', ascending=False).head()
    
    movie_title title_year imdb_score
    3884 The Veil 2016.0 4.7
    2375 My Big Fat Greek Wedding 2 2016.0 6.1
    2794 Miracles from Heaven 2016.0 6.8
    92 Independence Day: Resurgence 2016.0 5.5
    153 Kung Fu Panda 3 2016.0 7.2

    3.1.2 多列排序

    通过给ascending设置列表,可以同时对一列降序排列,一列升序排列

    movie4 = movie[['movie_title', 'title_year', 'content_rating', 'budget']]
    
    movie4_sorted = movie4.sort_values(['title_year', 'content_rating', 'budget'], ascending=[False, False, True])
    
    movie4_sorted.head()
    
    movie_title title_year content_rating budget
    4026 Compadres 2016.0 R 3000000.0
    3884 The Veil 2016.0 R 4000000.0
    3682 Fifty Shades of Black 2016.0 R 5000000.0
    3685 The Perfect Match 2016.0 R 5000000.0
    3396 The Neon Demon 2016.0 R 7000000.0

    3.2 去重

    DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

    • subset : column label or sequence of labels, optional
      • Only consider certain columns for identifying duplicates, by default use all of the columns
      • 默认整行的所有列数据都重复了,才会被认为重复。
      • subset 传递为列名的列表时,在判断重复时候,只要该几列值重复了,就认为该行是重复的。
    • keep : {‘first’, ‘last’, False}, default ‘first’
      • first : Drop duplicates except for the first occurrence.
      • last : Drop duplicates except for the last occurrence.
      • False : Drop all duplicates.
    • inplace : boolean, default False
      • Whether to drop duplicates in place or to return a copy
    movie3 = movie2.sort_values(['title_year','imdb_score'], ascending=False)
    movie3
    
    movie_title title_year imdb_score
    4312 Kickboxer: Vengeance 2016.0 9.1
    4277 A Beginner's Guide to Snuff 2016.0 8.7
    ... ... ... ...
    3246 The Bold and the Beautiful NaN 3.5
    2119 The Bachelor NaN 2.9

    4916 rows × 3 columns

    用drop_duplicates去重,只保留每年的第一条数据

    movie_top_year = movie3.drop_duplicates(subset='title_year',keep='first')
    
    movie_top_year
    
    movie_title title_year imdb_score
    4312 Kickboxer: Vengeance 2016.0 9.1
    3745 Running Forever 2015.0 8.6
    ... ... ... ...
    4695 Intolerance: Love's Struggle Throughout the Ages 1916.0 8.0
    2725 Towering Inferno NaN 9.5

    92 rows × 3 columns

    4 从最大中选择最小

    读取movie.csv,选取'movie_title', 'imdb_score', 'budget'三列

     movie2 = movie[['movie_title', 'imdb_score', 'budget']]
    
     movie2.head()
    
    movie_title imdb_score budget
    0 Avatar 7.9 237000000.0
    1 Pirates of the Caribbean: At World's End 7.1 300000000.0
    2 Spectre 6.8 245000000.0
    3 The Dark Knight Rises 8.5 250000000.0
    4 Star Wars: Episode VII - The Force Awakens 7.1 NaN

    4.1 nlargest与nsmallest

    用nlargest方法,选出imdb_score分数最高的100个

    movie2.nlargest(100, 'imdb_score').head()
    
    movie_title imdb_score budget
    2725 Towering Inferno 9.5 NaN
    1920 The Shawshank Redemption 9.3 25000000.0
    3402 The Godfather 9.2 6000000.0
    2779 Dekalog 9.1 NaN
    4312 Kickboxer: Vengeance 9.1 17000000.0

    用链式操作,nsmallest方法再从中挑出预算最小的五部

    movie2.nlargest(100, 'imdb_score').nsmallest(5, 'budget')
    
    movie_title imdb_score budget
    4804 Butterfly Girl 8.7 180000.0
    4801 Children of Heaven 8.5 180000.0
    4706 12 Angry Men 8.9 350000.0
    4550 A Separation 8.4 500000.0
    4636 The Other Dream Team 8.4 500000.0

    4.2 用sort_values方法

    用sort_values方法,选取imdb_score最高的100个

    movie2.sort_values('imdb_score', ascending=False).head(100).head()
    
    movie_title imdb_score budget
    2725 Towering Inferno 9.5 NaN
    1920 The Shawshank Redemption 9.3 25000000.0
    3402 The Godfather 9.2 6000000.0
    2779 Dekalog 9.1 NaN
    4312 Kickboxer: Vengeance 9.1 17000000.0

    然后可以再.sort_values('budget').head(),选出预算最低的5个,结果如下

     movie2.nlargest(100, 'imdb_score').tail()
    
    movie_title imdb_score budget
    4023 Oldboy 8.4 3000000.0
    4163 To Kill a Mockingbird 8.4 2000000.0
    4395 Reservoir Dogs 8.4 1200000.0
    4550 A Separation 8.4 500000.0
    4636 The Other Dream Team 8.4 500000.0
    天下风云出我辈,一入江湖岁月催
  • 相关阅读:
    点击图片等比缩放
    C# 使用 NPOI 库读写 Excel 文件
    vue url生产二维码
    centos Linux 同步Windows 时间
    centos 搭建 docker sentry
    centos 安装 pip
    webpack 提升90%的构建速度 HardSourceWebpackPlugin
    webpack 原生分析
    webpack 分析
    nodejs——js 实现webSocket 兼容移动端
  • 原文地址:https://www.cnblogs.com/shiyushiyu/p/9738861.html
Copyright © 2020-2023  润新知