Pandas CookBook -- 03数据分析入门

简书大神SeanCheney的译作，我作了些格式调整和文章目录结构的变化，更适合自己阅读，以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option('max_columns',5 , 'max_rows', 5)

1 基础分析

读取查看数据

college = pd.read_csv('data/college.csv')

1.1 基础信息

college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
INSTNM                7535 non-null object
CITY                  7535 non-null object
STABBR                7535 non-null object
HBCU                  7164 non-null float64
MENONLY               7164 non-null float64
WOMENONLY             7164 non-null float64
RELAFFIL              7535 non-null int64
SATVRMID              1185 non-null float64
SATMTMID              1196 non-null float64
DISTANCEONLY          7164 non-null float64
UGDS                  6874 non-null float64
UGDS_WHITE            6874 non-null float64
UGDS_BLACK            6874 non-null float64
UGDS_HISP             6874 non-null float64
UGDS_ASIAN            6874 non-null float64
UGDS_AIAN             6874 non-null float64
UGDS_NHPI             6874 non-null float64
UGDS_2MOR             6874 non-null float64
UGDS_NRA              6874 non-null float64
UGDS_UNKN             6874 non-null float64
PPTUG_EF              6853 non-null float64
CURROPER              7535 non-null int64
PCTPELL               6849 non-null float64
PCTFLOAN              6849 non-null float64
UG25ABV               6718 non-null float64
MD_EARN_WNE_P10       6413 non-null object
GRAD_DEBT_MDN_SUPP    7503 non-null object
dtypes: float64(20), int64(2), object(5)
memory usage: 1.6+ MB

1.2 统计数值列

college.describe(include=[np.number]).T

	count	mean	...	75%	max
HBCU	7164.0	0.014238	...	0.000000	1.0
MENONLY	7164.0	0.009213	...	0.000000	1.0
...	...	...	...	...	...
PCTFLOAN	6849.0	0.522211	...	0.745000	1.0
UG25ABV	6718.0	0.410021	...	0.572275	1.0

22 rows × 8 columns

1.3 统计对象和类型列

college.describe(include=[np.object, pd.Categorical]).T

	count	unique	top	freq
INSTNM	7535	7535	Austin Graduate School of Theology	1
CITY	7535	2514	New York	87
STABBR	7535	59	CA	773
MD_EARN_WNE_P10	6413	598	PrivacySuppressed	822
GRAD_DEBT_MDN_SUPP	7503	2038	PrivacySuppressed	1510

2 数据类型的转换

DataFrame.astype(dtype, copy=True, errors='raise', **kwargs)

dtype : data type, or dict of column name -> data type
- Use a numpy.dtype or Python type to cast entire pandas object to the same type.
- Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- 使用numpy的数据类型，或者直接用Python内置的类型。
- 如果只传递一个dtype的话整个dataframe都会被转成该类型，可以传递一个{列名:类型}的字典，对dataframe中的列分别转换。
copy : bool, default True.
Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).
如果设置了copy = True 的话，不改变原有的dataframe

2.1 选取5列

different_cols = ['RELAFFIL', 'SATMTMID', 'CURROPER', 'INSTNM', 'STABBR']

col2 = college.loc[:, different_cols]

col2.dtypes

RELAFFIL      int64
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

用memory_usage方法查看每列的内存消耗

original_mem = col2.memory_usage(deep=True)
original_mem

Index           80
RELAFFIL     60280
             ...  
INSTNM      660240
STABBR      444565
Length: 6, dtype: int64

2.2 检查数值列的独立个数

col2['RELAFFIL'].unique()

array([0, 1])

RELAFFIL这列只包含0或1，因此没必要用64位，使用astype方法将其变为8位（1字节）整数

2.3 检查两个对象列的独立值的个数

col2.select_dtypes(include=['object']).nunique()

INSTNM    7535
STABBR      59
dtype: int64

STABBR列可以转变为“类型”（Categorical），独立值的个数小于总数的1%

2.4 类型转化

col3 = col2.astype({'STABBR':'category','RELAFFIL':np.int8})

col3.dtypes

RELAFFIL        int8
SATMTMID     float64
CURROPER       int64
INSTNM        object
STABBR      category
dtype: object

new_mem = col3.memory_usage(deep = True)
new_mem

Index           80
RELAFFIL      7535
             ...  
INSTNM      660699
STABBR       13576
Length: 6, dtype: int64

2.5 效果对比

new_mem / original_mem

Index       1.000000
RELAFFIL    0.125000
              ...   
INSTNM      1.000695
STABBR      0.030538
Length: 6, dtype: float64

通过和原始数据比较，RELAFFIL列变为了原来的八分之一，STABBR列只有原始大小的3%

3 排序与去重

3.1 排序

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

by : str or list of str
- Name or list of names to sort by.
ascending : bool or list of bool, default True
- Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
inplace : bool, default False
- if True, perform operation in-place
na_position : {‘first’, ‘last’}, default ‘last’
- first puts NaNs at the beginning, last puts NaNs at the end

3.1.1 单列排序

选取出三列。按照title_year降序排列

movie = pd.read_csv('data/movie.csv')

movie2 = movie[['movie_title', 'title_year', 'imdb_score']]

movie2.sort_values('title_year', ascending=False).head()

	movie_title	title_year	imdb_score
3884	The Veil	2016.0	4.7
2375	My Big Fat Greek Wedding 2	2016.0	6.1
2794	Miracles from Heaven	2016.0	6.8
92	Independence Day: Resurgence	2016.0	5.5
153	Kung Fu Panda 3	2016.0	7.2

3.1.2 多列排序

通过给ascending设置列表，可以同时对一列降序排列，一列升序排列

movie4 = movie[['movie_title', 'title_year', 'content_rating', 'budget']]

movie4_sorted = movie4.sort_values(['title_year', 'content_rating', 'budget'], ascending=[False, False, True])

movie4_sorted.head()

	movie_title	title_year	content_rating	budget
4026	Compadres	2016.0	R	3000000.0
3884	The Veil	2016.0	R	4000000.0
3682	Fifty Shades of Black	2016.0	R	5000000.0
3685	The Perfect Match	2016.0	R	5000000.0
3396	The Neon Demon	2016.0	R	7000000.0

3.2 去重

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

subset : column label or sequence of labels, optional
- Only consider certain columns for identifying duplicates, by default use all of the columns
- 默认整行的所有列数据都重复了，才会被认为重复。
- subset 传递为列名的列表时，在判断重复时候，只要该几列值重复了，就认为该行是重复的。
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplace : boolean, default False
- Whether to drop duplicates in place or to return a copy

movie3 = movie2.sort_values(['title_year','imdb_score'], ascending=False)
movie3

	movie_title	title_year	imdb_score
4312	Kickboxer: Vengeance	2016.0	9.1
4277	A Beginner's Guide to Snuff	2016.0	8.7
...	...	...	...
3246	The Bold and the Beautiful	NaN	3.5
2119	The Bachelor	NaN	2.9

4916 rows × 3 columns

用drop_duplicates去重，只保留每年的第一条数据

movie_top_year = movie3.drop_duplicates(subset='title_year',keep='first')

movie_top_year

	movie_title	title_year	imdb_score
4312	Kickboxer: Vengeance	2016.0	9.1
3745	Running Forever	2015.0	8.6
...	...	...	...
4695	Intolerance: Love's Struggle Throughout the Ages	1916.0	8.0
2725	Towering Inferno	NaN	9.5

92 rows × 3 columns

4 从最大中选择最小

读取movie.csv，选取'movie_title', 'imdb_score', 'budget'三列

 movie2 = movie[['movie_title', 'imdb_score', 'budget']]

 movie2.head()

	movie_title	imdb_score	budget
0	Avatar	7.9	237000000.0
1	Pirates of the Caribbean: At World's End	7.1	300000000.0
2	Spectre	6.8	245000000.0
3	The Dark Knight Rises	8.5	250000000.0
4	Star Wars: Episode VII - The Force Awakens	7.1	NaN

4.1 nlargest与nsmallest

用nlargest方法，选出imdb_score分数最高的100个

movie2.nlargest(100, 'imdb_score').head()

	movie_title	imdb_score	budget
2725	Towering Inferno	9.5	NaN
1920	The Shawshank Redemption	9.3	25000000.0
3402	The Godfather	9.2	6000000.0
2779	Dekalog	9.1	NaN
4312	Kickboxer: Vengeance	9.1	17000000.0

用链式操作，nsmallest方法再从中挑出预算最小的五部

movie2.nlargest(100, 'imdb_score').nsmallest(5, 'budget')

	movie_title	imdb_score	budget
4804	Butterfly Girl	8.7	180000.0
4801	Children of Heaven	8.5	180000.0
4706	12 Angry Men	8.9	350000.0
4550	A Separation	8.4	500000.0
4636	The Other Dream Team	8.4	500000.0

4.2 用sort_values方法

用sort_values方法，选取imdb_score最高的100个

movie2.sort_values('imdb_score', ascending=False).head(100).head()

	movie_title	imdb_score	budget
2725	Towering Inferno	9.5	NaN
1920	The Shawshank Redemption	9.3	25000000.0
3402	The Godfather	9.2	6000000.0
2779	Dekalog	9.1	NaN
4312	Kickboxer: Vengeance	9.1	17000000.0

然后可以再.sort_values('budget').head()，选出预算最低的5个，结果如下

 movie2.nlargest(100, 'imdb_score').tail()

	movie_title	imdb_score	budget
4023	Oldboy	8.4	3000000.0
4163	To Kill a Mockingbird	8.4	2000000.0
4395	Reservoir Dogs	8.4	1200000.0
4550	A Separation	8.4	500000.0
4636	The Other Dream Team	8.4	500000.0

天下风云出我辈，一入江湖岁月催

相关阅读:
点击图片等比缩放
C# 使用 NPOI 库读写 Excel 文件
vue url生产二维码
centos Linux 同步Windows 时间
centos 搭建 docker sentry
centos 安装 pip
webpack 提升90%的构建速度 HardSourceWebpackPlugin
webpack 原生分析
webpack 分析
nodejs——js 实现webSocket 兼容移动端

原文地址：https://www.cnblogs.com/shiyushiyu/p/9738861.html