• Pandas 之 描述性统计案例


    认识

    jupyter地址: https://nbviewer.jupyter.org/github/chenjieyouge/jupyter_share/blob/master/share/pandas- 描述性统计.ipynb

    import numpy as np
    import pandas as pd
    

    pandas objects are equipped(配备的) with a set of common mathematical and statistical methods. Most of these fall into the categrory of reductions or summary statistics, methods that exract(提取) a single value(like the sum or mean) from a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they built-in handling for missiing data. Consider a small DataFarme -> (pandas提供了一些常用的统计函数, 输入通常是一个series的值, 或df的行, 列; 值得一提的是, pandas提供了缺失值处理, 在统计的时候, 不列入计算)

    df = pd.DataFrame([
        [1.4, np.nan],
        [7.6, -4.5],
        [np.nan, np.nan],
        [3, -1.5]
    ],
    index=list('abcd'), columns=['one', 'two'])
    
    df
    
    one two
    a 1.4 NaN
    b 7.6 -4.5
    c NaN NaN
    d 3.0 -1.5

    Calling DataFrame's sum method returns a Series containing column sums:

    "默认axis=0, 行方向, 下方, 展示每列, 忽略缺失值"
    df.sum()
    
    df.mean()
    "在计算平均值时, NaN 不计入样本"
    
    '默认axis=0, 行方向, 下方, 展示每列, 忽略缺失值'
    
    one    12.0
    two    -6.0
    dtype: float64
    
    one    4.0
    two   -3.0
    dtype: float64
    
    '在计算平均值时, NaN 不计入样本'
    

    Passing axis='columns' or axis=1 sums across the columns instead. -> axis方向

    "按行统计, aixs=1, 列方向, 右边"
    df.sum(axis=1)
    
    '按行统计, aixs=1, 列方向, 右边'
    
    a    1.4
    b    3.1
    c    0.0
    d    1.5
    dtype: float64
    

    NA values are excluded unless the entire slice (row or column in the case) is NA. This can be disabled with the skipna option: -> 统计计算会自动忽略缺失值, 不计入样本

    "默认是忽略缺失值的, 要缺失值, 则手动指定一下"
    df.mean(skipna=False, axis='columns') # 列方向, 行哦
    
    '默认是忽略缺失值的, 要缺失值, 则手动指定一下'
    
    a     NaN
    b    1.55
    c     NaN
    d    0.75
    dtype: float64
    

    See Table 5-7 for a list of common options for each reduction method.

    Method Description
    axis Axis to reduce over, 0 for DataFrame's rows and 1 for columns
    skipna Exclude missing values; True by default
    level Reduce grouped by level if the axis is hierachically indexed(MaltiIndex)

    Some methods, like idmax and idmin, return indirect statistics like the index where the minimum or maximum values are attained(取得).

    "idxmax() 返回最大值的第一个索引标签"
    df.idxmax()
    
    'idxmax() 返回最大值的第一个索引标签'
    
    one    b
    two    d
    dtype: object
    

    Other methods are accumulations: 累积求和-默认axis=0 行方向

    "累积求和, 默认axis=0, 忽略NA"
    df.cumsum()
    
    "也可指定axis=1列方向"
    df.cumsum(axis=1)
    
    '累积求和, 默认axis=0, 忽略NA'
    
    one two
    a 1.4 NaN
    b 9.0 -4.5
    c NaN NaN
    d 12.0 -6.0
    '也可指定axis=0列方向'
    
    one two
    a 1.4 NaN
    b 7.6 3.1
    c NaN NaN
    d 3.0 1.5

    Another type of method is neither a reduction(聚合) nor an accumulation. describe is one such example, producing multiple summary statistic in one shot: --> (describe()方法是对列变量做描述性统计)

    "describe() 返回列变量分位数, 均值, count, std等常用统计指标"
    " roud(2)保留2位小数"
    
    df.describe().round(2)
    
    'describe() 返回列变量分位数, 均值, count, std等常用统计指标'
    
    ' roud(2)保留2位小数'
    
    one two
    count 3.00 2.00
    mean 4.00 -3.00
    std 3.22 2.12
    min 1.40 -4.50
    25% 2.20 -3.75
    50% 3.00 -3.00
    75% 5.30 -2.25
    max 7.60 -1.50

    On non-numeric data, describe produces alternative(供选择的) summary statistics: --> 对于分类字段, 能自动识别并返回分类汇总信息

    obj = pd.Series(['a', 'a', 'b', 'c']*4)
    
    "describe()对分类字段自动分类汇总"
    obj.describe()
    
    'describe()对分类字段自动分类汇总'
    
    count     16
    unique     3
    top        a
    freq       8
    dtype: object
    

    See Table 5-8 for a full list of summary statistics and related methods.

    Method Description
    count Number of non-NA values
    describe 描述性统计Series或DataFrame的列
    min, max 极值
    argmin, argmax 极值所有的位置下标
    idmin, idmax 极值所对应的行索引label
    quantile 样本分位数
    sum 求和
    mean 求均值
    median 中位数
    var 方差
    std 标准差
    skew 偏度
    kurt 峰度
    skew 偏度
    cumsum 累积求和
    cumprod 累积求积
    diff Compute first arithmetic difference (useful for time series)
    pct_change Compute percent change
    df.idxmax()
    
    one    b
    two    d
    dtype: object
    
    
    df['one'].argmax()
    
    c:pythonpython36libsite-packagesipykernel_launcher.py:1: FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
    will be corrected to return the positional maximum in the future.
    Use 'series.values.argmax' to get the position of the maximum now.
      """Entry point for launching an IPython kernel.
    
    
    'b'
    
    

    Correlation and Convariance

    Some summary statistics, like correlation and convariance(方差和协方差), are computed from pairs of arguments. Let's consider some DataFrames of stock prices and volumes(体量) obtained from Yahoo! Finace using the add-on pandas-datareader package. If you don't have it install already, it can be obtained via or pip:

    (conda) pip install pandas-datareader

    I use the pandas_datareader module to dwonload some data for a few stock tickers:

    import pandas_datareader.data as web
    
    "字典推导式"
    # all_data = {ticker: web.get_data_yahoo(ticker)
    #             for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
    
    '字典推导式'
    
    
    "读取二进制数据 read_pickle(), 存为 to_pickle()"
    
    returns = pd.read_pickle("../examples/yahoo_volume.pkl")
    
    returns.tail()
    
    '读取二进制数据 read_pickle(), 存为 to_pickle()'
    
    
    AAPL GOOG IBM MSFT
    Date
    2016-10-17 23624900 1089500 5890400 23830000
    2016-10-18 24553500 1995600 12770600 19149500
    2016-10-19 20034600 116600 4632900 22878400
    2016-10-20 24125800 1734200 4023100 49455600
    2016-10-21 22384800 1260500 4401900 79974200

    The corr method of Series computes the correlation of the overlapping, non-NA(线性相关), aligned-by-index values in two Series. Relatedly, cov compute teh convariance: ->(corr 计算相关系数, cov 计算协方差)

    returns.describe()
    
    AAPL GOOG IBM MSFT
    count 1.714000e+03 1.714000e+03 1.714000e+03 1.714000e+03
    mean 9.595085e+07 4.111642e+06 4.815604e+06 4.630359e+07
    std 6.010914e+07 2.948526e+06 2.345484e+06 2.437393e+07
    min 1.304640e+07 7.900000e+03 1.415800e+06 9.009100e+06
    25% 5.088832e+07 1.950025e+06 3.337950e+06 3.008798e+07
    50% 8.270255e+07 3.710000e+06 4.216750e+06 4.146035e+07
    75% 1.235752e+08 5.243550e+06 5.520500e+06 5.558810e+07
    max 4.702495e+08 2.976060e+07 2.341650e+07 3.193179e+08
    "微软和IBM的相关系数是:  {}".format(returns['MSFT'].corr(returns['IBM']))
    
    "微软和IBM的协方差为是: {}".format(returns['MSFT'].cov(returns['IBM']))
    
    '微软和IBM的相关系数是:  0.42589249800808743'
    
    
    '微软和IBM的协方差为是: 24347708920434.156'
    
    

    Since(尽管) MSFT is a vaild(无效的) Python attritute, we can alse select these columns using more concise syntax:

    "通过 DF.col_name 这样的属性来选取字段, 面对对象, 支持"
    
    returns.MSFT.corr(returns.IBM)
    
    '通过 DF.col_name 这样的属性来选取字段, 面对对象, 支持'
    
    
    0.42589249800808743
    
    

    DataFrame's corr and cov methods, on the other hand, return a full correlaton or covariance matrix as a DataFrame, respectively(各自地). -> df.corr 返回相关系数矩阵 df.cov() 返回协方差矩阵哦

    "DF.corr() 返回矩阵, 这个厉害了, 不知道有无中心化过程"
    
    returns.corr()
    
    "DF.cov() 返回协方差矩阵"
    returns.cov()
    
    'DF.corr() 返回矩阵, 这个厉害了, 不知道有无中心化过程'
    
    
    AAPL GOOG IBM MSFT
    AAPL 1.000000 0.576030 0.383942 0.490353
    GOOG 0.576030 1.000000 0.438424 0.490446
    IBM 0.383942 0.438424 1.000000 0.425892
    MSFT 0.490353 0.490446 0.425892 1.000000
    'DF.cov() 返回协方差矩阵'
    
    
    AAPL GOOG IBM MSFT
    AAPL 3.613108e+15 1.020917e+14 5.413005e+13 7.184135e+14
    GOOG 1.020917e+14 8.693806e+12 3.032022e+12 3.524694e+13
    IBM 5.413005e+13 3.032022e+12 5.501297e+12 2.434771e+13
    MSFT 7.184135e+14 3.524694e+13 2.434771e+13 5.940884e+14

    Using the DataFrame's corrwith method, you can compute pairwise(成对的) corrlations between a DataFrame's columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column.

    使用DataFrame的corrwith方法,您可以计算DataFrame的列或行与另一个Series或DataFrame之间的成对相关。 传递一个Series会返回一个Series,其中包含为每列计算的相关值。

    "corrwith() 计算成对相关"
    
    "计算IMB与其他几个的相关"
    returns.corrwith(returns.IBM)
    
    'corrwith() 计算成对相关'
    
    
    '计算IMB与其他几个的相关'
    
    
    AAPL    0.383942
    GOOG    0.438424
    IBM     1.000000
    MSFT    0.425892
    dtype: float64
    
    
    returns.corrwith(returns)
    
    AAPL    1.0
    GOOG    1.0
    IBM     1.0
    MSFT    1.0
    dtype: float64
    
    

    Passing axis='column'(列方向, 每行) does things row-by-row instead. In all cases, the data points are aligned by label before the correlation is computed. ->按照行进进行计算, 前提是数据是按label对齐的.

    Unique Values, Value Counts, and Membership

    Another class of related methods extracts(提取) infomation about the values contained in a one-dimensional Series. To illustrate these, consider this example:

    obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
    
    "unique()返回不重复的值序列"
    obj.unique()
    
    'unique()返回不重复的值序列'
    
    
    array(['c', 'a', 'd', 'b'], dtype=object)
    
    

    The unique values are not neccessarily returned in sorted order(没有进行排序), but could be sorted ater the fact if needed(uniques.sort()). Relatedly, value_counts computes a Series containing value frequencies: ->value_count()统计频率

    "统计词频, value_counts()"
    
    obj.value_counts()
    
    '统计词频, value_counts()'
    
    
    a    3
    c    3
    b    2
    d    1
    dtype: int64
    
    

    The Series id sorted by value in descending order(降序) as a convenience. value_counts is also available as a top-level pandas method that can be used with any array or sequence: -> 统计词频,并降序排列

    "统计词频并降序排列"
    
    "默认是降序的"
    pd.value_counts(obj.values)
    
    "手动自动不排序"
    pd.value_counts(obj.values, sort=False)
    
    
    '统计词频并降序排列'
    
    
    '默认是降序的'
    
    
    a    3
    c    3
    b    2
    d    1
    dtype: int64
    
    
    '手动自动不排序'
    
    
    c    3
    b    2
    d    1
    a    3
    dtype: int64
    
    

    isin performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame: -> isin 成员判断

    obj
    
    0    c
    1    a
    2    d
    3    a
    4    a
    5    b
    6    b
    7    c
    8    c
    dtype: object
    
    
    mask = obj.isin(['b', 'c'])
    mask
    
    0     True
    1    False
    2    False
    3    False
    4    False
    5     True
    6     True
    7     True
    8     True
    dtype: bool
    
    
    "bool 过滤条件, True的则返回"
    obj[mask]
    
    'bool 过滤条件, True的则返回'
    
    
    0    c
    5    b
    6    b
    7    c
    8    c
    dtype: object
    
    

    Related to(涉及) isin is the Index.get_indexer method, which gives you can index array from an array of possibly non-distinct values into another array of distinct values:

    to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
    
    unique_vals = pd.Series(['c', 'b', 'a'])
    
    "没看懂这是干嘛"
    pd.Index(unique_vals).get_indexer(to_match)
    
    '没看懂这是干嘛'
    
    
    array([0, 2, 1, 1, 0, 2], dtype=int64)
    
    

    See Table 5-9 for a reference on these methods.

    Method Description
    isin 判断数组的每一个值是否在isin的数组里面, 返回一个bool数组
    match 数据对齐用的, 暂时还不会pass
    unique 数组元素去重后的数组结果
    value_counts 词频统计, 默认降序

    In some cases, you may want to compute a histogram(直方图) on multiple related columns in a DataFrame. Here's an example:

    data = pd.DataFrame({
        'Qu1': [1, 3, 4, 3, 4],
        'Qu2': [2, 3, 1, 2, 3],
        'Qu3': [1, 5, 2, 4, 4]})
    
    data
    
    Qu1 Qu2 Qu3
    0 1 2 1
    1 3 3 5
    2 4 1 2
    3 3 2 4
    4 4 3 4

    Passing pandas.value_counts to this DF's apply function gives: -> 对每列进行词频统计, 没有的用0填充

    result = data.apply(pd.value_counts).fillna(0)
    result
    
    Qu1 Qu2 Qu3
    1 1.0 1.0 1.0
    2 0.0 2.0 1.0
    3 2.0 2.0 0.0
    4 2.0 0.0 2.0
    5 0.0 0.0 1.0

    Here, the row labels in the result are the distinct values occuring in all of the columns. The values are the respective counts of these values in each clumns

    这里,结果中的行标签是在所有列中出现的不同值。 值是每列中这些值的相应计数

    Conclusion

    In the nex chapter, we will discuss tools for reading(or loading) and writing datasets with pandas. After that, we will dig deeper into data cleaning, wrangling, analysis, and visualization tool using pandas.

    后面的内容, 涉及数据的读写, 数据清理, 转换, 规整, 分析建模, 挖掘, 可视化等.

  • 相关阅读:
    在我的S5pv210开发板上安装busybox并体验busybox devmem 命令的强大功能
    修改 android 的 framework 层操作小记.转载
    【原创】再次强调MLC Nandflash 6410 开发板的不稳定性带来的安全隐患问题
    转载.简要介绍android HAL JNI HAL的基础
    【转】Andriod关机&重启分析
    转载.程序员为什么地位不高?
    转载.android 对linux 内核的改动,到底改了多少?
    在Ubuntu上为Android系统编写Linux内核驱动程序
    修改android HDMI 输出默认分辨率的方法
    [转载]Android编译过程详解(三)
  • 原文地址:https://www.cnblogs.com/chenjieyouge/p/11879235.html
Copyright © 2020-2023  润新知