• 直方图


    直方图

    • 数值型数据可视化的一种方式。
    • 如果原始数值型数据经过类别化(分组)处理,则可以使用直方图来观察数据的分布。
    • 如果原始数据没有经过分组处理,则使用茎叶图、箱线图、小提琴图、点图、核密度图等来观察数据的分布。

    作用:展示数据分布的一种常用方式,通过直方图可观察数据分布的大致形状,能看分布是否对称、快速判断数据是否近似服从正态分布。

    import matplotlib.pyplot as plt
    plt.rcParams["font.sans-serif"] = ["SimHei"]
    plt.rcParams["axes.unicode_minus"] = False
    

    下面来看plt.hist()的语法:

    plt.hist(x, bins=None, range=None, normed=False, weights=None, cumulative=False, bottom=None, histtype='bar',
    align='mid', orientation='vertical', rwidth=None, log=False, color=None, label=None, stacked=False,
    hold=None, data=None, **kwargs)

      Parameters
    ----------
    x : (n,) array or sequence of (n,) arrays  指定要绘制直方图的数据
        Input values, this takes either a single array or a sequency of
        arrays which are not required to be of the same length
    
    bins : integer or array_like or 'auto', optional    指定直方图条形的个数,一种是给定整数,指定绘制多少个条形;一种是给出一个序列,可随意指定间距
        If an integer is given, `bins + 1` bin edges are returned,   
        consistently with :func:`numpy.histogram` for numpy version >=
        1.3.
    
        Unequally spaced bins are supported if `bins` is a sequence.   
    
        If Numpy 1.11 is installed, may also be ``'auto'``.
    
        Default is taken from the rcParam ``hist.bins``.
    
    range : tuple or None, optional   指定直方图数据的上下界,默认包含绘图数据的max和min
        The lower and upper range of the bins. Lower and upper outliers
        are ignored. If not provided, `range` is (x.min(), x.max()). Range
        has no effect if `bins` is a sequence.
    
        If `bins` is a sequence or `range` is specified, autoscaling
        is based on the specified bin range instead of the
        range of x.
    
        Default is ``None``
    
    normed : boolean, optional   是否将直方图的频数转换成频率
        If `True`, the first element of the return tuple will
        be the counts normalized to form a probability density, i.e.,
        ``n/(len(x)`dbin)``, i.e., the integral of the histogram will sum
        to 1. If *stacked* is also *True*, the sum of the histograms is
        normalized to 1.
    
        Default is ``False``     默认是False,频数图;频率图需要改成True
    
    weights : (n, ) array_like or None, optional  该参数可为每一个数据设置权重
        An array of weights, of the same shape as `x`.  Each value in `x`
        only contributes its associated weight towards the bin count
        (instead of 1).  If `normed` is True, the weights are normalized,
        so that the integral of the density over the range remains 1.
    
        Default is ``None``
    
    cumulative : boolean, optional   是否需要计算累计频数或频率
        If `True`, then a histogram is computed where each bin gives the
        counts in that bin plus all bins for smaller values. The last bin
        gives the total number of datapoints.  If `normed` is also `True`
        then the histogram is normalized such that the last bin equals 1.
        If `cumulative` evaluates to less than 0 (e.g., -1), the direction
        of accumulation is reversed.  In this case, if `normed` is also
        `True`, then the histogram is normalized such that the first bin
        equals 1.
    
        Default is ``False``  默认是False,普通的频数或频率图;True,即设置为累计频数图or累计频率图
    
    bottom : array_like, scalar, or None    可以为直方图的每个条形添加基准线,默认0
        Location of the bottom baseline of each bin.  If a scalar,
        the base line for each bin is shifted by the same amount.
        If an array, each bin is shifted independently and the length
        of bottom must match the number of bins.  If None, defaults to 0.
    
        Default is ``None``
    
    histtype : {'bar', 'barstacked', 'step',  'stepfilled'}, optional   指定直方图的类型,默认为bar,还有barstacked,step,stepfilled
        The type of histogram to draw.
    
        - 'bar' is a traditional bar-type histogram.  If multiple data
          are given the bars are aranged side by side.
    
        - 'barstacked' is a bar-type histogram where multiple
          data are stacked on top of each other.
    
        - 'step' generates a lineplot that is by default
          unfilled.
    
        - 'stepfilled' generates a lineplot that is by default
          filled.
    
        Default is 'bar'
    
    align : {'left', 'mid', 'right'}, optional   设置条形边界值的对齐方式,默认mid,还有left and right
        Controls how the histogram is plotted.
    
            - 'left': bars are centered on the left bin edges.
    
            - 'mid': bars are centered between the bin edges.
    
            - 'right': bars are centered on the right bin edges.
    
        Default is 'mid'
    
    orientation : {'horizontal', 'vertical'}, optional     设置直方图的摆放方向,默认为垂直方向
        If 'horizontal', `~matplotlib.pyplot.barh` will be used for
        bar-type histograms and the *bottom* kwarg will be the left edges.
    
    rwidth : scalar or None, optional   设置直方图条形宽的的百分比
        The relative width of the bars as a fraction of the bin width.  If
        `None`, automatically compute the width.
    
        Ignored if `histtype` is 'step' or 'stepfilled'.
    
        Default is ``None``
    
    log : boolean, optional  对数据是否进行log变换
        If `True`, the histogram axis will be set to a log scale. If `log`
        is `True` and `x` is a 1D array, empty bins will be filtered out
        and only the non-empty (`n`, `bins`, `patches`) will be returned.
    
        Default is ``False``
    
    color : color or array_like of colors or None, optional   直方图的填充色
        Color spec or sequence of color specs, one per dataset.  Default
        (`None`) uses the standard line color sequence.
    
        Default is ``None``
    
    label : string or None, optional  直方图的标签,可通过legend展示其图例
        String, or sequence of strings to match multiple datasets.  Bar
        charts yield multiple patches per dataset, but only the first gets
        the label, so that the legend command will work as expected.
    
        default is ``None``
    
    stacked : boolean, optional 当有多个数据时,是否需要将直方图堆叠摆放,默认水平摆放
        If `True`, multiple data are stacked on top of each other If
        `False` multiple data are aranged side by side if histtype is
        'bar' or on top of each other if histtype is 'step'
    
        Default is ``False``
     
     facecolor: 直方图颜色
     alpha: 条形透明度
    

    借用Titanic的年龄数据来做例子:

    import pandas as pd
    data = pd.read_csv(r"G:KaggleTitanic	rain.csv")
    
    data.Age.isnull().sum()
    
    177
    

    存在177个缺失值,我们选择删除掉这些样本。

    data.dropna(subset=["Age"], inplace=True)
    
    plt.hist(data.Age)  #普通版
    plt.show()
    

    png

    下面逐步美化一下

    plt.style.use("ggplot")
    plt.hist(data.Age, bins=20, color="#FF7373" )     
    plt.show()
    

    png

    界限不清晰,再加一个边界颜色看看效果。下面先画一个指定组数的,一个指定组距的直方图,根据需要使用。

    plt.hist(data.Age, bins=20, color="#FF7373",  edgecolor="K")
    plt.title("乘客年龄分布直方图(指定20组)", fontsize=18)
    plt.xlabel("年龄", size=15)
    plt.ylabel("频数", size=15)
    plt.show()
    

    png

    plt.hist(data.Age, bins=np.arange(data.Age.min(),data.Age.max(),5), color="#FF7373",  edgecolor="K")
    plt.title("乘客年龄直方图(指定组距为5)",fontsize=18)
    plt.xlabel("年龄", size=15)
    plt.ylabel("频数", size=15)
    plt.show()
    

    png

    看到年龄分布有点接近正态分布

    下面绘制个累计频率直方图

    import numpy as np
    plt.figure(figsize=(5,6))
    plt.hist(data.Age, bins=np.arange(data.Age.min(),data.Age.max(),5), normed=True, cumulative=True, color="#FF7373",  edgecolor="K")
    plt.title("乘客年龄累计频率直方图",fontsize=18)
    plt.xlabel("年龄", size=15)
    plt.ylabel("频数", size=15)
    plt.show()
    

    png

    从上图看出乘客年龄的分布情况,接近70%的乘客年龄在35岁以下,接近80%的乘客年龄在40岁以下。

    下面给直方图添加两条曲线:正态分布曲线and核密度曲线。一般我们用核密度曲线,正态分布曲线与现实一般都是不一样的,只是理论上的曲线。

    import matplotlib.mlab as mlab
    plt.figure(figsize=(5.5,5.5))
    plt.hist(data.Age, bins = np.arange(data.Age.min(),data.Age.max(), 5), normed = True, color="#FF7373", edgecolor = "k")#画出原数据的频率直方图
    x1 = np.linspace(data.Age.min(), data.Age.max(), 1000)    #生成正态分布的数据
    normal = mlab.normpdf(x1, data.Age.mean(), data.Age.std())  
    line1, = plt.plot(x1,normal, "b--", linewidth = 3)        #画出正态分布曲线
    kde = mlab.GaussianKDE(data.Age)    #生存核密度数据
    x2 = np.linspace(data.Age.min(), data.Age.max(), 1000)
    line2, = plt.plot(x2,kde(x2), "g-", linewidth = 3, alpha=0.6)     #画出核密度曲线
    plt.legend([line1, line2],[ "正态分布曲线", "核密度曲线"],loc= "best")  #生成图例
    plt.title("乘客年龄分布直方图", fontsize=18)
    plt.xlabel("年龄", size=15)
    plt.ylabel("频率", size=15)
    plt.show()
    

    png

    乘客的年龄不服从正态分布,从核密度曲线看出年龄的实际分布情况。

  • 相关阅读:
    隐藏虚拟网卡
    Eclipse3.2编码选中对象着色
    PHP里的字符串定义小技巧汇总
    【原创】交互型网页防止IP欺骗的技巧
    VS2005的报错让我“二”了一把
    【原创】利用PHP5的__autoload代替繁琐低效的的外部文件包含方式
    关于WebDataWindow.Net的一些开发小细节
    PHP效率损失操作汇总
    动态添加按钮及关联方法(带参数)
    GridView中模版列使用RowCommand事件如何得到当前列的行索引或记录ID
  • 原文地址:https://www.cnblogs.com/wyy1480/p/9672301.html
Copyright © 2020-2023  润新知