主要讲述直方图与kernel density estimation,参考维基百科中的经典论述,从直方图和核密度估计的实现对比来说明这两种经典的非参数密度估计方法,具体的细节不做深入剖析。
In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form.
Let (x1, x2, …, xn) be an independent and identically distributed sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is
where
A range of kernel functions are commonly used: uniform, triangular, biweight, triweight, Epanechnikov, normal, and others. The Epanechnikov kernel is optimal in a mean square error sense,[3] though the loss of efficiency is small for the kernels listed previously,[4] and due to its convenient mathematical properties, the normal kernel is often used
The construction of a kernel density estimate finds interpretations in fields outside of density estimation. For example, in thermodynamics, this is equivalent to the amount of heat generated when heat kernels (the fundamental solution to the heat equation) are placed at each data point locations xi. Similar methods are used to construct discrete Laplace operators on point clouds for manifold learning.
Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. To see this, we compare the construction of histogram and kernel density estimators, using these 6 data points:
For the kernel density estimate, we place a normal kernel with variance 2.25 (indicated by the red dashed lines) on each of the data points x_i. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.
我们通过这个例子来分析一下直方图方法与核密度估计方法的异同之处:我们利用下面的6个数据点来做:
直方图的方法是将
核密度估计的方法生成的密度估计是一个连续的光滑的曲线,具体的方法是在对应的数据点上放置一个kernel函数,然后将所有数据点上的kernel叠加在一起,就可以构成一个光滑的密度函数。以上面的样本为例,上图右边的图中红色的虚线表示的是在对应6个数据点上放置的kernel函数,而蓝色的线代表的就是将所有的kernel函数叠加在一起构成的密度函数。
从两者的对比来看,直方图生成的离散的密度估计,而kernel则等效于将直方图采用kernel函数进行了平滑。
2015-8-28
艺少