概述
参考
A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori(MAP) estimation from a well-trained prior model.
高斯混合模型,经典的概率模型/生成模型,常用于声纹识别、语音识别等模式识别应用。常使用最大似然估计方法训练(估计参数),用期望最大化算法(Expectation Maximization,EM)具体实现。算法原理:
The sklearn.mixture
module implements mixture modeling algorithms. 里面有Gaussian_mixture和Baysian_mixture,这两个类都继承于BaseMixture。
GaussianMixture
高斯混合模型的概率分布,参数估计。
参考sklearn.mixture.GaussianMixture及其源码。
class sklearn.mixture.GaussianMixture(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)
初始化
初始化 GaussianMixture 类。使用方式如下:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 20, max_iter = 200, covariance_type='diag', n_init = 3)
该GMM由20个高斯分布函数组成,训练过程中EM算法迭代次数为200,协方差类型为diag(每个高斯分量都有对角协方差矩阵),3次初始化,训练过程中保存最好的结果。
参数(weights,means,precisions)的初始化默认采用kmeans方法,precisions_init默认为None,此时的尺寸由covariance_type决定,为diag时尺寸为(n_components, n_features)。
热启动warm_start默认为False。
注意:n_samples >= n_components
.fit
Estimate model parameters with the EM algorithm. The method fits the model n_init
times and sets the parameters with which the model has the largest likelihood or lower bound. Within each trial, the method iterates between E-step and M-step for max_iter
times until the change of likelihood or lower bound is less than tol
, otherwise, a ConvergenceWarning
is raised.
If warm_start
is True
, then n_init
is ignored and a single initialization is performed upon the first call. Upon consecutive calls, training starts where it left off.
参考源码。使用方式如下:
gmm.fit(datas)
The datas
is array-like of shape (n_samples, n_features). List of n_features-dimensional data points. Each row corresponds to a single data point.
.score
Compute the per-sample average log-likelihood of the given data X.
参考源码。使用方式如下:
ll_score = gmm.score(test_datas)
The test_datas
is array-like of shape (n_samples, n_dimensions). List of n_features-dimensional data points. Each row corresponds to a single data point.
The result ll_score
is float. Log likelihood of the Gaussian mixture given test_datas
.
Parameters: X, array-like of shape (n_samples, n_dimensions), List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns: log_likelihood, float data
preprocessing
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
例如,学习算法的目标函数中使用的许多elements(如SVM的RBF核,线性模型的L1和L2正则),假定all features are centered around 0 and have variance in the same order。
If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The sklearn.preprocessing
module includes scaling, centering, normalization, binarization methods.
.scale
高斯混合模型训练前,一般要对特征进行标准化处理。可以使用sklearn.preprocessing.scale
处理。Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
参考sklearn.preprocessing.scale及其源码,使用方法如下:
from sklearn import preprocessing
X_tr = preprocessing.scale(X, *, axis=0, with_mean=True, with_std=True, copy=True)
需要注意的是,输入X
是一个 array-like sparse matrix of shape (n_samples, n_features), which is the data to center and scale.
If the axis used to compute the means and standard deviations along, is 0, independently standardize each feature, otherwise (if 1) standardize each sample.
Return the transformed data X_tr
is ndarray, sparse matrix of shape (n_samples, n_features).
Warning Risk of data leak
Do not use
scale
unless you know what you are doing.
A common mistake is to apply it to the entire data before splitting into training and test sets. This will bias the model evaluation because information would have leaked from the test set to the training set. In general, we recommend usingStandardScaler
within aPipeline
in order to prevent most risks of data leaking:pipe = make_pipeline(StandardScaler(), LogisticRegression())
.
未用?
.StandardScaler
Standardize features by removing the mean and scaling to unit variance. 计算方式:$ { ext{z}} = left( {x - mu }
ight)/sigma $.
参考sklearn.preprocessing.StandardScaler:
class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)
默认with_mean
和with_std
均为True,需要对输入数据提前进行中心化和单位标准差归一化。
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.