机器学习sklearn（八）：特征工程（一）特征离散化（一）K-bins 离散化

离散化 (Discretization) (有些时候叫量化(quantization) 或装箱(binning)) 提供了将连续特征划分为离散特征值的方法。某些具有连续特征的数据集会受益于离散化，因为离散化可以把具有连续属性的数据集变换成只有名义属性(nominal attributes)的数据集。 (译者注： nominal attributes 其实就是 categorical features, 可以译为名称属性，名义属性，符号属性，离散属性等)

One-hot 编码的离散化特征可以使得一个模型更加的有表现力(expressive)，同时还能保留其可解释性(interpretability)。比如，用离散化器进行预处理可以给线性模型引入非线性。

KBinsDiscretizer 类使用k个等宽的bins把特征离散化

>>> X = np.array([[ -3., 5., 15 ],
...               [  0., 6., 14 ],
...               [  6., 3., 11 ]])
>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)

默认情况下，输出是被 one-hot 编码到一个稀疏矩阵。(请看类别特征编码)。而且可以使用参数encode进行配置。对每一个特征， bin的边界以及总数目在 fit过程中被计算出来，它们将用来定义区间。因此，对现在的示例，这些区间间隔被定义如下:

特征 1:[-∞,-1],[-1,2),[2,∞)
特征 2:[-∞,5),[5,∞)
特征 3:[-∞,14],[14,∞)

基于这些 bin 区间, X 就被变换成下面这样:

>>> est.transform(X)                      
array([[ 0., 1., 1.],
       [ 1., 1., 1.],
       [ 2., 0., 0.]])

由此产生的数据集包含了有序属性(ordinal attributes),可以被进一步用在类 sklearn.pipeline.Pipeline 中。

离散化(Discretization)类似于为连续数据构建直方图(histograms)。然而，直方图聚焦于统计特征落在特定的bins里面的数量，而离散化聚焦于给这些bins分配特征取值。

KBinsDiscretizer类实现了不同的 binning策略，可以通过参数strategy进行选择。 ‘uniform’ 策略使用固定宽度的bins。 ‘quantile’ 策略在每个特征上使用分位数(quantiles)值以便具有相同填充的bins。 ‘kmeans’ 策略基于在每个特征上独立执行的k-means聚类过程定义bins。

示例

class sklearn.preprocessing.KBinsDiscretizer(n_bins=5, *, encode='onehot', strategy='quantile', dtype=None)

Bin continuous data into intervals.

Read more in the User Guide.

New in version 0.20.

Parameters

n_binsint or array-like of shape (n_features,), default=5

The number of bins to produce. Raises ValueError if n_bins < 2.

encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}, default=’onehot’

Method used to encode the transformed result.

onehot: Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.
onehot-dense: Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.
ordinal: Return the bin identifier encoded as an integer value.

strategy{‘uniform’, ‘quantile’, ‘kmeans’}, default=’quantile’

Strategy used to define the widths of the bins.

uniform: All bins in each feature have identical widths.
quantile: All bins in each feature have the same number of points.
kmeans: Values in each bin have the same nearest center of a 1D k-means cluster.

dtype{np.float32, np.float64}, default=None

The desired data-type for the output. If None, output dtype is consistent with input dtype. Only np.float32 and np.float64 are supported.

New in version 0.24.

Attributes

n_bins_ndarray of shape (n_features,), dtype=np.int_: Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.
bin_edges_ndarray of ndarray of shape (n_features,): The edges of each bin. Contain arrays of varying shapes (n_bins_, ) Ignored features will have empty arrays.

Methods

`fit`(X[, y])	Fit the estimator.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(Xt)	Transform discretized data back to original feature space.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Discretize the data.

Examples

>>> X = [[-2, 1, -4,   -1],
...      [-1, 2, -3, -0.5],
...      [ 0, 3, -2,  0.5],
...      [ 1, 4, -1,    2]]
>>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
>>> est.fit(X)
KBinsDiscretizer(...)
>>> Xt = est.transform(X)
>>> Xt  
array([[ 0., 0., 0., 0.],
       [ 1., 1., 1., 0.],
       [ 2., 2., 2., 1.],
       [ 2., 2., 2., 2.]])

Sometimes it may be useful to convert the data back into the original feature space. The inverse_transform function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.

>>> est.bin_edges_[0]
array([-2., -1.,  0.,  1.])
>>> est.inverse_transform(Xt)
array([[-1.5,  1.5, -3.5, -0.5],
       [-0.5,  2.5, -2.5, -0.5],
       [ 0.5,  3.5, -1.5,  0.5],
       [ 0.5,  3.5, -1.5,  1.5]])

相关阅读:
ArrayList类?
Spring事务管理？
Collection集合？
Scanner类？
定义三个int类型变量c,d,e. 求出三个变量的最大值,并打印出来？
定义一个三位整数，请分别获取该三位数上每一位的数值？
Collection集合和Map集合的区别？
UITableView不错的资源大全
 几个新的开源框架
 一个错误Terminating app due to uncaught exception 'NSUnknownKeyException', reason: '[<UIViewController 0x4b3c2a0> setValue:forUndefinedKey:]: this class i
原文地址：https://www.cnblogs.com/qiu-hua/p/14903384.html

机器学习sklearn（八）： 特征工程（一）特征离散化（一）K-bins 离散化

机器学习sklearn（八）：特征工程（一）特征离散化（一）K-bins 离散化