• 机器学习sklearn(八): 特征工程(一)特征离散化(一)K-bins 离散化


    离散化 (Discretization) (有些时候叫 量化(quantization) 或 装箱(binning)) 提供了将连续特征划分为离散特征值的方法。 某些具有连续特征的数据集会受益于离散化,因为 离散化可以把具有连续属性的数据集变换成只有名义属性(nominal attributes)的数据集。 (译者注: nominal attributes 其实就是 categorical features, 可以译为 名称属性,名义属性,符号属性,离散属性 等)

    One-hot 编码的离散化特征 可以使得一个模型更加的有表现力(expressive),同时还能保留其可解释性(interpretability)。 比如,用离散化器进行预处理可以给线性模型引入非线性。

    KBinsDiscretizer 类使用k个等宽的bins把特征离散化

    >>> X = np.array([[ -3., 5., 15 ],
    ...               [  0., 6., 14 ],
    ...               [  6., 3., 11 ]])
    >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)

    默认情况下,输出是被 one-hot 编码到一个稀疏矩阵。(请看类别特征编码)。 而且可以使用参数encode进行配置。对每一个特征, bin的边界以及总数目在 fit过程中被计算出来,它们将用来定义区间。 因此,对现在的示例,这些区间间隔被定义如下:

    • 特征 1:[-∞,-1],[-1,2),[2,∞)
    • 特征 2:[-∞,5),[5,∞)
    • 特征 3:[-∞,14],[14,∞)

    基于这些 bin 区间, X 就被变换成下面这样:

    >>> est.transform(X)                      
    array([[ 0., 1., 1.],
           [ 1., 1., 1.],
           [ 2., 0., 0.]])

    由此产生的数据集包含了有序属性(ordinal attributes),可以被进一步用在类 sklearn.pipeline.Pipeline 中。

    离散化(Discretization)类似于为连续数据构建直方图(histograms)。 然而,直方图聚焦于统计特征落在特定的bins里面的数量,而离散化聚焦于给这些bins分配特征取值。

    KBinsDiscretizer类实现了不同的 binning策略,可以通过参数strategy进行选择。 ‘uniform’ 策略使用固定宽度的bins。 ‘quantile’ 策略在每个特征上使用分位数(quantiles)值以便具有相同填充的bins。 ‘kmeans’ 策略基于在每个特征上独立执行的k-means聚类过程定义bins。

    示例

    class sklearn.preprocessing.KBinsDiscretizer(n_bins=5*encode='onehot'strategy='quantile'dtype=None)

    Bin continuous data into intervals.

    Read more in the User Guide.

    New in version 0.20.

    Parameters
    n_binsint or array-like of shape (n_features,), default=5

    The number of bins to produce. Raises ValueError if n_bins 2.

    encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}, default=’onehot’

    Method used to encode the transformed result.

    onehot

    Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.

    onehot-dense

    Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.

    ordinal

    Return the bin identifier encoded as an integer value.

    strategy{‘uniform’, ‘quantile’, ‘kmeans’}, default=’quantile’

    Strategy used to define the widths of the bins.

    uniform

    All bins in each feature have identical widths.

    quantile

    All bins in each feature have the same number of points.

    kmeans

    Values in each bin have the same nearest center of a 1D k-means cluster.

    dtype{np.float32, np.float64}, default=None

    The desired data-type for the output. If None, output dtype is consistent with input dtype. Only np.float32 and np.float64 are supported.

    New in version 0.24.

    Attributes
    n_bins_ndarray of shape (n_features,), dtype=np.int_

    Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.

    bin_edges_ndarray of ndarray of shape (n_features,)

    The edges of each bin. Contain arrays of varying shapes (n_bins_, ) Ignored features will have empty arrays.

    Methods

    fit(X[, y])

    Fit the estimator.

    fit_transform(X[, y])

    Fit to data, then transform it.

    get_params([deep])

    Get parameters for this estimator.

    inverse_transform(Xt)

    Transform discretized data back to original feature space.

    set_params(**params)

    Set the parameters of this estimator.

    transform(X)

    Discretize the data.

    Examples

    >>> X = [[-2, 1, -4,   -1],
    ...      [-1, 2, -3, -0.5],
    ...      [ 0, 3, -2,  0.5],
    ...      [ 1, 4, -1,    2]]
    >>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
    >>> est.fit(X)
    KBinsDiscretizer(...)
    >>> Xt = est.transform(X)
    >>> Xt  
    array([[ 0., 0., 0., 0.],
           [ 1., 1., 1., 0.],
           [ 2., 2., 2., 1.],
           [ 2., 2., 2., 2.]])

    Sometimes it may be useful to convert the data back into the original feature space. The inverse_transform function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.

    >>> est.bin_edges_[0]
    array([-2., -1.,  0.,  1.])
    >>> est.inverse_transform(Xt)
    array([[-1.5,  1.5, -3.5, -0.5],
           [-0.5,  2.5, -2.5, -0.5],
           [ 0.5,  3.5, -1.5,  0.5],
           [ 0.5,  3.5, -1.5,  1.5]])
  • 相关阅读:
    ArrayList类?
    Spring事务管理?
    Collection集合?
    Scanner类?
    定义三个int类型变量c,d,e. 求出三个变量的最大值,并打印出来?
    定义一个三位整数,请分别获取该三位数上每一位的数值 ?
    Collection集合和Map集合的区别?
    UITableView不错的资源大全
    几个新的开源框架
    一个错误Terminating app due to uncaught exception 'NSUnknownKeyException', reason: '[<UIViewController 0x4b3c2a0> setValue:forUndefinedKey:]: this class i
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14903384.html
Copyright © 2020-2023  润新知