• 机器学习sklearn(二十): 特征工程(十一)特征编码(五)类别特征编码(三)独热编码 OneHotEncoder


    另外一种将标称型特征转换为能够被scikit-learn中模型使用的编码是one-of-K, 又称为 独热码或dummy encoding。 这种编码类型已经在类OneHotEncoder中实现。该类把每一个具有n_categories个可能取值的categorical特征变换为长度为n_categories的二进制特征向量,里面只有一个地方是1,其余位置都是0。

    继续我们上面的示例:

    >>>
    >>> enc = preprocessing.OneHotEncoder()
    >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
    >>> enc.fit(X)  
    OneHotEncoder(categorical_features=None, categories=None,
           dtype=<... 'numpy.float64'>, handle_unknown='error',
           n_values=None, sparse=True)
    >>> enc.transform([['female', 'from US', 'uses Safari'],
    ...                ['male', 'from Europe', 'uses Safari']]).toarray()
    array([[1., 0., 0., 1., 0., 1.],
           [0., 1., 1., 0., 0., 1.]])

    默认情况下,每个特征使用几维的数值可以从数据集自动推断。而且也可以在属性categories_中找到:

    >>>
    >>> enc.categories_
    [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]

    可以使用参数categories_显式地指定这一点。我们的数据集中有两种性别、四种可能的大陆和四种web浏览器:

    >>> genders = ['female', 'male']
    >>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
    >>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
    >>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
    >>> # Note that for there are missing categorical values for the 2nd and 3rd
    >>> # feature
    >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
    >>> enc.fit(X)
    OneHotEncoder(categorical_features=None,
           categories=[...], drop=None,
           dtype=<... 'numpy.float64'>, handle_unknown='error',
           n_values=None, sparse=True)
    >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
    array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

    如果训练数据可能缺少分类特性,通常最好指定handle_unknown='ignore',而不是像上面那样手动设置类别。当指定handle_unknown='ignore',并且在转换过程中遇到未知类别时,不会产生错误,但是为该特性生成的一热编码列将全部为零(handle_unknown='ignore'只支持一热编码):

    如果训练数据中可能含有缺失的标称型特征, 通过指定handle_unknown='ignore'比像上面代码那样手动设置categories更好。 当handle_unknown='ignore' 被指定并在变换过程中真的碰到了未知的 categories, 则不会抛出任何错误,但是由此产生的该特征的one-hot编码列将会全部变成 0 。(这个参数设置选项 handle_unknown='ignore' 仅仅在 one-hot encoding的时候有效):

    >>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
    >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
    >>> enc.fit(X)
    OneHotEncoder(categorical_features=None, categories=None, drop=None,
           dtype=<... 'numpy.float64'>, handle_unknown='ignore',
           n_values=None, sparse=True)
    >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
    array([[1., 0., 0., 0., 0., 0.]])

    还可以使用drop参数将每个列编码为n_categories-1列,而不是n_categories列。此参数允许用户为要删除的每个特征指定类别。这对于避免某些分类器中输入矩阵的共线性是有用的。例如,当使用非正则化回归(线性回归)时,这种功能是有用的,因为共线性会导致协方差矩阵是不可逆的。当这个参数不是None时,handle_unknown必须设置为error:

    >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
    >>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
    >>> drop_enc.categories_
    [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
    >>> drop_enc.transform(X).toarray()
    array([[1., 1., 1.],
           [0., 0., 0.]])

    标称型特征有时是用字典来表示的,而不是标量,具体请参阅从字典中加载特征

    class sklearn.preprocessing.OneHotEncoder(*categories='auto'drop=Nonesparse=Truedtype=<class 'numpy.float64'>handle_unknown='error')

    Encode categorical features as a one-hot numeric array.

    The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

    By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

    This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

    Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

    Read more in the User Guide.

    Parameters
    categories‘auto’ or a list of array-like, default=’auto’

    Categories (unique values) per feature:

    • ‘auto’ : Determine categories automatically from the training data.

    • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

    The used categories can be found in the categories_ attribute.

    New in version 0.20.

    drop{‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None

    Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

    However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

    • None : retain all features (the default).

    • ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

    • ‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

    • array : drop[i] is the category in feature X[:, i] that should be dropped.

    New in version 0.21: The parameter drop was added in 0.21.

    Changed in version 0.23: The option drop='if_binary' was added in 0.23.

    sparsebool, default=True

    Will return sparse matrix if set True else will return an array.

    dtypenumber type, default=float

    Desired dtype of output.

    handle_unknown{‘error’, ‘ignore’}, default=’error’

    Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

    Attributes
    categories_list of arrays

    The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).

    drop_idx_array of shape (n_features,)
    • drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.

    • drop_idx_[i] None if no category is to be dropped from the feature with index i, e.g. when drop='if_binary' and the feature isn’t binary.

    • drop_idx_ None if all the transformed features will be retained.

    Changed in version 0.23: Added the possibility to contain None values.

    Examples

    Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

    >>> from sklearn.preprocessing import OneHotEncoder

    One can discard categories not seen during fit:

    >>> enc = OneHotEncoder(handle_unknown='ignore')
    >>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
    >>> enc.fit(X)
    OneHotEncoder(handle_unknown='ignore')
    >>> enc.categories_
    [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
    >>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
    array([[1., 0., 1., 0., 0.],
           [0., 1., 0., 0., 0.]])
    >>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
    array([['Male', 1],
           [None, 2]], dtype=object)
    >>> enc.get_feature_names(['gender', 'group'])
    array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'],
      dtype=object)

    One can always drop the first column for each feature:

    >>> drop_enc = OneHotEncoder(drop='first').fit(X)
    >>> drop_enc.categories_
    [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
    >>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
    array([[0., 0., 0.],
           [1., 1., 0.]])

    Or drop a column for feature only having 2 categories:

    >>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
    >>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
    array([[0., 1., 0., 0.],
           [1., 0., 1., 0.]])

    Methods

    fit(X[, y])

    Fit OneHotEncoder to X.

    fit_transform(X[, y])

    Fit OneHotEncoder to X, then transform X.

    get_feature_names([input_features])

    Return feature names for output features.

    get_params([deep])

    Get parameters for this estimator.

    inverse_transform(X)

    Convert the data back to the original representation.

    set_params(**params)

    Set the parameters of this estimator.

    transform(X)

    Transform X using one-hot encoding.

  • 相关阅读:
    ajaxFileUpload 实现多文件上传(源码)
    Springboot 热部署的两种方式
    基于树莓派3B+Python3.5的OpenCV3.4的配置教程
    Shiro 架构原理
    Cron表达式
    SpringBoot中Scheduled代码实现
    Linus安装mysql8
    查看虚拟机CENTOS7 的 IP 地址和命令
    linux vi保存退出命令 (如何退出vi)
    Linux常用命令大全
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14904617.html
Copyright © 2020-2023  润新知