另外一种将标称型特征转换为能够被scikit-learn中模型使用的编码是one-of-K, 又称为 独热码或dummy encoding。 这种编码类型已经在类OneHotEncoder中实现。该类把每一个具有n_categories个可能取值的categorical特征变换为长度为n_categories的二进制特征向量,里面只有一个地方是1,其余位置都是0。
继续我们上面的示例:
>>> >>> enc = preprocessing.OneHotEncoder() >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] >>> enc.fit(X) OneHotEncoder(categorical_features=None, categories=None, dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None, sparse=True) >>> enc.transform([['female', 'from US', 'uses Safari'], ... ['male', 'from Europe', 'uses Safari']]).toarray() array([[1., 0., 0., 1., 0., 1.], [0., 1., 1., 0., 0., 1.]])
默认情况下,每个特征使用几维的数值可以从数据集自动推断。而且也可以在属性categories_
中找到:
>>> >>> enc.categories_ [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
可以使用参数categories_
显式地指定这一点。我们的数据集中有两种性别、四种可能的大陆和四种web浏览器:
>>> genders = ['female', 'male'] >>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US'] >>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari'] >>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers]) >>> # Note that for there are missing categorical values for the 2nd and 3rd >>> # feature >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] >>> enc.fit(X) OneHotEncoder(categorical_features=None, categories=[...], drop=None, dtype=<... 'numpy.float64'>, handle_unknown='error', n_values=None, sparse=True) >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray() array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
如果训练数据可能缺少分类特性,通常最好指定handle_unknown
='ignore',而不是像上面那样手动设置类别。当指定handle_unknown='ignore',并且在转换过程中遇到未知类别时,不会产生错误,但是为该特性生成的一热编码列将全部为零(handle_unknown='ignore'只支持一热编码):
如果训练数据中可能含有缺失的标称型特征, 通过指定handle_unknown
='ignore'比像上面代码那样手动设置categories
更好。 当handle_unknown
='ignore' 被指定并在变换过程中真的碰到了未知的 categories, 则不会抛出任何错误,但是由此产生的该特征的one-hot编码列将会全部变成 0 。(这个参数设置选项 handle_unknown
='ignore' 仅仅在 one-hot encoding的时候有效):
>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore') >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] >>> enc.fit(X) OneHotEncoder(categorical_features=None, categories=None, drop=None, dtype=<... 'numpy.float64'>, handle_unknown='ignore', n_values=None, sparse=True) >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray() array([[1., 0., 0., 0., 0., 0.]])
还可以使用drop
参数将每个列编码为n_categories-1列,而不是n_categories列。此参数允许用户为要删除的每个特征指定类别。这对于避免某些分类器中输入矩阵的共线性是有用的。例如,当使用非正则化回归(线性回归)时,这种功能是有用的,因为共线性会导致协方差矩阵是不可逆的。当这个参数不是None时,handle_unknown
必须设置为error:
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']] >>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X) >>> drop_enc.categories_ [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)] >>> drop_enc.transform(X).toarray() array([[1., 1., 1.], [0., 0., 0.]])
标称型特征有时是用字典来表示的,而不是标量,具体请参阅从字典中加载特征。
class sklearn.preprocessing.
OneHotEncoder
(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
Encode categorical features as a one-hot numeric array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse
parameter)
By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories
manually.
This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.
Note: a one-hot encoding of y labels should use a LabelBinarizer instead.
Read more in the User Guide.
- Parameters
- categories‘auto’ or a list of array-like, default=’auto’
-
Categories (unique values) per feature:
-
‘auto’ : Determine categories automatically from the training data.
-
list :
categories[i]
holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.
The used categories can be found in the
categories_
attribute.New in version 0.20.
-
- drop{‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None
-
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.
-
None : retain all features (the default).
-
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
-
‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
-
array :
drop[i]
is the category in featureX[:, i]
that should be dropped.
New in version 0.21: The parameter
drop
was added in 0.21.Changed in version 0.23: The option
drop='if_binary'
was added in 0.23. -
- sparsebool, default=True
-
Will return sparse matrix if set True else will return an array.
- dtypenumber type, default=float
-
Desired dtype of output.
- handle_unknown{‘error’, ‘ignore’}, default=’error’
-
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
- Attributes
- categories_list of arrays
-
The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of
transform
). This includes the category specified indrop
(if any). - drop_idx_array of shape (n_features,)
-
-
drop_idx_[i]
is the index incategories_[i]
of the category to be dropped for each feature. -
drop_idx_[i] = None
if no category is to be dropped from the feature with indexi
, e.g. whendrop='if_binary'
and the feature isn’t binary. -
drop_idx_ = None
if all the transformed features will be retained.
Changed in version 0.23: Added the possibility to contain
None
values. -
Examples
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.
>>> from sklearn.preprocessing import OneHotEncoder
One can discard categories not seen during fit
:
>>> enc = OneHotEncoder(handle_unknown='ignore') >>> X = [['Male', 1], ['Female', 3], ['Female', 2]] >>> enc.fit(X) OneHotEncoder(handle_unknown='ignore') >>> enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> enc.transform([['Female', 1], ['Male', 4]]).toarray() array([[1., 0., 1., 0., 0.], [0., 1., 0., 0., 0.]]) >>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]]) array([['Male', 1], [None, 2]], dtype=object) >>> enc.get_feature_names(['gender', 'group']) array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], dtype=object)
One can always drop the first column for each feature:
>>> drop_enc = OneHotEncoder(drop='first').fit(X) >>> drop_enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray() array([[0., 0., 0.], [1., 1., 0.]])
Or drop a column for feature only having 2 categories:
>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X) >>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray() array([[0., 1., 0., 0.], [1., 0., 1., 0.]])
Methods
|
Fit OneHotEncoder to X. |
|
Fit OneHotEncoder to X, then transform X. |
|
Return feature names for output features. |
|
Get parameters for this estimator. |
Convert the data back to the original representation. |
|
|
Set the parameters of this estimator. |
|
Transform X using one-hot encoding. |