Column Transformer with Mixed Types
https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py
使用ColumnTransformer
, 应用不同的预处理和特征提取管道,到不同的特征子集上。
这个工具是非常便利的对于处理异构数据集。
例如 对数值型数据进行 缩放, 对分类型数据进行 one-hot 编码。
This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using
ColumnTransformer
. This is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones.
对于数值型数据, 先执行数据填充, 使用中位数进行填充; 然后进行标准缩放。
对于 分类型数据, 先试用 missing值进行填充, 然后进行编码。
In this example, the numeric data is standard-scaled after mean-imputation, while the categorical data is one-hot encoded after imputing missing values with a new category (
'missing'
).
可以使用不同的方法, 将列分发到特定的处理器上, 按照列名 或者 按照列的 数据类型。
In addition, we show two different ways to dispatch the columns to the particular pre-processor: by column names and by column data types.
最终, 预处理管道被进程到 一个完整的预测管道上, 对接上一个简单的分类模型。
Finally, the preprocessing pipeline is integrated in a full prediction pipeline using
Pipeline
, together with a simple classification model.
Use ColumnTransformer
by selecting column by names
(1)对数值型数据构建一个变换流水线: numeric_transformer, 包括数据填充器SimpleImputer 和 标准变换器 StandardScaler, 这是一个子流水线, 后面集成到 ColumnTransformer
中。
(2) 对分类型数据,建立一个变换器 categorical_transformer 。
然后将 (1) (2)集成到 ColumnTransformer 中,按照列名集成, 构成列变换器 ,命名为 preprocessor
最后 将 preprocesssor 和 LogisticRegression 模型集成为 最终的 流水线。
We will train our classifier with the following features:
Numeric Features:
age
: float;
fare
: float.Categorical Features:
embarked
: categories encoded as strings{'C', 'S', 'Q'}
;
sex
: categories encoded as strings{'female', 'male'}
;
pclass
: ordinal integers{1, 2, 3}
.We create the preprocessing pipelines for both numeric and categorical data. Note that
pclass
could either be treated as a categorical or numeric feature.
numeric_features = ['age', 'fare'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) categorical_features = ['embarked', 'sex', 'pclass'] categorical_transformer = OneHotEncoder(handle_unknown='ignore') preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)]) # Append classifier to preprocessing pipeline. # Now we have a full prediction pipeline. clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) clf.fit(X_train, y_train) print("model score: %.3f" % clf.score(X_test, y_test))
HTML representation of Pipeline
查看流水线
When the
Pipeline
is printed out in a jupyter notebook an HTML representation of the estimator is displayed as follows:from sklearn import set_config set_config(display='diagram') clf
StandardScaler
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
数值数据, 由于单位不同, 量值表示的含义不同, 导致量值的范围,有的很大, 有的很小。
如果 不进行正规化, 则具有较大变化范围的特征, 将会成为模型的主要影响力量, 其它较小范围的变量, 影响力可能会被抛弃。
正规化到相似的范围空间, 则可以平均化每个特征的影响能力。
Standardize features by removing the mean and scaling to unit variance
The standard score of a sample
x
is calculated as:z = (x - u) / s
where
u
is the mean of the training samples or zero ifwith_mean=False
, ands
is the standard deviation of the training samples or one ifwith_std=False
.Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using
transform
.Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passing
with_mean=False
to avoid breaking the sparsity structure of the data.
>>> from sklearn.preprocessing import StandardScaler >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]] >>> scaler = StandardScaler() >>> print(scaler.fit(data)) StandardScaler() >>> print(scaler.mean_) [0.5 0.5] >>> print(scaler.transform(data)) [[-1. -1.] [-1. -1.] [ 1. 1.] [ 1. 1.]] >>> print(scaler.transform([[2, 2]])) [[3. 3.]]
Use ColumnTransformer
by selecting column by data types
对于不同类型数据,如果都有相同的处理过程, 则可以使用 列选择器, 选择出对应类型的列, 送入对应的 列的 变换器 或者 预处理的子流水线。
When dealing with a cleaned dataset, the preprocessing can be automatic by using the data types of the column to decide whether to treat a column as a numerical or categorical feature.
sklearn.compose.make_column_selector
gives this possibility. First, let’s only select a subset of columns to simplify our example.subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare'] X_train, X_test = X_train[subset_feature], X_test[subset_feature]
Then, we introspect the information regarding each column data type.
X_train.info()
Out:
<class 'pandas.core.frame.DataFrame'> Int64Index: 1047 entries, 1118 to 684 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 embarked 1045 non-null category 1 sex 1047 non-null category 2 pclass 1047 non-null float64 3 age 841 non-null float64 4 fare 1046 non-null float64 dtypes: category(2), float64(3) memory usage: 35.0 KBWe can observe that the
embarked
andsex
columns were tagged ascategory
columns when loading the data withfetch_openml
. Therefore, we can use this information to dispatch the categorical columns to thecategorical_transformer
and the remaining columns to thenumerical_transformer
.
from sklearn.compose import make_column_selector as selector preprocessor = ColumnTransformer(transformers=[ ('num', numeric_transformer, selector(dtype_exclude="category")), ('cat', categorical_transformer, selector(dtype_include="category")) ]) clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())]) clf.fit(X_train, y_train) print("model score: %.3f" % clf.score(X_test, y_test))
Using the prediction pipeline in a grid search
使用网格搜索, 来确定流水线中, 节点参数, 包括 数值型 填充策略, 和 分类器参数。
Grid search can also be performed on the different preprocessing steps defined in the
ColumnTransformer
object, together with the classifier’s hyperparameters as part of thePipeline
. We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter of the logistic regression usingGridSearchCV
.param_grid = { 'preprocessor__num__imputer__strategy': ['mean', 'median'], 'classifier__C': [0.1, 1.0, 10, 100], } grid_search = GridSearchCV(clf, param_grid, cv=10) grid_search
查看最好的结果对应的参数。
Calling ‘fit’ triggers the cross-validated search for the best hyper-parameters combination:
grid_search.fit(X_train, y_train) print(f"Best params:") print(grid_search.best_params_)
Out:
Best params: {'classifier__C': 0.1, 'preprocessor__num__imputer__strategy': 'mean'}The internal cross-validation scores obtained by those parameters is:
print(f"Internal CV score: {grid_search.best_score_:.3f}")
Out:
Internal CV score: 0.784
查看所有参数哦搜索组合中得分较大的前五组
We can also introspect the top grid search results as a pandas dataframe:
import pandas as pd cv_results = pd.DataFrame(grid_search.cv_results_) cv_results = cv_results.sort_values("mean_test_score", ascending=False) cv_results[["mean_test_score", "std_test_score", "param_preprocessor__num__imputer__strategy", "param_classifier__C" ]].head(5)
The best hyper-parameters have be used to re-fit a final model on the full training set. We can evaluate that final model on held out test data that was not used for hyparameter tuning.
print(("best logistic regression from grid search: %.3f" % grid_search.score(X_test, y_test)))
Out:
best logistic regression from grid search: 0.794