- 特征工程
特征工程可以有效地改善模型效果,减少训练时间。
简单的方法包括:
1. 进行特征转换
2. 增加语义特征
A Guiding Principle of Feature Engineering
For a feature to be useful, it must have a relationship to the target that your model is able to learn. Linear models, for instance, are only able to learn linear relationships. So, when using a linear model, your goal is to transform the features to make their relationship to the target linear.
The key idea here is that a transformation you apply to a feature becomes in essence a part of the model itself. Say you were trying to predict the
Price
of square plots of land from theLength
of one side. Fitting a linear model directly toLength
gives poor results: the relationship is not linear.A linear model fits poorly with only Length as feature. If we square the
Length
feature to get'Area'
, however, we create a linear relationship. AddingArea
to the feature set means this linear model can now fit a parabola. Squaring a feature, in other words, gave the linear model the ability to fit squared features.
- 互信息
面对大量特征,可以使用特征效益函数对特征进行排序,选择有效的特征子集。互信息是最常用的方法,相较于相关度,互信息可以衡量各种关系,而相关度只能衡量线性关 系。互信息以不确定性(熵)的角度进行特征相关度的衡量。
互信息的好处:
- easy to use and interpret,
- computationally efficient,
- theoretically well-founded,
- resistant to overfitting, and,
- able to detect any kind of relationship
互信息的下限是0,上限是正无穷,但一般不大于2.
互信息是一种单变量尺度。
注意:类别特征编码
# Label encoding for categoricals for colname in X.select_dtypes("object"): X[colname], _ = X[colname].factorize()
针对实数类型的标签使用mutual_info_regression,针对整数类型的标签使用mutual_info_classif