有时候分不清仨化
特征标准化
原因
特征数字差值很大的属性会对计算结果产生很大的影响,当我们认为特征是等权重的时候,因为取值范围不同,因此要进行归一化
例子
time | distance | weight |
---|---|---|
1.2 | 5000 | 80 |
1.6 | 6000 | 90 |
1.0 | 3000 | 50 |
例如我们认为,time,distance,weight三个权重是一样的,在做特征分析的时候会明显发现distance对计算结果的影响是最大的。
因此,使用归一化的方法将数值处理到0~1的范围内
最值标准化方法
(x_{new})=((x)-(x_{min}))/((x_{max})-(x_{min}))
cle<-function(df){
df_new<-(df-min(df))/(max(df)-min(df))
return df_new
}
均值方差标准化方法
(x_{ ext {scale}}=frac{x-x_{ ext {mean}}}{s})
cle<-function(df){
df_new<-(df-mean(df))/std(df)
return df_new
}
python中提供了standardscaler类可以直接对np对象进行均值方差标准化
可以参考
标准化
scale
# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)
# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
<script.py> output:
Accuracy with Scaling: 0.7700680272108843
Accuracy without Scaling: 0.6979591836734694
很明显,标准化之后的数据的预测精度更高