在做ML时,类别数据很常见,处理类别数据的方法分为以下几种情况:
1、当类别不多时并且value_counts差不多时,做 one-hot就好
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source'] for col in var_to_encode: data[col] = le.fit_transform(data[col]) data = pd.get_dummies(data, columns=var_to_encode) data.columns
2、但是有时候类型之间的value_count 其实很大,这是我们选取某几个类别,然后其余的小量都归结到 other 里面去,
data['Source'] = data['Source'].apply(lambda x: 'others' if x not in ['S122','S133'] else x) data['Source'].value_counts()