【IJCAI-2018】搜索广告 - 不平衡数据 Imbalanced Data
我并不擅长做比赛,也不擅长构造特征,也不擅长调参数,也没有服务器可以并行。大家的baseline都比我的模型要好。在这里写这篇文章,主要是想跟大家分享下我对数据的理解,以及我思考的一个大概框架,希望对大家能有那么一点点启发或者帮助。
像我这种无经验无战绩无队友,特征只会弄个dummy variable,降维只会PCA,模型只会LR,SVM,调参只会CV,ensemble只会求平均的人,每次在比赛里的存在感就是增大分母,当我看到大家在论坛分享自己的baseline的时候,真的是好高兴好兴奋,然后又看到大家在构造各种神奇的特征,模型的logloss居然有提高,真的是很佩服。由于我既没有聪明的头脑也没有足够的细致,于是就“拿来主义”至上,将论坛里看到的baseline copy下来,在电脑上跑了一下。哇,好牛,第一把就0.084,与排名最前的0.080,只差0.004,我这是要冲击leaderboard 的节奏啊~兴奋之余干劲更大了,各种吭哧吭哧搜模型,吭哧吭哧敲代码,感觉自带BGM,走路生风~反正就是那种全世界都属于我的感觉~但是~~当我想看看哪些预测为1的时候,我惊呆了,no one! 合着我1万8的test data,用模型预测出来之后,竟然没有一个1,一个都没有!
然后这就是问题了。可能聪明的大家早就知道了CTR的数据不平衡问题,但是愚钝如我啊,我竟然没有发现!
所以吐槽完了~
对于不平衡数据 Imbalanced Data,像这里的CTR里面的二分类预测,应该怎么处理呢?
正负样本比例严重不平衡的情况,比例达到了50:1,如果直接在此基础上做预测,对于样本量较小的类的召回率会极低。
因为传统的学习方法以降低总体分类精度为目标,将所有样本一视同仁,同等对待,造成了分类器在多数类的分类精度较高而在少数类的分类精度很低。例如ctr正负样本50:1的例子,算法就算全部预测为另一样本,准确率也会达到98%(50/51),因此传统的学习算法在不平衡数据集中具有较大的局限性。传统的学习算法的预测结果就是favor the majority, 因为the minority 本身数量少,又本同等对待,因此miss the minority 的代价极小,所以结果就是favor the majority。
解决方法主要分为两个方面。
第一种方案主要从数据的角度出发,主要方法为抽样,既然我们的样本是不平衡的,那么可以通过某种策略进行抽样,从而让我们的数据相对均衡一些;resampling 方法包括 over-, under-, combination. over- is increasing # of minority, under- is decreasing # of majority.
第二种方案从算法的角度出发,考虑不同误分类情况代价的差异性对算法进行优化,使得我们的算法在不平衡数据下也能有较好的效果。改写cost function by giving large cost of misclassifying the minority labels.
PS: 附件中有基于logloss , AUC 的对比的python代码,可以运行,不会memory error.
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Wed Apr 4 10:53:58 2018 4 @author : HaiyanJiang 5 @email : jianghaiyan.cn@gmail.com 6 7 8 9 what does the doc do? 10 some ideas of improving the accuracy of imbalanced data classification. 11 data characteristics: 12 imbalanced data. 13 the models: 14 model_baseline : lgb 15 model_baseline2 : another lgb 16 model_baseline3 : bagging 17 18 19 20 Other Notes: 21 除了基本特征外,还包括了'用户'在当前小时内和当天的点击量统计特征,以及当前所在的小时。 22 'context_day', 'context_hour', 23 'user_query_day', 'user_query_hour', 'user_query_day_hour', 24 non_feat = [ 25 'instance_id', 'user_id', 'context_id', 'item_category_list', 26 'item_property_list', 'predict_category_property', 27 'context_timestamp', 'TagTime', 'context_day' 28 ] 29 30 31 32 """ 33 34 35 36 import time 37 import pandas as pd 38 import lightgbm as lgb 39 from sklearn.metrics import log_loss 40 41 42 43 import numpy as np 44 import itertools 45 import matplotlib.pyplot as plt 46 from sklearn.metrics import confusion_matrix 47 from sklearn.metrics import auc, roc_curve 48 from scipy import interp 49 50 51 52 from sklearn.ensemble import BaggingClassifier 53 from imblearn.ensemble import BalancedBaggingClassifier 54 55 56 57 58 def read_bigcsv(filename, **kw): 59 with open(filename) as rf: 60 reader = pd.read_csv(rf, **kw, iterator=True) 61 chunkSize = 100000 62 chunks = [] 63 while True: 64 try: 65 chunk = reader.get_chunk(chunkSize) 66 chunks.append(chunk) 67 except StopIteration: 68 print("Iteration is stopped.") 69 break 70 df = pd.concat(chunks, axis=0, join='outer', ignore_index=True) 71 return df 72 73 74 75 76 def timestamp2datetime(value): 77 value = time.localtime(value) 78 dt = time.strftime('%Y-%m-%d %H:%M:%S', value) 79 return dt 80 81 82 83 84 ''' 85 from matplotlib import pyplot as plt 86 tt = data['context_timestamp'] 87 plt.plot(tt) 88 # 可以看出时间是没有排好的,有一定的错位。如果做成online的模型,一定要将时间排好。 89 # aa = data[data['user_id']==24779788309075] 90 aa = data_train[data_train.duplicated(subset=None, keep='first')] 91 bb = data_train[data_train.duplicated(subset=None, keep='last')] 92 cc = data_train[data_train.duplicated(subset=None, keep=False)] 93 94 95 96 a2 = pd.DataFrame(train_id)[pd.DataFrame(train_id).duplicated(keep=False)] 97 b2 = train_id[train_id.duplicated(keep='last')] 98 c2 = train_id[train_id.duplicated(keep=False)] 99 100 101 102 c2 = data_train[data_train.duplicated(subset=None, keep=False)] 103 104 105 106 经验证, 'instance_id'有重复 107 a3 = Xdata[Xdata['instance_id']==1037061371711078396] 108 ''' 109 110 111 112 113 def convert_timestamp(data): 114 ''' 115 1. convert timestamp to datetime. 116 2. no sort, no reindex. 117 data.duplicated(subset=None, keep='first') 118 TagTime from-to is ('2018-09-18 00:00:01', '2018-09-24 23:59:47') 119 'user_query_day', 'user_query_day_hour', 'hour', 120 np.corrcoef(data['user_query_day'], data['user_query_hour']) 121 np.corrcoef(data['user_query_hour'], data['user_query_day_hour']) 122 np.corrcoef(data['user_query_day'], data['user_query_day_hour']) 123 ''' 124 data['TagTime'] = data['context_timestamp'].apply(timestamp2datetime) 125 # data['TagTime'][0], data['TagTime'][len(data) - 1] 126 # x = data['TagTime'][len(data) - 1] 127 data['context_day'] = data['TagTime'].apply(lambda x: int(x[8:10])) 128 data['context_hour'] = data['TagTime'].apply(lambda x: int(x[11:13])) 129 query_day = data.groupby(['user_id', 'context_day']).size( 130 ).reset_index().rename(columns={0: 'user_query_day'}) 131 data = pd.merge(data, query_day, 'left', on=['user_id', 'context_day']) 132 query_hour = data.groupby(['user_id', 'context_hour']).size( 133 ).reset_index().rename(columns={0: 'user_query_hour'}) 134 data = pd.merge(data, query_hour, 'left', on=['user_id', 'context_hour']) 135 query_day_hour = data.groupby( 136 by=['user_id', 'context_day', 'context_hour']).size( 137 ).reset_index().rename(columns={0: 'user_query_day_hour'}) 138 data = pd.merge(data, query_day_hour, 'left', 139 on=['user_id', 'context_day', 'context_hour']) 140 return data 141 142 143 144 145 def plot_confusion_matrix(cm, classes, normalize=False, 146 title='Confusion matrix', 147 cmap=plt.cm.Blues): 148 """ 149 This function prints and plots the confusion matrix. 150 Normalization can be applied by setting 'normalize=True'. 151 """ 152 if normalize: 153 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 154 print("Normalized confusion matrix") 155 else: 156 print('Confusion matrix, without normalization') 157 print(cm) 158 plt.imshow(cm, interpolation='nearest', cmap=cmap) 159 plt.title(title) 160 plt.colorbar() 161 tick_marks = np.arange(len(classes)) 162 plt.xticks(tick_marks, classes, rotation=45) 163 plt.yticks(tick_marks, classes) 164 fmt = '.2f' if normalize else 'd' 165 thresh = cm.max() / 2. 166 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): 167 plt.text(j, i, format(cm[i, j], fmt), 168 horizontalalignment="center", 169 color="white" if cm[i, j] > thresh else "black") 170 plt.tight_layout() 171 plt.ylabel('True label') 172 plt.xlabel('Predicted label') 173 174 175 176 177 def data_baseline(): 178 filename = '../round1_ijcai_18_data/round1_ijcai_18_train_20180301.txt' 179 data = read_bigcsv(filename, sep=' ') 180 # data = pd.read_csv(filename, sep=' ') 181 data.drop_duplicates(inplace=True) 182 data.reset_index(drop=True, inplace=True) # very important 183 data = convert_timestamp(data) 184 train = data.loc[data['context_day'] < 24] # 18,19,20,21,22,23,24 185 test = data.loc[data['context_day'] == 24] # 暂时先使用第24天作为验证集 186 features = [ 187 'item_id', 'item_brand_id', 'item_city_id', 'item_price_level', 188 'item_sales_level', 'item_collected_level', 'item_pv_level', 189 'user_gender_id', 'user_age_level', 'user_occupation_id', 190 'user_star_level', 'context_page_id', 'shop_id', 191 'shop_review_num_level', 'shop_review_positive_rate', 192 'shop_star_level', 'shop_score_service', 193 'shop_score_delivery', 'shop_score_description', 194 'user_query_day', 'user_query_day_hour', 'context_hour', 195 ] 196 x_train = train[features] 197 x_test = test[features] 198 y_train = train['is_trade'] 199 y_test = test['is_trade'] 200 return x_train, x_test, y_train, y_test 201 # x_train, x_test, y_train, y_test = data_baseline() 202 203 204 205 206 def model_baseline(x_train, y_train, x_test, y_test): 207 cat_names = [ 208 'item_price_level', 209 'item_sales_level', 210 'item_collected_level', 211 'item_pv_level', 212 'user_gender_id', 213 'user_age_level', 214 'user_occupation_id', 215 'user_star_level', 216 'context_page_id', 217 'shop_review_num_level', 218 'shop_star_level', 219 ] 220 print("begin train...") 221 kw_lgb = dict(num_leaves=63, max_depth=7, n_estimators=80, random_state=6,) 222 clf = lgb.LGBMClassifier(**kw_lgb) 223 clf.fit(x_train, y_train, categorical_feature=cat_names,) 224 prob = clf.predict_proba(x_test,)[:, 1] 225 predict_score = [float('%.2f' % x) for x in prob] 226 loss_val = log_loss(y_test, predict_score) 227 # print(loss_val) # 0.0848226750637 228 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 229 mean_fpr = np.linspace(0, 1, 100) 230 mean_tpr = interp(mean_fpr, fpr, tpr) 231 x_auc = auc(fpr, tpr) 232 fig = plt.figure('fig1') 233 ax = fig.add_subplot(1, 1, 1) 234 name = 'base_lgb' 235 plt.plot(mean_fpr, mean_tpr, linestyle='--', 236 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 237 (x_auc, loss_val), lw=2) 238 y_pred = clf.predict(x_test) 239 cm1 = plt.figure() 240 cm = confusion_matrix(y_test, y_pred) 241 plot_confusion_matrix(cm, classes=[0, 1], title='Confusion matrix base1') 242 # add weighted according to the labels 243 clf = lgb.LGBMClassifier(**kw_lgb) 244 clf.fit(x_train, y_train, 245 sample_weight=[1 if y == 1 else 0.02 for y in y_train], 246 categorical_feature=cat_names) 247 prob = clf.predict_proba(x_test,)[:, 1] 248 predict_score = [float('%.2f' % x) for x in prob] 249 loss_val = log_loss(y_test, predict_score) 250 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 251 mean_fpr = np.linspace(0, 1, 100) 252 mean_tpr = interp(mean_fpr, fpr, tpr) 253 x_auc = auc(fpr, tpr) 254 name = 'base_lgb_weighted' 255 plt.figure('fig1') # 选择图 256 plt.plot( 257 mean_fpr, mean_tpr, linestyle='--', 258 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 259 (x_auc, loss_val), lw=2) 260 y_pred = clf.predict(x_test) 261 cm2 = plt.figure() 262 cm = confusion_matrix(y_test, y_pred) 263 plot_confusion_matrix(cm, classes=[0, 1], 264 title='Confusion matrix basemodle') 265 plt.figure('fig1') # 选择图 266 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck') 267 # make nice plotting 268 ax.spines['top'].set_visible(False) 269 ax.spines['right'].set_visible(False) 270 ax.get_xaxis().tick_bottom() 271 ax.get_yaxis().tick_left() 272 ax.spines['left'].set_position(('outward', 10)) 273 ax.spines['bottom'].set_position(('outward', 10)) 274 plt.xlim([0, 1]) 275 plt.ylim([0, 1]) 276 plt.xlabel('False Positive Rate') 277 plt.ylabel('True Positive Rate') 278 plt.title('Receiver Operating Characteristic') 279 plt.legend(loc="lower right") 280 plt.show() 281 return cm1, cm2, fig 282 283 284 285 286 def model_baseline3(x_train, y_train, x_test, y_test): 287 bagging = BaggingClassifier(random_state=0) 288 balanced_bagging = BalancedBaggingClassifier(random_state=0) 289 bagging.fit(x_train, y_train) 290 balanced_bagging.fit(x_train, y_train) 291 prob = bagging.predict_proba(x_test)[:, 1] 292 predict_score = [float('%.2f' % x) for x in prob] 293 loss_val = log_loss(y_test, predict_score) 294 y_pred = [1 if x > 0.5 else 0 for x in predict_score] 295 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 296 mean_fpr = np.linspace(0, 1, 100) 297 mean_tpr = interp(mean_fpr, fpr, tpr) 298 x_auc = auc(fpr, tpr) 299 fig = plt.figure('Bagging') 300 ax = fig.add_subplot(1, 1, 1) 301 name = 'base_Bagging' 302 plt.plot(mean_fpr, mean_tpr, linestyle='--', 303 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 304 (x_auc, loss_val), lw=2) 305 y_pred_bagging = bagging.predict(x_test) 306 cm_bagging = confusion_matrix(y_test, y_pred_bagging) 307 cm1 = plt.figure() 308 plot_confusion_matrix(cm_bagging, 309 classes=[0, 1], 310 title='Confusion matrix of BaggingClassifier') 311 # balanced_bagging 312 prob = balanced_bagging.predict_proba(x_test)[:, 1] 313 predict_score = [float('%.2f' % x) for x in prob] 314 loss_val = log_loss(y_test, predict_score) 315 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 316 mean_fpr = np.linspace(0, 1, 100) 317 mean_tpr = interp(mean_fpr, fpr, tpr) 318 x_auc = auc(fpr, tpr) 319 plt.figure('Bagging') # 选择图 320 name = 'base_Balanced_Bagging' 321 plt.plot( 322 mean_fpr, mean_tpr, linestyle='--', 323 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 324 (x_auc, loss_val), lw=2) 325 y_pred_balanced_bagging = balanced_bagging.predict(x_test) 326 cm_balanced_bagging = confusion_matrix(y_test, y_pred_balanced_bagging) 327 cm2 = plt.figure() 328 plot_confusion_matrix(cm_balanced_bagging, 329 classes=[0, 1], 330 title='Confusion matrix of BalancedBagging') 331 plt.figure('Bagging') # 选择图 332 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck') 333 # make nice plotting 334 ax.spines['top'].set_visible(False) 335 ax.spines['right'].set_visible(False) 336 ax.get_xaxis().tick_bottom() 337 ax.get_yaxis().tick_left() 338 ax.spines['left'].set_position(('outward', 10)) 339 ax.spines['bottom'].set_position(('outward', 10)) 340 plt.xlim([0, 1]) 341 plt.ylim([0, 1]) 342 plt.xlabel('False Positive Rate') 343 plt.ylabel('True Positive Rate') 344 plt.title('Receiver Operating Characteristic') 345 plt.legend(loc="lower right") 346 plt.show() 347 return cm1, cm2, fig 348 349 350 351 352 def model_baseline2(x_train, y_train, x_test, y_test): 353 params = { 354 'task': 'train', 355 'boosting_type': 'gbdt', 356 'objective': 'multiclass', 357 'num_class': 2, 358 'verbose': 0, 359 'metric': 'logloss', 360 'max_bin': 255, 361 'max_depth': 7, 362 'learning_rate': 0.3, 363 'nthread': 4, 364 'n_estimators': 85, 365 'num_leaves': 63, 366 'feature_fraction': 0.8, 367 'num_boost_round': 160, 368 } 369 lgb_train = lgb.Dataset(x_train, label=y_train) 370 lgb_eval = lgb.Dataset(x_test, label=y_test, reference=lgb_train) 371 print("begin train...") 372 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval) 373 prob = bst.predict(x_test)[:, 1] 374 predict_score = [float('%.2f' % x) for x in prob] 375 loss_val = log_loss(y_test, predict_score) 376 y_pred = [1 if x > 0.5 else 0 for x in predict_score] 377 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 378 x_auc = auc(fpr, tpr) 379 mean_fpr = np.linspace(0, 1, 100) 380 mean_tpr = interp(mean_fpr, fpr, tpr) 381 fig = plt.figure('weighted') 382 ax = fig.add_subplot(1, 1, 1) 383 name = 'base_lgb' 384 plt.plot(mean_fpr, mean_tpr, linestyle='--', 385 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 386 (x_auc, loss_val), lw=2) 387 cm1 = plt.figure() 388 cm = confusion_matrix(y_test, y_pred) 389 plot_confusion_matrix(cm, classes=[0, 1], 390 title='Confusion matrix basemodle') 391 # add weighted according to the labels 392 lgb_train = lgb.Dataset( 393 x_train, label=y_train, 394 weight=[1 if y == 1 else 0.02 for y in y_train]) 395 lgb_eval = lgb.Dataset( 396 x_test, label=y_test, reference=lgb_train, 397 weight=[1 if y == 1 else 0.02 for y in y_test]) 398 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval) 399 prob = bst.predict(x_test)[:, 1] 400 predict_score = [float('%.2f' % x) for x in prob] 401 loss_val = log_loss(y_test, predict_score) 402 y_pred = [1 if x > 0.5 else 0 for x in predict_score] 403 fpr, tpr, thresholds = roc_curve(y_test, predict_score) 404 mean_fpr = np.linspace(0, 1, 100) 405 mean_tpr = interp(mean_fpr, fpr, tpr) 406 x_auc = auc(fpr, tpr) 407 plt.figure('weighted') # 选择图 408 name = 'base_lgb_weighted' 409 plt.plot( 410 mean_fpr, mean_tpr, linestyle='--', 411 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) % 412 (x_auc, loss_val), lw=2) 413 cm2 = plt.figure() 414 cm = confusion_matrix(y_test, y_pred) 415 plot_confusion_matrix(cm, classes=[0, 1], 416 title='Confusion matrix basemodle') 417 plt.figure('weighted') # 选择图 418 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck') 419 # make nice plotting 420 ax.spines['top'].set_visible(False) 421 ax.spines['right'].set_visible(False) 422 ax.get_xaxis().tick_bottom() 423 ax.get_yaxis().tick_left() 424 ax.spines['left'].set_position(('outward', 10)) 425 ax.spines['bottom'].set_position(('outward', 10)) 426 plt.xlim([0, 1]) 427 plt.ylim([0, 1]) 428 plt.xlabel('False Positive Rate') 429 plt.ylabel('True Positive Rate') 430 plt.title('Receiver Operating Characteristic') 431 plt.legend(loc="lower right") 432 plt.show() 433 return cm1, cm2, fig 434 435 436 437 438 ''' 439 1. logloss VS AUC 440 虽然 baseline 的 logloss= 0.0819, 确实很小,但是从 Confusion matrix 看出, 441 模型倾向于将所有的数据都分成多的那个,加了weight 之后稍好一点? 442 Though the logloss is 0.0819, which is a very small value. 443 Confusion matrix shows y_pred all 0, which feavors the majority classes. 444 445 446 447 AUC 只有 0.64~0.67. 448 AUC如此小,按理来说不应该啊,但是为什么呢? 449 因为数据的label 极度不平衡,1 的比例大概只有 2%. 50:1. 450 AUC 对不平衡数据的分类性能测试更友好,用AUC去选特征,可能结果更好哦。 451 这里只提供一个大概的思考改进点。 452 2. handling with imbalanced data: 453 1. resampling, over- or under-, 454 over- is increasing # of minority, under- is decreasing # of majority. 455 2. revalue the loss function by giving large loss of misclassifying the 456 minority labels. 457 ''' 458 459 460 461 462 if __name__ == "__main__": 463 x_train, x_test, y_train, y_test = data_baseline() 464 cm11, cm12, fig1 = model_baseline(x_train, y_train, x_test, y_test) 465 cm21, cm22, fig2 = model_baseline2(x_train, y_train, x_test, y_test) 466 cm31, cm32, fig3 = model_baseline3(x_train, y_train, x_test, y_test) 467 468 469 470 fig1.savefig('./base_lgb_weighted.jpg', format='jpg') 471 cm11.savefig('./Confusion matrix1.jpg', format='jpg') 472 cm12.savefig('./Confusion matrix2.jpg', format='jpg')