• 搜索广告


    【IJCAI-2018】搜索广告 - 不平衡数据 Imbalanced Data

    我并不擅长做比赛,也不擅长构造特征,也不擅长调参数,也没有服务器可以并行。大家的baseline都比我的模型要好。在这里写这篇文章,主要是想跟大家分享下我对数据的理解,以及我思考的一个大概框架,希望对大家能有那么一点点启发或者帮助。

    像我这种无经验无战绩无队友,特征只会弄个dummy variable,降维只会PCA,模型只会LR,SVM,调参只会CV,ensemble只会求平均的人,每次在比赛里的存在感就是增大分母,当我看到大家在论坛分享自己的baseline的时候,真的是好高兴好兴奋,然后又看到大家在构造各种神奇的特征,模型的logloss居然有提高,真的是很佩服。由于我既没有聪明的头脑也没有足够的细致,于是就“拿来主义”至上,将论坛里看到的baseline copy下来,在电脑上跑了一下。哇,好牛,第一把就0.084,与排名最前的0.080,只差0.004,我这是要冲击leaderboard 的节奏啊~兴奋之余干劲更大了,各种吭哧吭哧搜模型,吭哧吭哧敲代码,感觉自带BGM,走路生风~反正就是那种全世界都属于我的感觉~但是~~当我想看看哪些预测为1的时候,我惊呆了,no one! 合着我1万8的test data,用模型预测出来之后,竟然没有一个1,一个都没有!

    Confusion matrix1

    然后这就是问题了。可能聪明的大家早就知道了CTR的数据不平衡问题,但是愚钝如我啊,我竟然没有发现!

    所以吐槽完了~

     

    对于不平衡数据 Imbalanced Data,像这里的CTR里面的二分类预测,应该怎么处理呢?

    正负样本比例严重不平衡的情况,比例达到了50:1,如果直接在此基础上做预测,对于样本量较小的类的召回率会极低。

     

    因为传统的学习方法以降低总体分类精度为目标,将所有样本一视同仁,同等对待,造成了分类器在多数类的分类精度较高而在少数类的分类精度很低。例如ctr正负样本50:1的例子,算法就算全部预测为另一样本,准确率也会达到98%(50/51),因此传统的学习算法在不平衡数据集中具有较大的局限性。传统的学习算法的预测结果就是favor the majority, 因为the minority 本身数量少,又本同等对待,因此miss the minority 的代价极小,所以结果就是favor the majority。

     

    解决方法主要分为两个方面。

    第一种方案主要从数据的角度出发,主要方法为抽样,既然我们的样本是不平衡的,那么可以通过某种策略进行抽样,从而让我们的数据相对均衡一些;resampling 方法包括 over-, under-, combination. over- is increasing # of minority, under- is decreasing # of majority.

     

    第二种方案从算法的角度出发,考虑不同误分类情况代价的差异性对算法进行优化,使得我们的算法在不平衡数据下也能有较好的效果。改写cost function by giving large cost of misclassifying the minority labels. 

    PS: 附件中有基于logloss , AUC 的对比的python代码,可以运行,不会memory error.

    base_lgb_weighted

      1 # -*- coding: utf-8 -*-
      2 """
      3 Created on Wed Apr 4 10:53:58 2018
      4 @author : HaiyanJiang
      5 @email : jianghaiyan.cn@gmail.com
      6 
      7  
      8 
      9 what does the doc do?
     10 some ideas of improving the accuracy of imbalanced data classification.
     11 data characteristics:
     12 imbalanced data.
     13 the models:
     14 model_baseline : lgb
     15 model_baseline2 : another lgb
     16 model_baseline3 : bagging
     17 
     18  
     19 
     20 Other Notes:
     21 除了基本特征外,还包括了'用户'在当前小时内和当天的点击量统计特征,以及当前所在的小时。
     22 'context_day', 'context_hour',
     23 'user_query_day', 'user_query_hour', 'user_query_day_hour',
     24 non_feat = [
     25 'instance_id', 'user_id', 'context_id', 'item_category_list',
     26 'item_property_list', 'predict_category_property',
     27 'context_timestamp', 'TagTime', 'context_day'
     28 ]
     29 
     30  
     31 
     32 """
     33 
     34  
     35 
     36 import time
     37 import pandas as pd
     38 import lightgbm as lgb
     39 from sklearn.metrics import log_loss
     40 
     41  
     42 
     43 import numpy as np
     44 import itertools
     45 import matplotlib.pyplot as plt
     46 from sklearn.metrics import confusion_matrix
     47 from sklearn.metrics import auc, roc_curve
     48 from scipy import interp
     49 
     50  
     51 
     52 from sklearn.ensemble import BaggingClassifier
     53 from imblearn.ensemble import BalancedBaggingClassifier
     54 
     55  
     56 
     57 
     58 def read_bigcsv(filename, **kw):
     59 with open(filename) as rf:
     60 reader = pd.read_csv(rf, **kw, iterator=True)
     61 chunkSize = 100000
     62 chunks = []
     63 while True:
     64 try:
     65 chunk = reader.get_chunk(chunkSize)
     66 chunks.append(chunk)
     67 except StopIteration:
     68 print("Iteration is stopped.")
     69 break
     70 df = pd.concat(chunks, axis=0, join='outer', ignore_index=True)
     71 return df
     72 
     73  
     74 
     75 
     76 def timestamp2datetime(value):
     77 value = time.localtime(value)
     78 dt = time.strftime('%Y-%m-%d %H:%M:%S', value)
     79 return dt
     80 
     81  
     82 
     83 
     84 '''
     85 from matplotlib import pyplot as plt
     86 tt = data['context_timestamp']
     87 plt.plot(tt)
     88 # 可以看出时间是没有排好的,有一定的错位。如果做成online的模型,一定要将时间排好。
     89 # aa = data[data['user_id']==24779788309075]
     90 aa = data_train[data_train.duplicated(subset=None, keep='first')]
     91 bb = data_train[data_train.duplicated(subset=None, keep='last')]
     92 cc = data_train[data_train.duplicated(subset=None, keep=False)]
     93 
     94  
     95 
     96 a2 = pd.DataFrame(train_id)[pd.DataFrame(train_id).duplicated(keep=False)]
     97 b2 = train_id[train_id.duplicated(keep='last')]
     98 c2 = train_id[train_id.duplicated(keep=False)]
     99 
    100  
    101 
    102 c2 = data_train[data_train.duplicated(subset=None, keep=False)]
    103 
    104  
    105 
    106 经验证, 'instance_id'有重复
    107 a3 = Xdata[Xdata['instance_id']==1037061371711078396]
    108 '''
    109 
    110  
    111 
    112 
    113 def convert_timestamp(data):
    114 '''
    115 1. convert timestamp to datetime.
    116 2. no sort, no reindex.
    117 data.duplicated(subset=None, keep='first')
    118 TagTime from-to is ('2018-09-18 00:00:01', '2018-09-24 23:59:47')
    119 'user_query_day', 'user_query_day_hour', 'hour',
    120 np.corrcoef(data['user_query_day'], data['user_query_hour'])
    121 np.corrcoef(data['user_query_hour'], data['user_query_day_hour'])
    122 np.corrcoef(data['user_query_day'], data['user_query_day_hour'])
    123 '''
    124 data['TagTime'] = data['context_timestamp'].apply(timestamp2datetime)
    125 # data['TagTime'][0], data['TagTime'][len(data) - 1]
    126 # x = data['TagTime'][len(data) - 1]
    127 data['context_day'] = data['TagTime'].apply(lambda x: int(x[8:10]))
    128 data['context_hour'] = data['TagTime'].apply(lambda x: int(x[11:13]))
    129 query_day = data.groupby(['user_id', 'context_day']).size(
    130 ).reset_index().rename(columns={0: 'user_query_day'})
    131 data = pd.merge(data, query_day, 'left', on=['user_id', 'context_day'])
    132 query_hour = data.groupby(['user_id', 'context_hour']).size(
    133 ).reset_index().rename(columns={0: 'user_query_hour'})
    134 data = pd.merge(data, query_hour, 'left', on=['user_id', 'context_hour'])
    135 query_day_hour = data.groupby(
    136 by=['user_id', 'context_day', 'context_hour']).size(
    137 ).reset_index().rename(columns={0: 'user_query_day_hour'})
    138 data = pd.merge(data, query_day_hour, 'left',
    139 on=['user_id', 'context_day', 'context_hour'])
    140 return data
    141 
    142  
    143 
    144 
    145 def plot_confusion_matrix(cm, classes, normalize=False,
    146 title='Confusion matrix',
    147 cmap=plt.cm.Blues):
    148 """
    149 This function prints and plots the confusion matrix.
    150 Normalization can be applied by setting 'normalize=True'.
    151 """
    152 if normalize:
    153 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    154 print("Normalized confusion matrix")
    155 else:
    156 print('Confusion matrix, without normalization')
    157 print(cm)
    158 plt.imshow(cm, interpolation='nearest', cmap=cmap)
    159 plt.title(title)
    160 plt.colorbar()
    161 tick_marks = np.arange(len(classes))
    162 plt.xticks(tick_marks, classes, rotation=45)
    163 plt.yticks(tick_marks, classes)
    164 fmt = '.2f' if normalize else 'd'
    165 thresh = cm.max() / 2.
    166 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    167 plt.text(j, i, format(cm[i, j], fmt),
    168 horizontalalignment="center",
    169 color="white" if cm[i, j] > thresh else "black")
    170 plt.tight_layout()
    171 plt.ylabel('True label')
    172 plt.xlabel('Predicted label')
    173 
    174  
    175 
    176 
    177 def data_baseline():
    178 filename = '../round1_ijcai_18_data/round1_ijcai_18_train_20180301.txt'
    179 data = read_bigcsv(filename, sep=' ')
    180 # data = pd.read_csv(filename, sep=' ')
    181 data.drop_duplicates(inplace=True)
    182 data.reset_index(drop=True, inplace=True) # very important
    183 data = convert_timestamp(data)
    184 train = data.loc[data['context_day'] < 24] # 18,19,20,21,22,23,24
    185 test = data.loc[data['context_day'] == 24] # 暂时先使用第24天作为验证集
    186 features = [
    187 'item_id', 'item_brand_id', 'item_city_id', 'item_price_level',
    188 'item_sales_level', 'item_collected_level', 'item_pv_level',
    189 'user_gender_id', 'user_age_level', 'user_occupation_id',
    190 'user_star_level', 'context_page_id', 'shop_id',
    191 'shop_review_num_level', 'shop_review_positive_rate',
    192 'shop_star_level', 'shop_score_service',
    193 'shop_score_delivery', 'shop_score_description',
    194 'user_query_day', 'user_query_day_hour', 'context_hour',
    195 ]
    196 x_train = train[features]
    197 x_test = test[features]
    198 y_train = train['is_trade']
    199 y_test = test['is_trade']
    200 return x_train, x_test, y_train, y_test
    201 # x_train, x_test, y_train, y_test = data_baseline()
    202 
    203  
    204 
    205 
    206 def model_baseline(x_train, y_train, x_test, y_test):
    207 cat_names = [
    208 'item_price_level',
    209 'item_sales_level',
    210 'item_collected_level',
    211 'item_pv_level',
    212 'user_gender_id',
    213 'user_age_level',
    214 'user_occupation_id',
    215 'user_star_level',
    216 'context_page_id',
    217 'shop_review_num_level',
    218 'shop_star_level',
    219 ]
    220 print("begin train...")
    221 kw_lgb = dict(num_leaves=63, max_depth=7, n_estimators=80, random_state=6,)
    222 clf = lgb.LGBMClassifier(**kw_lgb)
    223 clf.fit(x_train, y_train, categorical_feature=cat_names,)
    224 prob = clf.predict_proba(x_test,)[:, 1]
    225 predict_score = [float('%.2f' % x) for x in prob]
    226 loss_val = log_loss(y_test, predict_score)
    227 # print(loss_val) # 0.0848226750637
    228 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
    229 mean_fpr = np.linspace(0, 1, 100)
    230 mean_tpr = interp(mean_fpr, fpr, tpr)
    231 x_auc = auc(fpr, tpr)
    232 fig = plt.figure('fig1')
    233 ax = fig.add_subplot(1, 1, 1)
    234 name = 'base_lgb'
    235 plt.plot(mean_fpr, mean_tpr, linestyle='--',
    236 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
    237 (x_auc, loss_val), lw=2)
    238 y_pred = clf.predict(x_test)
    239 cm1 = plt.figure()
    240 cm = confusion_matrix(y_test, y_pred)
    241 plot_confusion_matrix(cm, classes=[0, 1], title='Confusion matrix base1')
    242 # add weighted according to the labels
    243 clf = lgb.LGBMClassifier(**kw_lgb)
    244 clf.fit(x_train, y_train,
    245 sample_weight=[1 if y == 1 else 0.02 for y in y_train],
    246 categorical_feature=cat_names)
    247 prob = clf.predict_proba(x_test,)[:, 1]
    248 predict_score = [float('%.2f' % x) for x in prob]
    249 loss_val = log_loss(y_test, predict_score)
    250 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
    251 mean_fpr = np.linspace(0, 1, 100)
    252 mean_tpr = interp(mean_fpr, fpr, tpr)
    253 x_auc = auc(fpr, tpr)
    254 name = 'base_lgb_weighted'
    255 plt.figure('fig1') # 选择图
    256 plt.plot(
    257 mean_fpr, mean_tpr, linestyle='--',
    258 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
    259 (x_auc, loss_val), lw=2)
    260 y_pred = clf.predict(x_test)
    261 cm2 = plt.figure()
    262 cm = confusion_matrix(y_test, y_pred)
    263 plot_confusion_matrix(cm, classes=[0, 1],
    264 title='Confusion matrix basemodle')
    265 plt.figure('fig1') # 选择图
    266 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
    267 # make nice plotting
    268 ax.spines['top'].set_visible(False)
    269 ax.spines['right'].set_visible(False)
    270 ax.get_xaxis().tick_bottom()
    271 ax.get_yaxis().tick_left()
    272 ax.spines['left'].set_position(('outward', 10))
    273 ax.spines['bottom'].set_position(('outward', 10))
    274 plt.xlim([0, 1])
    275 plt.ylim([0, 1])
    276 plt.xlabel('False Positive Rate')
    277 plt.ylabel('True Positive Rate')
    278 plt.title('Receiver Operating Characteristic')
    279 plt.legend(loc="lower right")
    280 plt.show()
    281 return cm1, cm2, fig
    282 
    283  
    284 
    285 
    286 def model_baseline3(x_train, y_train, x_test, y_test):
    287 bagging = BaggingClassifier(random_state=0)
    288 balanced_bagging = BalancedBaggingClassifier(random_state=0)
    289 bagging.fit(x_train, y_train)
    290 balanced_bagging.fit(x_train, y_train)
    291 prob = bagging.predict_proba(x_test)[:, 1]
    292 predict_score = [float('%.2f' % x) for x in prob]
    293 loss_val = log_loss(y_test, predict_score)
    294 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
    295 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
    296 mean_fpr = np.linspace(0, 1, 100)
    297 mean_tpr = interp(mean_fpr, fpr, tpr)
    298 x_auc = auc(fpr, tpr)
    299 fig = plt.figure('Bagging')
    300 ax = fig.add_subplot(1, 1, 1)
    301 name = 'base_Bagging'
    302 plt.plot(mean_fpr, mean_tpr, linestyle='--',
    303 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
    304 (x_auc, loss_val), lw=2)
    305 y_pred_bagging = bagging.predict(x_test)
    306 cm_bagging = confusion_matrix(y_test, y_pred_bagging)
    307 cm1 = plt.figure()
    308 plot_confusion_matrix(cm_bagging,
    309 classes=[0, 1],
    310 title='Confusion matrix of BaggingClassifier')
    311 # balanced_bagging
    312 prob = balanced_bagging.predict_proba(x_test)[:, 1]
    313 predict_score = [float('%.2f' % x) for x in prob]
    314 loss_val = log_loss(y_test, predict_score)
    315 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
    316 mean_fpr = np.linspace(0, 1, 100)
    317 mean_tpr = interp(mean_fpr, fpr, tpr)
    318 x_auc = auc(fpr, tpr)
    319 plt.figure('Bagging') # 选择图
    320 name = 'base_Balanced_Bagging'
    321 plt.plot(
    322 mean_fpr, mean_tpr, linestyle='--',
    323 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
    324 (x_auc, loss_val), lw=2)
    325 y_pred_balanced_bagging = balanced_bagging.predict(x_test)
    326 cm_balanced_bagging = confusion_matrix(y_test, y_pred_balanced_bagging)
    327 cm2 = plt.figure()
    328 plot_confusion_matrix(cm_balanced_bagging,
    329 classes=[0, 1],
    330 title='Confusion matrix of BalancedBagging')
    331 plt.figure('Bagging') # 选择图
    332 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
    333 # make nice plotting
    334 ax.spines['top'].set_visible(False)
    335 ax.spines['right'].set_visible(False)
    336 ax.get_xaxis().tick_bottom()
    337 ax.get_yaxis().tick_left()
    338 ax.spines['left'].set_position(('outward', 10))
    339 ax.spines['bottom'].set_position(('outward', 10))
    340 plt.xlim([0, 1])
    341 plt.ylim([0, 1])
    342 plt.xlabel('False Positive Rate')
    343 plt.ylabel('True Positive Rate')
    344 plt.title('Receiver Operating Characteristic')
    345 plt.legend(loc="lower right")
    346 plt.show()
    347 return cm1, cm2, fig
    348 
    349  
    350 
    351 
    352 def model_baseline2(x_train, y_train, x_test, y_test):
    353 params = {
    354 'task': 'train',
    355 'boosting_type': 'gbdt',
    356 'objective': 'multiclass',
    357 'num_class': 2,
    358 'verbose': 0,
    359 'metric': 'logloss',
    360 'max_bin': 255,
    361 'max_depth': 7,
    362 'learning_rate': 0.3,
    363 'nthread': 4,
    364 'n_estimators': 85,
    365 'num_leaves': 63,
    366 'feature_fraction': 0.8,
    367 'num_boost_round': 160,
    368 }
    369 lgb_train = lgb.Dataset(x_train, label=y_train)
    370 lgb_eval = lgb.Dataset(x_test, label=y_test, reference=lgb_train)
    371 print("begin train...")
    372 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
    373 prob = bst.predict(x_test)[:, 1]
    374 predict_score = [float('%.2f' % x) for x in prob]
    375 loss_val = log_loss(y_test, predict_score)
    376 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
    377 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
    378 x_auc = auc(fpr, tpr)
    379 mean_fpr = np.linspace(0, 1, 100)
    380 mean_tpr = interp(mean_fpr, fpr, tpr)
    381 fig = plt.figure('weighted')
    382 ax = fig.add_subplot(1, 1, 1)
    383 name = 'base_lgb'
    384 plt.plot(mean_fpr, mean_tpr, linestyle='--',
    385 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
    386 (x_auc, loss_val), lw=2)
    387 cm1 = plt.figure()
    388 cm = confusion_matrix(y_test, y_pred)
    389 plot_confusion_matrix(cm, classes=[0, 1],
    390 title='Confusion matrix basemodle')
    391 # add weighted according to the labels
    392 lgb_train = lgb.Dataset(
    393 x_train, label=y_train,
    394 weight=[1 if y == 1 else 0.02 for y in y_train])
    395 lgb_eval = lgb.Dataset(
    396 x_test, label=y_test, reference=lgb_train,
    397 weight=[1 if y == 1 else 0.02 for y in y_test])
    398 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
    399 prob = bst.predict(x_test)[:, 1]
    400 predict_score = [float('%.2f' % x) for x in prob]
    401 loss_val = log_loss(y_test, predict_score)
    402 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
    403 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
    404 mean_fpr = np.linspace(0, 1, 100)
    405 mean_tpr = interp(mean_fpr, fpr, tpr)
    406 x_auc = auc(fpr, tpr)
    407 plt.figure('weighted') # 选择图
    408 name = 'base_lgb_weighted'
    409 plt.plot(
    410 mean_fpr, mean_tpr, linestyle='--',
    411 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
    412 (x_auc, loss_val), lw=2)
    413 cm2 = plt.figure()
    414 cm = confusion_matrix(y_test, y_pred)
    415 plot_confusion_matrix(cm, classes=[0, 1],
    416 title='Confusion matrix basemodle')
    417 plt.figure('weighted') # 选择图
    418 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
    419 # make nice plotting
    420 ax.spines['top'].set_visible(False)
    421 ax.spines['right'].set_visible(False)
    422 ax.get_xaxis().tick_bottom()
    423 ax.get_yaxis().tick_left()
    424 ax.spines['left'].set_position(('outward', 10))
    425 ax.spines['bottom'].set_position(('outward', 10))
    426 plt.xlim([0, 1])
    427 plt.ylim([0, 1])
    428 plt.xlabel('False Positive Rate')
    429 plt.ylabel('True Positive Rate')
    430 plt.title('Receiver Operating Characteristic')
    431 plt.legend(loc="lower right")
    432 plt.show()
    433 return cm1, cm2, fig
    434 
    435  
    436 
    437 
    438 '''
    439 1. logloss VS AUC
    440 虽然 baseline 的 logloss= 0.0819, 确实很小,但是从 Confusion matrix 看出,
    441 模型倾向于将所有的数据都分成多的那个,加了weight 之后稍好一点?
    442 Though the logloss is 0.0819, which is a very small value.
    443 Confusion matrix shows y_pred all 0, which feavors the majority classes.
    444 
    445  
    446 
    447 AUC 只有 0.64~0.67.
    448 AUC如此小,按理来说不应该啊,但是为什么呢?
    449 因为数据的label 极度不平衡,1 的比例大概只有 2%. 50:1.
    450 AUC 对不平衡数据的分类性能测试更友好,用AUC去选特征,可能结果更好哦。
    451 这里只提供一个大概的思考改进点。
    452 2. handling with imbalanced data:
    453 1. resampling, over- or under-,
    454 over- is increasing # of minority, under- is decreasing # of majority.
    455 2. revalue the loss function by giving large loss of misclassifying the
    456 minority labels.
    457 '''
    458 
    459  
    460 
    461 
    462 if __name__ == "__main__":
    463 x_train, x_test, y_train, y_test = data_baseline()
    464 cm11, cm12, fig1 = model_baseline(x_train, y_train, x_test, y_test)
    465 cm21, cm22, fig2 = model_baseline2(x_train, y_train, x_test, y_test)
    466 cm31, cm32, fig3 = model_baseline3(x_train, y_train, x_test, y_test)
    467 
    468  
    469 
    470 fig1.savefig('./base_lgb_weighted.jpg', format='jpg')
    471 cm11.savefig('./Confusion matrix1.jpg', format='jpg')
    472 cm12.savefig('./Confusion matrix2.jpg', format='jpg')
  • 相关阅读:
    oracle11g静默安装
    pv vg lv
    oracle日志表
    oracle常用sql
    vulnhub~muzzybox
    vulnhub~sunset:dusk1
    vulnhub~MyExpense
    vulnhub~DC-9
    汇编学习一
    贪心算法和动态规划
  • 原文地址:https://www.cnblogs.com/fujian-code/p/8757345.html
Copyright © 2020-2023  润新知