• “魔镜杯”风控算法大赛


    比赛概览

    拍拍贷“魔镜风控系统”从平均400个数据维度评估用户当前的信用状态,给每个借款人打出当前状态的信用分,在此基础上,再结合新发标的信息,打出对于每个标的6个月内逾期率的预测,为投资人提供了关键的决策依据,促进健康高效的互联网金融。拍拍贷首次开放丰富而真实的历史数据,邀你PK“魔镜风控系统”,通过机器学习技术,你能设计出更具预测准确率和计算性能的违约预测算法吗?

    比赛规则

    参赛团队需要基于训练集数据构建预测模型,使用模型计算测试集的评分(评分数值越高,表示越有可能出现贷款违约)。

    模型评价标准:本次比赛采用AUC来评判模型的效果。AUC即以False Positive Rate为横轴,True Positive Rate为纵轴的ROC (Receiver Operating Characteristic)curve下方的面积的大小。

    比赛数据

    本次大赛将公开国内网络借贷行业的贷款风险数据,包括信用违约标签(因变量)、建模所需的基础与加工字段(自变量)、相关用户的网络行为原始数据。本着保护借款人隐私以及拍拍贷知识产权的目的,数据字段已经过脱敏处理。

    数据编码为GBK。初赛数据包括3万条训练集和2万条测试集。复赛会增加新的3万条数据,供参赛团队优化模型,并新增1万条数据作为测试集。所有训练集,测试集都包括3个csv文件。

    初赛数据下载链接

    复赛数据下载链接

    数据类型说明文档下载链接

    Master

    每一行代表一个样本(一笔成功成交借款),每个样本包含200多个各类字段。

    • idx:每一笔贷款的unique key,可以与另外2个文件里的idx相匹配。
    • UserInfo_*:借款人特征字段
    • WeblogInfo_*:Info网络行为字段
    • Education_Info*:学历学籍字段
    • ThirdParty_Info_PeriodN_*:第三方数据时间段N字段
    • SocialNetwork_*:社交网络字段
    • LinstingInfo:借款成交时间
    • Target:违约标签(1 = 贷款违约,0 = 正常还款)。测试集里不包含target字段。

    Log_Info

    借款人的登陆信息。

    • ListingInfo:借款成交时间
    • LogInfo1:操作代码
    • LogInfo2:操作类别
    • LogInfo3:登陆时间
    • idx:每一笔贷款的unique key

    Userupdate_Info

    借款人修改信息

    • ListingInfo1:借款成交时间
    • UserupdateInfo1:修改内容
    • UserupdateInfo2:修改时间
    • idx:每一笔贷款的unique key
    # 引入Package
    import numpy as np
    import pandas as pd
    from pandas import Series, DataFrame
    
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    import seaborn as sns
    sns.set(style='whitegrid')
    
    import arrow
    
    # 用arrow lib,把日期解析成年、月、日、周、星期几、月初/月中/月末。带入模型前进行one-hot encoding
    def parse_date(date_str, str_format='YYYY/MM/DD'):
        d = arrow.get(date_str, str_format)
        # 月初,月中,月末
        month_stage = int((d.day-1) / 10) + 1
        return (d.timestamp, d.year, d.month, d.day, d.week, d.isoweekday(), month_stage)
    
    # 显示列名
    def show_cols(df):
        for c in df.columns:
            print(c)
    

    读取数据

    path = '/Training Set'
    # path = './PPD-First-Round-Data-Update/Training Set'
    train_master = pd.read_csv('PPD_Training_Master_GBK_3_1_Training_Set.csv', encoding='gbk')
    train_loginfo = pd.read_csv('PPD_LogInfo_3_1_Training_Set.csv', encoding='gbk')
    train_userinfo = pd.read_csv('PPD_Userupdate_Info_3_1_Training_Set.csv', encoding='gbk')
    

    数据清洗

    • 删除数据缺失比例很大的列,比如超过20%为nan
    • 删除数据缺失比例大的行,并保持删除的行数不超过总体的1%
    • 填补剩余缺失值,通过value_count观察是连续/离散变量,然后用最高频/平均数填补nan。这里通过观察,而不是判断类型是否object,更贴近实际情况
    # train_master中每一列的NULL值数量
    null_sum = train_master.isnull().sum()
    # train_master中每一列的NULL值数量不为0的
    null_sum = null_sum[null_sum!=0]
    null_sum_df = DataFrame(null_sum, columns=['num'])
    # 缺失率
    null_sum_df['ratio'] = null_sum_df['num'] / 30000.0
    null_sum_df.sort_values(by='ratio', ascending=False, inplace=True)
    print(null_sum_df.head(10))
    
    # 删除缺失严重的列
    train_master.drop(['WeblogInfo_3', 'WeblogInfo_1', 'UserInfo_11', 'UserInfo_13', 'UserInfo_12', 'WeblogInfo_20'],
                      axis=1, inplace=True)
    
                     num     ratio
    WeblogInfo_3   29030  0.967667
    WeblogInfo_1   29030  0.967667
    UserInfo_11    18909  0.630300
    UserInfo_13    18909  0.630300
    UserInfo_12    18909  0.630300
    WeblogInfo_20   8050  0.268333
    WeblogInfo_21   3074  0.102467
    WeblogInfo_19   2963  0.098767
    WeblogInfo_2    1658  0.055267
    WeblogInfo_4    1651  0.055033
    
    # 删除缺失严重的行
    record_nan = train_master.isnull().sum(axis=1).sort_values(ascending=False)
    print(record_nan.head())
    # 删除缺失数量>=5的行
    drop_record_index = [i for i in record_nan.loc[(record_nan>=5)].index]
    # 删除之前(30000, 222)
    print('before train_master shape {}'.format(train_master.shape))
    train_master.drop(drop_record_index, inplace=True)
    # 删除之后(29189, 222)
    print('after train_master shape {}'.format(train_master.shape))
    # len(drop_record_index)
    
    29341    33
    18637    31
    17386    31
    29130    31
    29605    31
    dtype: int64
    before train_master shape (30000, 222)
    after train_master shape (29189, 222)
    
    # 所有nan值的数量
    print('before all nan num: {}'.format(train_master.isnull().sum().sum()))
    
    # UserInfo_2为null的行,UserInfo_2列置为'位置地点'
    train_master.loc[train_master['UserInfo_2'].isnull(), 'UserInfo_2'] = '位置地点'
    # UserInfo_4为null的行,UserInfo_4列置为'位置地点'
    train_master.loc[train_master['UserInfo_4'].isnull(), 'UserInfo_4'] = '位置地点'
    
    def fill_nan(f, method):
        if method == 'most':
            # 填充为最高频
            common_value = pd.value_counts(train_master[f], ascending=False).index[0]
        else:
            # 填充为均值
            common_value = train_master[f].mean()
        train_master.loc[train_master[f].isnull(), f] = common_value
    
    # 通过pd.value_counts(train_master[f])的观察得到经验
    fill_nan('UserInfo_1', 'most')
    fill_nan('UserInfo_3', 'most')
    fill_nan('WeblogInfo_2', 'most')
    fill_nan('WeblogInfo_4', 'mean')
    fill_nan('WeblogInfo_5', 'mean')
    fill_nan('WeblogInfo_6', 'mean')
    fill_nan('WeblogInfo_19', 'most')
    fill_nan('WeblogInfo_21', 'most')
    
    print('after all nan num: {}'.format(train_master.isnull().sum().sum()))
    
    before all nan num: 0
    9725
    13478
    25688
    24185
    23997
    after all nan num: 0
    

    Feature 分类

    • 所有的分类中,如果其中最大频率的值出现超过一定阈值(50%),则把这列转化成为2值。比如[0,1,2,0,0,0,4,0,3]转化为[0,1,1,0,0,0,1,0,1]
    • 剩余的feature中,根据dtype,把所有features分为numerical和categorical 2类
    • numerical中,如果unique num不超过10个,也归属为categorical分类
    ratio_threshold = 0.5
    binarized_features = []
    binarized_features_most_freq_value = []
    
    # 不同period的third_party_feature均值汇总在一起,结果并不好,故取消
    # third_party_features = []
    
    # 遍历所有列,除去target之外的列进行如下处理
    for f in train_master.columns:
        if f in ['target']:
            continue
        # 非空值的数量
        not_null_sum = (train_master[f].notnull()).sum()
        # 特征取值最多的index
        most_count = pd.value_counts(train_master[f], ascending=False).iloc[0]
        # 特征取值最多的值
        most_value = pd.value_counts(train_master[f], ascending=False).index[0]
        # 特征取值最多的值占所有取值的比率
        ratio = most_count / not_null_sum
        # 如果大于阈值则归入二值化的特征中
        if ratio > ratio_threshold:
            binarized_features.append(f)
            binarized_features_most_freq_value.append(most_value)
    
    # 数值型特征(除去类型为object,除去'Idx', 'target',除去binarized_features)
    numerical_features = [f for f in train_master.select_dtypes(exclude = ['object']).columns 
                          if f not in(['Idx', 'target']) and f not in binarized_features]
    
    # 类别型特征(包括类型为object,除去'Idx', 'target',除去binarized_features)
    categorical_features = [f for f in train_master.select_dtypes(include = ["object"]).columns 
                            if f not in(['Idx', 'target']) and f not in binarized_features]
    
    # 遍历所有二值化特征,加名称前缀b_,将最多的取值置为0,其余的置为1
    for i in range(len(binarized_features)):
        f = binarized_features[i]
        most_value = binarized_features_most_freq_value[i]
        train_master['b_' + f] = 1
        train_master.loc[train_master[f] == most_value, 'b_' + f] = 0
        train_master.drop([f], axis=1, inplace=True)
    
    feature_unique_count = []
    # 遍历数值型特征,统计各个特征取值不为0的数量
    for f in numerical_features:
        feature_unique_count.append((np.count_nonzero(train_master[f].unique()), f))
        
    # print(sorted(feature_unique_count))
    
    # 遍历,将取值数量<=10的归为类别型特征
    for c, f in feature_unique_count:
        if c <= 10:
            print('{} moved from numerical to categorical'.format(f))
            numerical_features.remove(f)
            categorical_features.append(f)
    
    [(60, 'WeblogInfo_4'), (59, 'WeblogInfo_6'), (167, 'WeblogInfo_7'), (64, 'WeblogInfo_16'), (103, 'WeblogInfo_17'), (38, 'UserInfo_18'), (273, 'ThirdParty_Info_Period1_1'), (252, 'ThirdParty_Info_Period1_2'), (959, 'ThirdParty_Info_Period1_3'), (916, 'ThirdParty_Info_Period1_4'), (387, 'ThirdParty_Info_Period1_5'), (329, 'ThirdParty_Info_Period1_6'), (1217, 'ThirdParty_Info_Period1_7'), (563, 'ThirdParty_Info_Period1_8'), (111, 'ThirdParty_Info_Period1_11'), (18784, 'ThirdParty_Info_Period1_13'), (17989, 'ThirdParty_Info_Period1_14'), (5073, 'ThirdParty_Info_Period1_15'), (20047, 'ThirdParty_Info_Period1_16'), (14785, 'ThirdParty_Info_Period1_17'), (336, 'ThirdParty_Info_Period2_1'), (298, 'ThirdParty_Info_Period2_2'), (1192, 'ThirdParty_Info_Period2_3'), (1149, 'ThirdParty_Info_Period2_4'), (450, 'ThirdParty_Info_Period2_5'), (431, 'ThirdParty_Info_Period2_6'), (1524, 'ThirdParty_Info_Period2_7'), (715, 'ThirdParty_Info_Period2_8'), (134, 'ThirdParty_Info_Period2_11'), (21685, 'ThirdParty_Info_Period2_13'), (20719, 'ThirdParty_Info_Period2_14'), (6582, 'ThirdParty_Info_Period2_15'), (22385, 'ThirdParty_Info_Period2_16'), (18554, 'ThirdParty_Info_Period2_17'), (339, 'ThirdParty_Info_Period3_1'), (293, 'ThirdParty_Info_Period3_2'), (1172, 'ThirdParty_Info_Period3_3'), (1168, 'ThirdParty_Info_Period3_4'), (453, 'ThirdParty_Info_Period3_5'), (428, 'ThirdParty_Info_Period3_6'), (1511, 'ThirdParty_Info_Period3_7'), (707, 'ThirdParty_Info_Period3_8'), (129, 'ThirdParty_Info_Period3_11'), (21521, 'ThirdParty_Info_Period3_13'), (20571, 'ThirdParty_Info_Period3_14'), (6569, 'ThirdParty_Info_Period3_15'), (22247, 'ThirdParty_Info_Period3_16'), (18311, 'ThirdParty_Info_Period3_17'), (324, 'ThirdParty_Info_Period4_1'), (295, 'ThirdParty_Info_Period4_2'), (1183, 'ThirdParty_Info_Period4_3'), (1143, 'ThirdParty_Info_Period4_4'), (447, 'ThirdParty_Info_Period4_5'), (422, 'ThirdParty_Info_Period4_6'), (1524, 'ThirdParty_Info_Period4_7'), (706, 'ThirdParty_Info_Period4_8'), (130, 'ThirdParty_Info_Period4_11'), (20894, 'ThirdParty_Info_Period4_13'), (20109, 'ThirdParty_Info_Period4_14'), (6469, 'ThirdParty_Info_Period4_15'), (21644, 'ThirdParty_Info_Period4_16'), (17849, 'ThirdParty_Info_Period4_17'), (322, 'ThirdParty_Info_Period5_1'), (284, 'ThirdParty_Info_Period5_2'), (1144, 'ThirdParty_Info_Period5_3'), (1119, 'ThirdParty_Info_Period5_4'), (436, 'ThirdParty_Info_Period5_5'), (401, 'ThirdParty_Info_Period5_6'), (1470, 'ThirdParty_Info_Period5_7'), (685, 'ThirdParty_Info_Period5_8'), (126, 'ThirdParty_Info_Period5_11'), (20010, 'ThirdParty_Info_Period5_13'), (19145, 'ThirdParty_Info_Period5_14'), (6033, 'ThirdParty_Info_Period5_15'), (20723, 'ThirdParty_Info_Period5_16'), (17149, 'ThirdParty_Info_Period5_17'), (312, 'ThirdParty_Info_Period6_1'), (265, 'ThirdParty_Info_Period6_2'), (1074, 'ThirdParty_Info_Period6_3'), (1046, 'ThirdParty_Info_Period6_4'), (414, 'ThirdParty_Info_Period6_5'), (363, 'ThirdParty_Info_Period6_6'), (1411, 'ThirdParty_Info_Period6_7'), (637, 'ThirdParty_Info_Period6_8'), (71, 'ThirdParty_Info_Period6_9'), (15, 'ThirdParty_Info_Period6_10'), (123, 'ThirdParty_Info_Period6_11'), (95, 'ThirdParty_Info_Period6_12'), (16605, 'ThirdParty_Info_Period6_13'), (16170, 'ThirdParty_Info_Period6_14'), (5188, 'ThirdParty_Info_Period6_15'), (17220, 'ThirdParty_Info_Period6_16'), (14553, 'ThirdParty_Info_Period6_17')]
    

    Feature Engineering 特征工程

    numerical 数值型特征

    • 所有的numerical feature,画出在不同target下的分布图,stripplot(with jitter),类似于boxplot,不过更方便于大值outlier寻找
    • 绘制所有numerical features的密度图,并且可以观察出,它们都可以通过求对数转化为更接近正态分布
    • 转化为log分布后,可以再删除一些极小的outlier
    melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features])
    print(melt.head(50))
    print(melt.shape)
    g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
    g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
    
        target      variable      value
    0        0  WeblogInfo_4   1.000000
    1        0  WeblogInfo_4   1.000000
    2        0  WeblogInfo_4   2.000000
    3        0  WeblogInfo_4   3.027468
    4        0  WeblogInfo_4   1.000000
    5        0  WeblogInfo_4   2.000000
    6        1  WeblogInfo_4  13.000000
    7        0  WeblogInfo_4  12.000000
    8        1  WeblogInfo_4  10.000000
    9        0  WeblogInfo_4   1.000000
    10       0  WeblogInfo_4   3.000000
    11       0  WeblogInfo_4   1.000000
    12       0  WeblogInfo_4  11.000000
    13       1  WeblogInfo_4   1.000000
    14       0  WeblogInfo_4   3.000000
    15       0  WeblogInfo_4   2.000000
    16       0  WeblogInfo_4   4.000000
    17       0  WeblogInfo_4   4.000000
    18       1  WeblogInfo_4   1.000000
    19       0  WeblogInfo_4   2.000000
    20       0  WeblogInfo_4   3.000000
    21       0  WeblogInfo_4   3.000000
    22       0  WeblogInfo_4   8.000000
    23       0  WeblogInfo_4   1.000000
    24       0  WeblogInfo_4   1.000000
    25       0  WeblogInfo_4   2.000000
    26       0  WeblogInfo_4   9.000000
    27       0  WeblogInfo_4   2.000000
    28       0  WeblogInfo_4   2.000000
    29       0  WeblogInfo_4   2.000000
    30       0  WeblogInfo_4   3.000000
    31       0  WeblogInfo_4   6.000000
    32       0  WeblogInfo_4   1.000000
    33       0  WeblogInfo_4   3.000000
    34       0  WeblogInfo_4   3.027468
    35       0  WeblogInfo_4   6.000000
    36       0  WeblogInfo_4   9.000000
    37       0  WeblogInfo_4   2.000000
    38       1  WeblogInfo_4   5.000000
    39       0  WeblogInfo_4   2.000000
    40       0  WeblogInfo_4   2.000000
    41       0  WeblogInfo_4   3.000000
    42       0  WeblogInfo_4   3.027468
    43       0  WeblogInfo_4  15.000000
    44       0  WeblogInfo_4   2.000000
    45       0  WeblogInfo_4   3.000000
    46       0  WeblogInfo_4   3.000000
    47       0  WeblogInfo_4   2.000000
    48       0  WeblogInfo_4   3.000000
    49       0  WeblogInfo_4   2.000000
    (2714577, 3)
    
    
    E:Anaconda3envssklearnlibsite-packagesseabornaxisgrid.py:715: UserWarning: Using the stripplot function without specifying `order` is likely to produce an incorrect plot.
      warnings.warn(warning)
    
    
    
    
    
    <seaborn.axisgrid.FacetGrid at 0x4491c80860>
    

    png

    # Seaborn画图,查看正负样本中特征取值的分布,删除离群值
    
    print('{} lines before drop'.format(train_master.shape[0]))
    
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_1 > 250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period6_2 > 400].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_2 > 250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period6_3 > 2000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_3 > 1250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period6_4 > 1500].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_4 > 1250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 400)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_7 > 2000)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_6 > 1500)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 1000) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1500)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_16 > 2000000) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_14 > 1000000) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_12 > 60)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 120) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 20) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 200000)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 150000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_15 > 40000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_17 > 130000) & (train_master.target == 0)].index, inplace=True)
    
    
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_1 > 500].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_2 > 500].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 3000) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 2000)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_5 > 500].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_4 > 2000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_6 > 700].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_6 > 300) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_7 > 4000)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_8 > 800)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_11 > 200)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_13 > 200000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_14 > 150000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_15 > 75000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_16 > 180000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_17 > 150000].index, inplace=True)
    
    # go above
    
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_1 > 400)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_2 > 350)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_3 > 1500)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_4 > 1600].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_4 > 1250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_5 > 500].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_6 > 800].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_6 > 400) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_8 > 1000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_13 > 250000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_14 > 200000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_15 > 70000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_16 > 210000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_17 > 160000].index, inplace=True)
    
    
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_1 > 400].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_2 > 380].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_3 > 1750].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_4 > 1750].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_4 > 1250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_5 > 600].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_6 > 800].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_6 > 400) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_7 > 1600) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_8 > 1000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_13 > 300000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_14 > 200000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_15 > 80000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_16 > 300000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_17 > 150000].index, inplace=True)
    
    
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_1 > 400].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_1 > 300) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_2 > 400].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_2 > 300) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_3 > 1800].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_3 > 1500) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_4 > 1500].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_5 > 580].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_6 > 800].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_6 > 400) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_7 > 2100].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_8 > 700) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_11 > 120].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_13 > 300000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_14 > 170000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_15 > 80000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_15 > 50000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_16 > 300000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_17 > 150000].index, inplace=True)
    
    
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_1 > 350].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_1 > 200) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_2 > 300].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_2 > 190) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_3 > 1500].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_4 > 1250].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_5 > 400].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_6 > 500].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_6 > 250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_7 > 1800].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_8 > 720].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_8 > 600) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_11 > 100].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_13 > 200000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_13 > 140000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_14 > 150000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_15 > 70000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_15 > 30000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_16 > 200000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_16 > 100000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_17 > 100000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_17 > 80000) & (train_master.target == 1)].index, inplace=True)
    
    train_master.drop(train_master[train_master.WeblogInfo_4 > 40].index, inplace=True)
    train_master.drop(train_master[train_master.WeblogInfo_6 > 40].index, inplace=True)
    train_master.drop(train_master[train_master.WeblogInfo_7 > 150].index, inplace=True)
    train_master.drop(train_master[train_master.WeblogInfo_16 > 50].index, inplace=True)
    train_master.drop(train_master[(train_master.WeblogInfo_16 > 25) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.WeblogInfo_17 > 100].index, inplace=True)
    train_master.drop(train_master[(train_master.WeblogInfo_17 > 80) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.UserInfo_18 < 10].index, inplace=True)
    
    print('{} lines after drop'.format(train_master.shape[0]))
    
    29189 lines before drop
    28074 lines after drop
    
    # melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features if f != 'Idx'])
    g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
    g.map(sns.distplot, "value")
    
    E:Anaconda3envssklearnlibsite-packagesscipystatsstats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
      return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
    
    
    
    
    
    <seaborn.axisgrid.FacetGrid at 0x44984f02e8>
    

    png

    # train_master_log = train_master.copy()
    numerical_features_log = [f for f in numerical_features if f not in ['Idx']]
    
    # 将数值型特征取log
    for f in numerical_features_log:
        train_master[f + '_log'] = np.log1p(train_master[f])
        train_master.drop([f], axis=1, inplace=True)
    
    
    E:Anaconda3envssklearnlibsite-packagesipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log1p
    
    from math import inf
    
    (train_master == -inf).sum().sum()
    
    206845
    
    train_master.replace(-inf, -1, inplace=True)
    
    # log后的密度图,应该分布靠近正态分布了
    melt = pd.melt(train_master, id_vars=['target'], value_vars = [f+'_log' for f in numerical_features])
    g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
    g.map(sns.distplot, "value")
    
    <seaborn.axisgrid.FacetGrid at 0x44f45c2470>
    

    png

    # log后的分布图,看是否有log后的outlier
    g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
    g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
    
    E:Anaconda3envssklearnlibsite-packagesseabornaxisgrid.py:715: UserWarning: Using the stripplot function without specifying `order` is likely to produce an incorrect plot.
      warnings.warn(warning)
    
    
    
    
    
    <seaborn.axisgrid.FacetGrid at 0x44e4270908>
    

    png

    categorical 类别型特征

    melt = pd.melt(train_master, id_vars=['target'], value_vars=[f for f in categorical_features])
    g = sns.FacetGrid(melt, col='variable', col_wrap=4, sharex=False, sharey=False)
    g.map(sns.countplot, 'value', palette="muted")
    
    E:Anaconda3envssklearnlibsite-packagesseabornaxisgrid.py:715: UserWarning: Using the countplot function without specifying `order` is likely to produce an incorrect plot.
      warnings.warn(warning)
    
    
    
    
    
    <seaborn.axisgrid.FacetGrid at 0x44e4c3eac8>
    

    png

    相关性查看

    target_corr = np.abs(train_master.corr()['target']).sort_values(ascending=False)
    target_corr
    
    target                            1.000000
    ThirdParty_Info_Period6_5_log     0.139606
    ThirdParty_Info_Period6_11_log    0.139083
    ThirdParty_Info_Period6_4_log     0.137962
    ThirdParty_Info_Period6_7_log     0.135729
    ThirdParty_Info_Period6_3_log     0.132310
    ThirdParty_Info_Period6_14_log    0.131138
    ThirdParty_Info_Period6_8_log     0.130577
    ThirdParty_Info_Period6_16_log    0.128451
    ThirdParty_Info_Period6_13_log    0.128013
    ThirdParty_Info_Period5_5_log     0.126701
    ThirdParty_Info_Period6_17_log    0.126456
    ThirdParty_Info_Period5_4_log     0.121786
    ThirdParty_Info_Period6_10_log    0.121729
    ThirdParty_Info_Period6_1_log     0.121112
    ThirdParty_Info_Period5_11_log    0.117162
    ThirdParty_Info_Period5_7_log     0.114794
    ThirdParty_Info_Period6_2_log     0.112041
    ThirdParty_Info_Period6_9_log     0.112039
    ThirdParty_Info_Period5_14_log    0.111374
    ThirdParty_Info_Period5_3_log     0.108039
    ThirdParty_Info_Period5_16_log    0.104786
    ThirdParty_Info_Period6_12_log    0.104733
    ThirdParty_Info_Period5_13_log    0.104688
    ThirdParty_Info_Period5_1_log     0.104191
    ThirdParty_Info_Period5_8_log     0.102859
    ThirdParty_Info_Period4_5_log     0.101329
    ThirdParty_Info_Period5_17_log    0.100960
    ThirdParty_Info_Period4_4_log     0.094715
    ThirdParty_Info_Period5_2_log     0.090261
                                        ...   
    ThirdParty_Info_Period4_15_log    0.004560
    b_ThirdParty_Info_Period4_12      0.004331
    b_WeblogInfo_13                   0.004090
    b_SocialNetwork_4                 0.003752
    b_SocialNetwork_3                 0.003752
    b_SocialNetwork_2                 0.003752
    b_SocialNetwork_16                0.003711
    b_SocialNetwork_6                 0.003701
    b_SocialNetwork_5                 0.003701
    b_WeblogInfo_44                   0.003542
    WeblogInfo_7_log                  0.003414
    b_WeblogInfo_32                   0.002961
    WeblogInfo_16_log                 0.002954
    b_ThirdParty_Info_Period2_12      0.002925
    b_WeblogInfo_29                   0.002550
    b_WeblogInfo_41                   0.002522
    ThirdParty_Info_Period4_6_log     0.002362
    b_WeblogInfo_11                   0.002257
    b_WeblogInfo_12                   0.002209
    b_WeblogInfo_8                    0.001922
    b_WeblogInfo_40                   0.001759
    b_WeblogInfo_36                   0.001554
    b_WeblogInfo_26                   0.001357
    ThirdParty_Info_Period1_3_log     0.000937
    b_WeblogInfo_31                   0.000896
    b_WeblogInfo_23                   0.000276
    ThirdParty_Info_Period1_8_log     0.000194
    b_WeblogInfo_38                   0.000077
    b_WeblogInfo_10                        NaN
    b_WeblogInfo_49                        NaN
    Name: target, Length: 215, dtype: float64
    
    # at_home,猜测UserInfo_2和UserInfo_8可能表示用户的当前居住地和户籍地,从而判断用户是否在老家。
    train_master['at_home'] = np.where(train_master['UserInfo_2']==train_master['UserInfo_8'], 1, 0)
    train_master['at_home']
    
    0        1
    1        1
    2        1
    3        1
    4        1
    5        0
    6        0
    7        0
    9        0
    10       1
    11       1
    12       1
    13       1
    14       0
    15       1
    16       0
    17       1
    18       1
    19       1
    20       1
    21       0
    22       0
    23       0
    24       0
    25       1
    26       1
    27       1
    28       1
    29       0
    30       1
            ..
    29970    0
    29971    1
    29972    1
    29973    0
    29974    0
    29975    1
    29976    0
    29977    1
    29978    1
    29979    1
    29980    0
    29981    0
    29982    1
    29983    0
    29984    0
    29985    0
    29986    0
    29987    1
    29988    1
    29989    1
    29990    0
    29991    1
    29992    1
    29993    1
    29994    0
    29995    1
    29996    1
    29997    0
    29998    0
    29999    1
    Name: at_home, Length: 28074, dtype: int32
    
    train_master_ = train_master.copy()
    
    def parse_ListingInfo(date):
        d = parse_date(date, 'YYYY/M/D')
        return Series(d, 
                      index=['ListingInfo_timestamp', 'ListingInfo_year', 'ListingInfo_month',
                               'ListingInfo_day', 'ListingInfo_week', 'ListingInfo_isoweekday', 'ListingInfo_month_stage'], 
                      dtype=np.int32)
    
    ListingInfo_parsed = train_master_['ListingInfo'].apply(parse_ListingInfo)
    print('before train_master_ shape {}'.format(train_master_.shape))
    train_master_ = train_master_.merge(ListingInfo_parsed, how='left', left_index=True, right_index=True)
    print('after train_master_ shape {}'.format(train_master_.shape))
    
    before train_master_ shape (28074, 223)
    after train_master_ shape (28074, 230)
    

    train_loginfo 借款人的登陆信息

    • 对Idx做group,提取记录数,LogInfo1独立数,活跃日期数,日期跨度
    def loginfo_aggr(group):
        # group的数量
        loginfo_num = group.shape[0]
        # 操作代码的数量
        loginfo_LogInfo1_unique_num = group['LogInfo1'].unique().shape[0]
        # 登录时间的数量
        loginfo_active_day_num = group['LogInfo3'].unique().shape[0]
        # 处理登录时间最小值
        min_day = parse_date(np.min(group['LogInfo3']), str_format='YYYY-MM-DD')
        # 处理登陆时间最大值
        max_day = parse_date(np.max(group['LogInfo3']), str_format='YYYY-MM-DD')
        # 最大值和最小值相差多少天
        gap_day = round((max_day[0] - min_day[0]) / 86400)
    
        indexes = {
            'loginfo_num': loginfo_num, 
            'loginfo_LogInfo1_unique_num': loginfo_LogInfo1_unique_num, 
            'loginfo_active_day_num': loginfo_active_day_num, 
            'loginfo_gap_day': gap_day, 
            'loginfo_last_day_timestamp': max_day[0]
        }
        
        # TODO every individual LogInfo1,LogInfo2 count
    
        def sub_aggr_loginfo(sub_group):
            return sub_group.shape[0]
    
        sub_group = group.groupby(by=['LogInfo1', 'LogInfo2']).apply(sub_aggr_loginfo)
        indexes['loginfo_LogInfo12_unique_num'] = sub_group.shape[0]
        return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
        
    train_loginfo_grouped = train_loginfo.groupby(by=['Idx']).apply(loginfo_aggr)
    train_loginfo_grouped.head()
    
    loginfo_num loginfo_LogInfo1_unique_num loginfo_active_day_num loginfo_gap_day loginfo_last_day_timestamp loginfo_LogInfo12_unique_num
    Idx
    3 26 4 8 63 1383264000 9
    5 11 6 4 13 1383696000 8
    8 125 7 13 12 1383696000 11
    12 199 8 11 328 1383264000 14
    16 15 4 7 8 1383523200 6
    train_loginfo_grouped.to_csv('train_loginfo_grouped.csv', header=True, index=True)
    
    train_loginfo_grouped = pd.read_csv('train_loginfo_grouped.csv')
    train_loginfo_grouped.head()
    
    Idx loginfo_num loginfo_LogInfo1_unique_num loginfo_active_day_num loginfo_gap_day loginfo_last_day_timestamp loginfo_LogInfo12_unique_num
    0 3 26 4 8 63 1383264000 9
    1 5 11 6 4 13 1383696000 8
    2 8 125 7 13 12 1383696000 11
    3 12 199 8 11 328 1383264000 14
    4 16 15 4 7 8 1383523200 6

    train_userinfo 借款人修改信息

    • 对于Idx做group,提取记录数,UserupdateInfo1独立数、UserupdateInfo1/UserupdateInfo2独立数,日期跨度。以及每种UserupdateInfo1/UserupdateInfo2的数量
    def userinfo_aggr(group):
        op_columns = ['_EducationId', '_HasBuyCar', '_LastUpdateDate',
           '_MarriageStatusId', '_MobilePhone', '_QQ', '_ResidenceAddress',
           '_ResidencePhone', '_ResidenceTypeId', '_ResidenceYears', '_age',
           '_educationId', '_gender', '_hasBuyCar', '_idNumber',
           '_lastUpdateDate', '_marriageStatusId', '_mobilePhone', '_qQ',
           '_realName', '_regStepId', '_residenceAddress', '_residencePhone',
           '_residenceTypeId', '_residenceYears', '_IsCash', '_CompanyPhone',
           '_IdNumber', '_Phone', '_RealName', '_CompanyName', '_Age',
           '_Gender', '_OtherWebShopType', '_turnover', '_WebShopTypeId',
           '_RelationshipId', '_CompanyAddress', '_Department',
           '_flag_UCtoBcp', '_flag_UCtoPVR', '_WorkYears', '_ByUserId',
           '_DormitoryPhone', '_IncomeFrom', '_CompanyTypeId',
           '_CompanySizeId', '_companyTypeId', '_department',
           '_companyAddress', '_workYears', '_contactId', '_creationDate',
           '_flag_UCtoBCP', '_orderId', '_phone', '_relationshipId', '_userId',
           '_companyName', '_companyPhone', '_isCash', '_BussinessAddress',
           '_webShopUrl', '_WebShopUrl', '_SchoolName', '_HasBusinessLicense',
           '_dormitoryPhone', '_incomeFrom', '_schoolName', '_NickName',
           '_CreationDate', '_CityId', '_DistrictId', '_ProvinceId',
           '_GraduateDate', '_GraduateSchool', '_IdAddress', '_companySizeId',
           '_HasPPDaiAccount', '_PhoneType', '_PPDaiAccount', '_SecondEmail',
           '_SecondMobile', '_nickName', '_HasSbOrGjj', '_Position']
    
        # group的数量
        userinfo_num = group.shape[0]
        # 修改内容的数量
        userinfo_unique_num = group['UserupdateInfo1'].unique().shape[0]
        # 修改时间的数量
        userinfo_active_day_num = group['UserupdateInfo2'].unique().shape[0]
        # 处理修改时间的最小值
        min_day = parse_date(np.min(group['UserupdateInfo2']))
        # 处理修改时间的最大值
        max_day = parse_date(np.max(group['UserupdateInfo2']))
        # 最小值和最大值相差几天
        gap_day = round((max_day[0] - min_day[0]) / (86400))
    
        indexes = {
            'userinfo_num': userinfo_num, 
            'userinfo_unique_num': userinfo_unique_num, 
            'userinfo_active_day_num': userinfo_active_day_num, 
            'userinfo_gap_day': gap_day, 
            'userinfo_last_day_timestamp': max_day[0]
        }
        
        for c in op_columns:
            indexes['userinfo' + c + '_num'] = 0
    
        def sub_aggr(sub_group):
            return sub_group.shape[0]
    
        sub_group = group.groupby(by=['UserupdateInfo1']).apply(sub_aggr)
        for c in sub_group.index:
            indexes['userinfo' + c + '_num'] = sub_group.loc[c]
        return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
        
    train_userinfo_grouped = train_userinfo.groupby(by=['Idx']).apply(userinfo_aggr)
    train_userinfo_grouped.head()
    
    userinfo_num userinfo_unique_num userinfo_active_day_num userinfo_gap_day userinfo_last_day_timestamp userinfo_EducationId_num userinfo_HasBuyCar_num userinfo_LastUpdateDate_num userinfo_MarriageStatusId_num userinfo_MobilePhone_num ... userinfo_IdAddress_num userinfo_companySizeId_num userinfo_HasPPDaiAccount_num userinfo_PhoneType_num userinfo_PPDaiAccount_num userinfo_SecondEmail_num userinfo_SecondMobile_num userinfo_nickName_num userinfo_HasSbOrGjj_num userinfo_Position_num
    Idx
    3 13 11 1 0 1377820800 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0
    5 13 11 1 0 1382572800 1 1 2 1 2 ... 0 0 0 0 0 0 0 0 0 0
    8 14 12 2 10 1383523200 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0
    12 14 14 2 298 1380672000 1 1 1 1 0 ... 0 0 0 0 0 0 0 0 0 0
    16 13 12 2 9 1383609600 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 91 columns

    train_userinfo_grouped.to_csv('train_userinfo_grouped.csv', header=True, index=True)
    
    train_userinfo_grouped = pd.read_csv('train_userinfo_grouped.csv')
    train_userinfo_grouped.head()
    
    Idx userinfo_num userinfo_unique_num userinfo_active_day_num userinfo_gap_day userinfo_last_day_timestamp userinfo_EducationId_num userinfo_HasBuyCar_num userinfo_LastUpdateDate_num userinfo_MarriageStatusId_num ... userinfo_IdAddress_num userinfo_companySizeId_num userinfo_HasPPDaiAccount_num userinfo_PhoneType_num userinfo_PPDaiAccount_num userinfo_SecondEmail_num userinfo_SecondMobile_num userinfo_nickName_num userinfo_HasSbOrGjj_num userinfo_Position_num
    0 3 13 11 1 0 1377820800 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
    1 5 13 11 1 0 1382572800 1 1 2 1 ... 0 0 0 0 0 0 0 0 0 0
    2 8 14 12 2 10 1383523200 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
    3 12 14 14 2 298 1380672000 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
    4 16 13 12 2 9 1383609600 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 92 columns

    print('before merge, train_master shape:{}'.format(train_master_.shape))
    
    # train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_index=True)
    # train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_index=True)
    
    train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_on='Idx')
    train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_on='Idx')
    
    train_master_.fillna(0, inplace=True)
    
    print('after merge, train_master shape:{}'.format(train_master_.shape))
    
    before merge, train_master shape:(28074, 230)
    after merge, train_master shape:(28074, 327)
    

    one-hot encoding features

    这里不要自动推算get_dummies所使用的列,pandas会自动选择object类型,而有些非object feature,实际含义也是categorical的,也需要被one-hot encoding

    drop_columns = ['Idx', 'ListingInfo', 'UserInfo_20',  'UserInfo_19', 'UserInfo_8', 'UserInfo_7', 
                    'UserInfo_4','UserInfo_2',
                   'ListingInfo_timestamp', 'loginfo_last_day_timestamp', 'userinfo_last_day_timestamp']
    train_master_ = train_master_.drop(drop_columns, axis=1)
    
    dummy_columns = categorical_features.copy()
    dummy_columns.extend(['ListingInfo_year', 'ListingInfo_month', 'ListingInfo_day', 'ListingInfo_week', 
                          'ListingInfo_isoweekday', 'ListingInfo_month_stage'])
    finally_dummy_columns = []
    
    for c in dummy_columns:
        if c not in drop_columns:
            finally_dummy_columns.append(c)
    
    print('before get_dummies train_master_ shape {}'.format(train_master_.shape))
    train_master_ = pd.get_dummies(train_master_, columns=finally_dummy_columns)
    print('after get_dummies train_master_ shape {}'.format(train_master_.shape))
    
    before get_dummies train_master_ shape (28074, 316)
    after get_dummies train_master_ shape (28074, 444)
    

    normalized

    from sklearn.preprocessing import StandardScaler
    
    X_train = train_master_.drop(['target'], axis=1)
    X_train = StandardScaler().fit_transform(X_train)
    y_train = train_master_['target']
    print(X_train.shape, y_train.shape)
    
    E:Anaconda3envssklearnlibsite-packagessklearnpreprocessingdata.py:617: DataConversionWarning: Data with input dtype uint8, int32, int64, float64 were all converted to float64 by StandardScaler.
      return self.partial_fit(X, y)
    
    
    (28074, 443) (28074,)
    
    
    E:Anaconda3envssklearnlibsite-packagessklearnase.py:462: DataConversionWarning: Data with input dtype uint8, int32, int64, float64 were all converted to float64 by StandardScaler.
      return self.fit(X, **fit_params).transform(X)
    
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import StratifiedKFold
    # from scikitplot import plotters as skplt
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.linear_model import RidgeClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from xgboost import XGBClassifier
    from sklearn.svm import SVC, LinearSVC
    
    # 使用StratifiedKFold分层采样,保证预测target的分布合理,并且shuffle随机
    cv = StratifiedKFold(n_splits=3, shuffle=True)
    
    # 计算auc accuracy recall
    def estimate(estimator, name='estimator'):
        auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
        accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
        recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()
    
        print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))
    
    #     skplt.plot_learning_curve(estimator, X_train, y_train)
    #     plt.show()
    
    #     estimator.fit(X_train, y_train)
    #     y_probas = estimator.predict_proba(X_train)
    #     skplt.plot_roc_curve(y_true=y_train, y_probas=y_probas)
    #     plt.show()
    
    estimate(XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic'), 'XGBClassifier')
    estimate(RidgeClassifier(), 'RidgeClassifier')
    estimate(LogisticRegression(), 'LogisticRegression')
    # estimate(RandomForestClassifier(), 'RandomForestClassifier')
    estimate(AdaBoostClassifier(), 'AdaBoostClassifier')
    # estimate(SVC(), 'SVC')# too long to wait
    # estimate(LinearSVC(), 'LinearSVC')
    
    # XGBClassifier: auc:0.747668, recall:0.000000, accuracy:0.944575
    # RidgeClassifier: auc:0.754218, recall:0.000000, accuracy:0.944433
    # LogisticRegression: auc:0.758454, recall:0.015424, accuracy:0.942010
    # AdaBoostClassifier: auc:0.784086, recall:0.013495, accuracy:0.943791
    
    XGBClassifier: auc:0.755890, recall:0.000000, accuracy:0.944575
    RidgeClassifier: auc:0.753939, recall:0.000000, accuracy:0.944575
    
    
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    
    
    LogisticRegression: auc:0.759646, recall:0.022494, accuracy:0.942438
    AdaBoostClassifier: auc:0.792333, recall:0.017988, accuracy:0.943827
    

    VotingClassifier

    from sklearn.ensemble import VotingClassifier
    
    estimators = []
    # estimators.append(('RidgeClassifier', RidgeClassifier()))
    estimators.append(('LogisticRegression', LogisticRegression()))
    estimators.append(('XGBClassifier', XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic')))
    estimators.append(('AdaBoostClassifier', AdaBoostClassifier()))
    # estimators.append(('RandomForestClassifier', RandomForestClassifier()))
    
    #voting: auc:0.794587, recall:0.000642, accuracy:0.944433
    
    voting = VotingClassifier(estimators = estimators, voting='soft')
    estimate(voting, 'voting')
    
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    
    
    voting: auc:0.790281, recall:0.000642, accuracy:0.944361
  • 相关阅读:
    leetcode 1 两数之和
    leetcode 486 预测赢家
    leetcode 121 买卖股票的最佳时机
    leetcode 5 最长回文子串
    个人作业——软件工程实践总结作业
    个人作业——软件产品案例分析
    软件工程实践2017 个人技术博客
    软件工程实践2017结对项目——第二次作业
    软件工程实践2017结对项目——第一次作业
    课堂笔记(六)
  • 原文地址:https://www.cnblogs.com/chenxiangzhen/p/10557424.html
Copyright © 2020-2023  润新知