• Coggle 30 Days of ML:结构化赛题:天池二手车交易价格预测(二)


    任务4:使用特征工程对比赛字段进行编码

    对数据集中类别字段(取值空间大于2)的进行one-hot操作

    对类别特征进行OneEncoder编码

    Train_data_onehot = pd.get_dummies(Train_data,columns = ['model', 'brand', 'bodyType', 'fuelType',
                                         'gearbox', 'notRepairedDamage'])
    
    Train_data_onehot
    
    SaleID name regDate power kilometer regionCode seller offerType creatDate price ... fuelType_2.0 fuelType_3.0 fuelType_4.0 fuelType_5.0 fuelType_6.0 gearbox_0.0 gearbox_1.0 notRepairedDamage_- notRepairedDamage_0.0 notRepairedDamage_1.0
    0 0 736 20040402 60 12.5 1046 0 0 20160404 1850 ... 0 0 0 0 0 1 0 0 1 0
    1 1 2262 20030301 0 15.0 4366 0 0 20160309 3600 ... 0 0 0 0 0 1 0 1 0 0
    2 2 14874 20040403 163 12.5 2806 0 0 20160402 6222 ... 0 0 0 0 0 1 0 0 1 0
    3 3 71865 19960908 193 15.0 434 0 0 20160312 2400 ... 0 0 0 0 0 0 1 0 1 0
    4 4 111080 20120103 68 5.0 6977 0 0 20160313 5200 ... 0 0 0 0 0 1 0 0 1 0
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    149995 149995 163978 20000607 163 15.0 4576 0 0 20160327 5900 ... 0 0 0 0 0 0 1 0 1 0
    149996 149996 184535 20091102 125 10.0 2826 0 0 20160312 9500 ... 0 0 0 0 0 1 0 0 1 0
    149997 149997 147587 20101003 90 6.0 3302 0 0 20160328 7500 ... 0 0 0 0 0 1 0 0 1 0
    149998 149998 45907 20060312 156 15.0 1877 0 0 20160401 4999 ... 0 0 0 0 0 1 0 0 1 0
    149999 149999 177672 19990204 193 12.5 235 0 0 20160305 4700 ... 0 0 0 0 0 0 1 0 1 0

    150000 rows × 334 columns

    Test_data_onehot = pd.get_dummies(Test_data,columns = ['model', 'brand', 'bodyType', 'fuelType',
                                         'gearbox', 'notRepairedDamage'])
    
    Test_data_onehot
    
    SaleID name regDate power kilometer regionCode seller offerType creatDate v_0 ... fuelType_2.0 fuelType_3.0 fuelType_4.0 fuelType_5.0 fuelType_6.0 gearbox_0.0 gearbox_1.0 notRepairedDamage_- notRepairedDamage_0.0 notRepairedDamage_1.0
    0 200000 133777 20000501 101 15.0 5019 0 0 20160308 42.142061 ... 0 0 0 0 0 1 0 0 1 0
    1 200001 61206 19950211 73 6.0 1505 0 0 20160310 43.907034 ... 0 0 0 0 0 1 0 0 1 0
    2 200002 67829 20090606 120 5.0 1776 0 0 20160309 45.389665 ... 0 0 0 0 0 1 0 1 0 0
    3 200003 8892 20020601 58 15.0 26 0 0 20160314 42.788775 ... 0 0 0 0 0 1 0 0 1 0
    4 200004 76998 20030301 116 15.0 738 0 0 20160306 43.670763 ... 0 0 0 0 0 1 0 0 1 0
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    49995 249995 111443 20041005 150 15.0 5564 0 0 20160309 46.321013 ... 0 0 0 0 0 0 1 1 0 0
    49996 249996 152834 20130409 179 4.0 5220 0 0 20160323 48.086547 ... 0 0 0 0 0 1 0 0 1 0
    49997 249997 132531 20041211 147 12.5 3795 0 0 20160316 46.145279 ... 0 0 0 0 0 0 1 0 1 0
    49998 249998 143405 20020702 176 15.0 61 0 0 20160327 45.507088 ... 0 0 0 0 0 0 1 0 1 0
    49999 249999 78202 20090708 0 3.0 4158 0 0 20160401 44.289471 ... 0 0 0 0 0 1 0 0 1 0

    50000 rows × 329 columns

    对日期特征提取年月日等信息

    Train_data_create = pd.to_datetime(Train_data['creatDate'],format='%Y%m%d', errors='coerce')
    Test_data_create = pd.to_datetime(Test_data['creatDate'],format='%Y%m%d', errors='coerce')
    Train_data_reg = pd.to_datetime(Train_data['regDate'],format='%Y%m%d', errors='coerce')
    Test_data_reg = pd.to_datetime(Test_data['regDate'],format='%Y%m%d', errors='coerce')
    
    Train_data_create
    
    0        2016-04-04
    1        2016-03-09
    2        2016-04-02
    3        2016-03-12
    4        2016-03-13
                ...    
    149995   2016-03-27
    149996   2016-03-12
    149997   2016-03-28
    149998   2016-04-01
    149999   2016-03-05
    Name: creatDate, Length: 150000, dtype: datetime64[ns]
    
    Train_data_reg
    
    0        2004-04-02
    1        2003-03-01
    2        2004-04-03
    3        1996-09-08
    4        2012-01-03
                ...    
    149995   2000-06-07
    149996   2009-11-02
    149997   2010-10-03
    149998   2006-03-12
    149999   1999-02-04
    Name: regDate, Length: 150000, dtype: datetime64[ns]
    

    任务5:使用Sklearn中基础树模型完成训练和预测

    学会五折交叉验证的数据划分方法(KFold)

    import numpy as np
    from sklearn.model_selection import KFold
    
    X = np.array([[1,2],[3,4],[1,2],[3,4],[3,4]])
    y = np.array([1,2,3,4,5])
    kf = KFold(n_splits = 5)
    
    for train_index,test_index in kf.split(X):
        print("TRAIN:",train_index,"TEST:",test_index)
        X_train,X_test = X[train_index],X[test_index]
        y_train,y_test = y[train_index],y[test_index]
        print(X_train,X_test)
        print(y_train,y_test)
    
    TRAIN: [1 2 3 4] TEST: [0]
    [[3 4]
     [1 2]
     [3 4]
     [3 4]] [[1 2]]
    [2 3 4 5] [1]
    TRAIN: [0 2 3 4] TEST: [1]
    [[1 2]
     [1 2]
     [3 4]
     [3 4]] [[3 4]]
    [1 3 4 5] [2]
    TRAIN: [0 1 3 4] TEST: [2]
    [[1 2]
     [3 4]
     [3 4]
     [3 4]] [[1 2]]
    [1 2 4 5] [3]
    TRAIN: [0 1 2 4] TEST: [3]
    [[1 2]
     [3 4]
     [1 2]
     [3 4]] [[3 4]]
    [1 2 3 5] [4]
    TRAIN: [0 1 2 3] TEST: [4]
    [[1 2]
     [3 4]
     [1 2]
     [3 4]] [[3 4]]
    [1 2 3 4] [5]
    

    对标签price按照大小划分成10等分,然后使用StratifiedKFold进行划分

    #按照大小划分成10等分
    Y_data = Train_data['price']
    Y_data = Y_data.sort_values()
    Y_data_dict = {}
    for i in range(10):
        Y_data_dict[i]=Y_data[i*15000:(i+1)*15000]
    
    Y_data_list = []
    Y_data_iloc = list(Y_data_dict.keys())
    for i in Y_data_iloc:
        Y_data_list.append(list(Y_data_dict[i]))
    
    Y_data_iloc
    
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    
    len(Y_data_list[0])
    
    15000
    
    Y_data_iloc = [0,1,0,1,0,1,0,1,0,1]
    
    import numpy as np 
    from sklearn.model_selection import StratifiedKFold
    
    skf=StratifiedKFold(n_splits = 5,shuffle=True,random_state=0)
    
    Y_data_list = np.array(Y_data_list)
    Y_data_iloc = np.array(Y_data_iloc)
    
    for train_index,test_index in skf.split(Y_data_list,Y_data_iloc):
        print("TRAIN:",train_index,"TEST:",test_index) 
        X_train, X_test = Y_data_list[train_index], Y_data_list[test_index]
        y_train,y_test = Y_data_iloc[train_index],Y_data_iloc[test_index]
    
    TRAIN: [0 3 4 5 6 7 8 9] TEST: [1 2]
    TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
    TRAIN: [1 2 4 5 6 7 8 9] TEST: [0 3]
    TRAIN: [0 1 2 3 4 5 7 8] TEST: [6 9]
    TRAIN: [0 1 2 3 4 5 6 9] TEST: [7 8]
    

    学会使用sklearn中的随机森林模型

    学习博客链接:https://www.cnblogs.com/banshaohuan/p/13308680.html

    任务6:成功将树模型的预测结果文件提交到天池

    使用StratifiedKFold配合随机森林完成模型的训练和预测

    在每折记录下模型对验证集和测试集的预测结果

    X_data = X_data.fillna(-1)
    
    X_data
    
    gearbox power kilometer v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
    0 0.0 60 12.5 43.357796 3.966344 0.050257 2.159744 1.143786 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
    1 0.0 0 15.0 45.305273 5.236112 0.137925 1.380657 -1.422165 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
    2 0.0 163 12.5 45.978359 4.823792 1.319524 -0.998467 -0.996911 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
    3 1.0 193 15.0 45.687478 4.492574 -0.050616 0.883600 -2.228079 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
    4 0.0 68 5.0 44.383511 2.031433 0.572169 -1.571239 2.246088 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    149995 1.0 163 15.0 45.316543 -3.139095 -1.269707 -0.736609 -1.505820 0.280264 0.000310 0.048441 0.071158 0.019174 1.988114 -2.983973 0.589167 -1.304370 -0.302592
    149996 0.0 125 10.0 45.972058 -3.143764 -0.023523 -2.366699 0.698012 0.253217 0.000777 0.084079 0.099681 0.079371 1.839166 -2.774615 2.553994 0.924196 -0.272160
    149997 0.0 90 6.0 44.733481 -3.105721 0.595454 -2.279091 1.423661 0.233353 0.000705 0.118872 0.100118 0.097914 2.439812 -1.630677 2.290197 1.891922 0.414931
    149998 0.0 156 15.0 45.658634 -3.204785 -0.441680 -1.179812 0.620680 0.256369 0.000252 0.081479 0.083558 0.081498 2.075380 -2.633719 1.414937 0.431981 -1.659014
    149999 1.0 193 12.5 45.536383 -3.200326 -1.612893 -0.067144 -1.396166 0.284475 0.000000 0.040072 0.062543 0.025819 1.978453 -3.179913 0.031724 -1.483350 -0.342674

    150000 rows × 18 columns

    Y_data = Train_data['price']
    Y_data
    
    0         1850
    1         3600
    2         6222
    3         2400
    4         5200
              ... 
    149995    5900
    149996    9500
    149997    7500
    149998    4999
    149999    4700
    Name: price, Length: 150000, dtype: int64
    
    #对每折记录模型对验证集和测试集的预测结果并求平均值
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_squared_error, mean_absolute_error
    #定义随机森林模型
    randomforest = RandomForestRegressor(n_estimators=50,random_state=0)
        
    scores_train = []
    scores_val = []
        
    for train_ind,val_ind in skf.split(X_data,Y_data):
            
        train_x = X_data.iloc[train_ind].values
        train_y = Y_data.iloc[train_ind]
        val_x = X_data.iloc[val_ind].values
        val_y = Y_data.iloc[val_ind]
            
        randomforest.fit(train_x,train_y)
        pred_train_random = randomforest.predict(train_x)
        pred_val_random = randomforest.predict(val_x)
            
        score_train = mean_absolute_error(train_y,pred_train_random)
        scores_train.append(score_train)
        score = mean_absolute_error(val_y,pred_val_random)
        scores_val.append(score)
    
    print('Train mae:',np.mean(scores_train))
    print('Val mae',np.mean(scores_val))
    
    Train mae: 253.97298353737415
    Val mae 667.4268940009031
    
    X_Test_data = Test_data[feature_cols]
    X_Test_data
    
    gearbox power kilometer v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
    0 0.0 101 15.0 42.142061 -3.094739 -0.721300 1.466344 1.009846 0.236520 0.000241 0.105319 0.046233 0.094522 3.619512 -0.280607 -2.019761 0.978828 0.803322
    1 0.0 73 6.0 43.907034 -3.244605 -0.766430 1.276718 -1.065338 0.261518 0.000000 0.120323 0.046784 0.035385 2.997376 -1.406705 -1.020884 -1.349990 -0.200542
    2 0.0 120 5.0 45.389665 3.372384 -0.965565 -2.447316 0.624268 0.261691 0.090836 0.000000 0.079655 0.073586 -3.951084 -0.433467 0.918964 1.634604 1.027173
    3 0.0 58 15.0 42.788775 4.035052 -0.217403 1.708806 1.119165 0.236050 0.101777 0.098950 0.026830 0.096614 -2.846788 2.800267 -2.524610 1.076819 0.461610
    4 0.0 116 15.0 43.670763 -3.135382 -1.134107 0.470315 0.134032 0.257000 0.000000 0.066732 0.057771 0.068852 2.839010 -1.659801 -0.924142 0.199423 0.451014
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    49995 1.0 150 15.0 46.321013 -3.304401 0.073363 -0.622359 -0.778349 0.263668 0.000292 0.141804 0.076393 0.039272 2.072901 -2.531869 1.716978 -1.063437 0.326587
    49996 0.0 179 4.0 48.086547 -3.318641 0.965881 -2.672160 0.357440 0.255310 0.000991 0.155868 0.108425 0.067841 1.358504 -3.290295 4.269809 0.140524 0.556221
    49997 1.0 147 12.5 46.145279 -3.305263 -0.015283 -0.288329 -0.687112 0.262933 0.000318 0.141872 0.071968 0.042966 2.165658 -2.417885 1.370612 -1.073133 0.270602
    49998 1.0 176 15.0 45.507088 -3.197006 -1.141252 -0.434930 -1.845040 0.282106 0.000023 0.067483 0.067526 0.009006 2.030114 -2.939244 0.569078 -1.718245 0.316379
    49999 0.0 0 3.0 44.289471 4.181452 0.547068 -0.775841 1.789601 0.231449 0.103947 0.096027 0.062328 0.110180 -3.689090 2.032376 0.109157 2.202828 0.847469

    50000 rows × 18 columns

    X_Test_data = X_Test_data.fillna(0)
    
    X_Test_data.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 50000 entries, 0 to 49999
    Data columns (total 18 columns):
     #   Column     Non-Null Count  Dtype  
    ---  ------     --------------  -----  
     0   gearbox    50000 non-null  float64
     1   power      50000 non-null  int64  
     2   kilometer  50000 non-null  float64
     3   v_0        50000 non-null  float64
     4   v_1        50000 non-null  float64
     5   v_2        50000 non-null  float64
     6   v_3        50000 non-null  float64
     7   v_4        50000 non-null  float64
     8   v_5        50000 non-null  float64
     9   v_6        50000 non-null  float64
     10  v_7        50000 non-null  float64
     11  v_8        50000 non-null  float64
     12  v_9        50000 non-null  float64
     13  v_10       50000 non-null  float64
     14  v_11       50000 non-null  float64
     15  v_12       50000 non-null  float64
     16  v_13       50000 non-null  float64
     17  v_14       50000 non-null  float64
    dtypes: float64(17), int64(1)
    memory usage: 6.9 MB
    
    from sklearn.model_selection import train_test_split
    #定义模型函数
    def build_model_randomforest(x_train,y_train):
        model = RandomForestRegressor(n_estimators=50,random_state=0)
        model.fit(x_train, y_train)
        return model
    
    model_random_pre = build_model_randomforest(X_data,Y_data)
    subpre = model_random_pre.predict(X_Test_data)
    subpre
    
    array([1227.94      , 1832.4       , 8610.005     , ..., 5474.99      ,
           5055.48      , 5637.44666667])
    

    将多折测试集结果进行求均值,并写入csv提交到天池

    sub = pd.DataFrame()
    sub['SaleID'] = Test_data.SaleID
    sub['price'] = subpre
    sub.to_csv('submit.csv',index = False)
    
    sub.head()
    
    SaleID price
    0 200000 1227.940
    1 200001 1832.400
    2 200002 8610.005
    3 200003 929.880
    4 200004 2075.360

    提交天池结果

  • 相关阅读:
    2020年12月15日Java学习日记
    2020年12月12日Java学习日记
    2020年12月10日Java学习日记
    2020年12月8日Java学习日记
    2020年12月4日Java学习日记
    2020年12月1日Java学习日记
    2020年11月30日Java学习日记
    2020年11月27日Java学习日记
    2020年11月26日Java学习日记
    B. Navigation System【CF 1320】
  • 原文地址:https://www.cnblogs.com/MurasameLory-chenyulong/p/15394549.html
Copyright © 2020-2023  润新知