• kaggle学习笔记


    kaggle学习笔记

    这部分的东西很碎,但是步骤是一样的,因此先记住大概的,然后一点一点的添东西就好

    导入数据

    import pandas as pd
    
    # Read the test data
    test = pd.read_csv('test.csv')
    # Print train and test columns.查看列名(变量名)
    print('Train columns:', train.columns.tolist())
    print('Test columns:', test.columns.tolist())
    
    # Read the sample submission file
    sample_submission = pd.read_csv('sample_submission.csv')
    
    # Look at the head() of the sample submission
    print(sample_submission.head())
    

    submission

    Public vs Private leaderboard
    这里的public和private没分太清呢

    overfit

    train :overfit:在训练集的误差大,而验证集的误差小,此时是训练集的过拟合

    train_test_split

    这个是划分训练集和测试集的
    train_test_split函数可以将原始数据集按照一定比例划分训练集和测试集对模型进行训练

    训练集和测试集的误差

    要同时比较训练集和测试集的误差判断是否overfiting

    from sklearn.metrics import mean_squared_error
    
    dtrain = xgb.DMatrix(data=train[['store', 'item']])
    dtest = xgb.DMatrix(data=test[['store', 'item']])
    
    # For each of 3 trained models
    for model in [xg_depth_2, xg_depth_8, xg_depth_15]:
        # Make predictions
        train_pred = model.predict(dtrain)     
        test_pred = model.predict(dtest)          
        
        # Calculate metrics
        mse_train =mean_squared_error(train['sales'], train_pred)                  
        mse_test = mean_squared_error(test['sales'], test_pred)
        print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))
    
    <script.py> output:
        MSE Train: 631.275. MSE Test: 558.522
        MSE Train: 183.771. MSE Test: 337.337
        MSE Train: 134.984. MSE Test: 355.534
    

    自定义误差函数

    import numpy as np
    
    # Import log_loss from sklearn
    from sklearn.metrics import log_loss
    
    # Define your own LogLoss function
    def own_logloss(y_true, prob_pred):
      	# Find loss for each observation
        terms = y_true * np.log(prob_pred) + (1 - y_true) * np.log(1 - prob_pred)
        # Find mean over all observations
        err = np.mean(terms)   
        return -err
    
    print('Sklearn LogLoss: {:.5f}'.format(log_loss(y_classification_true, y_classification_pred)))
    print('Your LogLoss: {:.5f}'.format(own_logloss(y_classification_true, y_classification_pred)))
    

    EDA

    PLOT

    # Create hour feature
    train['pickup_datetime'] = pd.to_datetime(train.pickup_datetime)
    train['hour'] = train.pickup_datetime.dt.hour
    
    # Find median fare_amount for each hour
    hour_price = train.groupby('hour', as_index=False)['fare_amount'].median()
    
    # Plot the line plot
    plt.plot(hour_price['hour'], hour_price['fare_amount'], marker='o')
    plt.xlabel('Hour of the day')
    plt.ylabel('Median fare amount')
    plt.title('Fare amount based on day time')
    plt.xticks(range(24))
    plt.show()
    

    Local validation

    Kfold

    KFold交叉采样:将训练/测试数据集划分n_splits个互斥子集,每次只用其中一个子集当做测试集,剩下的(n_splits-1)作为训练集,进行n_splits次实验并得到n_splits个结果

    # Import KFold
    from sklearn.model_selection import KFold
    
    # Create a KFold object
    kf = KFold(n_splits=3, shuffle=True, random_state=123)
    
    # Loop through each split
    fold = 0
    for train_index, test_index in kf.split(train):
        # Obtain training and testing folds
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
        print('Fold: {}'.format(fold))
        print('CV train shape: {}'.format(cv_train.shape))
        print('Medium interest listings in CV train: {}
    '.format(sum(cv_train.interest_level == 'medium')))
        fold += 1
    
    <script.py> output:
        Fold: 0
        CV train shape: (666, 9)
        Medium interest listings in CV train: 175
        
        Fold: 1
        CV train shape: (667, 9)
        Medium interest listings in CV train: 165
        
        Fold: 2
        CV train shape: (667, 9)
        Medium interest listings in CV train: 162
    

    data leakage

    划分时间特征

    # Create TimeSeriesSplit object
    time_kfold = TimeSeriesSplit(n_splits=3)
    
    # Sort train data by date
    train = train.sort_values('date')
    
    # Iterate through each split
    fold = 0
    for train_index, test_index in time_kfold.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
        
        print('Fold :', fold)
        print('Train date range: from {} to {}'.format(cv_train.date.min(), cv_train.date.max()))
        print('Test date range: from {} to {}
    '.format(cv_test.date.min(), cv_test.date.max()))
        fold += 1
    
    <script.py> output:
        Fold : 0
        Train date range: from 2017-12-01 to 2017-12-08
        Test date range: from 2017-12-08 to 2017-12-16
        
        Fold : 1
        Train date range: from 2017-12-01 to 2017-12-16
        Test date range: from 2017-12-16 to 2017-12-24
        
        Fold : 2
        Train date range: from 2017-12-01 to 2017-12-24
        Test date range: from 2017-12-24 to 2017-12-31
    

    验证集的误差

    from sklearn.model_selection import TimeSeriesSplit
    import numpy as np
    
    # Sort train data by date
    train = train.sort_values('date')
    
    # Initialize 3-fold time cross-validation
    kf = TimeSeriesSplit(n_splits=3)
    
    # Get MSE scores for each cross-validation split
    mse_scores = get_fold_mse(train, kf)
    
    print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))
    

    feature engineering

    Arithmetical features

    numerical
    数值特征,可以直接做算数运算,进行拼接

    # 这样做拼接的话是两个特征相加
    train['TotalArea'] = train['TotalBsmtSF'] + train['FirstFlrSF'] + train['SecondFlrSF']
    

    Date features

    提取时间特征

    pd.to_datetime

    # Concatenate train and test together
    taxi = pd.concat([train, test])
    
    # Convert pickup date to datetime object
    taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])
    # 提取星期
    # Create a day of week feature
    taxi['dayofweek'] = taxi['pickup_datetime'].dt.dayofweek
    
    # 提取小时
    # Create an hour feature
    taxi['hour'] = taxi['pickup_datetime'].dt.hour
    
    # Split back into train and test
    new_train = taxi[taxi['id'].isin(train['id'])]
    new_test = taxi[taxi['id'].isin(test['id'])]
    

    Categorical features特征编码问题

    是个大问题

    label encoding
    特征存在内在顺序 (ordinal feature)

    one hot encoding
    特征无内在顺序,category数量 < 4

    target encoding (mean encoding, likelihood encoding, impact encoding)
    特征无内在顺序,category数量 > 4

    beta target encoding
    特征无内在顺序,category数量 > 4, K-fold cross validation

    不做处理(模型自动编码)
    CatBoost,lightgbm
    文本(分类)特征

    有序的分类特征

    无序的分类特征

    处理方式有主要的两种,标签编码和独热编码

    Label encoding

    # Concatenate train and test together
    houses = pd.concat([train, test])
    
    # Label encoder
    对于一个有m个category的特征,经过label encoding以后,每个category会映射到0到m-1之间的一个数。label encoding适用于ordinal feature (特征存在内在顺序)。
    ```r
    #一般的实际案例是fit和transform分开的
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    
    # Create new features
    houses['RoofStyle_enc'] = le.fit_transform(houses['RoofStyle'])
    houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])
    
    # Look at new features
    print(houses[['RoofStyle', 'RoofStyle_enc', 'CentralAir', 'CentralAir_enc']].head())
    <script.py> output:
          RoofStyle  RoofStyle_enc CentralAir  CentralAir_enc
        0     Gable              1          Y               1
        1     Gable              1          Y               1
        2     Gable              1          Y               1
        3     Gable              1          Y               1
        4     Gable              1          Y               1
    

    one-hot

    对于一个有m个category的特征,经过独热编码(OHE)处理后,会变为m个二元特征,每个特征对应于一个category。这m个二元特征互斥,每次只有一个激活。

    独热编码解决了原始特征缺少内在顺序的问题,但是缺点是对于high-cardinality categorical feature (category数量很多),编码之后特征空间过大(此处可以考虑PCA降维),而且由于one-hot feature 比较unbalanced,树模型里每次的切分增益较小,树模型通常需要grow very deep才能得到不错的精度。因此OHE一般用于category数量 <4的情况。

    # Concatenate train and test together
    houses = pd.concat([train, test])
    
    # Label encode binary 'CentralAir' feature
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])
    
    # Create One-Hot encoded features
    ohe = pd.get_dummies(houses['RoofStyle'], prefix='RoofStyle')
    
    # Concatenate OHE features to houses
    houses = pd.concat([houses, ohe], axis=1)
    
    # Look at OHE features
    print(houses[[col for col in houses.columns if 'RoofStyle' in col]].head(3))
    

    Target encoding

    Mean target encoding

    使用目标变量时,非常重要的一点是不要泄露任何验证集的信息。
    所有基于目标编码的特征都应该在训练集上计算,接着仅仅合并或连接验证集和测试集。
    即使验证集中有目标变量,它不能用于任何编码计算,否则会给出过于乐观的验证误差估计。

    • Calculate the mean on the train, apply to the test
    • Split train into K folds. Calculate the out-of-fold mean for each fold, apply to this particular fold
      预测变量编码
    def mean_target_encoding(train, test, target, categorical, alpha=5):
      
        # Get the train feature
        train_feature = train_mean_target_encoding(train, target, categorical, alpha)
      
        # Get the test feature
        test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
        
        # Return new features to add to the model
        return train_feature, test_feature
    

    mean_target_encoding

    这里理解的不太好。。。
    k折交叉验证

    # Create 5-fold cross-validation
    kf = KFold(n_splits=5, random_state=123, shuffle=True)
    
    # For each folds split
    for train_index, test_index in kf.split(bryant_shots):
        cv_train, cv_test = bryant_shots.iloc[train_index], bryant_shots.iloc[test_index]
    
        # Create mean target encoded feature
        cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(train=cv_train,
                                                                               test=cv_test,
                                                                               target='shot_made_flag',
                                                                               categorical='game_id',
                                                                               alpha=5)
        # Look at the encoding
        print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))
    

    Missing data

    处理缺失值

    1. xgboost和lightGBM不需要处理缺失值,因为可以自动处理

    查看缺失值的数量

    df.isnull().sum()
    

    均值填充

    # Import SimpleImputer
    from sklearn.impute import SimpleImputer
    
    # Create mean imputer
    mean_imputer = SimpleImputer(strategy='mean')
    
    # Price imputation
    rental_listings[['price']] = mean_imputer.fit_transform(rental_listings[['price']])
    

    Baseline model

    这个我打算做一个实例,视频这部分有点糊,不过kaggle官网上面确实有很多有用的baseline,有些流程是固定的,可以有一个大体思路之后继续。

  • 相关阅读:
    一起谈.NET技术,Silverlight中二维变换详解 狼人:
    一起谈.NET技术,通过16道练习学习Linq和Lambda 狼人:
    一起谈.NET技术,技巧:使用可扩展对象模式扩展HttpApplication 狼人:
    一起谈.NET技术,ASP.NET的运行原理与运行机制 狼人:
    一起谈.NET技术,.NET远程处理框架详解 狼人:
    一起谈.NET技术,从原理来看Silverlight 4的架构 狼人:
    一起谈.NET技术,ASP.NET MVC中对Model进行分步验证的解决方法 狼人:
    一起谈.NET技术,解决编程中序列化问题 狼人:
    一起谈.NET技术,asp.net控件开发基础(2) 狼人:
    一起谈.NET技术,asp.net控件开发基础(1) 狼人:
  • 原文地址:https://www.cnblogs.com/gaowenxingxing/p/12660261.html
Copyright © 2020-2023  润新知