• Mercari Price Suggestion in Kaggle


    Mercari Price Suggestion

    • 最近看到了一个竞赛,竞赛的内容是根据已知的商品的描述,品牌,品类,物品的状态等特征来预测商品的价格

    • 最后的评估标准为 平均算术平方根误差Root Mean Squared Logarithmic Error.

      [epsilon = sqrt { frac { 1 } { n } sum _ { i = 1 } ^ { n } left( log left( p _ { i } + 1 ight) - log left( a _ { i } + 1 ight) ight) ^ { 2 } } ]

    • 最后提交的文件为test_id ,price 包含两列数据,一列为测试数据中id,另一列为预测的价格

    • 训练集或者测试集中包括以下特征

      • train_id test_id 物品的编号,一个商品对应一个编号
      • name 名称
      • item_condition_id 物品状态
      • category_name 品类
      • brand_name 品牌
      • price 物品售出的价格,测试集中不包含此列,此列也为我们要预测的值
      • shipping 1 if shipping fee is paid by seller and 0 by buyer,也就是1代表包邮,0代表不包邮
      • item_description 物品的详细描述,描述中已经除去带有价格标签的值,已用[rm]代替
    import pandas as pd
    import numpy as np
    
    df = pd.read_csv('input/train.tsv',sep='	')
    

    data information

    df.head()
    
    train_id name item_condition_id category_name brand_name price shipping item_description
    0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
    1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
    2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
    3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
    4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1482535 entries, 0 to 1482534
    Data columns (total 8 columns):
    train_id             1482535 non-null int64
    name                 1482535 non-null object
    item_condition_id    1482535 non-null int64
    category_name        1476208 non-null object
    brand_name           849853 non-null object
    price                1482535 non-null float64
    shipping             1482535 non-null int64
    item_description     1482531 non-null object
    dtypes: float64(1), int64(3), object(4)
    memory usage: 90.5+ MB
    

    price distribution

    df.price.describe()
    
    count    1.482535e+06
    mean     2.673752e+01
    std      3.858607e+01
    min      0.000000e+00
    25%      1.000000e+01
    50%      1.700000e+01
    75%      2.900000e+01
    max      2.009000e+03
    Name: price, dtype: float64
    
    import matplotlib.pyplot as plt
    
    
    plt.subplot(1, 2, 1)  #  要生成一行两列,这是第一个图plt.subplot('行','列','编号')
    df.price.plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white', range = [0, 250])
    plt.xlabel('price', fontsize=12)
    plt.title('Price Distribution', fontsize=12)
    plt.subplot(1, 2, 2)
    np.log((df.price+1)).plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white')
    plt.xlabel('log(price+1)', fontsize=12)
    plt.title('log(Price+1) Distribution', fontsize=12)
    
    Text(0.5, 1.0, 'log(Price+1) Distribution')
    

    png

    • 价格特征为左偏态,需要将其转化为正太分布的数据,价格的分布主要集中在10-20左右,而最大的价格在2009,需要将其做对数转化,转化后,其对数分布为较为规则的正态分布

    包邮对于价格影响

    df['shipping'].value_counts(normalize=True)
    
    0    0.552726
    1    0.447274
    Name: shipping, dtype: float64
    
    • 对于商家是否包邮,55%的商品不包邮,44.7%的商品包邮,需要看一下包邮是否对于价格影响
    shipping_yes = df.loc[df['shipping'] == 1, 'price']  # 商家出运费
    shipping_no = df.loc[df['shipping'] == 0, 'price']  #  买家出运费
    
    fig,ax  = plt.subplots(figsize=(8,5))
    ax.hist(shipping_yes,color='r',alpha=0.5,bins=30,range=[0,100],label='shipping_yes')
    ax.hist(shipping_no,color='green',alpha=0.5,bins=30,range=[0,100],label=
           'shipping_no')
    plt.xlabel('price',fontsize=12)
    plt.ylabel('frequency',fontsize=12)
    plt.title('price_distribution by shipping method')
    plt.tick_params(labelsize=12)
    plt.legend()
    plt.show()
    

    png

    print("不包邮平均的定价%s dollars" %(round(shipping_no.mean(),2)))
    print("包邮平均的定价%s dollars" %(round(shipping_yes.mean(),2)))
    
    
    不包邮平均的定价30.11 dollars
    包邮平均的定价22.57 dollars
    
    
    fig,ax  = plt.subplots(figsize=(8,5))
    ax.hist(np.log(shipping_yes+1),color='r',alpha=0.5,bins=50,label='shipping_yes')
    ax.hist(np.log(shipping_no+1),color='green',alpha=0.5,bins=50,label=
           'shipping_no')
    plt.xlabel('log(price+1)',fontsize=12)
    plt.ylabel('frequency',fontsize=12)
    plt.title('log(price+1)_distribution by shipping method')
    plt.tick_params(labelsize=12)
    plt.legend()
    plt.show()
    
    

    png

    处理category 数据

    "总共的数据有{}条记录".format(df.shape[0])
    
    
    '总共的数据有1482535条记录'
    
    
    • 数据集中的name,cageory,brand,item_condition_id 都需要转化为category类型的数据
    df['category_name'].value_counts()
    #  总共有1287类型
    
    
    Women/Athletic Apparel/Pants, Tights, Leggings                 60177
    Women/Tops & Blouses/T-Shirts                                  46380
    Beauty/Makeup/Face                                             34335
    Beauty/Makeup/Lips                                             29910
    Electronics/Video Games & Consoles/Games                       26557
    Beauty/Makeup/Eyes                                             25215
    Electronics/Cell Phones & Accessories/Cases, Covers & Skins    24676
    Women/Underwear/Bras                                           21274
    Women/Tops & Blouses/Tank, Cami                                20284
    Women/Tops & Blouses/Blouse                                    20284
    Women/Dresses/Above Knee, Mini                                 20082
    Women/Jewelry/Necklaces                                        19758
    Women/Athletic Apparel/Shorts                                  19528
    Beauty/Makeup/Makeup Palettes                                  19103
    Women/Shoes/Boots                                              18864
    Beauty/Fragrance/Women                                         18628
    Beauty/Skin Care/Face                                          15836
    Women/Women's Handbags/Shoulder Bag                            15328
    Men/Tops/T-shirts                                              15108
    Women/Dresses/Knee-Length                                      14770
    Women/Athletic Apparel/Shirts & Tops                           14738
    Women/Shoes/Sandals                                            14662
    Women/Jewelry/Bracelets                                        14497
    Men/Shoes/Athletic                                             14257
    Kids/Toys/Dolls & Accessories                                  13957
    Women/Women's Accessories/Wallets                              13616
    Women/Jeans/Slim, Skinny                                       13392
    Home/Home Décor/Home Décor Accents                             13004
    Women/Swimwear/Two-Piece                                       12758
    Women/Shoes/Athletic                                           12662
                                                                   ...  
    Men/Suits/Four Button                                              1
    Handmade/Bags and Purses/Other                                     1
    Handmade/Dolls and Miniatures/Primitive                            1
    Handmade/Furniture/Fixture                                         1
    Handmade/Housewares/Bathroom                                       1
    Handmade/Woodworking/Sculptures                                    1
    Men/Suits/One Button                                               1
    Handmade/Geekery/Housewares                                        1
    Kids/Safety/Crib Netting                                           1
    Vintage & Collectibles/Furniture/Entertainment                     1
    Home/Furniture/Bathroom Furniture                                  1
    Handmade/Glass/Vases                                               1
    Handmade/Geekery/Videogame                                         1
    Handmade/Woodworking/Sports                                        1
    Handmade/Art/Aceo                                                  1
    Vintage & Collectibles/Paper Ephemera/Map                          1
    Handmade/Patterns/Painting                                         1
    Handmade/Housewares/Cleaning                                       1
    Home/Home Décor/Doorstops                                          1
    Handmade/Accessories/Belt                                          1
    Handmade/Patterns/Accessories                                      1
    Vintage & Collectibles/Housewares/Towel                            1
    Other/Automotive/RV Parts & Accessories                            1
    Handmade/Paper Goods/Pad                                           1
    Handmade/Accessories/Cozy                                          1
    Kids/Diapering/Washcloths & Towels                                 1
    Handmade/Pets/Blanket                                              1
    Handmade/Needlecraft/Clothing                                      1
    Handmade/Furniture/Shelf                                           1
    Handmade/Quilts/Bed                                                1
    Name: category_name, Length: 1287, dtype: int64
    

    it_conditon_id vs price

    • 常见的箱型图 注释img
    import seaborn as sns
    
    
    sns.boxplot(x = 'item_condition_id', y = np.log(df['price']+1), data = df, palette = sns.color_palette('RdBu',5))
    
    
    <matplotlib.axes._subplots.AxesSubplot at 0x127d5bdd8>
    
    

    png

    • 不同的物品状态对应的价格千差外别

    竞赛杀器lightgbm

    • settings
    NUM_BRANDS = 4000
    NUM_CATEGORIES = 1000
    NAME_MIN_DF =10
    MAX_FEATURES_ITEM_DESCRIPTION =50000
    
    
    "There are %d items that do not have a category name" % df['category_name'].isnull().sum()
    
    
    'There are 6327 items that do not have a category name'
    
    
    "There are %d items that do not have a brand name" % df['brand_name'].isnull().sum()
    
    
    'There are 632682 items that do not have a brand name'
    
    
    "There are %d items that do not have a item_description " % df['item_description'].isnull().sum()
    
    
    'There are 4 items that do not have a item_description '
    
    
    def handling_missing_inplace(datasets):
        datasets['category_name'].fillna('missing',inplace=True)
        datasets['brand_name'].fillna('missing',inplace=True)
        datasets['item_description'].replace('No description yet,''missing', inplace=True) # 需要仔细看数据才能看到
        datasets['item_description'].fillna(value='missing', inplace=True)
    
    
    def cutting(datasets):
        pop_brand = datasets['brand_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_BRANDS]
        datasets.loc[~datasets['brand_name'].isin(pop_brand),'brand_name'] ='missing'
        pop_category = datasets['category_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_CATEGORIES]
        datasets.loc[~datasets['category_name'].isin(pop_category),'category_name'] ='missing'
    
    def to_category(datasets):
        datasets['category_name'] = datasets['category_name'].astype('category')
        datasets['brand_name'] = datasets['brand_name'].astype('category')
        datasets['item_condition_id'] = datasets['item_condition_id'].astype('category')
    
    • 查看价格的数量分布,发现竟然有价格为0的,所以需要去掉价格为0的数据
    df['price'].value_counts().reset_index().sort_values(by='index').head()
    
    
    index price
    25 3.0 18703
    28 4.0 16139
    17 5.0 31502
    261 5.5 33
    16 6.0 32260
    df=df[df['price']!=0].reset_index(drop=True)
    
    
    df.head()
    
    
    train_id name item_condition_id category_name brand_name price shipping item_description
    0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
    1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
    2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
    3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
    4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.preprocessing import LabelBinarizer
    import lightgbm as lgb
    from scipy.sparse import csr_matrix, hstack  # 解决稀疏矩阵
    # referenc https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html
    import gc
    import time
    from sklearn.linear_model import Ridge
    
    def main():
        start_time = time.time()
    
        train = pd.read_table('input/train.tsv', engine='c')
        # train=train[train['price']!=0]
        test = pd.read_table('input/test_stg2.tsv', engine='c')
        print('[{}] Finished to load data'.format(time.time() - start_time))
        print('Train shape: ', train.shape)
        print('Test shape: ', test.shape)
    
        nrow_train = train.shape[0]
        y = np.log1p(train["price"])
        merge: pd.DataFrame = pd.concat([train, test])
        submission: pd.DataFrame = test[['test_id']]
    
        del train
        del test
        gc.collect()
    
        handling_missing_inplace(merge)
        print('[{}] Finished to handle missing'.format(time.time() - start_time))
    
        cutting(merge)
        print('[{}] Finished to cut'.format(time.time() - start_time))
    
        to_category(merge)
        print('[{}] Finished to convert categorical'.format(time.time() - start_time))
    
        cv = CountVectorizer(min_df=NAME_MIN_DF)
        X_name = cv.fit_transform(merge['name'])
        print('[{}] Finished count vectorize `name`'.format(time.time() - start_time))
    
        cv = CountVectorizer()
        X_category = cv.fit_transform(merge['category_name'])
        print('[{}] Finished count vectorize `category_name`'.format(time.time() - start_time))
    
        tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION,
                             ngram_range=(1, 3),
                             stop_words='english')
        X_description = tv.fit_transform(merge['item_description'])
        print('[{}] Finished TFIDF vectorize `item_description`'.format(time.time() - start_time))
    
        lb = LabelBinarizer(sparse_output=True)
        X_brand = lb.fit_transform(merge['brand_name'])
        print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))
    
        X_dummies = csr_matrix(pd.get_dummies(merge[['item_condition_id', 'shipping']],
                                              sparse=True).values)
        print('[{}] Finished to get dummies on `item_condition_id` and `shipping`'.format(time.time() - start_time))
    
        sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()
        print('[{}] Finished to create sparse merge'.format(time.time() - start_time))
    
        X = sparse_merge[:nrow_train]
        X_test = sparse_merge[nrow_train:]
        
        #train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.1, random_state = 144) 
        d_train = lgb.Dataset(X, label=y)
        #d_valid = lgb.Dataset(valid_X, label=valid_y, max_bin=8192)
        #watchlist = [d_train, d_valid]
        
        params = {
            'learning_rate': 0.73,
            'application': 'regression',
            'max_depth': 3,
            'num_leaves': 100,
            'verbosity': -1,
            'metric': 'RMSE',
        }
    
    
        model = lgb.train(params, train_set=d_train, num_boost_round=3000, verbose_eval=100) 
        preds = 0.56*model.predict(X_test)
    
    
        model = Ridge(solver="sag", fit_intercept=True, random_state=42)
        model.fit(X, y)
        print('[{}] Finished to train ridge'.format(time.time() - start_time))
        preds += 0.44*model.predict(X=X_test)
        print('[{}] Finished to predict ridge'.format(time.time() - start_time))
    
    
        submission['price'] = np.expm1(preds)
        submission.loc[submission['price'] < 0.0, 'price'] = 0.0
        submission.to_csv("sample_submission_stg2.csv", index=False)
        
    
    if __name__ == '__main__':
        main()
    
    
    
  • 相关阅读:
    数组
    css动画
    mui 常用手势
    ejs 用到的语法
    css 高度塌陷
    SQL 用到的操作符
    position: relative;导致页面卡顿
    h5 图片生成
    li之间的间隙问题
    虚拟机扩容mac
  • 原文地址:https://www.cnblogs.com/onemorepoint/p/11087155.html
Copyright © 2020-2023  润新知