Mercari Price Suggestion in Kaggle

Mercari Price Suggestion

最近看到了一个竞赛，竞赛的内容是根据已知的商品的描述，品牌，品类，物品的状态等特征来预测商品的价格
最后的评估标准为平均算术平方根误差Root Mean Squared Logarithmic Error.

[epsilon = sqrt { frac { 1 } { n } sum _ { i = 1 } ^ { n } left( log left( p _ { i } + 1 ight) - log left( a _ { i } + 1 ight) ight) ^ { 2 } } ]
最后提交的文件为test_id ,price 包含两列数据，一列为测试数据中id，另一列为预测的价格
训练集或者测试集中包括以下特征
- train_id test_id 物品的编号，一个商品对应一个编号
- name 名称
- item_condition_id 物品状态
- category_name 品类
- brand_name 品牌
- price 物品售出的价格，测试集中不包含此列，此列也为我们要预测的值
- shipping 1 if shipping fee is paid by seller and 0 by buyer,也就是1代表包邮，0代表不包邮
- item_description 物品的详细描述，描述中已经除去带有价格标签的值，已用[rm]代替

import pandas as pd
import numpy as np

df = pd.read_csv('input/train.tsv',sep='	')

data information

df.head()

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.0	1	No description yet
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.0	0	This keyboard is in great condition and works ...
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.0	1	Adorable top with a hint of lace and a key hol...
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.0	1	New with tags. Leather horses. Retail for [rm]...
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.0	0	Complete with certificate of authenticity

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
train_id             1482535 non-null int64
name                 1482535 non-null object
item_condition_id    1482535 non-null int64
category_name        1476208 non-null object
brand_name           849853 non-null object
price                1482535 non-null float64
shipping             1482535 non-null int64
item_description     1482531 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB

price distribution

df.price.describe()

count    1.482535e+06
mean     2.673752e+01
std      3.858607e+01
min      0.000000e+00
25%      1.000000e+01
50%      1.700000e+01
75%      2.900000e+01
max      2.009000e+03
Name: price, dtype: float64

import matplotlib.pyplot as plt

plt.subplot(1, 2, 1)  #  要生成一行两列，这是第一个图plt.subplot('行','列','编号')
df.price.plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white', range = [0, 250])
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)
plt.subplot(1, 2, 2)
np.log((df.price+1)).plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white')
plt.xlabel('log(price+1)', fontsize=12)
plt.title('log(Price+1) Distribution', fontsize=12)

Text(0.5, 1.0, 'log(Price+1) Distribution')

png

价格特征为左偏态，需要将其转化为正太分布的数据，价格的分布主要集中在10-20左右，而最大的价格在2009，需要将其做对数转化，转化后，其对数分布为较为规则的正态分布

包邮对于价格影响

df['shipping'].value_counts(normalize=True)

0    0.552726
1    0.447274
Name: shipping, dtype: float64

对于商家是否包邮，55%的商品不包邮，44.7%的商品包邮，需要看一下包邮是否对于价格影响

shipping_yes = df.loc[df['shipping'] == 1, 'price']  # 商家出运费
shipping_no = df.loc[df['shipping'] == 0, 'price']  #  买家出运费

fig,ax  = plt.subplots(figsize=(8,5))
ax.hist(shipping_yes,color='r',alpha=0.5,bins=30,range=[0,100],label='shipping_yes')
ax.hist(shipping_no,color='green',alpha=0.5,bins=30,range=[0,100],label=
       'shipping_no')
plt.xlabel('price',fontsize=12)
plt.ylabel('frequency',fontsize=12)
plt.title('price_distribution by shipping method')
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

png

print("不包邮平均的定价%s dollars" %(round(shipping_no.mean(),2)))
print("包邮平均的定价%s dollars" %(round(shipping_yes.mean(),2)))

不包邮平均的定价30.11 dollars
包邮平均的定价22.57 dollars

fig,ax  = plt.subplots(figsize=(8,5))
ax.hist(np.log(shipping_yes+1),color='r',alpha=0.5,bins=50,label='shipping_yes')
ax.hist(np.log(shipping_no+1),color='green',alpha=0.5,bins=50,label=
       'shipping_no')
plt.xlabel('log(price+1)',fontsize=12)
plt.ylabel('frequency',fontsize=12)
plt.title('log(price+1)_distribution by shipping method')
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

png

处理category 数据

"总共的数据有{}条记录".format(df.shape[0])

'总共的数据有1482535条记录'

数据集中的name,cageory,brand,item_condition_id 都需要转化为category类型的数据

df['category_name'].value_counts()
#  总共有1287类型

Women/Athletic Apparel/Pants, Tights, Leggings                 60177
Women/Tops & Blouses/T-Shirts                                  46380
Beauty/Makeup/Face                                             34335
Beauty/Makeup/Lips                                             29910
Electronics/Video Games & Consoles/Games                       26557
Beauty/Makeup/Eyes                                             25215
Electronics/Cell Phones & Accessories/Cases, Covers & Skins    24676
Women/Underwear/Bras                                           21274
Women/Tops & Blouses/Tank, Cami                                20284
Women/Tops & Blouses/Blouse                                    20284
Women/Dresses/Above Knee, Mini                                 20082
Women/Jewelry/Necklaces                                        19758
Women/Athletic Apparel/Shorts                                  19528
Beauty/Makeup/Makeup Palettes                                  19103
Women/Shoes/Boots                                              18864
Beauty/Fragrance/Women                                         18628
Beauty/Skin Care/Face                                          15836
Women/Women's Handbags/Shoulder Bag                            15328
Men/Tops/T-shirts                                              15108
Women/Dresses/Knee-Length                                      14770
Women/Athletic Apparel/Shirts & Tops                           14738
Women/Shoes/Sandals                                            14662
Women/Jewelry/Bracelets                                        14497
Men/Shoes/Athletic                                             14257
Kids/Toys/Dolls & Accessories                                  13957
Women/Women's Accessories/Wallets                              13616
Women/Jeans/Slim, Skinny                                       13392
Home/Home Décor/Home Décor Accents                             13004
Women/Swimwear/Two-Piece                                       12758
Women/Shoes/Athletic                                           12662
                                                               ...  
Men/Suits/Four Button                                              1
Handmade/Bags and Purses/Other                                     1
Handmade/Dolls and Miniatures/Primitive                            1
Handmade/Furniture/Fixture                                         1
Handmade/Housewares/Bathroom                                       1
Handmade/Woodworking/Sculptures                                    1
Men/Suits/One Button                                               1
Handmade/Geekery/Housewares                                        1
Kids/Safety/Crib Netting                                           1
Vintage & Collectibles/Furniture/Entertainment                     1
Home/Furniture/Bathroom Furniture                                  1
Handmade/Glass/Vases                                               1
Handmade/Geekery/Videogame                                         1
Handmade/Woodworking/Sports                                        1
Handmade/Art/Aceo                                                  1
Vintage & Collectibles/Paper Ephemera/Map                          1
Handmade/Patterns/Painting                                         1
Handmade/Housewares/Cleaning                                       1
Home/Home Décor/Doorstops                                          1
Handmade/Accessories/Belt                                          1
Handmade/Patterns/Accessories                                      1
Vintage & Collectibles/Housewares/Towel                            1
Other/Automotive/RV Parts & Accessories                            1
Handmade/Paper Goods/Pad                                           1
Handmade/Accessories/Cozy                                          1
Kids/Diapering/Washcloths & Towels                                 1
Handmade/Pets/Blanket                                              1
Handmade/Needlecraft/Clothing                                      1
Handmade/Furniture/Shelf                                           1
Handmade/Quilts/Bed                                                1
Name: category_name, Length: 1287, dtype: int64

it_conditon_id vs price

常见的箱型图注释

import seaborn as sns

sns.boxplot(x = 'item_condition_id', y = np.log(df['price']+1), data = df, palette = sns.color_palette('RdBu',5))

<matplotlib.axes._subplots.AxesSubplot at 0x127d5bdd8>

png

不同的物品状态对应的价格千差外别

竞赛杀器lightgbm

settings

NUM_BRANDS = 4000
NUM_CATEGORIES = 1000
NAME_MIN_DF =10
MAX_FEATURES_ITEM_DESCRIPTION =50000

"There are %d items that do not have a category name" % df['category_name'].isnull().sum()

'There are 6327 items that do not have a category name'

"There are %d items that do not have a brand name" % df['brand_name'].isnull().sum()

'There are 632682 items that do not have a brand name'

"There are %d items that do not have a item_description " % df['item_description'].isnull().sum()

'There are 4 items that do not have a item_description '

def handling_missing_inplace(datasets):
    datasets['category_name'].fillna('missing',inplace=True)
    datasets['brand_name'].fillna('missing',inplace=True)
    datasets['item_description'].replace('No description yet,''missing', inplace=True) # 需要仔细看数据才能看到
    datasets['item_description'].fillna(value='missing', inplace=True)

def cutting(datasets):
    pop_brand = datasets['brand_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_BRANDS]
    datasets.loc[~datasets['brand_name'].isin(pop_brand),'brand_name'] ='missing'
    pop_category = datasets['category_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_CATEGORIES]
    datasets.loc[~datasets['category_name'].isin(pop_category),'category_name'] ='missing'

def to_category(datasets):
    datasets['category_name'] = datasets['category_name'].astype('category')
    datasets['brand_name'] = datasets['brand_name'].astype('category')
    datasets['item_condition_id'] = datasets['item_condition_id'].astype('category')

查看价格的数量分布，发现竟然有价格为0的，所以需要去掉价格为0的数据

df['price'].value_counts().reset_index().sort_values(by='index').head()

	index	price
25	3.0	18703
28	4.0	16139
17	5.0	31502
261	5.5	33
16	6.0	32260

df=df[df['price']!=0].reset_index(drop=True)

df.head()

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.0	1	No description yet
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.0	0	This keyboard is in great condition and works ...
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.0	1	Adorable top with a hint of lace and a key hol...
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.0	1	New with tags. Leather horses. Retail for [rm]...
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.0	0	Complete with certificate of authenticity

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelBinarizer
import lightgbm as lgb
from scipy.sparse import csr_matrix, hstack  # 解决稀疏矩阵
# referenc https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html
import gc
import time
from sklearn.linear_model import Ridge

def main():
    start_time = time.time()

    train = pd.read_table('input/train.tsv', engine='c')
    # train=train[train['price']!=0]
    test = pd.read_table('input/test_stg2.tsv', engine='c')
    print('[{}] Finished to load data'.format(time.time() - start_time))
    print('Train shape: ', train.shape)
    print('Test shape: ', test.shape)

    nrow_train = train.shape[0]
    y = np.log1p(train["price"])
    merge: pd.DataFrame = pd.concat([train, test])
    submission: pd.DataFrame = test[['test_id']]

    del train
    del test
    gc.collect()

    handling_missing_inplace(merge)
    print('[{}] Finished to handle missing'.format(time.time() - start_time))

    cutting(merge)
    print('[{}] Finished to cut'.format(time.time() - start_time))

    to_category(merge)
    print('[{}] Finished to convert categorical'.format(time.time() - start_time))

    cv = CountVectorizer(min_df=NAME_MIN_DF)
    X_name = cv.fit_transform(merge['name'])
    print('[{}] Finished count vectorize `name`'.format(time.time() - start_time))

    cv = CountVectorizer()
    X_category = cv.fit_transform(merge['category_name'])
    print('[{}] Finished count vectorize `category_name`'.format(time.time() - start_time))

    tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION,
                         ngram_range=(1, 3),
                         stop_words='english')
    X_description = tv.fit_transform(merge['item_description'])
    print('[{}] Finished TFIDF vectorize `item_description`'.format(time.time() - start_time))

    lb = LabelBinarizer(sparse_output=True)
    X_brand = lb.fit_transform(merge['brand_name'])
    print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))

    X_dummies = csr_matrix(pd.get_dummies(merge[['item_condition_id', 'shipping']],
                                          sparse=True).values)
    print('[{}] Finished to get dummies on `item_condition_id` and `shipping`'.format(time.time() - start_time))

    sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()
    print('[{}] Finished to create sparse merge'.format(time.time() - start_time))

    X = sparse_merge[:nrow_train]
    X_test = sparse_merge[nrow_train:]
    
    #train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.1, random_state = 144) 
    d_train = lgb.Dataset(X, label=y)
    #d_valid = lgb.Dataset(valid_X, label=valid_y, max_bin=8192)
    #watchlist = [d_train, d_valid]
    
    params = {
        'learning_rate': 0.73,
        'application': 'regression',
        'max_depth': 3,
        'num_leaves': 100,
        'verbosity': -1,
        'metric': 'RMSE',
    }


    model = lgb.train(params, train_set=d_train, num_boost_round=3000, verbose_eval=100) 
    preds = 0.56*model.predict(X_test)


    model = Ridge(solver="sag", fit_intercept=True, random_state=42)
    model.fit(X, y)
    print('[{}] Finished to train ridge'.format(time.time() - start_time))
    preds += 0.44*model.predict(X=X_test)
    print('[{}] Finished to predict ridge'.format(time.time() - start_time))


    submission['price'] = np.expm1(preds)
    submission.loc[submission['price'] < 0.0, 'price'] = 0.0
    submission.to_csv("sample_submission_stg2.csv", index=False)

if __name__ == '__main__':
    main()

相关阅读:
数组
 css动画
 mui 常用手势
 ejs 用到的语法
 css 高度塌陷
 SQL 用到的操作符
 position: relative;导致页面卡顿
 h5 图片生成
 li之间的间隙问题
 虚拟机扩容mac
原文地址：https://www.cnblogs.com/onemorepoint/p/11087155.html