最近做了一个月的腾讯广告大赛,总结一下,算法程序
腾讯给出了两个代码
1、基于平均分组转化的代码
# -*- coding: utf-8 -*- """ baseline 1: history pCVR of creativeID/adID/camgaignID/advertiserID/appID/appPlatform """ import zipfile import numpy as np import pandas as pd # load data data_root = "." dfTrain = pd.read_csv("%s/train.csv"%data_root) dfTest = pd.read_csv("%s/test.csv"%data_root) dfAd = pd.read_csv("%s/ad.csv"%data_root) # process data dfTrain = pd.merge(dfTrain, dfAd, on="creativeID") dfTest = pd.merge(dfTest, dfAd, on="creativeID") y_train = dfTrain["label"].values # model building key = "appID" dfCvr = dfTrain.groupby(key).apply(lambda df: np.mean(df["label"])).reset_index() dfCvr.columns = [key, "avg_cvr"] dfTest = pd.merge(dfTest, dfCvr, how="left", on=key) dfTest["avg_cvr"].fillna(np.mean(dfTrain["label"]), inplace=True) proba_test = dfTest["avg_cvr"].values # submission df = pd.DataFrame({"instanceID": dfTest["instanceID"].values, "proba": proba_test}) df.sort_values("instanceID", inplace=True) df.to_csv("submission.csv", index=False) with zipfile.ZipFile("submission.zip", "w") as fout: fout.write("submission.csv", compress_type=zipfile.ZIP_DEFLATED)
2、基于热编码后进行逻辑回归算法的预测算法
# -*- coding: utf-8 -*- """ baseline 2: ad.csv (creativeID/adID/camgaignID/advertiserID/appID/appPlatform) + lr """ import zipfile import pandas as pd from scipy import sparse from sklearn.preprocessing import OneHotEncoder from sklearn.linear_model import LogisticRegression # load data data_root = "." dfTrain = pd.read_csv("%s/train.csv"%data_root) dfTest = pd.read_csv("%s/test.csv"%data_root) dfAd = pd.read_csv("%s/ad.csv"%data_root) # process data dfTrain = pd.merge(dfTrain, dfAd, on="creativeID") dfTest = pd.merge(dfTest, dfAd, on="creativeID") y_train = dfTrain["label"].values # feature engineering/encoding enc = OneHotEncoder() feats = ["creativeID", "adID", "camgaignID", "advertiserID", "appID", "appPlatform"] for i,feat in enumerate(feats): x_train = enc.fit_transform(dfTrain[feat].values.reshape(-1, 1)) x_test = enc.transform(dfTest[feat].values.reshape(-1, 1)) if i == 0: X_train, X_test = x_train, x_test else: X_train, X_test = sparse.hstack((X_train, x_train)), sparse.hstack((X_test, x_test)) # model training lr = LogisticRegression() lr.fit(X_train, y_train) proba_test = lr.predict_proba(X_test)[:,1] # submission df = pd.DataFrame({"instanceID": dfTest["instanceID"].values, "proba": proba_test}) df.sort_values("instanceID", inplace=True) df.to_csv("submission.csv", index=False) with zipfile.ZipFile("submission.zip", "w") as fout: fout.write("submission.csv", compress_type=zipfile.ZIP_DEFLATED)
然后我基于第二组的代码进行不同的特征选择,最终这个比较好
# -*- coding: utf-8 -*- """ baseline 2: ad.csv (creativeID/adID/camgaignID/advertiserID/appID/appPlatform) + lr """ import zipfile import pandas as pd from scipy import sparse from sklearn.preprocessing import OneHotEncoder from sklearn.linear_model import LogisticRegression # load data data_root = "." dfTrain = pd.read_csv("%s/train.csv"%data_root) dfTest = pd.read_csv("%s/test.csv"%data_root) dfAd = pd.read_csv("%s/ad.csv"%data_root) dfapp_categories = pd.read_csv("%s/app_categories.csv"%data_root) dfuser = pd.read_csv("%s/user.csv"%data_root) dfposition = pd.read_csv("%s/position.csv"%data_root) # process data dfad=pd.merge(dfAd,dfapp_categories,on="appID") dfTrain = pd.merge(dfTrain, dfad, on="creativeID") dfTrain = pd.merge(dfTrain, dfposition,how='left',on="positionID") dfTest = pd.merge(dfTest, dfad, on="creativeID") dfTest = pd.merge(dfTest, dfposition, how='left', on="positionID") y_train = dfTrain["label"].values dfTest.sort_values("instanceID", inplace=True) dfTrain.to_csv("strain.csv", index=False) dfTest.to_csv("stest.csv", index=False) # feature engineering/encoding enc = OneHotEncoder() feats = ["creativeID", "adID", "camgaignID", "advertiserID", "appPlatform","appID","appCategory","positionID","sitesetID","positionType","userID","connectionType","telecomsOperator"] for i,feat in enumerate(feats): x_train = enc.fit_transform(dfTrain[feat].values.reshape(-1, 1)) x_test = enc.transform(dfTest[feat].values.reshape(-1, 1)) if i == 0: X_train, X_test = x_train, x_test else: X_train, X_test = sparse.hstack((X_train, x_train)), sparse.hstack((X_test, x_test)) # 模型训练 # model training lr = LogisticRegression() lr.fit(X_train, y_train) proba_test = lr.predict_proba(X_test)[:,1] # submission df = pd.DataFrame({"instanceID": dfTest["instanceID"].values, "proba": proba_test}) df.sort_values("instanceID", inplace=True) df.to_csv("submission.csv", index=False) with zipfile.ZipFile("submission.zip", "w") as fout: fout.write("submission.csv", compress_type=zipfile.ZIP_DEFLATED)
不同的特征,结果是不同的,所以如果以后做算法大赛的话,需要的还是调参和特征选择
PS:
使用的Python中SK-learn包中的一些机器学习算法,Python的好处就是简洁,另外有很多机器学习的算法都已经集成在包中,使用时
调用即可,不用再去写算法的代码。
PSS:此次成绩不是很理想,最高也仅仅是3百多名而已,以后还要深入学习Python机器学习方面的知识,为以后的工作打下基础。
PSSS:此次大赛的数据集在此
链接:http://pan.baidu.com/s/1nv7cf49 密码:gqz1