• 转化率算法大赛


    最近做了一个月的腾讯广告大赛,总结一下,算法程序

     

    腾讯给出了两个代码

    1、基于平均分组转化的代码

     

    # -*- coding: utf-8 -*-
    """
    baseline 1: history pCVR of creativeID/adID/camgaignID/advertiserID/appID/appPlatform
    """
    
    import zipfile
    import numpy as np
    import pandas as pd
    
    # load data
    data_root = "."
    dfTrain = pd.read_csv("%s/train.csv"%data_root)
    dfTest = pd.read_csv("%s/test.csv"%data_root)
    dfAd = pd.read_csv("%s/ad.csv"%data_root)
    
    # process data
    dfTrain = pd.merge(dfTrain, dfAd, on="creativeID")
    dfTest = pd.merge(dfTest, dfAd, on="creativeID")
    y_train = dfTrain["label"].values
    
    # model building
    key = "appID"
    dfCvr = dfTrain.groupby(key).apply(lambda df: np.mean(df["label"])).reset_index()
    dfCvr.columns = [key, "avg_cvr"]
    dfTest = pd.merge(dfTest, dfCvr, how="left", on=key)
    dfTest["avg_cvr"].fillna(np.mean(dfTrain["label"]), inplace=True)
    proba_test = dfTest["avg_cvr"].values
    
    # submission
    df = pd.DataFrame({"instanceID": dfTest["instanceID"].values, "proba": proba_test})
    df.sort_values("instanceID", inplace=True)
    df.to_csv("submission.csv", index=False)
    with zipfile.ZipFile("submission.zip", "w") as fout:
        fout.write("submission.csv", compress_type=zipfile.ZIP_DEFLATED)
    平均分组

    2、基于热编码后进行逻辑回归算法的预测算法

    # -*- coding: utf-8 -*-
    """
    baseline 2: ad.csv (creativeID/adID/camgaignID/advertiserID/appID/appPlatform) + lr
    """
    
    import zipfile
    import pandas as pd
    from scipy import sparse
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    
    # load data
    data_root = "."
    dfTrain = pd.read_csv("%s/train.csv"%data_root)
    dfTest = pd.read_csv("%s/test.csv"%data_root)
    dfAd = pd.read_csv("%s/ad.csv"%data_root)
    
    # process data
    dfTrain = pd.merge(dfTrain, dfAd, on="creativeID")
    dfTest = pd.merge(dfTest, dfAd, on="creativeID")
    y_train = dfTrain["label"].values
    
    # feature engineering/encoding
    enc = OneHotEncoder()
    feats = ["creativeID", "adID", "camgaignID", "advertiserID", "appID", "appPlatform"]
    for i,feat in enumerate(feats):
        x_train = enc.fit_transform(dfTrain[feat].values.reshape(-1, 1))
        x_test = enc.transform(dfTest[feat].values.reshape(-1, 1))
        if i == 0:
            X_train, X_test = x_train, x_test
        else:
            X_train, X_test = sparse.hstack((X_train, x_train)), sparse.hstack((X_test, x_test))
    
    # model training
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    proba_test = lr.predict_proba(X_test)[:,1]
    
    # submission
    df = pd.DataFrame({"instanceID": dfTest["instanceID"].values, "proba": proba_test})
    df.sort_values("instanceID", inplace=True)
    df.to_csv("submission.csv", index=False)
    with zipfile.ZipFile("submission.zip", "w") as fout:
    fout.write("submission.csv", compress_type=zipfile.ZIP_DEFLATED)
    逻辑回归

    然后我基于第二组的代码进行不同的特征选择,最终这个比较好

    # -*- coding: utf-8 -*-
    """
    baseline 2: ad.csv (creativeID/adID/camgaignID/advertiserID/appID/appPlatform) + lr
    """
    
    import zipfile
    import pandas as pd
    from scipy import sparse
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    
    # load data
    data_root = "."
    dfTrain = pd.read_csv("%s/train.csv"%data_root)
    dfTest = pd.read_csv("%s/test.csv"%data_root)
    dfAd = pd.read_csv("%s/ad.csv"%data_root)
    dfapp_categories = pd.read_csv("%s/app_categories.csv"%data_root)
    dfuser = pd.read_csv("%s/user.csv"%data_root)
    dfposition = pd.read_csv("%s/position.csv"%data_root)
    
    # process data
    dfad=pd.merge(dfAd,dfapp_categories,on="appID")
    dfTrain = pd.merge(dfTrain, dfad, on="creativeID")
    dfTrain = pd.merge(dfTrain, dfposition,how='left',on="positionID")
    dfTest = pd.merge(dfTest, dfad, on="creativeID")
    dfTest = pd.merge(dfTest, dfposition, how='left', on="positionID")
    y_train = dfTrain["label"].values
    dfTest.sort_values("instanceID", inplace=True)
    dfTrain.to_csv("strain.csv", index=False)
    dfTest.to_csv("stest.csv", index=False)
    # feature engineering/encoding
    enc = OneHotEncoder()
    feats = ["creativeID", "adID", "camgaignID", "advertiserID", "appPlatform","appID","appCategory","positionID","sitesetID","positionType","userID","connectionType","telecomsOperator"]
    for i,feat in enumerate(feats):
        x_train = enc.fit_transform(dfTrain[feat].values.reshape(-1, 1))
        x_test = enc.transform(dfTest[feat].values.reshape(-1, 1))
        if i == 0:
            X_train, X_test = x_train, x_test
        else:
            X_train, X_test = sparse.hstack((X_train, x_train)), sparse.hstack((X_test, x_test))
    
    
    
    # 模型训练
    # model training
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    proba_test = lr.predict_proba(X_test)[:,1]
    
    # submission
    df = pd.DataFrame({"instanceID": dfTest["instanceID"].values, "proba": proba_test})
    df.sort_values("instanceID", inplace=True)
          
    df.to_csv("submission.csv", index=False)
    with zipfile.ZipFile("submission.zip", "w") as fout:
        fout.write("submission.csv", compress_type=zipfile.ZIP_DEFLATED)
    最好的

    不同的特征,结果是不同的,所以如果以后做算法大赛的话,需要的还是调参和特征选择

    PS:

    使用的Python中SK-learn包中的一些机器学习算法,Python的好处就是简洁,另外有很多机器学习的算法都已经集成在包中,使用时

    调用即可,不用再去写算法的代码。

    PSS:此次成绩不是很理想,最高也仅仅是3百多名而已,以后还要深入学习Python机器学习方面的知识,为以后的工作打下基础。

    PSSS:此次大赛的数据集在此

    链接:http://pan.baidu.com/s/1nv7cf49 密码:gqz1

  • 相关阅读:
    POJ 2002 Squares
    POJ 1840 Eqs
    POJ 1759 Garland
    UVA 572 Oil Deposits
    POJ 3278 Catch That Cow
    POJ 2488 A Knight's Journey
    UVA 699 The Falling Leaves
    [Poi1999] 原始生物
    [bzoj3033] 太鼓达人
    [bzoj1123] BLO
  • 原文地址:https://www.cnblogs.com/b-l-java/p/6973754.html
Copyright © 2020-2023  润新知