• NLP(二十七):DeepCTR框架的使用


    一 、算法介绍

    左边deep network,右边FM,所以叫deepFM

    DeepFm

    包含两个部分:

    • Part1: FM(Factorization machines),因子分解机部分

    FM

    在传统的一阶线性回归之上,加了一个二次项,可以表达两两特征的相互关系。

    特征相互关系

    这里的公式可以简化,减少计算量,下图来至于网络。

    FM

    • Part2: Deep部分

    deep部分是多层dnn网络。

    二、算法实现

    实现部分,用Keras实现一个DeepFM 和·清尘·《FM、FMM、DeepFM整理(pytorch)》

    讲的比较清楚,这里引用keras实现来说明。

    整体的网络结构:

    网络结构

    特征编码#

    特征可以分为3类:

    • 连续型field,比如数字类型特征
    • 单值离散型特征,比如gender,可选为male、female
    • 多值离散型,比如tag,可以有多个

    连续型field,可以拼接到一起,dense数据。

    单值,多值field进行Onehot后,可见单值离散field对应的独热向量只有一位取1,而多值离散field对应的独热向量有多于一位取1,表示该field可以同时取多个特征值。

    labelshop_scoregender=mgender=finterest=finterest=c
    0 0.2 1 0 1 1
    1 0.8 0 1 0 1

    FM 部分#

    FM

    看公式:
    FM

    先算 FM一次项:

    • 连续型field 可以用Dense(1)层实现
    • 单值离散型field 用Embedding(n,1), n是分类中值的个数
    • 多值离散型field可以同时取多个特征值,为了batch training,必须对样本进行补零padding。同样可以用Embedding实现,因为有多个Embedding,可以取下平均值。

    1次项

    然后计算FM二次项,这里理解比较费劲一点。

    ·清尘·《FM、FMM、DeepFM整理(pytorch)》 深入浅出的讲明白了这个过程,大家可以参见。

    我们来看具体实现方面,这里的DeepFM模型CTR预估理论与实战
    讲解更容易理解。

    FM公式

    假设只有前面的C1和C2两个Category的特征,词典大小还是3和2。假设输入还是C1=2,C2=2(下标从1开始),则Embedding之后为V2=[e21,e22,e23,e24]和V5=[e51,e52,e53,e54]。

    因为xi和xj同时不为零才需要计算,所以上面的公式里需要计算的只有i=2和j=5的情况。因此:

    FM

    扩展到多个,比如C1,C2,C3,需要算内积

    怎么用用矩阵乘法一次计算出来呢?我们可以看看这个

    对应的代码就是

    Copy
           square_of_sum = tf.square(reduce_sum(
    			concated_embeds_value, axis=1, keep_dims=True))
    		sum_of_square = reduce_sum(
    			concated_embeds_value * concated_embeds_value, axis=1, keep_dims=True)
    		cross_term = square_of_sum - sum_of_square
    		cross_term = 0.5 * reduce_sum(cross_term, axis=2, keep_dims=False)
    

    其中concated_embeds_value是拼接起来的embeds_value。

    Deep部分#

    DNN比较简单,FM的输入和DNN的输入都是同一个group_embedding_dict。

    三、使用movielens 来测试

    下载ml-100k 数据集

    Copy
    wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
    unzip ml-100k.zip
    

    安装相关软件包,sklearn,deepctr

    导入包:

    Copy
    import pandas
    import pandas as pd
    import sklearn
    from sklearn.metrics import log_loss, roc_auc_score
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder
    from tensorflow.python.keras.preprocessing.sequence import pad_sequences
    import tensorflow as tf
    from tqdm import tqdm
    
    from deepctr.models import DeepFM
    from deepctr.feature_column import SparseFeat, VarLenSparseFeat, get_feature_names
    import numpy as np
    

    读取评分数据:

    Copy
    u_data = pd.read_csv("ml-100k/u.data", sep='	', header=None)
    u_data.columns = ['user_id', 'movie_id', 'rating', 'timestamp']
    

    有评分的设置为1,随机采用未评分的

    Copy
    def neg_sample(u_data, neg_rate=1):
        # 全局随机采样
        item_ids = u_data['movie_id'].unique()
        print('start neg sample')
        neg_data = []
        # 负采样
        for user_id, hist in tqdm(u_data.groupby('user_id')):
            # 当前用户movie
            rated_movie_list = hist['movie_id'].tolist()
            candidate_set = list(set(item_ids) - set(rated_movie_list))
            neg_list_id = np.random.choice(candidate_set, size=len(rated_movie_list) * neg_rate, replace=True)
            for id in neg_list_id:
                neg_data.append([user_id, id, -1, 0])
        u_data_neg = pd.DataFrame(neg_data)
        u_data_neg.columns = ['user_id', 'movie_id', 'rating', 'timestamp']
        u_data = pandas.concat([u_data, u_data_neg])
        print('end neg sample')
        return u_data
    

    读取item数据

    Copy
    u_item = pd.read_csv("ml-100k/u.item", sep='|', header=None, error_bad_lines=False)
        genres_columns = ['Action', 'Adventure',
                          'Animation',
                          'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
                          'Film_Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
                          'Thriller', 'War', 'Western']
    
        u_item.columns = ['movie_id', 'title', 'release_date', 'video_date', 'url', 'unknown'] + genres_columns
    
    

    处理genres并删除单独的genres列

    Copy
         genres_list = []
        for index, row in u_item.iterrows():
            genres = []
            for item in genres_columns:
                if row[item]:
                    genres.append(item)
            genres_list.append('|'.join(genres))
    
        u_item['genres'] = genres_list
        for item in genres_columns:
            del u_item[item]
    

    读取用户信息:

    Copy
      # user id | age | gender | occupation(职业) | zip code(邮编,地区)
        u_user = pd.read_csv("ml-100k/u.user", sep='|', header=None)
        u_user.columns = ['user_id', 'age', 'gender', 'occupation', 'zip']
    

    join到一起:

    Copy
     data = pandas.merge(u_data, u_item, on="movie_id", how='left')
     data = pandas.merge(data, u_user, on="user_id", how='left')
     data.to_csv('ml-100k/data.csv', index=False)
    

    处理特征:

    Copy
    sparse_features = ["movie_id", "user_id",
                       "gender", "age", "occupation", "zip", ]
    
    data[sparse_features] = data[sparse_features].astype(str)
    target = ['rating']
    
    # 评分
    data['rating'] = [1 if int(x) >= 0 else 0 for x in data['rating']]
    

    先特征编码:

    Copy
    for feat in sparse_features:
    		lbe = LabelEncoder()
    		data[feat] = lbe.fit_transform(data[feat])
    

    处理genres特征,一个movie有多个genres,先拆分,然后编码为数字,注意是从1开始;由于每个movie的genres长度不一样,可以计算最大长度,位数不足的后面补零(pad_sequences,在post补0)

    Copy
    	 def split(x):
    			key_ans = x.split('|')
    			for key in key_ans:
    				if key not in key2index:
    					# Notice : input value 0 is a special "padding",so we do not use 0 to encode valid feature for sequence input
    					key2index[key] = len(key2index) + 1
    			return list(map(lambda x: key2index[x], key_ans))
    
    
    		key2index = {}
    		genres_list = list(map(split, data['genres'].values))
    		genres_length = np.array(list(map(len, genres_list)))
    		max_len = max(genres_length)
    		# Notice : padding=`post`
    		genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', )
    

    构建deepctr的特征列,主要分为两类特征,一是定长的SparseFeat,稀疏的类别特征,二是可变长度的VarLenSparseFeat,像genres这样的包含多个的。

    Copy
    	   fixlen_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=4)
    								  for feat in sparse_features]
    
    		use_weighted_sequence = False
    		if use_weighted_sequence:
    			varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres', vocabulary_size=len(
    				key2index) + 1, embedding_dim=4), maxlen=max_len, combiner='mean',
    													   weight_name='genres_weight')]  # Notice : value 0 is for padding for sequence input feature
    		else:
    			varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres', vocabulary_size=len(
    				key2index) + 1, embedding_dim=4), maxlen=max_len, combiner='mean',
    													   weight_name=None)]  # Notice : value 0 is for padding for sequence input feature
    
    		linear_feature_columns = fixlen_feature_columns + varlen_feature_columns
    		dnn_feature_columns = fixlen_feature_columns + varlen_feature_columns
    
    		feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
    

    封装训练数据,先shuffle(乱排)数据,然后生成dict input数据。

    Copy
    data = sklearn.utils.shuffle(data)
    train_model_input = {name: data[name] for name in sparse_features}  #
    train_model_input["genres"] = genres_list
    

    构建DeepFM模型,由于目标值是0,1,因此采用binary,损失函数用binary_crossentropy

    Copy
    model = DeepFM(linear_feature_columns, dnn_feature_columns, task='binary')
    
    model.compile(optimizer=tf.keras.optimizers.Adam(), loss='binary_crossentropy',
                  metrics=['AUC', 'Precision', 'Recall'])
    model.summary()
    

    训练模型:

    Copy
    model.fit(train_model_input, data[target].values,
    						batch_size=256, epochs=20, verbose=2,
    						validation_split=0.2
    				)
    

    开始训练:

    Copy
    Epoch 1/20
    625/625 - 3s - loss: 0.5081 - auc: 0.8279 - precision: 0.7419 - recall: 0.7695 - val_loss: 0.4745 - val_auc: 0.8513 - val_precision: 0.7563 - val_recall: 0.7936
    Epoch 2/20
    625/625 - 2s - loss: 0.4695 - auc: 0.8538 - precision: 0.7494 - recall: 0.8105 - val_loss: 0.4708 - val_auc: 0.8539 - val_precision: 0.7498 - val_recall: 0.8127
    Epoch 3/20
    625/625 - 2s - loss: 0.4652 - auc: 0.8564 - precision: 0.7513 - recall: 0.8139 - val_loss: 0.4704 - val_auc: 0.8545 - val_precision: 0.7561 - val_recall: 0.8017
    Epoch 4/20
    625/625 - 2s - loss: 0.4624 - auc: 0.8579 - precision: 0.7516 - recall: 0.8146 - val_loss: 0.4724 - val_auc: 0.8542 - val_precision: 0.7296 - val_recall: 0.8526
    Epoch 5/20
    625/625 - 2s - loss: 0.4607 - auc: 0.8590 - precision: 0.7521 - recall: 0.8173 - val_loss: 0.4699 - val_auc: 0.8550 - val_precision: 0.7511 - val_recall: 0.8141
    Epoch 6/20
    625/625 - 2s - loss: 0.4588 - auc: 0.8602 - precision: 0.7545 - recall: 0.8165 - val_loss: 0.4717 - val_auc: 0.8542 - val_precision: 0.7421 - val_recall: 0.8265
    Epoch 7/20
    625/625 - 2s - loss: 0.4574 - auc: 0.8610 - precision: 0.7535 - recall: 0.8192 - val_loss: 0.4722 - val_auc: 0.8547 - val_precision: 0.7549 - val_recall: 0.8023
    Epoch 8/20
    625/625 - 2s - loss: 0.4561 - auc: 0.8619 - precision: 0.7543 - recall: 0.8201 - val_loss: 0.4717 - val_auc: 0.8548 - val_precision: 0.7480 - val_recall: 0.8185
    Epoch 9/20
    625/625 - 2s - loss: 0.4531 - auc: 0.8643 - precision: 0.7573 - recall: 0.8210 - val_loss: 0.4696 - val_auc: 0.8583 - val_precision: 0.7598 - val_recall: 0.8103
    Epoch 10/20
    625/625 - 2s - loss: 0.4355 - auc: 0.8768 - precision: 0.7787 - recall: 0.8166 - val_loss: 0.4435 - val_auc: 0.8769 - val_precision: 0.7756 - val_recall: 0.8293
    Epoch 11/20
    625/625 - 2s - loss: 0.4093 - auc: 0.8923 - precision: 0.7915 - recall: 0.8373 - val_loss: 0.4301 - val_auc: 0.8840 - val_precision: 0.7806 - val_recall: 0.8390
    Epoch 12/20
    625/625 - 2s - loss: 0.3970 - auc: 0.8988 - precision: 0.7953 - recall: 0.8497 - val_loss: 0.4286 - val_auc: 0.8867 - val_precision: 0.7903 - val_recall: 0.8299
    Epoch 13/20
    625/625 - 2s - loss: 0.3896 - auc: 0.9029 - precision: 0.8001 - recall: 0.8542 - val_loss: 0.4253 - val_auc: 0.8888 - val_precision: 0.7913 - val_recall: 0.8322
    Epoch 14/20
    625/625 - 2s - loss: 0.3825 - auc: 0.9067 - precision: 0.8038 - recall: 0.8584 - val_loss: 0.4205 - val_auc: 0.8917 - val_precision: 0.7885 - val_recall: 0.8506
    Epoch 15/20
    625/625 - 2s - loss: 0.3755 - auc: 0.9102 - precision: 0.8074 - recall: 0.8624 - val_loss: 0.4204 - val_auc: 0.8940 - val_precision: 0.7868 - val_recall: 0.8607
    Epoch 16/20
    625/625 - 2s - loss: 0.3687 - auc: 0.9136 - precision: 0.8117 - recall: 0.8653 - val_loss: 0.4176 - val_auc: 0.8956 - val_precision: 0.8097 - val_recall: 0.8236
    Epoch 17/20
    625/625 - 2s - loss: 0.3617 - auc: 0.9170 - precision: 0.8155 - recall: 0.8682 - val_loss: 0.4166 - val_auc: 0.8966 - val_precision: 0.8056 - val_recall: 0.8354
    Epoch 18/20
    625/625 - 2s - loss: 0.3553 - auc: 0.9201 - precision: 0.8188 - recall: 0.8716 - val_loss: 0.4168 - val_auc: 0.8977 - val_precision: 0.7996 - val_recall: 0.8492
    Epoch 19/20
    625/625 - 2s - loss: 0.3497 - auc: 0.9227 - precision: 0.8214 - recall: 0.8741 - val_loss: 0.4187 - val_auc: 0.8973 - val_precision: 0.8079 - val_recall: 0.8358
    Epoch 20/20
    625/625 - 2s - loss: 0.3451 - auc: 0.9248 - precision: 0.8244 - recall: 0.8753 - val_loss: 0.4210 - val_auc: 0.8982 - val_precision: 0.7945 - val_recall: 0.8617
    

    最后我们测试下数据:

    Copy
     pred_ans = model.predict(train_model_input, batch_size=256)
     count = 0
        for (i, j) in zip(pred_ans, data['rating'].values):
            print(i, j)
            count += 1
            if count > 10:
                break
    

    输出如下:

     

    四、参考

    五、实战

    import pandas as pd
    from sklearn.metrics import log_loss, roc_auc_score
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder, MinMaxScaler
    from deepctr.models import *
    from deepctr.feature_column import SparseFeat, DenseFeat, get_feature_names
    from tensorflow.python.keras.callbacks import EarlyStopping
    from tensorflow.python.keras.models import  save_model,load_model
    from deepctr.layers import custom_objects
    import json
    class DeepFmModel(object):
        def __init__(self):
            self.data = pd.read_csv('../data/train.txt', sep = "	")
            # 类别特征(16)
            fixlen_category_columns = ['m_sex', 'm_access_frequencies', 'm_twoA', 'm_twoB', 'm_twoC',
                                       'm_twoD', 'm_twoE', 'm_categoryA', 'm_categoryB', 'm_categoryC',
                                       'm_categoryD', 'm_categoryE', 'm_num_interest_topic',
                                       'num_topic_attention_intersection',
                                       'q_num_topic_words',
                                       'num_topic_interest_intersection'
                                       ]
            # 数值特征(7)
            fixlen_number_columns = ['m_salt_score', 'm_num_atten_topic', 'q_num_title_chars_words',
                                     'q_num_desc_chars_words', 'q_num_desc_words', 'q_num_title_words',
                                     'days_to_invite'
                                     ]
            target = ['label']
            #sparse_类别标签
            self.sparse_features = fixlen_category_columns
            #dense_线性标签
            self.dense_features = fixlen_number_columns
    
            #对数据预处理,nan填充0、-1
            self.data[self.sparse_features] = self.data[self.sparse_features].fillna('-1', )
            self.data[self.dense_features] = self.data[self.dense_features].fillna(0, )
            self.target = ['label']
            #对类别标签硬编码,由数字表示类别
            for feat in self.sparse_features:
                lbe = LabelEncoder()
                self.data[feat] = lbe.fit_transform(self.data[feat])
            #对线性标签归一化到0-1
            mms = MinMaxScaler(feature_range=(0, 1))
            self.data[self.dense_features] = mms.fit_transform(self.data[self.dense_features])
    
            #对两种特征词嵌入
            fixlen_feature_columns = 
                [SparseFeat(feat, vocabulary_size=self.data[feat].max() + 1, embedding_dim=4)
                 for i, feat in enumerate(self.sparse_features)] 
                + [DenseFeat(feat, 1, ) for feat in self.dense_features]
            self.dnn_feature_columns = fixlen_feature_columns
            self.linear_feature_columns = fixlen_feature_columns
    
            #得到词嵌入的列名称
            feature_names = get_feature_names(self.linear_feature_columns + self.dnn_feature_columns)
    
            self.train, self.test = train_test_split(self.data, test_size=0.1, random_state=2020)
            self.train_model_input = {name: self.train[name] for name in feature_names}
            self.test_model_input = {name: self.test[name] for name in feature_names}
    
        def train_model(self):
            # 4.Define Model,train,predict and evaluate
            model = DeepFM(self.linear_feature_columns, self.dnn_feature_columns, task='binary')
            model.compile("adam", "binary_crossentropy",
                          metrics=['binary_crossentropy', "accuracy"], )
            #按照某个指标提前停止
            es = EarlyStopping(monitor='val_binary_crossentropy', patience=3)
            hi = model.fit(self.train_model_input, self.train[self.target].values,
                                batch_size=256, epochs=100, verbose=2, validation_split=0.2,callbacks=[es] )
    
            save_model(model, '../checkpoint/DeepFM.cpt')# save_model, same as before
            with open("../logs/deep_fm_train.txt", "w", encoding="utf8") as f:
                for k, v in hi.history.items():
                    f.write(k + "	" + json.dumps(v) + "
    ")
    
        def predictData(self):
            model = load_model('../checkpoint/DeepFM.cpt',custom_objects)# load_model,just add a parameter
            pred_ans = model.predict(self.test_model_input, batch_size=256)
    
            with open("../logs/deep_fm_test.txt", "w", encoding="utf8") as f:
                f.write("test LogLoss" + "	" + str(round(log_loss(self.test[self.target].values, pred_ans), 4)) + "
    ")
                f.write("test AUC" + "	" + str(round(roc_auc_score(self.test[self.target].values, pred_ans), 4)) + "
    ")
                import numpy as np
                label = np.squeeze(self.test[self.target].values).astype(int)
                pre = np.around(np.squeeze(pred_ans)).astype(int)
                from sklearn.metrics import accuracy_score
                f.write("accuracy_score" + "	" + str(accuracy_score(label, pre)) + "
    ")
                from sklearn.metrics import recall_score
                f.write("recall_score" + "	" + str(recall_score(label, pre)) + "
    ")
                from sklearn.metrics import precision_score
                f.write("precision_score" + "	" + str(precision_score(label, pre)) + "
    ")
                from sklearn.metrics import f1_score
                f.write("f1_score" + "	" + str(f1_score(label, pre)) + "
    ")
                from sklearn.metrics import confusion_matrix
                confusion = confusion_matrix(label, pre)
                import seaborn as sns
                import matplotlib.pyplot as plt
                sns.heatmap(confusion, annot=True, xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'], cmap=plt.cm.Blues)
                plt.ylabel("Label")
                plt.xlabel("Predicted")
                plt.savefig("../logs/deep_fm.png")
                plt.show()
    
    
    if __name__ == '__main__':
        DeepFmModel = DeepFmModel()
        DeepFmModel.train_model()
        DeepFmModel.predictData()
  • 相关阅读:
    bzoj 2115: [Wc2011] Xor【线性基+dfs】
    bzoj 1027: [JSOI2007]合金【凸包+Floyd】
    bzoj 4824: [Cqoi2017]老C的键盘【树形dp】
    bzoj 2111: [ZJOI2010]Perm 排列计数【树形dp+lucas】
    bzoj 4822: [Cqoi2017]老C的任务【扫描线+树状数组+二维差分】
    bzoj 4823: [Cqoi2017]老C的方块【最大权闭合子图】
    bzoj 4826: [Hnoi2017]影魔【单调栈+树状数组+扫描线】
    洛谷 P3731 [HAOI2017]新型城市化【最大流(二分图匹配)+tarjan】
    洛谷 P3732 [HAOI2017]供给侧改革【trie树】
    poj 1474 Video Surveillance 【半平面交】
  • 原文地址:https://www.cnblogs.com/zhangxianrong/p/14971653.html
Copyright © 2020-2023  润新知