embedding based logistic regression-神经网络逻辑回归tensorflow

--- 灵感 --- 因为最近一直在做rnn based NLP，其中无论是什么cell，lstm， GRU或者cnn都是基于单词的embedding表示；单词的embdding就是把每个单词表示成一个向量，然后通过bp训练这些向量的值，这种想法很奇妙，于是我尝试性的把这种思想用在logistic regression上面；

--- 问题 --- 对于logistic regression的话，很多向量都是categorial，如果碰到有1000个category怎么做？转换成1000*1的one-hot向量吗？方法：用embedding，每个category给一个10维的向量，然后再用传统的回归或者神经网络的方法；

--- 实验 --- 1：数据一览；数据来自kaggle，是redhat那个项目，感兴趣的自己去看看； 2：方法；标题是逻辑回归，但是本质上还是神经网络做分类；但是这个问题传统上都是用逻辑回归解决的，因为包含了很多categorial的数据，然后label是0和1，要求做分类；运行一个logistic regression是很简单的；但是这里的问题是数据里面有个group变量和一个people向量，group大概有3k+种类，people大概有180K+种类，显然转换成dummy变量再做逻辑回归的话不合适；这里我主要是参考word embedding的思想，在tensorflow里面建立两个个词典，一个people词典一个group词典，然后训练的时候分别去查这个词典返回两个10维的实数向量，这两个实数向量就分别是people和group的特征；之后再随便弄了一点full connected的层和一些激活函数，效果不错，很快收敛到90%以上了； 3：效果；这个数据的话，我刚开始只是想用来实验在tf.Session（）的情况下怎么样batch读取tfrecords数据的，因为tfrecords数据读取的话不需要把整个数据load进去内存；之前一直用estimator的方法读tfrecords，但是用session之后似乎没有很好的解决方法；效果还不错，主要是感觉对于多种类的问题都可以用embedding的方法来做了以后；

#encoding=utf-8
import numpy as np 
import tensorflow as tf 
import pickle
import random 
model_dir = '/home/yanjianfeng/kaggle/data/model_dir/'


people_dic, group_dic, dic = pickle.load(open('/home/yanjianfeng/kaggle/data/data.dump', 'r'))
def create_train_op(loss):
    train_op = tf.contrib.layers.optimize_loss(loss = loss, 
        global_step = tf.contrib.framework.get_global_step(), 
        learning_rate = 0.1, 
        clip_gradients = 10.0, 
        optimizer = "Adam")
    return train_op 

def create_input():
    random_id = random.randint(0, len(dic['outcome'])-2049)
    keys = dic.keys() 
    data = {}
    for k in keys:
        data[k] = dic[k][random_id: random_id+2048]
    return data


# 主体部分还是最好不要放在函数里面，不太容易提取出某个特定的值
# 或者直接把主体部分放在tf.Session里面比较容， 大概就是这么一个模式；


global_step = tf.Variable(0, name = 'global_step', trainable=False)

people_id = tf.placeholder("int64", [None])
group = tf.placeholder('int64', [None])
time = tf.placeholder('int64', [None])
peofea = tf.placeholder('int64', [None, 262])
rowfea = tf.placeholder('int64', [None, 174])
outcome = tf.placeholder("int64", [None])

name_embed = tf.get_variable('names', shape = [189120, 10])
group_embed = tf.get_variable('groups', shape = [35000, 10])
name_ = tf.nn.embedding_lookup(name_embed, people_id)
group_ = tf.nn.embedding_lookup(group_embed, group)

name_w = tf.get_variable('name_w', shape = [10, 2])
group_w = tf.get_variable('group_w', shape = [10, 5])

name_outcome = tf.matmul(name_, name_w)
group_outcome = tf.matmul(group_, group_w)

w_1 = tf.get_variable('w_1', shape = [262, 10])
w_2 = tf.get_variable('w_2', shape = [174, 10])
w_3 = tf.get_variable('w_3', shape = [1])

peofea_outcome = tf.matmul(tf.to_float(peofea), w_1)
rowfea_outcome = tf.matmul(tf.to_float(rowfea), w_2)

time_outcome = tf.mul(tf.to_float(time), w_3)
time_outcome = tf.expand_dims(time_outcome, -1)

name_outcome = tf.sigmoid(name_outcome)
group_outcome = tf.sigmoid(group_outcome)
peofea_outcome = tf.sigmoid(peofea_outcome)
rowfea_outcome = tf.sigmoid(rowfea_outcome)
time_outcome = tf.sigmoid(time_outcome)

x = tf.concat(1, [name_outcome, group_outcome, peofea_outcome, rowfea_outcome, time_outcome])

w_f = tf.get_variable('w_f', shape = [28, 28])
b = tf.get_variable('b', shape = [1])
w_f_2 = tf.get_variable('w_f_2', shape = [28, 1])

pred = tf.sigmoid(tf.matmul(x, w_f)) + b 
pred = tf.matmul(pred, w_f_2)

y = tf.expand_dims(tf.to_float(outcome), -1)

prob = tf.sigmoid(pred)
prob = tf.to_float(tf.greater(prob, 0.5))
c = tf.reduce_mean(tf.to_float(tf.equal(prob, y)))

loss = tf.nn.sigmoid_cross_entropy_with_logits(pred, y)
loss = tf.reduce_mean(loss)
train_op = create_train_op(loss)



# 这里的顺序很重要，要是在最前面用saver，则会save到最开始的情况？
saver = tf.train.Saver()
with tf.Session() as sess:

    sess.run(tf.initialize_all_variables())
    ckpt = tf.train.get_checkpoint_state(model_dir)
    if ckpt and ckpt.model_checkpoint_path:
        print 'the model being restored is '
        print ckpt.model_checkpoint_path 
        saver.restore(sess, ckpt.model_checkpoint_path)
        print 'sucesssfully restored the session'

    count = global_step.eval()

    for i in range(0, 10000):
        data = create_input()
        l, _ , c_ = sess.run([loss, train_op, c], feed_dict = {people_id: data['people_id'],
            group: data['group'],
            time: data['time'],
            peofea: data['people_features'],
            rowfea: data['row_features'],
            outcome: data['outcome']})
        print 'the loss	' + str(l) + '		the count	' + str(c_)
        global_step.assign(count).eval()
        saver.save(sess, model_dir + 'model.ckpt', global_step = global_step)
        count += 1

相关阅读:
SpringMVC中利用@InitBinder来对页面数据进行解析绑定
 转型新零售必看：线下零售的运营模型
 主流CTR预估模型的演化及对比
 tensorflow创建自定义 Estimator
构建分布式Tensorflow模型系列:CVR预估之ESMM
tensorflow tfdbg 调试手段
 推荐系统算法学习（一）——协同过滤(CF) MF FM FFM
CTR预估算法之FM, FFM, DeepFM及实践
 深度学习在美团搜索广告排序的应用实践
 Redis集群搭建最佳实践
原文地址：https://www.cnblogs.com/LarryGates/p/6560839.html