• tensorflow在文本处理中的使用——Doc2Vec情感分析


    代码来源于:tensorflow机器学习实战指南(曾益强 译,2017年9月)——第七章:自然语言处理

    代码地址:https://github.com/nfmcclure/tensorflow-cookbook

    数据:http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz

    在word2vec的方法中,处理的是单词之间的上下文关系,但是没有考虑单词和单词所在文档之间的关系

    word2vec方法的拓展之一就是doc2vec方法,其考虑文档的影响。

    算法思想

    引入文档嵌套,连同单词嵌套一起帮助判断文档的情感色彩

    如何结合?

    1.文档嵌套和单词嵌套相加(要求文档嵌套的大小和单词嵌套大小相同)

    2.文档嵌套追加在单词嵌套后面(需要增加变量数目)

    一般而言,对于小数据集,两种嵌套相加是更好的选择

    步骤

    • 数据准备处理工作
    • 计算出文档嵌套和单词嵌套
    • 训练逻辑回归模型

    1. 数据准备处理工作

    1.1 导入必要的包

    1.2 加载数据

    1.3 声明模型参数

    1.4 归一化影评,确保影评长度大于指定窗口大小

    1.5 创建单词词典(无需建立文档字典,每个文档均有唯一的索引值)

    import tensorflow as tf
    import matplotlib.pyplot as plt
    import numpy as np
    import random
    import os
    import pickle
    import string
    import requests
    import collections
    import io
    import tarfile
    import urllib.request
    import text_helpers
    from nltk.corpus import stopwords
    from tensorflow.python.framework import ops
    ops.reset_default_graph()
    
    os.chdir(os.path.dirname(os.path.realpath(__file__)))
    
    # Make a saving directory if it doesn't exist
    data_folder_name = 'temp'
    if not os.path.exists(data_folder_name):
        os.makedirs(data_folder_name)
    
    # Start a graph session
    sess = tf.Session()
    
    # Declare model parameters
    batch_size = 500
    vocabulary_size = 7500
    generations = 100000
    model_learning_rate = 0.001
    
    embedding_size = 200   # Word embedding size
    doc_embedding_size = 100   # Document embedding size
    concatenated_size = embedding_size + doc_embedding_size
    
    num_sampled = int(batch_size/2)    # Number of negative examples to sample.
    window_size = 3       # How many words to consider to the left.
    
    # Add checkpoints to training
    save_embeddings_every = 5000
    print_valid_every = 5000
    print_loss_every = 100
    
    # Declare stop wordsstops = stopwords.words('english')# We pick a few test words for validation.
    valid_words = ['love', 'hate', 'happy', 'sad', 'man', 'woman']
    # Later we will have to transform these into indices
    
    # Load the movie review data
    print('Loading Data')
    texts, target = text_helpers.load_movie_data(data_folder_name)
    
    # Normalize text
    print('Normalizing Text Data')
    texts = text_helpers.normalize_text(texts, stops)
    
    # Texts must contain at least 3 words
    target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > window_size]
    texts = [x for x in texts if len(x.split()) > window_size]    
    assert(len(target)==len(texts))
    
    # Build our data set and dictionaries
    print('Creating Dictionary')
    word_dictionary = text_helpers.build_dictionary(texts, vocabulary_size)
    word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys()))
    text_data = text_helpers.text_to_numbers(texts, word_dictionary)
    
    # Get validation word keys待验证单词集
    valid_examples = [word_dictionary[x] for x in valid_words]

    2. 计算出文档嵌套和单词嵌套

    2.1 定义文档嵌套和单词嵌套

    print('Creating Model')
    # Define Embeddings:
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    doc_embeddings = tf.Variable(tf.random_uniform([len(texts), doc_embedding_size], -1.0, 1.0))
    
    # NCE loss parameters
    nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, concatenated_size], stddev=1.0 / np.sqrt(concatenated_size)))
    nce_biases = tf.Variable(tf.zeros([vocabulary_size])) 

    2.2 声明doc2vec索引和目标单词索引的占位符(输入索引大小是窗口大小加1,因为有一个额外的文档索引)

    # Create data/target placeholders
    x_inputs = tf.placeholder(tf.int32, shape=[None, window_size + 1]) # plus 1 for doc index
    y_target = tf.placeholder(tf.int32, shape=[None, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

    2.3 解释产生批量数据时输入多一维文档索引

    参考:tensorflow在文本处理中的使用——辅助函数

    # 和skip-gram,CBOW一样
    >>> rand_sentence=[2520, 1421, 146, 1215, 5, 468, 12, 14, 18, 20]
    >>> window_size = 3
    >>> window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
    >>> label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]
    >>> window_sequences
    [[2520, 1421, 146, 1215], [2520, 1421, 146, 1215, 5], [2520, 1421, 146, 1215, 5, 468], [2520, 1421, 146, 1215, 5, 468, 12], [1421, 146, 1215, 5, 468, 12, 14], 
    [146, 1215, 5, 468, 12, 14, 18], [1215, 5, 468, 12, 14, 18, 20], [5, 468, 12, 14, 18, 20], [468, 12, 14, 18, 20], [12, 14, 18, 20]] >>> label_indices [0, 1, 2, 3, 3, 3, 3, 3, 3, 3] # 左边窗口词和目标词 >>> batch_and_labels = [(rand_sentence[i:i+window_size], rand_sentence[i+window_size]) for i in range(0, len(rand_sentence)-window_size)] >>> batch_and_labels [([2520, 1421, 146], 1215), ([1421, 146, 1215], 5), ([146, 1215, 5], 468), ([1215, 5, 468], 12), ([5, 468, 12], 14), ([468, 12, 14], 18), ([12, 14, 18], 20)] >>> batch, labels = [list(x) for x in zip(*batch_and_labels)] >>> batch [[2520, 1421, 146], [1421, 146, 1215], [146, 1215, 5], [1215, 5, 468], [5, 468, 12], [468, 12, 14], [12, 14, 18]] >>> labels [1215, 5, 468, 12, 14, 18, 20] # 155代表的是rand_sentence_ix,把文档索引加入 >>> batch = [x + [155] for x in batch] >>> batch [[2520, 1421, 146, 155], [1421, 146, 1215, 155], [146, 1215, 5, 155], [1215, 5, 468, 155], [5, 468, 12, 155], [468, 12, 14, 155], [12, 14, 18, 155]]

    2.4 创建嵌套函数将单词嵌套求和,然后连接文档嵌套

    # 创建嵌套函数将单词嵌套求和,然后连接文档嵌套
    embed = tf.zeros([batch_size, embedding_size])
    for element in range(window_size):
        embed += tf.nn.embedding_lookup(embeddings, x_inputs[:, element])
    
    doc_indices = tf.slice(x_inputs, [0,window_size],[batch_size,1])
    doc_embed = tf.nn.embedding_lookup(doc_embeddings,doc_indices)
    
    # concatenate embeddings
    final_embed = tf.concat(1, [embed, tf.squeeze(doc_embed)])

    函数参考:tf.slicetf.concattf.squeeze

    上述计算过程实例展示

    In [1]: import tensorflow as tf
    
    In [2]: batch_size=4
    
    In [3]: embedding_size=2
    
    In [4]: window_size=3
    
    In [5]: vocabulary_size=5
    
    In [6]: embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    
    In [7]: x_inputs = tf.constant([[1,2,3,4],[2,1,3,4],[0,1,2,4],[0,2,1,4]])
    
    In [8]: doc_embedding_size=6
    
    In [9]: doc_embeddings=tf.Variable(tf.random_uniform([5, doc_embedding_size], -1.0, 1.0))
    
    # 输入上述operation
    
    In [17]: init = tf.global_variables_initializer()
    
    In [18]: sess.run(init)
    
    In [19]: sess.run(embeddings)
    Out[19]: 
    array([[ 0.98145223, -0.83131742],
           [-0.52276659, -0.29783845],
           [ 0.46809649,  0.044348  ],
           [ 0.31654406, -0.46324134],
           [ 0.75828886, -0.99072552]], dtype=float32)
    
    In [20]: sess.run(doc_embeddings)
    Out[20]: 
    array([[ 0.9781754 , -0.06880736, -0.2043252 ,  0.26498127, -0.5746839 ,
             0.98072433],
           [-0.38463736,  0.79673767,  0.43830776,  0.85574818,  0.56155062,
            -0.18029761],
           [-0.72975278,  0.39068675, -0.72180915, -0.04128242,  0.00453711,
             0.35815096],
           [ 0.66498852,  0.07083178, -0.42824841, -0.12427211,  0.35502028,
             0.92180991],
           [ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,
             0.52539349]], dtype=float32)
    
    In [22]: sess.run(x_inputs)
    Out[22]: 
    array([[1, 2, 3, 4],
           [2, 1, 3, 4],
           [0, 1, 2, 4],
           [0, 2, 1, 4]], dtype=int32)
    
    In [23]: sess.run(embed)
    Out[23]: 
    array([[ 0.26187396, -0.71673179],
           [ 0.26187396, -0.71673179],
           [ 0.92678213, -1.08480787],
           [ 0.92678213, -1.08480787]], dtype=float32)
    
    In [24]: sess.run(doc_indices)
    Out[24]: 
    array([[4],
           [4],
           [4],
           [4]], dtype=int32)
    
    In [25]: sess.run(doc_embed)
    Out[25]: 
    array([[[ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,
              0.52539349]],
    
           [[ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,
              0.52539349]],
    
           [[ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,
              0.52539349]],
    
           [[ 0.64659524, -0.57290649,  0.76603436, -0.20811057, -0.10866618,
              0.52539349]]], dtype=float32)

    2.5 声明loss和optimizer

    # Get loss from prediction
    loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, final_embed, y_target, num_sampled, vocabulary_size))
                                         
    # Create optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate)
    train_step = optimizer.minimize(loss)

    2.6 声明验证单词集的余弦距离

    # Cosine similarity between words
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

    2.7 保存嵌套数据&训练

    # Create model saving operation
    saver = tf.train.Saver({"embeddings": embeddings, "doc_embeddings": doc_embeddings})
    
    #Add variable initializer.
    init = tf.initialize_all_variables()
    sess.run(init)
    
    # Run the skip gram model.
    print('Starting Training')
    loss_vec = []
    loss_x_vec = []
    for i in range(generations):
        batch_inputs, batch_labels = text_helpers.generate_batch_data(text_data, batch_size, window_size, method='doc2vec')
        feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}
    
        # Run the train step
        sess.run(train_step, feed_dict=feed_dict)
    
        # Return the loss
        if (i+1) % print_loss_every == 0:
            loss_val = sess.run(loss, feed_dict=feed_dict)
            loss_vec.append(loss_val)
            loss_x_vec.append(i+1)
            print('Loss at step {} : {}'.format(i+1, loss_val))
          
        # Validation: Print some random words and top 5 related words
        if (i+1) % print_valid_every == 0:
            sim = sess.run(similarity, feed_dict=feed_dict)
            for j in range(len(valid_words)):
                valid_word = word_dictionary_rev[valid_examples[j]]
                top_k = 5 # number of nearest neighbors
                nearest = (-sim[j, :]).argsort()[1:top_k+1]
                log_str = "Nearest to {}:".format(valid_word)
                for k in range(top_k):
                    close_word = word_dictionary_rev[nearest[k]]
                    log_str = '{} {},'.format(log_str, close_word)
                print(log_str)
                
        # Save dictionary + embeddings
        if (i+1) % save_embeddings_every == 0:
            # Save vocabulary dictionary
            with open(os.path.join(data_folder_name,'movie_vocab.pkl'), 'wb') as f:
                pickle.dump(word_dictionary, f)
            
            # Save embeddings
            model_checkpoint_path = os.path.join(os.getcwd(),data_folder_name,'doc2vec_movie_embeddings.ckpt')
            save_path = saver.save(sess, model_checkpoint_path)
            print('Model saved in file: {}'.format(save_path)) 

    结果展示:


    3. 训练逻辑回归模型

    3.1 参数

    # Start logistic model-------------------------
    max_words = 20
    logistic_batch_size = 500

    3.2 分割数据集

    # Split dataset into train and test sets
    # Need to keep the indices sorted to keep track of document index
    train_indices = np.sort(np.random.choice(len(target), round(0.8*len(target)), replace=False))
    test_indices = np.sort(np.array(list(set(range(len(target))) - set(train_indices))))
    texts_train = [x for ix, x in enumerate(texts) if ix in train_indices]
    texts_test = [x for ix, x in enumerate(texts) if ix in test_indices]
    target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])
    target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])

    3.3 以20个单词表示每条影评

    # Convert texts to lists of indices
    text_data_train = np.array(text_helpers.text_to_numbers(texts_train, word_dictionary))
    text_data_test = np.array(text_helpers.text_to_numbers(texts_test, word_dictionary))
    
    # Pad/crop movie reviews to specific length
    text_data_train = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_train]])
    text_data_test = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_test]])

    3.4 创建另一个嵌套函数,前面的嵌套函数是训练三个单词窗口和文档索引预测最近的单词,这里类似,不同的是训练20个单词的影评

    # Define logistic embedding lookup (needed if we have two different batch sizes)
    # Add together element embeddings in window:
    log_embed = tf.zeros([logistic_batch_size, embedding_size])
    for element in range(max_words):
        log_embed += tf.nn.embedding_lookup(embeddings, log_x_inputs[:, element])
    
    log_doc_indices = tf.slice(log_x_inputs, [0,max_words],[logistic_batch_size,1])
    log_doc_embed = tf.nn.embedding_lookup(doc_embeddings,log_doc_indices)
    
    # concatenate embeddings
    log_final_embed = tf.concat(1, [log_embed, tf.squeeze(log_doc_embed)])

    3.5 构建逻辑回归模型的训练图

    # Define Logistic placeholders
    log_x_inputs = tf.placeholder(tf.int32, shape=[None, max_words + 1]) # plus 1 for doc index
    log_y_target = tf.placeholder(tf.int32, shape=[None, 1])
    
    # Define model:
    # Create variables for logistic regression
    A = tf.Variable(tf.random_normal(shape=[concatenated_size,1]))
    b = tf.Variable(tf.random_normal(shape=[1,1]))
    
    # Declare logistic model (sigmoid in loss function)
    model_output = tf.add(tf.matmul(log_final_embed, A), b)
    
    # Declare loss function (Cross Entropy loss)
    logistic_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(model_output, tf.cast(log_y_target, tf.float32)))
    
    # Actual Prediction
    prediction = tf.round(tf.sigmoid(model_output))
    predictions_correct = tf.cast(tf.equal(prediction, tf.cast(log_y_target, tf.float32)), tf.float32)
    accuracy = tf.reduce_mean(predictions_correct)
    
    # Declare optimizer
    logistic_opt = tf.train.GradientDescentOptimizer(learning_rate=0.01)
    logistic_train_step = logistic_opt.minimize(logistic_loss, var_list=[A, b])

    3.6 训练

    # Intitialize Variables
    init = tf.global_variables_initializer()
    sess.run(init)
    
    # Start Logistic Regression
    print('Starting Logistic Doc2Vec Model Training')
    train_loss = []
    test_loss = []
    train_acc = []
    test_acc = []
    i_data = []
    for i in range(10000):
        rand_index = np.random.choice(text_data_train.shape[0], size=logistic_batch_size)
        rand_x = text_data_train[rand_index]
        # Append review index at the end of text data
        rand_x_doc_indices = train_indices[rand_index]
        rand_x = np.hstack((rand_x, np.transpose([rand_x_doc_indices])))
        rand_y = np.transpose([target_train[rand_index]])
        
        feed_dict = {log_x_inputs : rand_x, log_y_target : rand_y}
        sess.run(logistic_train_step, feed_dict=feed_dict)
        
        # Only record loss and accuracy every 100 generations
        if (i+1)%100==0:
            rand_index_test = np.random.choice(text_data_test.shape[0], size=logistic_batch_size)
            rand_x_test = text_data_test[rand_index_test]
            # Append review index at the end of text data
            rand_x_doc_indices_test = test_indices[rand_index_test]
            rand_x_test = np.hstack((rand_x_test, np.transpose([rand_x_doc_indices_test])))
            rand_y_test = np.transpose([target_test[rand_index_test]])
            
            test_feed_dict = {log_x_inputs: rand_x_test, log_y_target: rand_y_test}
            
            i_data.append(i+1)
    
            train_loss_temp = sess.run(logistic_loss, feed_dict=feed_dict)
            train_loss.append(train_loss_temp)
            
            test_loss_temp = sess.run(logistic_loss, feed_dict=test_feed_dict)
            test_loss.append(test_loss_temp)
            
            train_acc_temp = sess.run(accuracy, feed_dict=feed_dict)
            train_acc.append(train_acc_temp)
        
            test_acc_temp = sess.run(accuracy, feed_dict=test_feed_dict)
            test_acc.append(test_acc_temp)
        if (i+1)%500==0:
            acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]
            acc_and_loss = [np.round(x,2) for x in acc_and_loss]
            print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

    结果:

    可视化代码参考:tensorflow在文本处理中的使用——Word2Vec预测

  • 相关阅读:
    对Spring Boot 及Mybatis简单应用
    云时代架构读后感(十)
    云时代架构读后感(九)
    云时代架构读后感(八)
    云时代架构读后感(七)
    云时代架构读后感(六)
    关于mysql,sqlserverl,与oracle数据库连接总结
    云时代架构读后感(五)
    javaweb实现添加课程
    javaweb入门(使用SQLserver2008 R2数据库)
  • 原文地址:https://www.cnblogs.com/helloworld0604/p/9009885.html
Copyright © 2020-2023  润新知