• 主题模型LDA


    主题模型LDA

    原理

    LDA也称为隐狄利克雷分布,LDA的目的就是要识别主题,即把文档—词汇矩阵变成文档—主题矩阵(分布)和主题—词汇矩阵(分布)。

    文档生成方式

    • 按照先验概率$P(d_{i})$选择一篇文档$d_{i}$
    • 从狄利克雷分布$alpha$中取样生成文档$i$的主题分布$ heta_{i}$,换言之,主题分布$ heta_{i}$由超参数$alpha$的狄利克雷分布生成
    • 从主题多项式分布$ heta_{i}$中取样生成文档$i$第$j$个词的主题$z_{i,j}$
    • 从狄利克雷分布$eta$中取样生成主题$z_{i,j}$对应的词语分布$phi_{z_{i,j}}$,换言之,词语分布$phi_{z_{i,j}}$由参数为$eta$的狄利克雷分布生成
    • 从词语的多项式分布$phi_{z_{i,j}}$中采样最终生成词语$w_{i,j}$

    共轭先验分布

    狄利克雷分布是多项式分布的共轭先验分布,如果后验概率P(θ|x)和先验概率p(θ)满足同样的分布律,那么,先验分布和后验分布被叫做共轭分布,同时,先验分布叫做似然函数的共轭先验分布。

    LDA参数估计

    LDA的参数估计使用的是吉布斯采样的方法。LDA的学习过程其实就是估计主题分布$ heta$和词分布$phi$这两个未知参数的过程。我们知道LDA是生成模型,最终目的是在控制超参数$alpha$和$eta$的条件下,通过隐变量$ heta$和$phi$,得到联合分布$p(w,z)$,公式如下:

    $$p(z,w|alpha, eta)=p(w|z,eta)p(z|alpha)$$

    $$p(w|z,eta)= int p(w|z, phi)p(phi|eta)d phi$$

    $$p(z|alpha)= int p(z| heta)p( heta| alpha)$$

    当得到联合分布后,就可以根据当前的文章计算出主题分布$phi$和词分布$ heta$

    代码实现

    模型训练

     1 import json
     2 from gensim import corpora, models
     3 from gensim.corpora import Dictionary
     4 
     5 with open(r'./data/data_specification/cn_software_data.json', 'r', encoding='utf8') as f:
     6     cn_software_data = json.load(f)
     7 
     8 with open(r'./data/LDA_data/LDA_text.json', 'r', encoding='UTF8') as f:
     9     LDA_texts = json.load(f)
    10 
    11 LDA_dict = Dictionary(LDA_texts)
    12 LDA_dict.save(r'./data/LDA_data/LDA_dict')
    13 LDA_corpus = [LDA_dict.doc2bow(text) for text in LDA_texts]
    14  
    15 # LDA训练参数
    16 num_topics=500
    17 iterations=1000
    18 workers=3
    19 
    20 # lda多进程训练
    21 lda = models.ldamulticore.LdaMulticore(LDA_corpus, id2word=LDA_dict, num_topics=num_topics, iterations=iterations, workers=workers, batch=True)
    22 lda.save(r'./LDA_model/lda.model' + 'lda_%s_%s.model'%(num_topics, iterations))

    计算perplexity

     1 #-*-coding:utf-8-*-
     2 import sys
     3 import os
     4 from gensim.corpora import Dictionary
     5 from gensim import corpora, models
     6 from datetime import datetime
     7 import logging
     8 import math
     9 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s : ', level=logging.INFO)
    10 
    11 def perplexity(ldamodel, testset, dictionary, size_dictionary, num_topics):
    12     """calculate the perplexity of a lda-model"""
    13     # dictionary : {7822:'deferment', 1841:'circuitry',19202:'fabianism'...]
    14     print ('the info of this ldamodel: 
    ')
    15     print ('num of testset: %s; size_dictionary: %s; num of topics: %s'%(len(testset), size_dictionary, num_topics))
    16     prep = 0.0
    17     prob_doc_sum = 0.0
    18     topic_word_list = []
    19     for topic_id in range(num_topics):
    20         topic_word = ldamodel.show_topic(topic_id, size_dictionary)
    21         dic = {}
    22         for word, probability in topic_word:
    23             dic[word] = probability
    24         topic_word_list.append(dic)
    25     doc_topics_ist = []
    26     for doc in testset:
    27         doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))
    28     testset_word_num = 0
    29     for i in range(len(testset)):
    30         prob_doc = 0.0 # the probablity of the doc
    31         doc = testset[i]
    32         doc_word_num = 0 # the num of words in the doc
    33         for word_id, num in doc:
    34             prob_word = 0.0 # the probablity of the word 
    35             doc_word_num += num
    36             word = dictionary[word_id]
    37             for topic_id in range(num_topics):
    38                 # cal p(w) : p(w) = sumz(p(z)*p(w|z))
    39                 prob_topic = doc_topics_ist[i][topic_id][1]
    40                 prob_topic_word = topic_word_list[topic_id][word]
    41                 prob_word += prob_topic*prob_topic_word
    42             prob_doc += math.log(prob_word) # p(d) = sum(log(p(w)))
    43         prob_doc_sum += prob_doc
    44         testset_word_num += doc_word_num
    45     prep = math.exp(-prob_doc_sum/testset_word_num) # perplexity = exp(-sum(p(d))/sum(Nd))
    46     print ("the perplexity of this ldamodel is : %s"%prep)
    47     
    48     return prep
    49 
    50 if __name__ == '__main__':
    51     dictionary_path = r'./data/LDA_data/LDA_dict'
    52     corpus_path = r'./data/LDA_data/LDA_corpus'
    53     num_topics = 500
    54     ldamodel_path = './LDA_model/lda_{}_1000.model'.format(str(num_topics))
    55     dictionary = corpora.Dictionary.load(dictionary_path)
    56     corpus = corpora.MmCorpus(corpus_path)
    57     lda_multi = models.ldamodel.LdaModel.load(ldamodel_path)
    58     
    59     testset = []
    60     # sample 1/300
    61     for i in range(int(corpus.num_docs/300)):
    62         testset.append(corpus[i*300])
    63         # print(corpus[i*300])
    64     prep = perplexity(lda_multi, testset, dictionary, len(dictionary.keys()), num_topics)
    65     with open('./LDA_model/lda_{}.txt'.format(str(num_topics)), 'a', encoding='utf8') as f:
    66         f.write("the perplexity of K={} ldamodel is : {}".format(str(num_topics), str(prep)))

    面试问题

    pLSA和LDA的关系

    pLSA和LDA都在寻找主题分布与词分布。pLSA跟LDA的区别在于,去探索这两个未知参数的方法或思想不一样。pLSA是求到一个能拟合文本最好的参数(分布),这个值就认为是真实的参数。但LDA认为,其实我们没法去完全求解出主题分布、词分布到底是什么参数,我们只能把它们当成随机变量,通过缩小其方差(变化度)来尽量让这个随机变量变得更“确切”。换言之,我们不再求主题分布、词分布的具体值,而是通过这些分布生成的观测值(即实际文本)来反推分布的参数的范围,即在什么范围比较可能,在什么范围不太可能。所以,其实这就是一种贝叶斯分析的思想,虽然无法给出真实值具体是多少,但可以按照经验给一个相对合理的真实值服从的先验分布,然后从先验出发求解其后验分布。

    参考

    https://blog.csdn.net/v_july_v/article/details/41209515

    https://www.jianshu.com/p/b7033e792718

    延伸

    ABAE模型:https://www.jianshu.com/p/241cb238e21f

  • 相关阅读:
    修改python注册表
    python 调用exe程序
    python msg_box
    python pickle
    python send email
    get data from splunk
    剑指offer 14.代码的鲁棒性 链表中倒数第k个结点
    Kafka 与flume的整合
    Kafka Java API+自定义分区
    Kafka 命令行操作topic+producer+consumer详解
  • 原文地址:https://www.cnblogs.com/4PrivetDrive/p/12189350.html
Copyright © 2020-2023  润新知