Python之酒店评论主题提取LDA主题模型

1.LDA主题模型简介

主题模型的核心思想是——一篇文章中的每个词语都是经历以下两个步骤之后生成而来：

一篇文章以一定概率选择了某个主题，
然后并从这个主题中以一定概率选择某个词语。

$p( 词语| 文档)=sum_{主题} p(词语|主题) p( 主题 |文档 )$
如下图所示：
$p( 主题 |文档 )$

比如某一篇文档 d，它的主题分布如右方红色柱状图所示。这篇文档最有可能是一篇体育，新闻类型的文档。
$p(词语|主题)$ :

所以主题模型本质上想说一篇文章是如何诞生的：
1.首先选择好文章的主题，
2 .然后选择好符合主题的词语组合一下。
有没有发现主题模型的简单粗暴，它行文时并没有考虑词语之间的衔接，语法是否通顺等。其实主题模型依旧是一个词袋模型，并没有考虑语序，语法，语义等高级特征。不过并不妨碍它能够带给我们很多惊喜。

2.利用gensim包进行LDA实战

(1)最简单的入门案例

from gensim import corpora, models
import jieba.posseg as jp, jieba
# 文本集
texts = [
    '美国教练坦言，没输给中国女排，是输给了郎平',
    '美国无缘四强，听听主教练的评价',
    '中国女排晋级世锦赛四强，全面解析主教练郎平的执教艺术',
    '为什么越来越多的人买MPV，而放弃SUV？跑一趟长途就知道了',
    '跑了长途才知道，SUV和轿车之间的差距',
    '家用的轿车买什么好']
# 分词过滤条件
jieba.add_word('四强', 9, 'n')
flags = ('n', 'nr', 'ns', 'nt', 'eng', 'v', 'd')  # 词性
stopwords = ('没', '就', '知道', '是', '才', '听听', '坦言', '全面', '越来越', '评价', '放弃', '人')  # 停词
# 分词
words_ls = []
for text in texts:
    words = [w.word for w in jp.cut(text) if w.flag in flags and w.word not in stopwords]
    words_ls.append(words)

print(words_ls)
dictionary = corpora.Dictionary(words_ls)
# 基于词典，使【词】→【稀疏向量】，并将向量放入列表，形成【稀疏向量集】
corpus = [dictionary.doc2bow(words) for words in words_ls]
print(corpus)
# lda模型，num_topics设置主题的个数
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2)
# 打印所有主题，每个主题显示5个词
for topic in lda.print_topics(num_words=5):
    print(topic)
# 主题推断
print(lda.inference(corpus))

控制台输出：

D:softwaretoolsanacondapython.exe D:/pycharmprojects/hoteltest01/hoteltest01/review_LDA.py
D:softwaretoolsanacondalibsite-packagesgensimutils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Building prefix dict from the default dictionary ...
Loading model from cache C:UsersluckyAppDataLocalTempjieba.cache
Loading model cost 0.621 seconds.
Prefix dict has been built succesfully.
[['美国', '输给', '中国女排', '输给', '郎平'], ['美国', '无缘', '四强', '主教练'], ['中国女排', '晋级', '世锦赛', '四强', '主教练', '郎平', '执教', '艺术'], ['买', 'MPV', 'SUV', '跑', '长途'], ['跑', '长途', 'SUV', '轿车', '差距'], ['家用', '轿车', '买']]
[[(0, 1), (1, 1), (2, 2), (3, 1)], [(1, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 1), (10, 1)], [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [(12, 1), (14, 1), (15, 1), (16, 1), (17, 1)], [(13, 1), (17, 1), (18, 1)]]
(0, '0.082*"SUV" + 0.081*"跑" + 0.080*"长途" + 0.079*"轿车" + 0.057*"差距"')
(1, '0.090*"美国" + 0.088*"输给" + 0.078*"四强" + 0.070*"主教练" + 0.068*"郎平"')
(array([[0.59499687, 5.405003  ],
       [0.57637316, 4.423627  ],
       [5.029799  , 3.9702005 ],
       [5.3376627 , 0.6623374 ],
       [5.4332952 , 0.56670463],
       [2.2165096 , 1.7834904 ]], dtype=float32), None)

Process finished with exit code 0

(2)酒店评论

from gensim import corpora, models
words_ls = []
review_list=[]
txt_id = 1
review_split_txt_path = 'split_result_txt/split_txt_9.txt'
f = open(review_split_txt_path, 'r', encoding='utf-8') #打开分词结果的txt文件
for line in f.readlines():
    #分词结果txt文档每行是一条评论
    #这是txt一行的样式：酒店/干净/床品/采光/周边/早上/小跑/西湖/晨练/跑步/风景/景区/没/吃/早餐
    review_words = line.split("/") 
    # print(review_words)
    words_ls.append(review_words) #将列表review_words追加到列表words_ls中，当做一个元素

print(words_ls)
dictionary = corpora.Dictionary(words_ls)
# 基于词典，使【词】→【稀疏向量】，并将向量放入列表，形成【稀疏向量集】
corpus = [dictionary.doc2bow(words) for words in words_ls]
# print(corpus)
# lda模型，num_topics设置主题的个数
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
# 打印所有主题，每个主题显示10个词
for topic in lda.print_topics(num_words=10):
    print(topic)

控制台输出：

(0, '0.018*"风" + 0.015*"设计" + 0.011*"酒店" + 0.011*"独特" + 0.009*"房间" + 0.009*"空调" + 0.009*"感觉" + 0.008*"年代" + 0.008*"民国" + 0.008*"送"')
(1, '0.030*"酒店
" + 0.019*"停车场" + 0.018*"酒店" + 0.016*"早餐" + 0.016*"房间" + 0.016*"免费" + 0.015*"停车" + 0.014*"很好" + 0.013*"一家" + 0.013*"馀"')
(2, '0.024*"近" + 0.019*"房间" + 0.018*"酒店" + 0.017*"东站" + 0.015*"西湖" + 0.011*"杭州" + 0.011*"入住" + 0.011*"风格" + 0.010*"装修" + 0.010*"湖边"')
(3, '0.042*"房间" + 0.037*"闹中取静" + 0.032*"环境" + 0.030*" " + 0.027*"酒店" + 0.019*"不错" + 0.015*"很大" + 0.014*"早餐" + 0.012*"特色" + 0.011*"味道
"')
(4, '0.013*"客厅" + 0.010*"房间" + 0.009*"说" + 0.009*"服务" + 0.009*"吃" + 0.009*"浴室" + 0.008*"历史" + 0.008*"酒店" + 0.008*"客人" + 0.007*"早饭"')
(5, '0.036*"酒店" + 0.033*"服务" + 0.017*"西湖" + 0.017*"设施" + 0.015*"近" + 0.013*"房间" + 0.012*"早餐" + 0.012*"位置" + 0.012*"地理位置" + 0.011*"人员"')
(6, '0.056*"
" + 0.032*"酒店" + 0.028*"住" + 0.027*"很棒
" + 0.022*"服务" + 0.022*" " + 0.016*"满意
" + 0.014*"精品" + 0.013*"环境" + 0.012*"这家"')
(7, '0.043*"服务" + 0.042*"升级" + 0.038*"免费" + 0.036*"酒店" + 0.030*"前台" + 0.019*"不错" + 0.018*"帮" + 0.017*"房间" + 0.014*"朋友" + 0.014*" "')
(8, '0.048*"房间" + 0.043*"酒店" + 0.037*"不错" + 0.032*"干净" + 0.027*"性价比" + 0.024*"高" + 0.019*"位置" + 0.015*"设施" + 0.015*"安静" + 0.015*"舒适"')
(9, '0.031*"不错" + 0.025*"位置" + 0.024*"酒店" + 0.023*"楼下" + 0.018*"房间" + 0.013*"出行
" + 0.013*"卫生" + 0.011*"品牌" + 0.011*"更好
" + 0.011*"早餐"')
(10, '0.034*"太" + 0.029*"早餐" + 0.016*"房间" + 0.015*"服务" + 0.015*"酒店" + 0.014*"差" + 0.012*"西湖" + 0.011*"不错" + 0.009*"" + 0.009*"开"')
(11, '0.036*"服务" + 0.036*"酒店" + 0.034*"房间" + 0.022*"前台" + 0.021*"不错" + 0.019*"干净" + 0.018*"早餐" + 0.013*"卫生" + 0.011*"入住" + 0.010*"交通"')
(12, '0.028*"带" + 0.020*"孩子" + 0.020*"舒服
" + 0.016*"感觉
" + 0.016*"家庭" + 0.015*"适合" + 0.014*"完美
" + 0.012*"酒店" + 0.009*"早餐" + 0.009*"民国"')
(13, '0.055*"下次" + 0.034*"服务" + 0.030*"酒店" + 0.030*"住
" + 0.028*"不错" + 0.022*"房间" + 0.020*"入住
" + 0.019*"前台" + 0.018*"入住" + 0.016*"选择
"')
(14, '0.026*"管家" + 0.020*"早餐" + 0.016*"房间" + 0.016*"建筑" + 0.014*"酒店" + 0.009*"东站" + 0.009*"阳台" + 0.009*"近" + 0.009*"便利" + 0.009*"不错"')
(15, '0.067*"服务
" + 0.035*"酒店" + 0.026*"不错" + 0.023*"装修" + 0.022*"前台" + 0.021*"挺" + 0.020*"风格" + 0.016*"房间" + 0.016*"喜欢
" + 0.012*"住"')
(16, '0.060*"西湖" + 0.047*"酒店" + 0.038*"不错" + 0.033*"早餐" + 0.019*"小张" + 0.017*"杭州" + 0.016*"周边" + 0.015*"位置" + 0.013*"性价比" + 0.012*"步行"')
(17, '0.023*"地铁站" + 0.020*"酒店" + 0.019*"早餐" + 0.015*"房间" + 0.013*"服务" + 0.013*"前台" + 0.011*"带" + 0.010*"房子" + 0.010*"服务员" + 0.009*"不错"')
(18, '0.086*"不错
" + 0.064*"酒店" + 0.039*"服务" + 0.032*"前台" + 0.028*"老" + 0.027*" " + 0.023*"早餐" + 0.021*"不错" + 0.015*"位置" + 0.013*"环境"')
(19, '0.020*"近" + 0.019*"酒店" + 0.015*"西式" + 0.014*"三层" + 0.014*"西湖" + 0.013*"走路" + 0.013*"高铁" + 0.012*"南宋" + 0.011*"御街" + 0.011*"房间"')

参考文献：

https://blog.csdn.net/Yellow_python/article/details/83097994 Python+gensim【中文LDA】简洁模型

https://www.jianshu.com/p/f2af7b74e506 LDA主题模型——gensim实战

https://www.cnblogs.com/Luv-GEM/p/10881838.html 文本主题抽取：用gensim训练LDA模型

相关阅读:
CF703D Mishka and Interesting sum
CF697D Puzzles
SCOI2017酱油记
 [BZOJ4730][清华集训2016][UOJ266] Alice和Bob又在玩游戏
 BZOJ4311：向量
 BZOJ4520: [Cqoi2016]K远点对
 BZOJ4555: [Tjoi2016&Heoi2016]求和
 [Codechef November Challenge 2012] Arithmetic Progressions
agc040
补题
原文地址：https://www.cnblogs.com/luckyplj/p/13200006.html