Python 中文文件统计词频 + 中文词云

1. 词频统计：

 1 import jieba
 2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
 3 words  = jieba.lcut(txt)
 4 counts = {}
 5 for word in words:
 6     if len(word) == 1:
 7         continue
 8     else:
 9         counts[word] = counts.get(word,0) + 1
10 items = list(counts.items())
11 items.sort(key=lambda x:x[1], reverse=True)
12 for i in range(15):
13     word, count = items[i]
14     print ("{0:<10}{1:>5}".format(word, count))

结果是：

曹操 946
孔明 737
将军 622
玄德 585
却说 534
关公 509
荆州 413
二人 410
丞相 405
玄德曰 390
不可 387
孔明曰 374
张飞 358
如此 320
不能 318

进一步改进，我想只知道人物出场统计，代码如下：

 1 import jieba
 2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
 3 names = {'曹操','孔明','刘备','关羽','张飞','吕布','赵云','孙权','周瑜','袁绍','黄忠','魏延'}
 4 words  = jieba.lcut(txt)
 5 counts = {}
 6 for word in words:
 7     if len(word) == 1:
 8         continue
 9     elif word == "诸葛亮" or word == "孔明曰":
10         rword = "孔明"
11     elif word == "关公" or word == "云长":
12         rword = "关羽"
13     elif word == "玄德" or word == "玄德曰":
14         rword = "刘备"
15     elif word == "孟德" or word == "丞相":
16         rword = "曹操"
17     else:
18         rword = word
19     counts[rword] = counts.get(rword,0) + 1
20 # for word in excludes:
21 #     del counts[word]
22 items = list(counts.items())
23 items.sort(key=lambda x:x[1], reverse=True)
24 for i in range(40):
25     word, count = items[i]
26     if word in names:
27         print ("{0:<10}{1:>5}".format(word, count))

运行结果为：

曹操 1358
孔明 1265
刘备 1251
关羽 783
张飞 358
吕布 300
赵云 278
孙权 257
周瑜 217
袁绍 191

进一步的做词云图：

 1 import jieba
 2 import os
 3 import wordcloud
 4  
 5 def getText(file):
 6     with open(file, 'r', encoding= 'UTF-8') as txt:
 7         txt = txt.read()
 8         jieba.lcut(txt)
 9     return txt
10  
11  
12 directoryname =  os.getcwd()
13 filename = input()
14 txt = getText(filename + '.txt')
15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
16 wordclouds.to_file('{}.png'.format(filename))
17  
18 os.system('{}.png'.format(filename))

名称是可以进一步优化的，参见第二部分代码。

中文wordcloud库默认会出现乱码，解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233

参考：https://blog.csdn.net/weixin_44521703/article/details/93058003

相关阅读:
HBASE & DB : ACID 事务与相关实现
LeetCode Largest Number
LeetCode Power of Two
LeetCode Dungeon Game
LeetCode The Skyline Problem
LeetCode Kth Smallest Element in a BST
LeetCode Flatten Binary Tree to Linked List
JAVA 并发：CLH 锁与 AbstractQueuedSynchronizer
为什么引入TSS
特权级概述（哥子就想知道CPU是如何验证特权级的）GATE+TSS

原文地址：https://www.cnblogs.com/116970u/p/11611821.html