中文词频统计

下载一中文长篇小说，并转换成UTF-8编码。
使用jieba库，进行中文词频统计，输出TOP20的词及出现次数。
排除一些无意义词、合并同一词。

 1 # _*_coding:utf-8_*_
 2 import jieba
 3 # 实例：词频统计
 4 # 打开文件
 5 fr = open('tridebody.txt','r',encoding= 'utf-8')
 6 str = fr.read()
 7 fr.close()
 8 # 排除元数的集合
 9 words = jieba.cut(str)
10 words = list(words)
11 print('{0:-^50}'.format('分词解析成功！'))
12 # 定义一个空字典
13 di = {}
14 # 用循环，写入字典
15 exc = {'“','”','。','，','！','？','‘','’','：','：','……','的','地','得','你','我','他','啊','哦','嗯','呼','也','尔','吾'}
16 disc = set(words)
17 disc = disc -exc
18 print('分词统计中。。。。')
19 for i in disc:
20     di[i] = words.count(i)
21 wc = list(di.items())
22 # print(wc)
23 wc.sort(key = lambda x:x[1],reverse=True)
24 # print(wc)
25 print('{0:-^50}'.format('词频统计结果前10'))
26 for i in range(len(disc)):
27     print('{0} = {1}'.format(wc[i][0],wc[i][1]))

相关阅读:
Topic for paper reading
Github
APPIUM+Python+HTMLTestRunner（转）
PyCharm 2016.3.2 汉化
APPIUM 常用API（转）
Python IDE PyCharm2016.3.2（转）
APPIUM笔记
将博客搬至CSDN
碎碎念
关于set或map的key使用自定义类型的问题

原文地址：https://www.cnblogs.com/alliancehacker/p/7609636.html