综合练习：词频统计

综合练习：词频统计
综合练习

词频统计预处理

下载一首英文的歌词或文章

将所有,.？！’:等分隔符全部替换为空格
sep=''',.?!'":'''
for a in sep:
    news = news.replace(a,' ')

print(news)

将所有大写转换为小写

sep=''',.?'":'''
for a in sep:
    news = news.lower().replace(a,' ')

print(news)

生成单词列表

sep=''',.?'":'''
for a in sep:
    news = news.replace(a,' ')
wordList=news.lower().split()
for w in wordList:
    print(w)

生成词频统计
```
sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

for w in wordDict:

    print(w, wordDict[w])
```
排序
```
sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)

print(dictList)
```
排除语法型词汇，代词、冠词、连词
```
exclude = {'the','and','of','to'}

sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)-exclude

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)

print(dictList)
```
输出词频最大TOP20
```
sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)
```
```
for i in range(20):
```
```
print(dictList[i])
```
将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

f = open('hongluomeng.txt','r', encoding='utf-8')

安装与使用jieba进行中文分词。

for i in g:
    text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20（或把结果存放到文件里）

import jieba

f = open('hongluomeng.txt','r', encoding='utf-8')
text = f.read()
f.close()

g = '''，。‘’“”：；（）！？、'''
a = {
    '的', ' ',
     '曰', '之', '不', '人', '一', '大', '马', '来', '有', '于', '下', '此',
     }
for i in g:
    text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)
count = {}
q = list(set(b) - a)
print(q)

for i in range(0, len(q)):
    count[q[i]] = text.count(str(q[i]))

r = list(count.items())
r.sort(key=lambda x: x[1], reverse=True)
print(r)

f = open('hlmCount.txt', 'a')
for i in range(20):
    f.write(r[i][0] + ':' + str(r[i][1]) + ' ')
f.close()
相关阅读:
分享在winform下实现左右布局多窗口界面续篇
 批处理加密与解密
 Winform应用程序实现通用遮罩层二
 Winform自定义窗体标题栏样式
 斯坦福 CS231n 全套解读系列文章
 net use命令详解
 技术管理杂谈
 用命令行禁用网卡
 查看已连接的无线网密码(windows)
Application.EnableVisualStyles()和Application.SetCompatibleTextRenderingDefault()的作用及用法
原文地址：https://www.cnblogs.com/Brilliance-pan/p/8666659.html