综合练习
词频统计预处理
下载一首英文的歌词或文章
将所有,.?!’:等分隔符全部替换为空格
sep=''',.?!'":'''
for a in sep:
news = news.replace(a,' ')
print(news)
将所有大写转换为小写
sep=''',.?'":'''
for a in sep:
news = news.lower().replace(a,' ')
print(news)
生成单词列表
sep=''',.?'":'''
for a in sep:
news = news.replace(a,' ')
wordList=news.lower().split()
for w in wordList:
print(w)
生成词频统计
sep=''',.?'":'''
for a in sep:
news = news.replace(a,' ')
wordList=news.lower().split()
wordDict = {}
wordSet = set(wordList)
for w in wordSet:
wordDict[w] = wordList.count(w)
for w in wordDict:
print(w, wordDict[w])
排序
sep=''',.?'":'''
for a in sep:
news = news.replace(a,' ')
wordList=news.lower().split()
for a in sep:
news = news.lower().replace(a,' ')
wordList=news.split()
wordDict = {}
wordSet = set(wordList)
for w in wordSet:
wordDict[w] = wordList.count(w)
dictList = list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
print(dictList)
排除语法型词汇,代词、冠词、连词
exclude = {'the','and','of','to'}
sep=''',.?'":'''
for a in sep:
news = news.replace(a,' ')
wordList=news.lower().split()
for a in sep:
news = news.lower().replace(a,' ')
wordList=news.split()
wordDict = {}
wordSet = set(wordList)-exclude
for w in wordSet:
wordDict[w] = wordList.count(w)
dictList = list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
print(dictList)
输出词频最大TOP20
sep=''',.?'":'''
for a in sep:
news = news.replace(a,' ')
wordList=news.lower().split()
for a in sep:
news = news.lower().replace(a,' ')
wordList=news.split()
wordDict = {}
wordSet = set(wordList)
for w in wordSet:
wordDict[w] = wordList.count(w)
dictList = list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
for i in range(20):
print(dictList[i])
将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。
2.中文词频统计
下载一长篇中文文章。
从文件读取待分析文本。
f = open('hongluomeng.txt','r', encoding='utf-8')
安装与使用jieba进行中文分词。
for i in g:
text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)
生成词频统计
排序
排除语法型词汇,代词、冠词、连词
输出词频最大TOP20(或把结果存放到文件里)
import jieba
f = open('hongluomeng.txt','r', encoding='utf-8')
text = f.read()
f.close()
g = ''',。‘’“”:;()!?、'''
a = {
'的', '
',
'曰', '之', '不', '人', '一', '大', '马', '来', '有', '于', '下', '此',
}
for i in g:
text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)
count = {}
q = list(set(b) - a)
print(q)
for i in range(0, len(q)):
count[q[i]] = text.count(str(q[i]))
r = list(count.items())
r.sort(key=lambda x: x[1], reverse=True)
print(r)
f = open('hlmCount.txt', 'a')
for i in range(20):
f.write(r[i][0] + ':' + str(r[i][1]) +
'
')
f.close()