• Python 中文文件统计词频 + 中文词云

    1. 词频统计:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 words  = jieba.lcut(txt)
     4 counts = {}
     5 for word in words:
     6     if len(word) == 1:
     7         continue
     8     else:
     9         counts[word] = counts.get(word,0) + 1
    10 items = list(counts.items())
    11 items.sort(key=lambda x:x[1], reverse=True)
    12 for i in range(15):
    13     word, count = items[i]
    14     print ("{0:<10}{1:>5}".format(word, count))


    曹操 946
    孔明 737
    将军 622
    玄德 585
    却说 534
    关公 509
    荆州 413
    二人 410
    丞相 405
    玄德曰 390
    不可 387
    孔明曰 374
    张飞 358
    如此 320
    不能 318

    进一步改进, 我想只知道人物出场统计,代码如下:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 names = {'曹操','孔明','刘备','关羽','张飞','吕布','赵云','孙权','周瑜','袁绍','黄忠','魏延'}
     4 words  = jieba.lcut(txt)
     5 counts = {}
     6 for word in words:
     7     if len(word) == 1:
     8         continue
     9     elif word == "诸葛亮" or word == "孔明曰":
    10         rword = "孔明"
    11     elif word == "关公" or word == "云长":
    12         rword = "关羽"
    13     elif word == "玄德" or word == "玄德曰":
    14         rword = "刘备"
    15     elif word == "孟德" or word == "丞相":
    16         rword = "曹操"
    17     else:
    18         rword = word
    19     counts[rword] = counts.get(rword,0) + 1
    20 # for word in excludes:
    21 #     del counts[word]
    22 items = list(counts.items())
    23 items.sort(key=lambda x:x[1], reverse=True)
    24 for i in range(40):
    25     word, count = items[i]
    26     if word in names:
    27         print ("{0:<10}{1:>5}".format(word, count))


    曹操 1358
    孔明 1265
    刘备 1251
    关羽 783
    张飞 358
    吕布 300
    赵云 278
    孙权 257
    周瑜 217
    袁绍 191


     1 import jieba
     2 import os
     3 import wordcloud
     5 def getText(file):
     6     with open(file, 'r', encoding= 'UTF-8') as txt:
     7         txt = txt.read()
     8         jieba.lcut(txt)
     9     return txt
    12 directoryname =  os.getcwd()
    13 filename = input()
    14 txt = getText(filename + '.txt')
    15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
    16 wordclouds.to_file('{}.png'.format(filename))
    18 os.system('{}.png'.format(filename))


    中文wordcloud库默认会出现乱码,解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233


  • 原文地址:https://www.cnblogs.com/116970u/p/11611821.html
