• Python 中文文件统计词频 + 中文词云


    1. 词频统计:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 words  = jieba.lcut(txt)
     4 counts = {}
     5 for word in words:
     6     if len(word) == 1:
     7         continue
     8     else:
     9         counts[word] = counts.get(word,0) + 1
    10 items = list(counts.items())
    11 items.sort(key=lambda x:x[1], reverse=True)
    12 for i in range(15):
    13     word, count = items[i]
    14     print ("{0:<10}{1:>5}".format(word, count))

    结果是:

    曹操 946
    孔明 737
    将军 622
    玄德 585
    却说 534
    关公 509
    荆州 413
    二人 410
    丞相 405
    玄德曰 390
    不可 387
    孔明曰 374
    张飞 358
    如此 320
    不能 318

    进一步改进, 我想只知道人物出场统计,代码如下:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 names = {'曹操','孔明','刘备','关羽','张飞','吕布','赵云','孙权','周瑜','袁绍','黄忠','魏延'}
     4 words  = jieba.lcut(txt)
     5 counts = {}
     6 for word in words:
     7     if len(word) == 1:
     8         continue
     9     elif word == "诸葛亮" or word == "孔明曰":
    10         rword = "孔明"
    11     elif word == "关公" or word == "云长":
    12         rword = "关羽"
    13     elif word == "玄德" or word == "玄德曰":
    14         rword = "刘备"
    15     elif word == "孟德" or word == "丞相":
    16         rword = "曹操"
    17     else:
    18         rword = word
    19     counts[rword] = counts.get(rword,0) + 1
    20 # for word in excludes:
    21 #     del counts[word]
    22 items = list(counts.items())
    23 items.sort(key=lambda x:x[1], reverse=True)
    24 for i in range(40):
    25     word, count = items[i]
    26     if word in names:
    27         print ("{0:<10}{1:>5}".format(word, count))

    运行结果为:

    曹操 1358
    孔明 1265
    刘备 1251
    关羽 783
    张飞 358
    吕布 300
    赵云 278
    孙权 257
    周瑜 217
    袁绍 191

    进一步的做词云图:

     1 import jieba
     2 import os
     3 import wordcloud
     4  
     5 def getText(file):
     6     with open(file, 'r', encoding= 'UTF-8') as txt:
     7         txt = txt.read()
     8         jieba.lcut(txt)
     9     return txt
    10  
    11  
    12 directoryname =  os.getcwd()
    13 filename = input()
    14 txt = getText(filename + '.txt')
    15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
    16 wordclouds.to_file('{}.png'.format(filename))
    17  
    18 os.system('{}.png'.format(filename))

    名称是可以进一步优化的,参见第二部分代码。

    中文wordcloud库默认会出现乱码,解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233

    参考:https://blog.csdn.net/weixin_44521703/article/details/93058003

  • 相关阅读:
    HBASE & DB : ACID 事务 与 相关实现
    LeetCode Largest Number
    LeetCode Power of Two
    LeetCode Dungeon Game
    LeetCode The Skyline Problem
    LeetCode Kth Smallest Element in a BST
    LeetCode Flatten Binary Tree to Linked List
    JAVA 并发:CLH 锁 与 AbstractQueuedSynchronizer
    为什么引入TSS
    特权级概述(哥子就想知道CPU是如何验证特权级的)GATE+TSS
  • 原文地址:https://www.cnblogs.com/116970u/p/11611821.html
Copyright © 2020-2023  润新知