• 综合练习:词频统计


    综合练习

    词频统计预处理

    下载一首英文的歌词或文章

    将所有,.?!’:等分隔符全部替换为空格
    sep=''',.?!'":'''
    for a in sep:
        news = news.replace(a,' ')

    print(news)

    将所有大写转换为小写

    sep=''',.?'":'''
    for a in sep:
        news = news.lower().replace(a,' ')

    print(news)

    生成单词列表

    sep=''',.?'":'''
    for a in sep:
        news = news.replace(a,' ')
    wordList=news.lower().split()
    for w in wordList:
        print(w)

    生成词频统计

    sep=''',.?'":'''
    for a in sep:
        news = news.replace(a,' ')
    wordList=news.lower().split()
    wordDict = {}
    wordSet = set(wordList)
    for w in wordSet:
        wordDict[w] = wordList.count(w)
    for w in wordDict:
        print(w, wordDict[w])

    排序

    sep=''',.?'":'''
    for a in sep:
        news = news.replace(a,' ')
    wordList=news.lower().split()

    for a in sep:
        news = news.lower().replace(a,' ')
    wordList=news.split()
    wordDict = {}
    wordSet = set(wordList)
    for w in wordSet:
        wordDict[w] = wordList.count(w)
    dictList = list(wordDict.items())
    dictList.sort(key=lambda x:x[1],reverse=True)
    print(dictList)

    排除语法型词汇,代词、冠词、连词

    exclude = {'the','and','of','to'}
    sep=''',.?'":'''
    for a in sep:
        news = news.replace(a,' ')
    wordList=news.lower().split()

    for a in sep:
        news = news.lower().replace(a,' ')
    wordList=news.split()
    wordDict = {}
    wordSet = set(wordList)-exclude
    for w in wordSet:
        wordDict[w] = wordList.count(w)
    dictList = list(wordDict.items())
    dictList.sort(key=lambda x:x[1],reverse=True)
    print(dictList)

    输出词频最大TOP20

    sep=''',.?'":'''
    for a in sep:
        news = news.replace(a,' ')
    wordList=news.lower().split()

    for a in sep:
        news = news.lower().replace(a,' ')
    wordList=news.split()
    wordDict = {}
    wordSet = set(wordList)
    for w in wordSet:
        wordDict[w] = wordList.count(w)
    dictList = list(wordDict.items())
    dictList.sort(key=lambda x:x[1],reverse=True)
    for i in range(20):
    print(dictList[i])

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

    2.中文词频统计

    下载一长篇中文文章。

    从文件读取待分析文本。

    f = open('hongluomeng.txt','r', encoding='utf-8')

    安装与使用jieba进行中文分词。

    for i in g:
        text = text.replace(i, '')
    print(list(jieba.cut(text)))
    b = list(jieba.lcut(text))
    print(b)

    生成词频统计

    排序

    排除语法型词汇,代词、冠词、连词

    输出词频最大TOP20(或把结果存放到文件里)

    import jieba

    f = open('hongluomeng.txt','r', encoding='utf-8')
    text = f.read()
    f.close()


    g = ''',。‘’“”:;()!?、'''
    a = {
        '的', ' ',
         '曰', '之', '不', '人',  '一',  '大', '马', '来', '有', '于', '下', '此',
         }
    for i in g:
        text = text.replace(i, '')
    print(list(jieba.cut(text)))
    b = list(jieba.lcut(text))
    print(b)
    count = {}
    q = list(set(b) - a)
    print(q)

    for i in range(0, len(q)):
        count[q[i]] = text.count(str(q[i]))

    r = list(count.items())
    r.sort(key=lambda x: x[1], reverse=True)
    print(r)

    f = open('hlmCount.txt', 'a')
    for i in range(20):
        f.write(r[i][0] + ':' + str(r[i][1]) + ' ')
    f.close()

  • 相关阅读:
    分享在winform下实现左右布局多窗口界面续篇
    批处理加密与解密
    Winform应用程序实现通用遮罩层二
    Winform自定义窗体标题栏样式
    斯坦福 CS231n 全套解读 系列文章
    net use命令详解
    技术管理杂谈
    用命令行禁用网卡
    查看已连接的无线网密码(windows)
    Application.EnableVisualStyles()和Application.SetCompatibleTextRenderingDefault()的作用及用法
  • 原文地址:https://www.cnblogs.com/Brilliance-pan/p/8666659.html
Copyright © 2020-2023  润新知