• 文本词频统计


    文本词频统计

    词频:单词出现的次数

    # 统计英文:
    f = open('F:实习pythonhamlet','r',encoding='utf8')
    data = f.read().lower()
    data_split = data.split(' ')
    
    count_dict = {}
    for word in data_split:
        if word not in count_dict:
            count_dict[word] = 1
        else:
            count_dict[word] += 1
    
    def func(i):
        return i[1]
    lt = list(count_dict.items())
    lt.sort(key = func)
    
    lt.reverse()            # 运行结果由大到小排列
    for i in lt[0:10]:
        print(f'{i[0]:^7}{i[1]^5}')
    
    # 统计中文:
    import jieba            # 导入一个jieba库,用来分词
    f = open(r'F:实习python719	hreekingdoms','r',encoding='utf8')
    data = f.read()
    data_jieba = jieba.lcut(data)
    
    count_dict = {}
    for word in data_jieba:
        if len(word) == 1:
            continue
        if word in {"将军", "却说", "荆州", "二人", "不可", "不能", "如此"}:
            continue
        if '曰' in word:
            word = word.replace('曰','')
        if word not in count_dict:
            count_dict[word] = 1
        else:
            count_dict[word] += 1
    
    def func(i):
        return i[1]
    data_list = list(count_dict.items())
    data_list.sort(key = func)
    
    data_list.reverse()
    print(data_list)
    
  • 相关阅读:
    JavaScript之正则表达式
    BOM之本地数据存储
    BOM之定时器
    BOM之window核心模块
    BOM简介
    DOM之元素定位
    DOM之事件
    DOM之节点操作
    DOM简介
    linux机制
  • 原文地址:https://www.cnblogs.com/yushan1/p/11213414.html
Copyright © 2020-2023  润新知