• NLP相似度之tf-idf计算


    当然,在学习过程中也是参考了很多其他的资料,代码都是一行一行敲出来的。

    一、将多个文件合并成一个文件,避免频繁的打开和关闭

     1 import sys
     2 
     3 for line in sys.stdin:
     4     ss = line.strip().split('	')
     5     file_name = ss[0].strip()
     6     file_context = ss[1].strip()
     7     word_list = file_context.split(' ')
     8 
     9     word_set = set()
    10     for word in word_list:
    11         word_set.add(word)
    12 
    13     for word in word_set:
    14         print '	'.join([word, '1'])

    执行命令:就可以得到合并后的文件啦!!!

    python convert.py input_tfidf_dir/ > merge_files.data 

    tf-idf计算流程图:

    二 、计算IDF的值:

    map阶段:读取每一行

     1 import sys
     2 
     3 for line in sys.stdin:
     4     ss = line.strip().split('	')
     5     file_name = ss[0].strip()
     6     file_context = ss[1].strip()
     7     word_list = file_context.split(' ')
     8 
     9     word_set = set()
    10     for word in word_list:
    11         word_set.add(word)
    12 
    13     for word in word_set:
    14         print '	'.join([word, '1'])

    reduce阶段:

     1 import sys
     2 import math
     3 
     4 current_word = None
     5 doc_cnt = 508
     6 count_pool = []
     7 sum = 0
     8 
     9 for line in sys.stdin:
    10     ss = line.strip().split('	')
    11     if len(ss) != 2:
    12         continue
    13 
    14     word, val = ss
    15     if current_word == None:
    16         current_word = word
    17     if current_word != word:
    18         for count in count_pool:
    19             sum += count
    20 
    21         idf_score = math.log(float(doc_cnt) / (float(sum) + 1))
    22         print '	'.join([current_word, str(idf_score)])
    23 
    24         current_word = word
    25         count_pool = []
    26         sum = 0
    27 
    28     count_pool.append((int(val)))
    29 
    30 for count in count_pool:
    31     sum += count
    32 
    33 idf_score = math.log(float(doc_cnt) / (float(sum) + 1))
    34 print '	'.join([current_word, str(idf_score)])

    三、计算TF的值:

     1 # 计算tf
     2 # 读取合并后的数据
     3 # 执行命令 cat merge_files.data | python map_tf.py mapper_func idf.data
     4 
     5 import sys
     6 
     7 word_dict = {}
     8 idf_dict = {}
     9 
    10 # 读取计算的idf数据文件
    11 def read_idf_file_func(idf_file_fd):
    12     with open() as fd:
    13         for line in fd:
    14             ss = line.strip().split('	')
    15             if len(ss) != 2:
    16                 continue
    17             token = ss[0].strip()
    18             idf_score = ss[1].strip()
    19             idf_dict[token] = float(idf_score)
    20     return idf_dict
    21 
    22 # cat merge_files.data | python map_tf.py mapper_func
    23 def mapper_func(idf_file_fd):
    24     idf_dict = read_idf_file_func(idf_file_fd)
    25     # 标准输入
    26     for line in sys.stdin:
    27         ss = line.strip().split('	')
    28         file_name = ss[0].strip()
    29         file_context = ss[1].strip()
    30         word_list = file_context.split(' ')
    31 
    32         for word in word_list:
    33             if word not in word_dict:
    34                 word_dict[word] = 1
    35             else:
    36                 word_dict[word] += 1
    37 
    38         for k,v in word_dict.item():
    39             if k not in idf_dict:
    40                 continue
    41             print(file_name,k,v,idf_file_fd[k])
    42             print(k,v)
    43 
    44 if __name__ == "__main__":
    45     module = sys.modules[__name__]
    46     func = getattr(module, sys.argv[1])
    47     args = None
    48     if len(sys.argv) > 1:
    49         args = sys.argv[2:]
    50     func(*args)
  • 相关阅读:
    LyX使用中的一些问题
    Mac OS apache php配置
    MySQL utf8mb4 字符集:支持 emoji 表情符号
    java.util.NoSuchElementException: Timeout waiting for idle object
    MyEclipse 2014跟2015破解
    No row with the given identifier exists:
    Android启动icon切图大小
    Android接入百度自动更新SDK
    Android自定义spinner下拉框实现的实现
    android给View设置边框 填充颜色 弧度
  • 原文地址:https://www.cnblogs.com/ssqq5200936/p/10744284.html
Copyright © 2020-2023  润新知