jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
–sentence 为待提取的文本
–topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
–withWeight 为是否一并返回关键词权重值,默认值为 False
–allowPOS 仅包括指定词性的词,默认值为空,即不筛选
模块:os、codecs、pandas、jieba、
import os import codecs import pandas import jieba import jieba.analyse filePaths = [] contents = [] tag1s = [] tag2s = [] tag3s = [] tag4s = [] tag5s = [] for root, dirs, files in os.walk( "D:\PDM\2.6\SogouC.mini\Sample\" ): for name in files: filePath = root + '\' + name; f = codecs.open(filePath, 'r', 'utf-8') content = f.read().strip() f.close() tags = jieba.analyse.extract_tags(content, topK=5) filePaths.append(filePath) contents.append(content) tag1s.append(tags[0]) tag2s.append(tags[1]) tag3s.append(tags[2]) tag4s.append(tags[3]) tag5s.append(tags[4]) tagDF = pandas.DataFrame({ 'filePath': filePaths, 'content': contents, 'tag1': tag1s, 'tag2': tag2s, 'tag3': tag3s, 'tag4': tag4s, 'tag5': tag5s })