最近也看了一些博客中的方法,不准备造轮子了,拿来主义使用当先。
几个参考:
https://spaces.ac.cn/archives/4176
https://mp.weixin.qq.com/s?__biz=MzUyMDY0OTg3Nw%3D%3D&idx=1&mid=2247483824&scene=21&sn=831ed590670e2de5f2e29d5b6072df31#wechat_redirect
https://mp.weixin.qq.com/s?__biz=MzUyMDY0OTg3Nw==&mid=2247483803&idx=1&sn=95318832711e96ff31c18ac21f7e29c2&scene=21#wechat_redirect
我的步骤:
第一步:种子词汇获取(利用wiki词条)
import bz2file import re import opencc import codecs from tqdm import tqdm from opencc import OpenCC
from gensim.corpora.wikicorpus import extract_pages,filter_wiki openCC = OpenCC('t2s') # 繁转简 # can also set conversion by calling set_conversion # openCC.set_conversion('s2tw') wiki = extract_pages(bz2file.open('resources/zhwiki-20210420-pages-articles-multistream.xml.bz2')) i = 0 f = codecs.open('resources/wiki.txt', 'w', encoding='utf-8') w = tqdm(wiki, desc=u'已获取0篇文章') for d in w: if not re.findall('^[a-zA-Z]+:', d[0]) and d[0] and not re.findall(u'^#', d[1]): s = openCC.convert(d[0]) # 此处仅获取词汇,没有内容, f.write(s+' ') i += 1 if i % 100 == 0: w.set_description(u'已获取%s篇文章'%i) f.close()
第二步:使用autophrasex自动挖掘
from autophrasex import * import pandas as pd import tqdm # 构造autophrase autophrase = AutoPhrase( reader=DefaultCorpusReader(tokenizer=JiebaTokenizer()), selector=DefaultPhraseSelector(), extractors=[ NgramsExtractor(N=4), IDFExtractor(), EntropyExtractor() ] ) # 开始挖掘 predictions = autophrase.mine( corpus_files=['./data/query_words_200w.txt'], quality_phrase_files='./resources/wiki.txt', callbacks=[ LoggingCallback(), ConstantThresholdScheduler(), EarlyStopping(patience=2, min_delta=3) ]) # 输出挖掘结果 words = [] prob = [] for pred in tqdm.tqdm(predictions): words.append(pred[0]) prob.append(pred[-1]) df = pd.DataFrame({"words":words, "prob":prob}) df.to_csv('./data/mining_phrase.csv', index=None) print('ok')
该方法是调用第三方包来实现词语挖掘,如果读者生成的topk效果不佳,可以借助wiki远程监督训练model自己构建,参考丁香园工作实现。
在autoPharsex使用中,注意一下:corpus_files为自身语料,一行为一个文本sample;quality_phrase_files为高质量wiki词汇,此表一定要足够大足够全,要不程序会报错。
希望能帮到你!