• 玩转python主题模型程序库gensim


     gensim是python下一个极易上手的主题模型程序库(topic model),网址在:http://radimrehurek.com/gensim/index.html

    安装过程较为繁琐,参考http://radimrehurek.com/gensim/install.html 中的步骤。

    我本机用的python2.7,需安装setuptools或者pip,然后通过这2个工具安装numpy和scipy,因为gensim里面依赖科学/数值计算。其中scipy还需要安装BLAS和LAPACK这2个包,所以说依赖的库非常之多。在安装LAPACK的时候还要注意修改make.inc中的编译参数,增加-fPIC,否则scipy安装会报错。这其中的弯路暂且不表,重点介绍安装完之后的怎么玩gensim。

    gensim支持的模型由LSI,LDA,TFIDF等模型,由于主题模型要用到语料,所以在做预研的时候可以结合NLTK的丰富语料库来getting started。

    以LSI为例,整个流程包括数据读取/预处理、训练和预测,数据读取和处理使用gensim的corpus包和dictionary包进行序列化,转成TFIDF格式后,交给model下的LsiModel训练,代码如下:

    • 主函数
    '''
    Created on 2013-6-13
    
    @author: william.hw
    '''
    import gensim, aequerycluster, logging, nltk, sys
    
    def notpurepunc(word):
        for ch in word:
            if (ch <= '9' and ch >= '0') or (ch <= 'z' and ch >= 'a'):
                return True
        return False
    
    if __name__ == '__main__':
        logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
        logger = logging.getLogger('aequerycluster')
        
        dictpath = 'aequerycluster.dict'
        modelpath = 'aequerycluster.model'
        logger.info("start...")
        
        hivefile = sys.argv[1]
        uniqlevel = sys.argv[2]
        
        texts = []
        count = 10
        with open(hivefile) as finput:
            for line in finput:
                fields = line[1:-2].split("","")
                if len(fields) != 2:
                    continue
                count -= 1
                if count < 0:
                    break
                texts.append(fields[1].split())
        logger.info("finish make texts")
        
        # unique token
        if uniqlevel == "full":
            all_tokens = sum(texts, [])
            tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
            texts = [[word for word in text if word not in tokens_once and len(word) > 1 and word.lower() not in nltk.corpus.stopwords.words('english') and notpurepunc(word)] for text in texts]
        else:
            texts = [[word for word in text if len(word) > 1 and word.lower() not in nltk.corpus.stopwords.words('english') and notpurepunc(word)] for text in texts]
        logger.info("finish filter texts")
        
        serdictionary = aequerycluster.ClusterUtil.serializeDictionary(texts, dictpath)
    
        aqc = aequerycluster.AeQueryCluster()
        mycorpus = [serdictionary.doc2bow(text) for text in texts]
        mytfidf = gensim.models.TfidfModel(mycorpus)
        corpus_tfidf = mytfidf[mycorpus]
        lsimodel = aqc.train_tfidf(corpus_tfidf, dictpath, 10)
        logger.info("finish train gensim")
    
        lsimodel.print_topics(10,5)
        lsimodel.save(modelpath)
        
        index = gensim.similarities.MatrixSimilarity(lsimodel[corpus_tfidf])
        
        newdoc = "Human computer interaction"
        vec_bow = serdictionary.doc2bow(newdoc.lower().split())
        vec_lsi = lsimodel[vec_bow]
        sims = index[vec_lsi]
        print list(enumerate(sims))
        
        
    • 封装类
    '''
    Created on 2013-6-13
    
    @author: william.hw
    '''
    
    import gensim
    
    class ClusterUtil(object):    
        @staticmethod
        def deserializedCorpus(serpath, serformat = "MM"):
            if serformat == "SVM":
                return gensim.corpora.SvmLightCorpus(serpath)
            else:
                return gensim.corpora.MmCorpus(serpath)
        
        @staticmethod
        def serializeCorpus(corpus, serpath, serformat = "MM"):
            if serformat == "SVM":
                gensim.corpora.SvmLightCorpus.serialize(serpath, corpus)
            else:
                gensim.corpora.MmCorpus.serialize(serpath, corpus)
        
        @staticmethod
        def deserializedDictionary(serpath):
            return gensim.corpora.Dictionary.load(serpath)
        
        @staticmethod
        def serializeDictionary(texts, dictpath, save=False):
            dictionary = gensim.corpora.Dictionary(texts)
            if save:
                dictionary.save(dictpath)
            return dictionary
    
    class AeQueryCluster(object):
        def __init__(self):
            self._model = None
             
        def train(self, corpus, serdictionary, num_topics=2):
            tfidf = gensim.models.TfidfModel(corpus)
            corpus_tfidf = tfidf[corpus]
            self._model = self.train_tfidf(corpus_tfidf, self._dictionary, num_topics)
            return self._model
        
        def train_tfidf(self, corpus_tfidf, serdictionary, num_topics=2):
            self._dictionary = ClusterUtil.deserializedDictionary(serdictionary)
            self._model = gensim.models.LsiModel(corpus_tfidf, num_topics, self._dictionary)
            return self._model
        
        def train_onedoc(self, corpus_tfidf):
            self._model.add_documents(corpus_tfidf)
            return self._model
        
        

    首先要记得初始化logging,因为之后LSI训练的结果需要通过logging才能够打印出来调试。hivefile保存的是一行行的待聚类的doc,切词之后过滤掉标点符号和停用词(使用NLTK英语停用词库)。然后将其序列化到dictionary中,并将文本打散成词袋(doc2bow),生成原始语料。

    生成预料后需要将语料转为TDIDF,因为主题模型以每个token的TD和IDF作为训练的特征(我个人理解),然后用词频和逆文档频率作为LSI模型的训练输入,通过设置主题数=10后,模型建立完毕,我们用print_topics(10,5)打印出每个主题,及其最显著的5个token代表。

    2013-06-19 17:00:43,409 : INFO : topic #0(1.087): 0.210*"phone" + 0.187*"Windows" + 0.183*"Dual" + 0.183*"3G" + 0.171*"core"
    2013-06-19 17:00:43,410 : INFO : topic #1(1.056): 0.441*"silver" + 0.289*"accessories" + 0.289*"tibetan" + 0.289*"jewelry" + 0.171*"pure"
    2013-06-19 17:00:43,410 : INFO : topic #2(1.054): 0.419*"hair" + 0.222*"wave" + 0.222*"product,queen" + 0.222*"brazilian" + 0.222*"body"
    2013-06-19 17:00:43,411 : INFO : topic #3(1.024): -0.239*"phone" + -0.164*"arrival" + -0.164*"many" + -0.164*"new" + -0.164*"wifi"
    2013-06-19 17:00:43,412 : INFO : topic #4(1.000): 0.557*"Led" + 0.371*"Indoor" + 0.371*"Display" + 0.186*"Module" + 0.186*"Unit"
    2013-06-19 17:00:43,412 : INFO : topic #5(1.000): -0.459*"capacity" + -0.459*"4g" + -0.459*"card" + -0.229*"memory" + -0.229*"4gb"
    2013-06-19 17:00:43,413 : INFO : topic #6(0.960): 0.204*"Windows" + -0.155*"Leather" + -0.155*"X2" + -0.155*"x2" + -0.155*"Case"
    2013-06-19 17:00:43,413 : INFO : topic #7(0.944): -0.217*"wave" + -0.217*"product,queen" + -0.217*"brazilian" + -0.217*"body" + -0.217*"3pcs/lot,queen"
    2013-06-19 17:00:43,414 : INFO : topic #8(0.937): 0.263*"accessories" + 0.263*"tibetan" + 0.263*"jewelry" + -0.164*"s999" + -0.164*"pure"
    2013-06-19 17:00:43,415 : INFO : topic #9(0.923): -0.219*"Windows" + 0.185*"smartphones" + 0.185*"WiFi" + 0.185*"S5" + 0.185*"Phone"

    每个token之前的权重代表token对主题的影响和贡献,可以看到有些weight是负的,代表token对主题是反方向的。

    如果需要预测,则先索引模型数据similarities.MatrixSimilarity,对于待预测的一个newdoc,同样也是预处理成词袋格式doc2bow,再跟索引数据进行相似度计算(用的是余弦夹角,值区间是-1到1):

    sims = index[vec_lsi]
    print list(enumerate(sims)) 
    [(0, -2.2351742e-08), (1, 3.7252903e-09), (2, 0.99415344), (3, 7.4505806e-09), (4, -9.3132257e-09), (5, -1.4901161e-08), (6, 0.0), (7, -1.8626451e-09), (8, 3.7252903e-09), (9, 0.0)]

    可以看到当前的这个doc相对第2个主题的相似度最大,为0.99415344。

    美中不足的是,虽然gensim本身支持分布式计算,但是对于想直接利用hadoop的mapreduce进行云计算还没有找到好的解决方案,希望有志之士能一起探讨,找到解决方案。

  • 相关阅读:
    python ping监控
    MongoDB中一些命令
    进制转换(十进制转十六进制 十六进制转十进制)
    通过ssh建立点对点的隧道,实现两个子网通信
    linux环境下的各种后台执行
    python requests请求指定IP的域名
    不需要修改/etc/hosts,curl直接解析ip请求域名
    MongoDB数据update的坑
    windows平台使用Microsoft Visual C++ Compiler for Python 2.7编译python扩展
    rabbitmq问题之HTTP access denied: user 'guest'
  • 原文地址:https://www.cnblogs.com/tychyg/p/5277210.html
Copyright © 2020-2023  润新知