• gensim中TaggedDocument 怎么使用


    我有两个目录,我想从中读取它们的文本文件并给它们贴上标签,但我不知道如何通过taggedDocument来实现这一点。我以为它可以作为标记文档([strings],[labels])工作,但这显然不起作用。

    from gensim import models
    from gensim.models.doc2vec import TaggedDocument
    import utilities as util
    import os
    from sklearn import svm
    from nltk.tokenize import sent_tokenize
    CogPath = "./FixedCog/"
    NotCogPath = "./FixedNotCog/"
    SamplePath ="./Sample/"
    docs = []
    tags = []
    CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')]
    NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')]
    SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')]
    for doc in CogList:
         str = open(CogPath+doc,'r').read().decode("utf-8")
         docs.append(str)
         print docs
         tags.append(doc)
         print "###########"
         print tags
         print "!!!!!!!!!!!"
    for doc in NotCogList:
         str = open(NotCogPath+doc,'r').read().decode("utf-8")
         docs.append(str)
         tags.append(doc)
    for doc in SampleList:
         str = open(SamplePath + doc, 'r').read().decode("utf-8")
         docs.append(str)
         tags.append(doc)
    
    T = TaggedDocument(docs,tags)
    
    model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

    错误

    Traceback (most recent call last):
      File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module>
        model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)
      File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__
        self.build_vocab(documents, trim_rule=trim_rule)
      File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab
        self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
      File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab
        if isinstance(document.words, string_types):
    AttributeError: 'list' object has no attribute 'words'

    所以我只是做了一些测试,在Github上发现了这一点:

    class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
        """
        A single document, made up of `words` (a list of unicode string tokens)
        and `tags` (a list of tokens). Tags may be one or more unicode string
        tokens, but typical practice (which will also be most memory-efficient) is
        for the tags list to include a unique integer id as the only tag.
    
        Replaces "sentence as a list of words" from Word2Vec.

    因此,我决定通过为每个文档生成一个taggedDocument类来更改使用taggedDocument函数的方式,重要的是必须将标记作为列表传递。

    for doc in CogList:
         str = open(CogPath+doc,'r').read().decode("utf-8")
         str_list = str.split()
         T = TaggedDocument(str_list,[doc])
         docs.append(T)

    doc2vec模型的输入应该是taggeddocument的列表(['list'、'of'、'word']、[tag_])。一个好的实践是使用句子的索引作为标记。例如,用两个句子(即文档、段落)训练doc2vec模型:

    s1 = 'the quick fox brown fox jumps over the lazy dog'
    s1_tag = '001'
    s2 = 'i want to burn a zero-day'
    s2_tag = '002'
    
    docs = []
    docs.append(TaggedDocument(words=s1.split(), tags=[s1_tag])
    docs.append(TaggedDocument(words=s2.split(), tags=[s2_tag])
    
    model = gensim.models.Doc2Vec(vector_size=300, window=5, min_count=5, workers=4, epochs=20)
    model.build_vocab(docs)
    
    print 'Start training process...'
    model.train(docs, total_examples=model.corpus_count, epochs=model.iter)
    
    #save model
    model.save(model_path)

    您可以使用Gensim的常用文本作为示例:

    from gensim.test.utils import common_texts
    from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    
    documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
    model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
  • 相关阅读:
    python+fiddler 抓取抖音数据包并下载抖音视频
    fiddler抓包+安卓机 完成手机app抓包的配置 遇到的一些问题
    Mobileye独创性创新
    EyeQ进展The Evolution of EyeQ
    Mobileye高级驾驶辅助系统(ADAS)
    重型车辆盲区行为检查Behaviours – Heavy Vehicle Blind Spots
    Xilinx低比特率高品质 ABR 视频实时转码(HPE 参考架构)
    Xilinx FPGA全局介绍
    用NumPy genfromtxt导入数据
    如何在Python中加速信号处理
  • 原文地址:https://www.cnblogs.com/blogpro/p/11343819.html
Copyright © 2020-2023  润新知