• word2vec相关



    word 'xe8xb6x85xe8x87xaaxe7x84xb6xe7x8exb0xe8xb1xa1' not in vocabulary

    分词后的样本格式:
    英雄联盟,疾风剑豪-亚索,五杀,精彩操作
    长安外传,街头采访,神回复
    日本料理,蛋包饭
    滑板运动,极限达人,城会玩

    LineSentence

    u'王者荣耀'
    print(model[u'王者荣耀'])
    print(model[u'超自然现象'])


    python保存numpy数据
    numpy.savetxt("result.txt", numpy_data)

    python保存list数据
    file=open('data.txt','w')
    file.write(str(list_data))
    file.close()

    写list到txt文件
    ipTable = ['158.59.194.213', '18.9.14.13', '58.59.14.21']
    fileObject = open('sampleList.txt', 'w')
    for ip in ipTable:
    fileObject.write(ip)
    fileObject.write(' ')
    fileObject.close()


    写dict对象到json文件将dict转为字符串后写入json文件
    import json
    dictObj = {
    'andy':{
    'age': 23,
    'city': 'shanghai',
    'skill': 'python'
    },
    'william': {
    'age': 33,
    'city': 'hangzhou',
    'skill': 'js'
    }
    }
    jsObj = json.dumps(dictObj)
    fileObject = open('jsonFile.json', 'w')
    fileObject.write(jsObj)
    fileObject.close()

    The first parameter passed to gensim.models.Word2Vec is an iterable of sentences.
    Sentences themselves are a list of words.


    gensim.models.word2vec.LineSentence
    Simple format: one sentence = one line; words already preprocessed and separated by whitespace.



    优质参考
    http://wetest.qq.com/lab/view/30.html
    http://lxbwk.njournal.sdu.edu.cn/fileup/HTML/2017-7-66.htm
    http://jacoxu.com/%E7%A8%80%E7%96%8F%E7%9A%84%E7%9F%AD%E6%96%87%E6%9C%AC/
    http://www.jianshu.com/p/d34d61188ab5
    https://radimrehurek.com/gensim/models/doc2vec.html
    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
    https://github.com/RaRe-Technologies/gensim/blob/b0f80a6ff3b4e58c55b6162b3b621af71225761a/docs/notebooks/doc2vec-IMDB.ipynb

    https://stackoverflow.com/questions/31321209/doc2vec-how-to-get-document-vectors


    >>> from gensim.models.doc2vec import TaggedDocument
    可以
    下面不可以
    >>> import gensim.models.doc2vec.TaggedDocument
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ImportError: No module named TaggedDocument


    >>> from gensim.models.doc2vec import Doc2Vec,LabeledSentence
    >>> import gensim.models.doc2vec.Doc2Vec
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ImportError: No module named Doc2Vec
    >>> from gensim.models.doc2vec import Doc2Vec
    >>>


    LabeledSentence的输入文件格式:每一行为:<labels, words>, 其中labels 可以有多个,用tab 键分隔,words 用空格键分隔,eg:<id  category  I like my cat demon>.

    输出为词典vocabuary 中每个词的向量表示,这样就可以将商品labels:id,类别的向量拼接用作商品的向量表示。
    参考http://www.360doc.com/content/17/0814/15/17572791_679139034.shtml

    >>> from gensim.models.doc2vec import LabeledSentence
    >>> documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    TypeError: __new__() got an unexpected keyword argument 'labels'
    >>> documents = LabeledSentence(words=[u'some', u'words', u'here'], tags=[u'SENT_1'])
    >>> print(documents)
    LabeledSentence([u'some', u'words', u'here'], [u'SENT_1'])

    >>> from gensim.models.doc2vec import Doc2Vec
    >>> model =Doc2Vec(documents, size = 100, window = 5, min_count = 1, workers=4)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib64/python2.7/site-packages/gensim/models/doc2vec.py", line 641, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
    File "/usr/lib64/python2.7/site-packages/gensim/models/word2vec.py", line 577, in build_vocab
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
    File "/usr/lib64/python2.7/site-packages/gensim/models/doc2vec.py", line 680, in scan_vocab
    if isinstance(document.words, string_types):
    AttributeError: 'list' object has no attribute 'words'

    Input to gensim.models.doc2vec should be an iterator over the LabeledSentence (say a list object). Try:
    >>> model =Doc2Vec([documents], size = 100, window = 5, min_count = 1, workers=4)
    >>> print model
    Doc2Vec(dm/m,d100,n5,w5,s0.001,t4)
    >>>

    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

    >>> print(model.infer_vector([u'some',u'here']))
    [ 3.02350195e-03 -2.47021206e-03 -4.23655838e-05 1.06619455e-05
    -2.07307865e-03 1.52201334e-03 -2.68392172e-03 4.86029405e-03
    -3.07570468e-03 -1.27961146e-04 3.59600926e-05 5.56750805e-04
    -1.86618324e-03 -2.78112385e-03 -3.24939704e-03 -4.69824160e-03
    -1.94230478e-03 3.41035030e-03 -1.96390250e-03 -3.12410085e-03
    2.32424913e-03 4.13724314e-03 -3.76667455e-03 4.44490695e-03
    4.86690132e-03 -1.01872580e-03 -4.15571406e-03 4.93804645e-03
    2.08313856e-03 -2.49790330e-03 2.88306503e-03 -2.11228104e-03
    -7.48132443e-05 -2.86692451e-03 1.31704379e-03 -3.49374721e-03
    2.85517215e-03 1.55686424e-03 2.88037118e-03 2.10905354e-03
    -8.35062645e-04 1.03656796e-03 3.66695994e-03 3.16017168e-03
    3.91360372e-03 1.89097866e-03 -4.97946097e-03 -1.25238323e-03
    -1.44126080e-03 3.26181017e-03 -6.02229848e-05 2.08685431e-03
    4.63444972e-03 2.12231209e-03 2.76103779e-03 -4.06579726e-04
    6.27412752e-04 3.08081333e-04 -3.25262197e-03 -4.00892925e-03
    3.97314038e-03 4.02647816e-03 1.02536182e-03 2.09628342e-04
    1.93663652e-03 -2.59007933e-03 2.82125012e-03 -4.11406020e-03
    8.89573072e-04 -2.25311797e-03 -2.08429853e-03 1.73660505e-04
    2.08250736e-03 1.53203832e-03 7.52889435e-04 -1.24395418e-03
    -3.14715598e-03 -4.88714431e-04 -3.19321570e-03 -1.17522234e-03
    3.58190737e-03 3.01620923e-03 -3.71830584e-03 -2.14487920e-03
    3.48089077e-03 1.65970484e-03 3.03952186e-03 1.13033829e-03
    2.58382503e-03 -4.09777975e-03 -8.57007224e-04 -2.81002838e-03
    -1.20109224e-04 3.29560786e-03 4.00114199e-03 -1.00307877e-03
    -3.04128020e-03 -3.20556248e-03 -3.60509683e-03 -3.22059076e-03]

  • 相关阅读:
    X的平方根(二分)
    JavaScript(1)
    入门训练 Fibonacci数列 (水题)
    set集合容器
    deque双端队列容器
    回归分析
    cf1121d 尺取
    CF1121C 模拟
    poj3662 二分+最短路
    最短路小结
  • 原文地址:https://www.cnblogs.com/vincentqliu/p/7806841.html
Copyright © 2020-2023  润新知