• 一、word2vec的使用


    一、处理短句子

    from gensim.models import Word2Vec
    sentences = [["Python", "深度学习", "机器学习"], ["NLP", "深度学习", "机器学习"]]
    model = Word2Vec(sentences, min_count=1)

    注意:把Python内置列表当作输入很方便,但当输入量很大的时候,大会占用大量内存。

    二、语料是文件

    1、Gensim需要输入一个可迭代的列表,可以是迭代器,没有必要把一切东西都保存在内存中,提供一个语句,加载处理它,忘记它,加载另一个语句。

    2、一般我们的语料是在文件中存放的,首先,需要保证语料文件内部每一行对应一个句子(已经分词,以空格隔开),方法见上。

    三、对一个目录下的所有文件生效

    这些文件已经被分词好了,如果还需要进一步预处理文件中的单词,如移除数字,提取命名实体… 所有的这些都可以在MySentences 迭代器内进行,保证给work2vec的是处理好的迭代器。

    class MySentences(object):
        def __init__(self, dirname):
            self.dirname = dirname
     
        def __iter__(self):
            for fname in os.listdir(self.dirname):
                for line in open(os.path.join(self.dirname, fname)):
                    yield line.split()
     
    sentences = MySentences('/some/directory') # a memory-friendly iterator
    model = gensim.models.Word2Vec(sentences)

    四、对于单个文件

    class: gensim.models.word2vec.LineSentence

    每一行对应一个句子(已经分词,以空格隔开),我们可以直接用LineSentence把txt文件转为所需要的格式。

    LineSentence功能解释:Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace(对包含句子的文件进行迭代:一行=一句话。单词必须经过预处理,并由空格分隔) 

    from gensim import Word2Vec
    from gensim.Word2Vec import LineSentence
    from gensim.test.utils import common_texts, get_tmpfile
     
    # inp为输入语料
    inp = 'wiki.zh.text.jian.seg.txt'
    sentences = LineSentences(inp)
    path = get_tmpfile("word2vec.model") #创建临时文件
    model = Word2Vec(sentences, size=100, window=5, min_count=1)
    model.save("word2vec.model")
     

    gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)

    预处理类,限制句子最大长度,文档最大行数
    拿到了分词后的文件,在一般的NLP处理中,会需要去停用词。由于word2vec的算法依赖于上下文,而上下文有可能就是停词。因此对于word2vec,我们可以不用去停词。

    五、获取语料

    1、https://files-cdn.cnblogs.com/files/pinard/in_the_name_of_people.zip

    或者

    class gensim.models.word2vec.Text8Corpus(fname, max_sentence_length=10000)
    Bases: object
    从一个叫‘text8’的语料库中获取数据,该语料来源于以下网址,参数max_sentence_length限定了获取的语料长度
    Iterate over sentences from the “text8” corpus, unzipped from http://mattmahoney.net/dc/text8.zip 

    2、代码

    import jieba
    import jieba.analyse
    from gensim.test.utils import common_texts, get_tmpfile
    from gensim.models import Word2Vec
    import gensim
    with open("in_the_name_of_people.txt", encoding="utf8") as f:
        document = f.read()
        document_cut = jieba.cut(document)
        result = " ".join(document_cut)
        with open("segment.txt", "w", encoding="utf8") as fout:
            fout.write(result)
    
    sentences = gensim.models.word2vec.LineSentence("segment.txt")
    model = Word2Vec(sentences, hs=0,min_count=5,window=5,size=100)
    # 上下文窗口大小:window=5
    # 忽略低频次term:min_count=5
    # 语言模型是用CBOW还是skip-gram?sg=0 是CBOW
    # 优化方法是用层次softmax还是负采样:hs=0 是负采样
    # 负采样样本数: negative=5 (一般设为5-20)
    # 负采样采样概率的平滑指数:ns_exponent=0.75
    # 高频词抽样的阈值 sample=0.001
    model.save("word2vec.model")
    model = Word2Vec.load("word2vec.model")
    for key in model.wv.similar_by_word('检察院', topn =10):
        print(key)

     从bin中加载模型:

    # with open("in_the_name_of_people.txt", encoding="utf8") as f:
    #     document = f.read()
    #     document_cut = jieba.cut(document)
    #     result = " ".join(document_cut)
    #     with open("segment.txt", "w", encoding="utf8") as fout:
    #         fout.write(result)
    #
    # sentences = gensim.models.word2vec.LineSentence("segment.txt")
    # model = Word2Vec(sentences, hs=0,min_count=5,window=5,size=100)
    # # 上下文窗口大小:window=5
    # # 忽略低频次term:min_count=5
    # # 语言模型是用CBOW还是skip-gram?sg=0 是CBOW
    # # 优化方法是用层次softmax还是负采样:hs=0 是负采样
    # # 负采样样本数: negative=5 (一般设为5-20)
    # # 负采样采样概率的平滑指数:ns_exponent=0.75
    # # 高频词抽样的阈值 sample=0.001
    # model.save("word2vec.model")
    # model = gensim.models.KeyedVectors.load_word2vec_format('baike_26g_news_13g_novel_229g.bin', binary=True)
    # sentence1 = "北京是中华人民供各国的首都"
    # sentence2 = "人民民主"
    # cut1 = jieba.cut(sentence1)
    # cut2 = jieba.cut(sentence2)
    #
    # def getNumPyVec(list_cut):
    #     vecList = []
    #     for x in list_cut:
    #         vecList.append(model[x])
    #     torch_list = torch.tensor(vecList)
    #     print(torch_list.shape)
    #
    #
    # l1 = getNumPyVec(cut1)
    # l2 = getNumPyVec(cut2)
    
    #输入矩阵特征数input_size、输出矩阵特征数hidden_size、层数num_layers
    # lstm = nn.LSTM(128,20,4)    #(input_size,hidden_size,num_layers)
    # h0 = torch.randn(4,3,20)   #(num_layers* 1,batch_size,hidden_size)
    # c0 = torch.randn(4,3,20)    #(num_layers*1,batch_size,hidden_size)
    # inputs = torch.randn(10,3,128)   #(seq_len,batch_size,input_size)
    # output,(hn,cn) = lstm(inputs,(h0,c0))
  • 相关阅读:
    让textarea完全显示文章并且不滚动、不可拖拽、不可编辑
    解决css3毛玻璃效果(blur)有白边问题
    mysql_binlog恢复
    SED_AWK_正则
    进程;线程
    网络编程
    面向对象
    python_递归_协程函数(yield关键字)_匿名函数_模块
    Python 函数对象 命名空间与作用域 闭包函数 装饰器 迭代器 内置函数
    python_字符_函数
  • 原文地址:https://www.cnblogs.com/zhangxianrong/p/14105956.html
Copyright © 2020-2023  润新知