• Comparison of FastText and Word2Vec


    Comparison of FastText and Word2Vec

     

    Facebook Research open sourced a great project yesterday - fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.

     

    Download data

    In [ ]:
    import nltk
    nltk.download() 
    # Only the brown corpus is needed in case you don't have it.
    # alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
    
    # Generate brown corpus text file
    with open('brown_corp.txt', 'w+') as f:
        for word in nltk.corpus.brown.words():
            f.write('{word} '.format(word=word))
    
    In [ ]:
    # download the text8 corpus (a 100 MB sample of cleaned wikipedia text)
    # alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
    !wget http://mattmahoney.net/dc/text8.zip
    
    In [ ]:
    # download the file questions-words.txt to be used for comparing word embeddings
    !wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt
    
     

    Train models

     

    If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for fastText and run the training with -

    In [ ]:
    !./fasttext skipgram -input brown_corp.txt -output brown_ft
    !./fasttext skipgram -input text8.txt -output text8_ft
    
     

    For training the gensim models -

    In [ ]:
    from nltk.corpus import brown
    from gensim.models import Word2Vec
    from gensim.models.word2vec import Text8Corpus
    import logging
    
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    
    MODELS_DIR = 'models/'
    
    brown_gs = Word2Vec(brown.sents())
    brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec')
    
    text8_gs = Word2Vec(Text8Corpus('text8'))
    text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')
    
     

    Download models

    In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -

    In [ ]:
    # download the fastText and gensim models trained on the brown corpus and text8 corpus
    !wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz
    
     

    Once you have downloaded or trained the models (make sure they're in the models/ directory, or that you've appropriately changed MODELS_DIR) and downloaded questions-words.txt, you're ready to run the comparison.

     

    Comparisons

    In [1]:
    from gensim.models import Word2Vec
    
    def print_accuracy(model, questions_file):
        print('Evaluating...
    ')
        acc = model.accuracy(questions_file)
        for section in acc:
            correct = len(section['correct'])
            total = len(section['correct']) + len(section['incorrect'])
            total = total if total else 1
            accuracy = 100*float(correct)/total
            print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))
        sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
        sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
        print('
    Semantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total))
        
        syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
        syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
        print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%
    '.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total))
    
    MODELS_DIR = 'models/'
    
    word_analogies_file = 'questions-words.txt'
    print('
    Loading FastText embeddings')
    ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
    print('Accuracy for FastText:')
    print_accuracy(ft_model, word_analogies_file)
    
    print('
    Loading Gensim embeddings')
    gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
    print('Accuracy for word2vec:')
    print_accuracy(gs_model, word_analogies_file)
    
     
    Loading FastText embeddings
    Accuracy for FastText:
    Evaluating...
    
    0/1, 0.00%, Section: capital-common-countries
    0/1, 0.00%, Section: capital-world
    0/1, 0.00%, Section: currency
    0/1, 0.00%, Section: city-in-state
    27/182, 14.84%, Section: family
    539/702, 76.78%, Section: gram1-adjective-to-adverb
    106/132, 80.30%, Section: gram2-opposite
    656/1056, 62.12%, Section: gram3-comparative
    136/210, 64.76%, Section: gram4-superlative
    439/650, 67.54%, Section: gram5-present-participle
    0/1, 0.00%, Section: gram6-nationality-adjective
    165/1260, 13.10%, Section: gram7-past-tense
    327/552, 59.24%, Section: gram8-plural
    245/342, 71.64%, Section: gram9-plural-verbs
    2640/5086, 51.91%, Section: total
    
    Semantic: 27/182, Accuracy: 14.84%
    Syntactic: 2613/4904, Accuracy: 53.28%
    
    
    Loading Gensim embeddings
    Accuracy for word2vec:
    Evaluating...
    
    0/1, 0.00%, Section: capital-common-countries
    0/1, 0.00%, Section: capital-world
    0/1, 0.00%, Section: currency
    0/1, 0.00%, Section: city-in-state
    53/182, 29.12%, Section: family
    8/702, 1.14%, Section: gram1-adjective-to-adverb
    0/132, 0.00%, Section: gram2-opposite
    75/1056, 7.10%, Section: gram3-comparative
    0/210, 0.00%, Section: gram4-superlative
    16/650, 2.46%, Section: gram5-present-participle
    0/1, 0.00%, Section: gram6-nationality-adjective
    30/1260, 2.38%, Section: gram7-past-tense
    4/552, 0.72%, Section: gram8-plural
    8/342, 2.34%, Section: gram9-plural-verbs
    194/5086, 3.81%, Section: total
    
    Semantic: 53/182, Accuracy: 29.12%
    Syntactic: 141/4904, Accuracy: 2.88%
    
    
     

    Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.

    Let me explain that better.

    According to the paper [1], embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for apparently would include information from both character n-grams apparent and ly (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.

    Example analogy:

    amazing amazingly calm calmly

    This analogy is marked correct if:

    embedding(amazing) - embedding(amazingly) = embedding(calm) - embedding(calmly)

    Both these subtractions would result in a very similar set of remaining ngrams. No surprise the fastText embeddings do extremely well on this.

    A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).

    Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.

    In [2]:
    print('Loading FastText embeddings')
    ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
    print('Accuracy for FastText:')
    print_accuracy(ft_model, word_analogies_file)
    
    print('Loading Gensim embeddings')
    gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
    print('Accuracy for word2vec:')
    print_accuracy(gs_model, word_analogies_file)
    
     
    Loading FastText embeddings
    Accuracy for FastText:
    Evaluating...
    
    298/506, 58.89%, Section: capital-common-countries
    625/1452, 43.04%, Section: capital-world
    37/268, 13.81%, Section: currency
    291/1511, 19.26%, Section: city-in-state
    151/306, 49.35%, Section: family
    567/756, 75.00%, Section: gram1-adjective-to-adverb
    188/306, 61.44%, Section: gram2-opposite
    809/1260, 64.21%, Section: gram3-comparative
    303/506, 59.88%, Section: gram4-superlative
    528/992, 53.23%, Section: gram5-present-participle
    1291/1371, 94.16%, Section: gram6-nationality-adjective
    451/1332, 33.86%, Section: gram7-past-tense
    853/992, 85.99%, Section: gram8-plural
    360/650, 55.38%, Section: gram9-plural-verbs
    6752/12208, 55.31%, Section: total
    
    Semantic: 1402/4043, Accuracy: 34.68%
    Syntactic: 5350/8165, Accuracy: 65.52%
    
    Loading Gensim embeddings
    Accuracy for word2vec:
    Evaluating...
    
    138/506, 27.27%, Section: capital-common-countries
    248/1452, 17.08%, Section: capital-world
    28/268, 10.45%, Section: currency
    158/1571, 10.06%, Section: city-in-state
    227/306, 74.18%, Section: family
    85/756, 11.24%, Section: gram1-adjective-to-adverb
    54/306, 17.65%, Section: gram2-opposite
    739/1260, 58.65%, Section: gram3-comparative
    178/506, 35.18%, Section: gram4-superlative
    297/992, 29.94%, Section: gram5-present-participle
    718/1371, 52.37%, Section: gram6-nationality-adjective
    325/1332, 24.40%, Section: gram7-past-tense
    389/992, 39.21%, Section: gram8-plural
    200/650, 30.77%, Section: gram9-plural-verbs
    3784/12268, 30.84%, Section: total
    
    Semantic: 799/4103, Accuracy: 19.47%
    Syntactic: 2985/8165, Accuracy: 36.56%
    
    
     

    With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec

    These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.

     

    References

  • 相关阅读:
    python进阶学习chapter04(字符串相关)
    python进阶学习chapter03(迭代相关)
    python学习笔记之collections模块的使用
    python进阶学习chapter02(列表、字典、集合操作)
    python接口测试之json模块的使用
    python接口测试之如何发送邮件
    python接口测试之如何操作excel
    python unittest库的入门学习
    python requests库学习笔记
    重建二叉树*
  • 原文地址:https://www.cnblogs.com/timssd/p/7163853.html
Copyright © 2020-2023  润新知