Comparison of FastText and Word2Vec

Facebook Research open sourced a great project yesterday - fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.

Download data

In [ ]:

import nltk
nltk.download() 
# Only the brown corpus is needed in case you don't have it.
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training

# Generate brown corpus text file
with open('brown_corp.txt', 'w+') as f:
    for word in nltk.corpus.brown.words():
        f.write('{word} '.format(word=word))

In [ ]:

# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
!wget http://mattmahoney.net/dc/text8.zip

In [ ]:

# download the file questions-words.txt to be used for comparing word embeddings
!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt

Train models

If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for fastText and run the training with -

In [ ]:

!./fasttext skipgram -input brown_corp.txt -output brown_ft
!./fasttext skipgram -input text8.txt -output text8_ft

For training the gensim models -

In [ ]:

from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)

MODELS_DIR = 'models/'

brown_gs = Word2Vec(brown.sents())
brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec')

text8_gs = Word2Vec(Text8Corpus('text8'))
text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')

Download models

In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -

In [ ]:

# download the fastText and gensim models trained on the brown corpus and text8 corpus
!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz

Once you have downloaded or trained the models (make sure they're in the models/ directory, or that you've appropriately changed MODELS_DIR) and downloaded questions-words.txt, you're ready to run the comparison.

Comparisons

In [1]:

from gensim.models import Word2Vec

def print_accuracy(model, questions_file):
    print('Evaluating...
')
    acc = model.accuracy(questions_file)
    for section in acc:
        correct = len(section['correct'])
        total = len(section['correct']) + len(section['incorrect'])
        total = total if total else 1
        accuracy = 100*float(correct)/total
        print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))
    sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
    sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
    print('
Semantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total))
    
    syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
    syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
    print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%
'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total))

MODELS_DIR = 'models/'

word_analogies_file = 'questions-words.txt'
print('
Loading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file)

print('
Loading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)

Loading FastText embeddings
Accuracy for FastText:
Evaluating...

0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
27/182, 14.84%, Section: family
539/702, 76.78%, Section: gram1-adjective-to-adverb
106/132, 80.30%, Section: gram2-opposite
656/1056, 62.12%, Section: gram3-comparative
136/210, 64.76%, Section: gram4-superlative
439/650, 67.54%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
165/1260, 13.10%, Section: gram7-past-tense
327/552, 59.24%, Section: gram8-plural
245/342, 71.64%, Section: gram9-plural-verbs
2640/5086, 51.91%, Section: total

Semantic: 27/182, Accuracy: 14.84%
Syntactic: 2613/4904, Accuracy: 53.28%


Loading Gensim embeddings
Accuracy for word2vec:
Evaluating...

0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
53/182, 29.12%, Section: family
8/702, 1.14%, Section: gram1-adjective-to-adverb
0/132, 0.00%, Section: gram2-opposite
75/1056, 7.10%, Section: gram3-comparative
0/210, 0.00%, Section: gram4-superlative
16/650, 2.46%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
30/1260, 2.38%, Section: gram7-past-tense
4/552, 0.72%, Section: gram8-plural
8/342, 2.34%, Section: gram9-plural-verbs
194/5086, 3.81%, Section: total

Semantic: 53/182, Accuracy: 29.12%
Syntactic: 141/4904, Accuracy: 2.88%

Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.

Let me explain that better.

According to the paper [1], embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for apparently would include information from both character n-grams apparent and ly (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.

Example analogy:

amazing amazingly calm calmly

This analogy is marked correct if:

embedding(amazing) - embedding(amazingly) = embedding(calm) - embedding(calmly)

Both these subtractions would result in a very similar set of remaining ngrams. No surprise the fastText embeddings do extremely well on this.

A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).

Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.

In [2]:

print('Loading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file)

print('Loading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)

Loading FastText embeddings
Accuracy for FastText:
Evaluating...

298/506, 58.89%, Section: capital-common-countries
625/1452, 43.04%, Section: capital-world
37/268, 13.81%, Section: currency
291/1511, 19.26%, Section: city-in-state
151/306, 49.35%, Section: family
567/756, 75.00%, Section: gram1-adjective-to-adverb
188/306, 61.44%, Section: gram2-opposite
809/1260, 64.21%, Section: gram3-comparative
303/506, 59.88%, Section: gram4-superlative
528/992, 53.23%, Section: gram5-present-participle
1291/1371, 94.16%, Section: gram6-nationality-adjective
451/1332, 33.86%, Section: gram7-past-tense
853/992, 85.99%, Section: gram8-plural
360/650, 55.38%, Section: gram9-plural-verbs
6752/12208, 55.31%, Section: total

Semantic: 1402/4043, Accuracy: 34.68%
Syntactic: 5350/8165, Accuracy: 65.52%

Loading Gensim embeddings
Accuracy for word2vec:
Evaluating...

138/506, 27.27%, Section: capital-common-countries
248/1452, 17.08%, Section: capital-world
28/268, 10.45%, Section: currency
158/1571, 10.06%, Section: city-in-state
227/306, 74.18%, Section: family
85/756, 11.24%, Section: gram1-adjective-to-adverb
54/306, 17.65%, Section: gram2-opposite
739/1260, 58.65%, Section: gram3-comparative
178/506, 35.18%, Section: gram4-superlative
297/992, 29.94%, Section: gram5-present-participle
718/1371, 52.37%, Section: gram6-nationality-adjective
325/1332, 24.40%, Section: gram7-past-tense
389/992, 39.21%, Section: gram8-plural
200/650, 30.77%, Section: gram9-plural-verbs
3784/12268, 30.84%, Section: total

Semantic: 799/4103, Accuracy: 19.47%
Syntactic: 2985/8165, Accuracy: 36.56%

With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec

These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.

References

[1] Enriching Word Vectors with Subword Information

[2] Efficient Estimation of Word Representations in Vector Space

相关阅读:
python进阶学习chapter04（字符串相关）
python进阶学习chapter03（迭代相关）
python学习笔记之collections模块的使用
 python进阶学习chapter02（列表、字典、集合操作）
python接口测试之json模块的使用
 python接口测试之如何发送邮件
 python接口测试之如何操作excel
python unittest库的入门学习
 python requests库学习笔记
 重建二叉树*
原文地址：https://www.cnblogs.com/timssd/p/7163853.html