• fasttext原理


     模型的优化目标如下:


    其中,$<x_n,y_n>$是一条训练样本,$y_n$是训练目标,$x_n$是normalized bag of features。矩阵参数A是基于word的look-up table,也就是A是词的embedding向量。$Ax_n$矩阵运算的数学意义是将word的embedding向量找到后相加或者取平均,得到hidden向量。矩阵参数B是函数f的参数,函数f是一个多分类问题,所以$f(BAx_n)$是一个多分类的线性函数。优化目标是使的这个多分类问题的似然越大越好。
    将优化目标表示为图模型如下:

    与Word2Vec的区别:

    相似的地方:

    1. 图模型结构很像,都是采用embedding向量的形式,得到word的隐向量表达。
    2. 都采用很多相似的优化方法,比如使用Hierarchical softmax优化训练和预测中的打分速度。

    不同的地方:

    1. word2vec是一个无监督算法,而fasttext是一个有监督算法。word2vec的学习目标是skip的word,而fasttext的学习目标是人工标注的分类结果。

    word2vec treats each word in corpus like an atomic entity and generates a vector for each word(word2vec中每个Word对应一个词向量,fasttext中每个Word可以产生多个character字符ngrams,每个ngram对应一个词向量,word的词向量是所有ngrams的词向量的和,需要指定ngrams的长度范围). Fasttext (which is essentially an extension of word2vec model), treats each word as composed of character ngrams. So the vector for a word is made of the sum of this character n grams. For example the word vector “apple” is a sum of the vectors of the n-grams “<ap”, “app”, ”appl”, ”apple”, ”apple>”, “ppl”, “pple”, ”pple>”, “ple”, ”ple>”, ”le>” (assuming hyperparameters for smallest ngram[minn] is 3 and largest ngram[maxn] is 6). This difference manifests as follows.

    1. Generate better word embeddings for rare words ( even if words are rare their character n grams are still shared with other words - hence the embeddings can still be good).
    2. Out of vocabulary words(即使不在训练集语料中的Word也能得到词向量) - they can construct the vector for a word from its character n grams even if word doesn't appear in training corpus.
    3. From a practical usage standpoint, the choice of hyperparamters for generating fasttext embeddings becomes key:since the training is at character n-gram level, it takes longer to generate fasttext embeddings compared to word2vec - the choice of hyper parameters controlling the minimum and maximum n-gram sizes has a direct bearing on this time.
    4. The usage of character embeddings (individual characters as opposed to n-grams) for downstream tasks have recently shown to boost the performance of those tasks compared to using word embeddings like word2vec or Glove.

    https://heleifz.github.io/14732610572844.html

    https://arxiv.org/pdf/1607.04606v1.pdf

    http://www.jianshu.com/p/b7ede4e842f1

    https://www.quora.com/What-is-the-main-difference-between-word2vec-and-fastText

  • 相关阅读:
    XAMPP Error: Apache shutdown unexpectedly. 解决思路
    [转]《我眼中的技术高手》——邯郸学步、创作与创新
    svn 提交错误 400 Bad Reqest MKACTIVITY 请求于XX失败 Conflict Unable to connect to a repository at URL
    try catch 怎么写?
    正则词典
    用B表更新A表
    3种PHP连接MYSQL数据库的常用方法
    PHP数据库连接失败--could not find driver 解决办法
    Php.ini 文件位置在哪里,怎么找到 php.ini
    检测到在集成的托管管道模式下不适用的ASP.NET设置的解决方法(非简单设置为【经典】模式)。
  • 原文地址:https://www.cnblogs.com/ljygoodgoodstudydaydayup/p/7220459.html
Copyright © 2020-2023  润新知