• 【NLP】How to Generate Embeddings?


    How to represent words.

    0 .

    Native represtation: one-hot vectors

    Demision: |all words|

    (too large and hard to express senmatic similarity)

    Idea:produce dense vector representations based on the context/use of words

    So, there are Three main approaches:

    1.

    Count-based methods

    (1) Define a basis vocabulary C(lower than all words dimision) of context words(expect:the、a、of…)

    (2) Define a word window size W

    (3) Count the basis vocabulary words occurring W words to the left or right of each instance of a target word in the corpus

    (4) From a vector represtation of the target word based on these counts

    Example-express:

    clip_image002

    clip_image004

    We can calculate the similarity of two words using inner product or cosine.

    For instance.

    clip_image005

    2.

    Neural Embedding Models(Main Idea)

    To generate an embedding matrix in R(|all words| * |context words|) which looks like:

    clip_image006(count based vectors)

    Rows are word vectores.

    We can retrieve a certain word vector with one-hot vector.

    clip_image007

    (One)generic idea behind embedding learning:

    (1) Collect instances ti∈inst(t) of a word t of vocab V

    (2) For each instance, collect its context word c(ti) (e.g.k-word window)

    (3) Define some score function score(ti,c(ti),θ,E) with upper bound on output

    (4) Define a loss

    clip_image008

    (5) Estimate:

    clip_image009

    (6) Use the estimated E as the embedding matrix

    Attention:

    Scoring function estimates whether a sentence(or the object word and its context) is said or used normally by a people,so the higher the score,the more likely it is.

    3.

    C&W

    clip_image010

    Firstly,we embed all words in a sentence with E.

    Then,sentence(w1,w2,w3,w4,w5) goes through a convolution layer(maybe just simpal connection layer).

    Then,it goes through a simpal MLP.

    Then,it goes through the ‘scorer’layer and output the final Score.

    Minimize the loss function(!),and use the parameter matrix of input layer and ..

    clip_image011

    clip_image012

    4. Word2Vec

    1) CBoW(contextual bag of words)

    clip_image013

    clip_image014

    2) Skip-gram:

    clip_image015

    clip_image016

  • 相关阅读:
    HTTP之multipart/formdata格式
    PHP之什么是CGI
    ObjectiveC之多语言的预编译头
    JS之addEventListener
    JS之要点Review
    iOS之JSON框架的选择
    iOS之Block笔记
    iOS之HTTP框架的选择
    PHP之自动加载对象
    HTTP之LastModified和ETag
  • 原文地址:https://www.cnblogs.com/duye/p/9368590.html
Copyright © 2020-2023  润新知