• beam search


    Beam Search

    generate (or “decode”) the target sentence by taking argmax on each step of the decoder

    problem with greedy search :

    • Greedy decoding has no way to undo decisions!
      • Input: il a m’entarté (he hit me with a pie)
      • → he ____
      • → he hit ____
      • → he hit a ____ (whoops! no going back now…)

    Exhaustive search decoding

    Ideally we want to find a (length T) translation y that maximizes :

    [egin{aligned} P(y | x) &=Pleft(y_{1} | x ight) Pleft(y_{2} | y_{1}, x ight) Pleft(y_{3} | y_{1}, y_{2}, x ight) ldots, Pleft(y_{T} | y_{1}, ldots, y_{T-1}, x ight) \ &=prod_{t=1}^{T} Pleft(y_{t} | y_{1}, ldots, y_{t-1}, x ight) end{aligned} ]

    We could try computing all possible sequences y:

    • This means that on each step t of the decoder, we’re tracking (V^t) possible partial translations, where (V) is vocab size
    • This (O(V^t)) complexity is far too expensive!

    beam search

    • Core idea : On each step of decoder, keep track of the k most probable partial translations (which we call hypotheses), where (k) is the beam size (in practice around 5 to 10)

    • A hypothesis (y_1,cdots,y_t) has a score which is its log probability:

      [operatorname{score}left(y_{1}, ldots, y_{t} ight)=log P_{mathrm{LM}}left(y_{1}, ldots, y_{t} | x ight)=sum_{i=1}^{t} log P_{mathrm{LM}}left(y_{i} | y_{1}, ldots, y_{i-1}, x ight) ]

      • Scores are all negative, and higher score is better
      • We search for high-scoring hypotheses, tracking top (k) on each step
    • Beam search is not guaranteed to find optimal solution

    • But much more efficient than exhaustive search!

    Beam search decoding: stopping criterion

    • In greedy decoding, usually we decode until the model produces a token
    • In beam search decoding, different hypotheses may produce tokens on different timesteps
      • When a hypothesis produces , that hypothesis is complete.
      • Place it aside and continue exploring other hypotheses via beam search.
    • Usually we continue beam search until:
      • We reach timestep T (where T is some pre-defined cutoff), or
      • We have at least n completed hypotheses (where n is pre-defined cutoff)

    Beam search decoding: finishing up

    • We have our list of completed hypotheses.
    • How to select top one with highest score?
    • Each hypothesis (y_1,cdots,y_t) on our list has a score

      [operatorname{score}left(y_{1}, ldots, y_{t} ight)=log P_{mathrm{LM}}left(y_{1}, ldots, y_{t} | x ight)=sum_{i=1}^{t} log P_{mathrm{LM}}left(y_{i} | y_{1}, ldots, y_{i-1}, x ight) ]

    Problem with this: longer hypotheses have lower scores

    Fix : Normalize by length. Use this to select top one instead:

    [frac{1}{t} sum_{i=1}^{t} log P_{mathrm{LM}}left(y_{i} | y_{1}, ldots, y_{i-1}, x ight) ]

  • 相关阅读:
    LNMP 部署
    zabbix3.2安装graphtree3.0.4
    升级java8---from centos
    mysql5.6-5.7性能调优
    samba server install
    centos7 zabbix3 install done
    实验四总结
    第五周学习小结
    个人的一些html、css笔记
    为什么wait,notify,notifyAll定义在Object中?
  • 原文地址:https://www.cnblogs.com/curtisxiao/p/10828197.html
Copyright © 2020-2023  润新知