Sequence Model

Sequence Model
Various Sequence To Sequence Architectures

Basic Models

Sequence to sequence model

Image captioning

use CNN(AlexNet) first to get a 4096-dimensional vector, feed it to a RNN

Picking the Most Likely Sentence

translate a French sentence (x) to the most likely English sentence (y) .

it's to find

[argmax_{y^{<1>}, dots, y^{<T_y>}} P(y^{<1>}, dots, y^{<T_y>} | x) ]
- Why not a greedy search?
  
  (Find the most likely words one by one) Because it may be verbose and long.
Beam Search
- set the (B = 3 ext{(beam width)}), find (3) most likely English outputs
- consider each for the most likely second word, and then find (B) most likely words
- do it again until (<EOS>)
if (B = 1), it's just greedy search.

Refinements to beam search

Length normalization

[argmax_{y} prod_{t = 1}^{T_y} P(y^{<t>}|x, y^{<1>}, y^{<t - 1>}) ]
(P) is much less than (1) (close to (0)) take (log)

[argmax_{y} sum_{t = 1}^{T_y} log P(y^{<t>}|x, y^{<1>}, y^{<t - 1>}) ]
it tends to give the short sentences.

So you can normalize it ((alpha) is a hyperparameter)

[argmax_{y} frac 1 {T_y^{alpha}} sum_{t = 1}^{T_y} log P(y^{<t>}|x, y^{<1>}, y^{<t - 1>}) ]
Beam search discussion
- large (B) : better result, slower
- small (B) : worse result, faster
Error Analysis in Beam Search

let (y^*) be human high quality translation, and (hat y) be algorithm output.
- (P(y^* | x) > P(hat y | x)) : Beam search is at fault
- (P(y^* | x) le P(hat y | x)) : RNN model is at fault
Bleu(bilingual evaluation understudy) Score

if you have some good referrences to evaluate the score.

[p_n = frac{sum_{ ext{n-grams} in hat y} ext{Count}_{ ext{clip}}( ext{n-grams})} {sum_{ ext{n-grams} in hat y} ext{Count}( ext{n-grams})} ]
Bleu details

calculate it with (exp(frac{1}{4} sum_{n = 1}^4 p_n))

BP = brevity penalty

[BP = egin{cases} 1 & ext{if~~MT\_output\_length > reference\_output\_length}\ exp(1 - ext{reference\_output\_length / MT\_output\_length}) & ext{otherwise} end{cases} ]
don't want short translation.

Attention Model Intuition

it's hard for network to memorize the whole sentence.

compute the attention weight to predict the word from the context

Attention Model

Use a BiRNN or BiLSTM.

[egin{aligned} a^{<t'>} &= (vec a^{<t'>}, overleftarrow a^{<t'>})\ sum_{t'} alpha^{<i, t'>} &= 1\ c^{<i>} &= sum_{t'} alpha^{<i, t'>} alpha^{<t'>} end{aligned} ]

Computing attention

[egin{aligned} alpha^{<t, t'>} &= ext{amount of "attention" } y^{<t>} ext{ should pay to } a^{<t'>}\ &= frac{exp(e^{<t, t'>})}{sum_{t' = 1}^{T_x} exp(e^{<t, t'>})} end{aligned} ]
train a very small network to learn what the function is

the complexity is (mathcal O(T_x T_y)) , which is so big (quadratic cost)

Speech Recognition - Audio Data

Speech recognition

(x( ext{audio clip}) o y( ext{transcript}))

Attention model for sppech recognition

generate character by character

CTC cost for speech recognition

CTC(Connectionist temporal classification)

"ttt_h_eee___ ____qqq(dots)" ( ightarrow) "the quick brown fox"

Basic rule: collapse repeated characters not separated by "blank"

Trigger Word Detection

label the trigger word, let the output be (1)s
相关阅读:
师弟大喜之日，送上一幅对联求横批
 漫画：Google 走了
 产品研发流程改进
 Outlook2010 Bug 一则
 Android 手机用户版本比例
 CDMA 短信中心号码
 UIM卡 PIN 码特点
 [Accessibility] Missing contentDescription attribute on image
java打印函数的调用堆栈
 android中解析Json
原文地址：https://www.cnblogs.com/zjp-shadow/p/15178221.html

Various Sequence To Sequence Architectures

Basic Models

Sequence to sequence model

Image captioning

Picking the Most Likely Sentence

Beam Search

Refinements to beam search

Length normalization

Beam search discussion

Error Analysis in Beam Search

Bleu(bilingual evaluation understudy) Score

Bleu details

Attention Model Intuition

Attention Model

Computing attention

Speech Recognition - Audio Data

Speech recognition

Attention model for sppech recognition

CTC cost for speech recognition

Trigger Word Detection