Various Sequence To Sequence Architectures
Basic Models
Sequence to sequence model
Image captioning
use CNN(AlexNet) first to get a 4096-dimensional vector, feed it to a RNN
Picking the Most Likely Sentence
translate a French sentence (x) to the most likely English sentence (y) .
it's to find
-
Why not a greedy search?
(Find the most likely words one by one) Because it may be verbose and long.
Beam Search
-
set the (B = 3 ext{(beam width)}), find (3) most likely English outputs
-
consider each for the most likely second word, and then find (B) most likely words
-
do it again until (<EOS>)
if (B = 1), it's just greedy search.
Refinements to beam search
Length normalization
(P) is much less than (1) (close to (0)) take (log)
it tends to give the short sentences.
So you can normalize it ((alpha) is a hyperparameter)
Beam search discussion
- large (B) : better result, slower
- small (B) : worse result, faster
Error Analysis in Beam Search
let (y^*) be human high quality translation, and (hat y) be algorithm output.
- (P(y^* | x) > P(hat y | x)) : Beam search is at fault
- (P(y^* | x) le P(hat y | x)) : RNN model is at fault
Bleu(bilingual evaluation understudy) Score
if you have some good referrences to evaluate the score.
Bleu details
calculate it with (exp(frac{1}{4} sum_{n = 1}^4 p_n))
BP = brevity penalty
don't want short translation.
Attention Model Intuition
it's hard for network to memorize the whole sentence.
compute the attention weight to predict the word from the context
Attention Model
Use a BiRNN or BiLSTM.
Computing attention
train a very small network to learn what the function is
the complexity is (mathcal O(T_x T_y)) , which is so big (quadratic cost)
Speech Recognition - Audio Data
Speech recognition
(x( ext{audio clip}) o y( ext{transcript}))
Attention model for sppech recognition
generate character by character
CTC cost for speech recognition
CTC(Connectionist temporal classification)
"ttt_h_eee___ ____qqq(dots)" ( ightarrow) "the quick brown fox"
Basic rule: collapse repeated characters not separated by "blank"
Trigger Word Detection
label the trigger word, let the output be (1)s