• 论文:Show and Tell: A Neural Image Caption Generator-阅读总结


    Show and Tell: A Neural Image Caption Generator-阅读总结

    笔记不能简单的抄写文中的内容,得有自己的思考和理解。

    一、基本信息

    标题 作者 作者单位 发表期刊/会议 发表时间
    Show and Tell:
    A Neural Image Caption Generator
    Oriol Vinyals
    vinyals@google.com
    Alexander
    Toshevtoshev@google.com
    Samy Bengio
    bengio@google.com
    Dumitru Erhan
    dumitru@google.com
    Google_DeepMind
    Google
    Google Brain
    Google Brain
    CVPR 2015

    二、看本篇论文的目的

    了解image caption 的较为早期的神经网络相关的研究成果。

    三、场景和问题

    scene: computer vision and natural language processing, natural image.

    problem: automatically describing the content of an image using properly formed English sentences.

    四、研究目标

    present a single joint model that is more accurate, both qualitatively and quantitatively, and it just takes an image I as input, and is trained to maximize the likelihood p(S|I) of producing a target sequences of words (S={{S_1,S_2,...}}) where each word (S_t) comes from a given dictionary, that describes the image adequately.

    五、主要思路/创新

    Main inspiration:

    Recent advances in machine translation---to transform a sentence S written in a source language, into its translation T in the target language, by maximizing (p(T|S)). An "encoder" RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a "decoder" RNN that generates the target sentence.

    Main innovation:

    Replacing the encoder RNN by a deep CNN(using a CNN as an image "encoder"). CNNs can produce a rich representation of the input image by embedding it to a fixed-length vector. By pre-training the CNN for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences.

    Using stochastic gradient decent.

    Model name:

    An end-to-end system: Neural Image Caption--NIC.

    六、核心算法

    1.Directly maximize the probability of the correct description given the image by using formulation:

    [ heta^*=argmathop{max}limits_{ heta}sum_{(I,S)}log\,p(S|I; heta) ]

    ( heta) are the parameters of the model, (I) is an image, (S) is its correct transcription.

    (N) is the length of the sentence: (log\, p(S|I)=sum_limits{t=0}^N log\,p(S_t|I,S_0,...,S_{t-1}))

    2.Model (p(S_t|I,S_0,...,S_{t-1})) with a RNN, the variable number of words which we condition upon up to (t-1) is expressed by a fixed length hidden state or memory (h_t). The (h_t) is updated after seeing a new input (x_t) by using a non-linear function (f) : (h_{t+1}=f(h_t,x_t)).

    3.CNN uses a novel approachto batch normalization and yields the current best performance on the ILSVRC 2014 classification competition. (f) is a Long-Short Term Memory (LSTM) net.

    4.Implementation details of LSTM model:

    七、模型实现细节

    The unrolling procedure:

    [x_{-1}=CNN(I)\ x_t=W_eS_t,quad tin {0...N-1}\ p_{t+1}=LSTM(x_t),quad tin{0...N-1} ]

    1.each word is represented as a one-hot vector (S_t) of dimension equal to the size of the dictionary.

    2.(S_0) is a special denoted start word and (S_N) is a special denoted stop word which designates the start and end of the sentence.

    3.both the iamge and the words are mapped to the same space, the image by using a vision CNN, the words by using word embedding (W_e).

    4.The image is only input once, at (t=-1), to inform the LSTM about the image contents.

    5.the loss is the sum of the negative log likelihood of the correct word at each step:(L(I,S)=-sum_limits{t=1}^Nlog\, p_t(S_t)).

    6.two sentence generation approaches:

    the first one: Sampling-just sample the first word according to (p_1), then provide the corresponding embedding as input and sample (p_2), continuing like this until we sample the special end-of-sentence token or some maximum length.

    the second one: BeamSearch-iteratively consider the set of the k best sentences up to time (t) as candidates to generate sentences of size (t+1), and keep only the resulting best k of them. This better approximates (S=arg\,max_{S'}\,p(S'|I)).

    this paper used the BeamSearch approach in the experimens, beam size is 20 and it's verified that beam size of 1 did degrade the results by 2 BLEU points on average.

    八、采用的评价指标&数据集

    1.Amazon Mechanical Turk: The most reliable metrics-ask for raters to give a subjective score on the usefulness of each description given the iamge-each image was rated by 2 workers, in case of disagreement, just simply average the scores.

    2.Automatically computed metrics: BLEU score, METEOR, Cider

    3.Datasets:

    九、实验细节

    Training Details:

    1.Overfitting: high quality datasets have less than 100000 images

    --initialize the weights of the CNN component of the system to a pretrained model.

    --initialize the (W_e), the word embeddings from a large news corpus but no significant gains were observed, so just leave them uninitialized for simplicity.

    --some model level overfitting-avoiding techniques: dropout, ensembling models, exploring the size of the model by trading off number of hidden units versus depth.(dropout and ensembling gave a few BLEU points improvement)

    --using 512 dimensions for the embeddings and the size of the LSTM memory.

    十、验证的问题&效果

    Question 1:

    whether we could transfer a model to a different dataset, and how much the mismatch in domain would be compensated with e.g. higer quality labels or more training data.

    transfer learning -- data size:

    1.Flickr30k and Flickr8k (30k is about 4 times more training data than 8k): Flickr30k results obtained 4 BLEU points better.

    2.MSCOCO and Flickr30k (MSCOCO has 5 times more training data than 30k): more differences in vocabulary and a larger mismatch, all the BLEU scores degrade by 10 points.

    3.PASCAL - transfer learning from Flickr30k, yielded worse results - BLEU-1 at 53 (cf. 59).

    4.SBU (i.e., the labels were captions and not human generated descriptions): task is much harder with a much larger and noiser vocabulary, transfer MSCOCO model on SBU - performance degrades from 28 down to 16.

    Question 2:

    whther the model generates novel captions, and whether the generated captions are both diverse and high quality.

    1.agreement in BLEU score between the top 15 genetated sentences is 58 .

    2.80% of the best candidates are present in the training set, beacuse of the small amount of training data.

    3.half of the times, we can see a completely novel description when analyze the top 15 generated sentences.

    Question 3:

    how the learned representations have captured some semantic from the statistics of the language.

    hypothesis: a few examples of a class (e.g., "unicorn"), its proximity to other word embeddings (e.g., "horse") should provide a lot more information that would be completely lost with more traditional bag-of-words based approaches.

  • 相关阅读:
    LightOJ 1203--Guarding Bananas(二维凸包+内角计算)
    断言assert()与调试帮助
    POJ 3528--Ultimate Weapon(三维凸包)
    POJ 2208--Pyramids(欧拉四面体体积计算)
    HDU 1411--校庆神秘建筑(欧拉四面体体积计算)
    HDU 1241 DFS
    HDU 2037(贪心)
    一次傻乎乎的错误QAQ
    封装Qt的SQLite接口类
    最小生成树(Kruskal算法)模板
  • 原文地址:https://www.cnblogs.com/phoenixash/p/12335193.html
Copyright © 2020-2023  润新知