Show, Attend and Tell [Kor]
Xu et al. / Show, Attend and Tell - Neural Image Caption Generation with Visual Attention/ ICML 2015
English version of this article is available.
1. Problem definition
ģ“ ėŖØėøģ Encoder-Decoder 구씰ģ attention mechanismģ ėķģ¬ image captioning task넼 ģķķ©ėė¤!
image captioningģ“ė?
ź°ėØķ ė§ķė©“, image captioningģ image넼 ėŖØėøģ ģ ė „ģ¼ė” ė£ģģ ė ėŖØėøģ“ captionģ ė¬ģ image넼 ģ¤ėŖ ķė task넼 ė§ķ©ėė¤. ģ“ė° ģģ ģ ķźø° ģķ“ģė ģ¼ėØ image ģģ ė¬“ģØ objectź° ģėģ§ ķė³ķ ģ ģģ“ģ¼ ķź³ , ź·ø image ķģģ¼ė” ķķė object넼 ģ°ė¦¬ź° ģ¬ģ©ķė ģøģ“, ģ¦ Natural languageģ ģ°ź²°ķ ģ ģģ“ģ¼ ķ©ėė¤.
ė°ģ“ķ°ė 그림(visual), 문ģ(text), ģģ±(auditory) ė± ė¤ģķ ķķė” ķķė ģ ģėė°, ģ“ė ėÆ ģ¬ė¬ ė°ģ“ķ° type (mode)넼 ģ¬ģ©ķ“ģ ėŖØėøģ ķģµģķ¤ė ķķ넼 Multi-modal learningģ“ė¼ź³ ķ©ėė¤. ė°ė¼ģ ģ“ ź³ ģ ģ ģø ė ¼ė¬øģģ ė¤ė£° ėŖØėøė visual information (image)ź³¼ descriptive language (natural language)넼 mapķģ¬ image captioningģ ģķķė¤ė ė©“ģģ multi-modal learningģ“ė¼ź³ ķ ģ ģģµėė¤.

ģ“ ėŖØėøģ encoder-decoder 구씰ģ attention based approach넼 ģ°Øģ©ķ©ėė¤. ģ“ģ ėķ“ģė Related work ģ¹ģ ģģ ģģøķ ģ¤ėŖ ķ ź²ģ. ė ¼ė¬øģģ ė¤ė£¬ ėŖØėøģ“ ģķķė task넼 ź°ėØķ ģ¤ėŖ ķė©“,
ā (1) 2-D image넼 inputģ¼ė” ė°ģ,
ā (2) ź·øģ ėķ“ CNNģ¼ė” feature extractionģ ģķķģ¬ input imageģ ėģėė feature vector넼 ģ»ź³ ,
ā (3) LSTMģģ feature vectorģ attention mechanismģ ģ¬ģ©ķģ¬ ,
ā (4) word넼 generationķė ė°©ģģ¼ė” image넼 captioningķ©ėė¤.
2. Motivation
Related work
Neural net ģ“ģ image captioning ė°©ė²ė¤
Neural netģ image captioning taskģ ģ¬ģ©ķźø° ģ ź¹ģ§, ķ¬ź² ė ķė¦ģ“ ģģģµėė¤. ź·øė¬ė ģ“미 Neural Netģ ģ°Øģ©ķė ė°©ģģ ķ¬ź² ė°ė ¤ ė ģ“ģ ģ¬ģ©ķģ§ ģėė¤ź³ ķ©ėė¤.
object detectionź³¼ attribute discovery넼 먼ģ ģ§ķķ ķ, caption templateģ ģģ±ķė ė°©ģ
ģ°øģ”°: Kulkarni et al. (2013), Li et al. (2011), Yang et al. (2011)
captioning ķė ¤ė imageģ ė¹ģ·ķģ§ė§ ģ“미 captioned image넼 DBģģ ģ°¾ģėø ķ, ģ°ė¦¬ģ imageģ queryģ ė§ź² ģ°¾ģėø ģ“미ģ§ģ ģŗ”ģ ģ ģģ ķė ė°©ģ
ģ°øģ”°: Kuznetsova et al., 2012; 2014
The encoder-decoder framework (sequence-to-sequence training with neural networks for machine translation)
machine translation ė¶ģ¼ģ ģ£¼ė„ź° ė ė°©ģģ¼ė”, ė¹ģģė seq-to-seq trainingģ ģ£¼ė” RNNģģ ģķķė źµ¬ģ”°ė„¼ ģ¬ģ©ķģµėė¤. image captioningģ ź·øė¦¼ģ 'translating'ķė ģģ ź³¼ ė¹ģ·ķźø° ė문ģ, Cho et al., 2014ģģ ė¤ė£¬ encoder-decoder framework for machine translationģ“ ķØź³¼ģ ģ¼ ź²ģ“ė¼ź³ ģ ģė ė§ķģµėė¤.
ģ°øģ”°: (Cho et al., 2014).
Attention mechanism
Show and Tell: A Neural Image Caption Generator
Show, Attend, and Tellģ ģ ė²ģ ¼ģ“ė¼ź³ ķ ģ ģģµėė¤. LSTMģ ģ¬ģ©ķģ¬ image captioningģ ķė¤ė ģ ģģ ė¹ģ·ķģ§ė§, ėŖØėøģź² image넼 ķė ģģ ģ LSTMģ imageź° ģ ė „ė ė ķė² ėæģ ėė¤. ź·øė źø° ė문ģ sequenceź° źøøģ“ģ§ė©“ ėŖØėøģ sequenceģ ģė¶ė¶ģ ģ ģ°Ø ģģ“ė²ė¦¬ź² ėėė°, ģ“ė RNN, LSTM ė± sequential ėŖØėøģ ź³ ģ§ģ ģø ė¬øģ ģ ģ“ė¼ź³ ķ ģ ģģµėė¤.
Idea
ģ“ė¦ģģė ģ ģ ģėÆģ“, Show, Attend, and Tellģ ėŖØėøģ Show and Tellģ ėģØ Generatorģ attention mechanismģ ėķ źµ¬ģ”°ė” ėģ“ģģµėė¤. ģģ ģøźøķėÆģ“, RNN, LSTM ė± sequential ėŖØėøģ ķ¹ģ±ģ, sequenceź° źøøģ“ģ§ė©“ ėŖØėøģ sequenceģ ģė¶ė¶ģ ģ ģ°Ø ģģ“ė²ė¦½ėė¤.
Show, Attend, and Tellģģė Decoderģ visual attentionģ ģ¶ź°ķØģ¼ė”ģØ
sequenceź° źøøģ“ģ øė ėŖØėøģ“ sequenceģ ėŖØė ė¶ė¶ģ źø°ģµķ ģ ģź² ķź³ ,
ėŖØėøģ“ 그림ģ ģ“ė ė¶ė¶ģ 주목(attention)ķģ¬ ėØģ“넼 captioningķėģ§ ģ ģ ģģ“, ķ“ģź°ė„ģ±(Interpretability)ģ 볓ģ„ķģģ¼ė©°,
state-of-the-art ģ±ė„ģ ė¼ ģ ģź² ėģģµėė¤.
3. Method

Encoder: Convolutional feature
CNNģ¼ė” 구ģ±ė Encoder ė 2D input image넼 ė°ģ ė¼ė feature vector넼 ģ¶ė „ķ©ėė¤. CNNģ ė§ģ§ė§ layerź° Dź° neuron, Lź°ģ channelė” ģ“루ģ“ģ øģģµėė¤. ė°ė¼ģ feature extractionģ ģķķ ź²°ź³¼ė ź° ė Dģ°Øģ ė²”ķ°ź° ėź³ , ģ“ė¬ķ ė²”ķ°ė¤ģ“ ģ“ Lź° ģė ķķź° ė©ėė¤.
Decoder: LSTM with attentiond over the image
decoderė”ė LSTMģ ģ¬ģ©ķ©ėė¤. ķ° ė§„ė½ģģ ģ¤ėŖ ķė©“, ź° time step t = 1 .. Cė§ė¤ caption vector ģ elementģø ķ ėØģ“ 넼 outputķė ė°, ģ“ė ģø ģģ 넼 inputģ¼ė” ė°ģķź² ė¤ė ź²ģ“ LSTM 구씰넼 ģ°Øģ©ķ ķµģ¬ ģ“ģ ģ ėė¤. ģ¦ ģ“ģ stepģ ź²°ź³¼ė„¼ inputģ¼ė” ė°ģ ģ“ė² stepģ ź²°ź³¼ė„¼ ė“ė autoregressiveķ ė°©ģģ¼ė”, ėØģ“ė¤ģ ģ°Øė”ėė” ģģ±ķė Sequential modelģø ź²ģ ėė¤.
ģ“ė inputź³¼ outputģ ź°ź°ģ ģ미넼 ź°ģ§ź³ ģģµėė¤.
output = ķģ¬ time stampģģ ė§ė¤ģ“ė¼ ėØģ“
input = ė°ė” ģ§ģ (t-1) ģģ ģ hidden state
input = ģ§ģ time stampģģ ė§ė¤ģ“ėø ėØģ“ 넼 Embedding matrix E ()ģ ź³±ķģ¬ embedding ķ ė²”ķ°
input = CNN encoder output ģ ģ§ģ hidden stateģø ģ ģ“ģ©ķ“ ź³ģ°ķ context vector.
inputź³¼ outputė¤ģ LSTMģ ź° gateģ matchingķ“ģ ģ¤ėŖ ķė©“ ė¤ģź³¼ ź°ģµėė¤
Attention
attention mechanismģ ķµķ“ ź²°ģ ėė vectorė context vector ģ ėė¤. ģģģ ģøźøķėė”, CNN encoder output ģ ģ§ģ hidden stateģø ģ ģ“ģ©ķ“ context vector넼 ź³ģ°ķ ģ ģģµėė¤.
Context vector넼 구ķė ź³¼ģ ģ ģ°Øė”ėė” ģ“ķ“볓멓,
CNN encoder output ģ ģ§ģ hidden state 넼 ķØģ ģ ė£ģ“ 넼 구ķ©ėė¤(i= 1 ... L).
ģ“ė ė weight vector넼 ź³ģ°ķźø° ģķ attention modelģ“ė©°, hard attentionź³¼ soft attentionģ¼ė” ėė©ėė¤. ģ“ė ė¤ģģ ė¤ģ ģ¤ėŖ ķ©ėė¤ .
(i= 1 ... L)ģ ėķ“ģ softmax layer넼 ź±°ģ¹ė©“ 넼 ģ»ģµėė¤.
ź²°źµ ģ ģ¤ ģ“ėģ weight넼 ģ£¼ģ“ attentionķ ź²ģøģ§ė„¼ ź²°ģ ķė vectorģø ź²ģ ėė¤.
ź·øė ź² 구ķ ģ ź° 넼 ź±°ģ¹ė©“ context vector ź° ė©ėė¤.
Note; Attention - Stochastic hard vs Deterministic soft
Attention model ģ ķ¬ź² Hard attentionź³¼ Soft attentionģ¼ė” ėė©ėė¤. ģ“ ėģ ė§ģ¹ 0ź³¼ 1ģ ģ¬ģ©ķģ¬ ģ§ģ ģø ģ°Øģ“(ģ /묓)넼 구ė¶ķė hard labelź³¼ ģ¤ģ ģ 첓 ķ¹ģ [0, 1]ģ ģķė ģ¤ģ넼 ģ¬ģ©ķė soft label ģ ģ©ė”ģ ė¹ģ·ķ©ėė¤. modelģ“ sum-to-1 vector넼 ģ“ģ©ķģ¬ ģ“ė¤ ė¶ė¶ģ attendķ ź²ģøģ§ ź²°ģ ķ ė, 0ź³¼ 1ė”ģØ deterministicķź² attendķė hard attentionģ ģ¬ģ©ķ ģė ģź³ , 1ģ ģ¬ė¬ ķķøė” ė¶ģ°ķė soft attentionģ ģ¬ģ©ķ ģė ģģµėė¤. ģ“ ė문ģ ė¹ģ“ģ§ė ģ°Øģ“ģ ģ hidden stateģ weight넼 ź³ģ°ķė functionģ“ differentiableķģ§ ģ¬ė¶ģ ėė¤.
ė°ė¼ģ,
Soft Attentionģ Encoderģ hidden state넼 미ė¶ķģ¬ cost넼 źµ¬ķź³ attention mechanismģ ķµķ“ gradientź° ķė ¤ė³“ė“ė ė°©ģģ¼ė” ėŖØėøģ ķģµģķµėė¤.
ķķø Hard Attentionģ trainingģ ģķķ ė, 매 timestampė§ė¤ ģŗ”ģ ėŖØėøģ“ focusķ“ģ¼ķė ģģ¹ė„¼ random samplingķźø° ė문ģ ėŖØėøģ stochasticityź° ģźø°ź³ , ė°ė¼ģ hidden stateģ weight넼 ź³ģ°ķė functionģ“ differentiableķģ§ ģģµėė¤.
ė§ģ½ weight functionģ“ indifferentiableķė¤ė©“, end-to-endė” ķė²ģ ķģµķ ģ ģź³ , ėģ¤ģ gradient flow넼 ź·¼ģ¬ķ“ģ¼ķė ė²ź±°ė”ģģ“ ģź¹ėė¤. ė°ė¼ģ ķģ¬ė gradient넼 ģ§ģ ģ ģ¼ė” ź³ģ°ķģ¬ end-to-end ėŖØėøģ ģ°ģ¼ ģ ģė soft attentionģ ė ė§ģ“ ģėė¤.

ģ figureģģ hard/soft attentionģ ź²½ģ°ė„¼ ģ visualizationķ“ģ¤ėė¤. ģģ¤ģ soft attention, ģė«ģ¤ģ hard attentionģ ź²½ģ°ģøė°ģ. ķėØģ caption (A, bird, flying, over, ...)ģ targetķģ¬ attendķ ė, soft attentionģ ź²½ģ° ģėģ ģ¼ė” captionź³¼ 묓ź“ķ featureź¹ģ§ attendķź³ ģģµėė¤(non-deterministicķėÆė”). hard attentionģ ź²½ģ°ė ģķė§ģ ķµķ“ ź³ģ°ėėÆė”, ģ¤ė”Æģ“ captionģ featureė§ targetķź³ ģė ź²ģ ģėģ§ė§, soft attentionģ ė¹ķ“ ķØģ¬ ģ ģ featureė§ģ focusingķ“ģ density function ģ¤ ė§ģ ė¶ė¶ģ ķ ģ ķģ¬ attendķź³ ģģµėė¤.
4. Experiment & Result
Experimental setup
Dataset: Flickr8k, Flickr30k, and MS COCO
Flickr8k/30k: ķ ģ„ģ ģ“미ģ§ģ ź·øģ ģģķė ė¬øģ„ ėØģ ģ“ėÆøģ§ ģ¤ėŖ (sentence-based image description)ģ ź°ģ¶ ė°ģ“ķ°ģ ģ ėė¤. Flickr8kė ģ½ 8,000ģ„ģ ģ“미ģ§, Flickr30kė ģ½ 30,000ģ„ģ ģ“ėÆøģ§ź° ź° ģ“ėÆøģ§ ė¹ 5ź° Captionģ ź°ģ§ėė¤.
MS COCO: ź°ģ²“ ķģ§ (object detection), ģøź·øėؼķ ģ“ģ (segmentation), ķ¤ķ¬ģøķø ķģ§ (keypoint detection) ė±ģ task넼 ėŖ©ģ ģ¼ė” ė§ė¤ģ“ģ§ ė°ģ“ķ°ģ ģ ėė¤
Baselines: Google NIC, Log Bilinear, CMU/MS Research, MS Research, BRNN
Evaluation metric: BLEU-1,2,3,4/METEOR metrics
BLEU (Bilingual Evaluation Understudy) score: translation taskģģ ėķģ ģ¼ė” ģ¬ģ©ķė n-gram based metricģ ėė¤. ķ¬ź² 3ź°ģ§ ģģė” ģ“루ģ“ģ ø ģģ“ģ.
Precision: 먼ģ referenceź³¼ predictionģ¬ģ“ģ n-gramģ“ ģ¼ė§ė ź²¹ģ¹ėģ§ ģø”ģ ķ©ėė¤.
Clipping: ź°ģ ėØģ“ź° ģ¬ė¬ ė² ėģ¤ė ź²½ģ° precisionģ 볓ģ ķ“ģ¤ėė¤. predictionģ ģ¤ė³µ ėØģ“넼 precisionģ ė°ģķ ė, ģ묓리 ė§ģ“ ėģ¤ėė¼ė referenceģ ģ¤ė³µķģ넼 ģ“ź³¼ķ“ģ countėģ§ ģģµėė¤.
Brevity Penalty: ģ컨ė ķ ėØģ“ė” ė§ė 문ģ„ģ“ ģģ ė ģ“ė ģ ėė” ė 문ģ„ģ“ ģėģ§ė§, precisionģ“ ė§¤ģ° ėź² ėģµėė¤. ė°ė¼ģ predictionģ źøøģ“넼 reference źøøģ“ė” ėė , ė¬øģ„źøøģ“ģ ėķ ź³¼ģ ķ©ģ 볓ģ ķ“ģ¤ėė¤.
Meteor (Metric for Evaluation of Translation with Explicit ORdering) score: BLEU넼 볓ģķ“ģ ėģØ metricģ ėė¤. ģ ģė unigram precisionź³¼ recallģ harmonic meanģ ķµķ“ ź³ģ°ķėė°, ė¤ė„ø metricź³¼ė ė¬ė¦¬ exact word matching ė°©ģģ ģ¬ģ©ķ©ėė¤. sentence levelź³¼ segment levelģģ human judgementģ ėģ ģź“ź“ź³ė„¼ 볓ģøė¤ė ģ ģģ, corpus levelģģ ģėķė BLEUģ ģ°Øģ“ź° ģģµėė¤.
Training setup
encoder CNN: Oxford VGGnet pretrained on ImageNet without finetuning.
stochastic gradient descent: using adaptive learning rates.
For the Flickr8k dataset: RMSProp
Flickr30k/MS COCO dataset: Adam algorithm
Result

ėŖØė ė°ģ“ķ° ģ ģģ źø°ģ”“ ėŖØėøė¤ė³“ė¤ attention based approach넼 ģ¼ģ ė BLEU, METEOR scoreź° ķØģ¬ ėģģµėė¤.

Caption generation ėŖØėøģ“ 그림 ģ¤ ģ“ė ė¶ė¶ģ 주목ķģ¬ ėØģ“넼 ģģ±ķėģ§ ķķķģ¬ captioning processģ ķ“ģź°ė„ģ±ģ ė¶ģ¬ķģģµėė¤.
5. Conclusion
Show and Tell ė ¼ė¬øģ“ ė°ķėźø° ģ“ģ ź¹ģ§ image captioningģ ģ£¼ė” object detectionģ źø°ė°ģ¼ė” ķģµėė¤. 주ģ“ģ§ ģ“미ģ§ģģ 물첓넼 detectķź³ ģ“넼 ģ§ģ ģģ°ģ“ė” ģ°ź²°ķė ė°©ģģ ķķ ź²ģ ėė¤.
Show and Tell ė ¼ė¬øģ źø°ģ”“ ė°©ė²ģ ķķ¼ķģ¬, end-to-end ė°©ģģ¼ė” image captioningģ ģķķģµėė¤. ģ“미ģ§ė„¼ CNNģ¼ė” ģøģ½ė©ķģ¬ representation vector넼 ģ»ź³ , captionģ LSTMģ¼ė” ėģ½ė©ķģ¬ ģ±ė„ģ ķ¬ź² ķ„ģ ģģ¼°ģµėė¤.
Show, Attend, and Tell ėŖØėøģ, Show and tellģģ ģ°Øģ©ķ 구씰ģ Attention mechanismģ ģ¶ź°ķ ź²ź³¼ ź°ģµėė¤. ėŖØė ģ“미ģ§ė„¼ ź· ė±ķź² ė³“ģ§ ģź³ , ķ“ė¹ captionģ“ ģ“ė ģ“미ģ§ģ ķ“ė¹ķėģ§ ź°ģ¤ģ¹ė„¼ ė¶ė°°ķģ¬ ķ“ģķ ź²ģ ėė¤.
ė¤ģė§ķ“, Attentionģ ķµķ“
sequential modelģ gradient vanishing 문ģ 넼 ķ“ź²°ķź³ ,
attendķė ė¶ė¶ģ ėģ¼ė” ķģøķ ģ ģė¤ė ģ ģģ interpretability넼 ė¶ģ¬ķģµėė¤.
Take home message (ģ¤ėģ źµķ)
Show, Attend, and tellģ vision taskģģ Visual attentionģ ėģ ķė ģėģ“ė©° ģ“ ėŖ ė§„ģ ģ§źøź¹ģ§ ģ“ģ“ģ§ź³ ģė¤!
ģģ ģė£ģ ķ¬ķØėģ“ ģģ ė²ķ ź³ ģ ė ¼ė¬øģ ģ°¾ģė³“ė ź²ė... ź°ėģ ģ¢ė¤.
Author / Reviewer information
Author
ģ“ėÆ¼ģ¬ (Lee Min Jae)
M.S. student, KAIST AI
https://github.com/mjbooo slalektm@gmail.com
Reviewer
ģģģ: 칓ģ“ģ¤ķø AI ėķģ ģģ¬ź³¼ģ
ė°ģ¬ģ : 칓ģ“ģ¤ķø AI ėķģ ģģ¬ź³¼ģ
ģ¤ģģ¤: 칓ģ“ģ¤ķø źø°ź³ź³µķź³¼ ė°ģ¬ź³¼ģ
Reference & Additional materials
Show, Attend, and Tell paper
https://arxiv.org/abs/1502.03044
On the 'show, attend, and tell' model
http://sanghyukchun.github.io/93/
https://hulk89.github.io/nhttps://jomuljomul.tistory.com/entry/Deep-Learning-Attention-Mechanism-%EC%96%B4%ED%85%90%EC%85%98eural%20machine%20translation/2017/04/04/attention-mechanism/
https://ahjeong.tistory.com/8
An implementation code with Pytorch (unofficial)
https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning
On attention part
https://github.com/Kyushik/Attention
On MS COCO dataset
https://ndb796.tistory.com/667
On BLEU SCORE
https://wikidocs.net/31695
https://donghwa-kim.github.io/BLEU.html
Last updated
Was this helpful?