Show, Attend and Tell [Kor]

Xu et al. / Show, Attend and Tell - Neural Image Caption Generation with Visual Attention/ ICML 2015

English version of this article is available.

1. Problem definition

ģ“ ėŖØėøģ€ Encoder-Decoder 구씰와 attention mechanismģ„ ė”ķ•˜ģ—¬ image captioning task넼 ģˆ˜ķ–‰ķ•©ė‹ˆė‹¤!

  • image captioningģ“ėž€?

    ź°„ė‹Øķžˆ ė§ķ•˜ė©“, image captioningģ€ image넼 ėŖØėøģ— ģž…ė „ģœ¼ė”œ ė„£ģ—ˆģ„ ė•Œ ėŖØėøģ“ captionģ„ 달아 image넼 ģ„¤ėŖ…ķ•˜ėŠ” task넼 ė§ķ•©ė‹ˆė‹¤. ģ“ėŸ° ģž‘ģ—…ģ„ ķ•˜źø° ģœ„ķ•“ģ„œėŠ” ģ¼ė‹Ø image ģ•ˆģ— 묓슨 objectź°€ ģžˆėŠ”ģ§€ ķŒė³„ķ•  수 ģžˆģ–“ģ•¼ ķ•˜ź³ , ź·ø image ķ˜•ģ‹ģœ¼ė”œ ķ‘œķ˜„ėœ object넼 ģš°ė¦¬ź°€ ģ‚¬ģš©ķ•˜ėŠ” ģ–øģ–“, 즉 Natural language에 ģ—°ź²°ķ•  수 ģžˆģ–“ģ•¼ ķ•©ė‹ˆė‹¤.

    ė°ģ“ķ„°ėŠ” 그림(visual), ė¬øģž(text), ģŒģ„±(auditory) 등 ė‹¤ģ–‘ķ•œ ķ˜•ķƒœė”œ ķ‘œķ˜„ė  수 ģžˆėŠ”ė°, ģ“ė ‡ė“Æ ģ—¬ėŸ¬ ė°ģ“ķ„° type (mode)넼 ģ‚¬ģš©ķ•“ģ„œ ėŖØėøģ„ ķ•™ģŠµģ‹œķ‚¤ėŠ” ķ˜•ķƒœė„¼ Multi-modal learningģ“ė¼ź³  ķ•©ė‹ˆė‹¤. ė”°ė¼ģ„œ ģ“ ź³ ģ „ģ ģø ė…¼ė¬øģ—ģ„œ 다룰 ėŖØėøė„ visual information (image)ź³¼ descriptive language (natural language)넼 mapķ•˜ģ—¬ image captioningģ„ ģˆ˜ķ–‰ķ•œė‹¤ėŠ” ė©“ģ—ģ„œ multi-modal learningģ“ė¼ź³  ķ•  수 ģžˆģŠµė‹ˆė‹¤.

overall
  • ģ“ ėŖØėøģ€ encoder-decoder 구씰와 attention based approach넼 ģ°Øģš©ķ•©ė‹ˆė‹¤. ģ“ģ— ėŒ€ķ•“ģ„œėŠ” Related work ģ„¹ģ…˜ģ—ģ„œ ģžģ„øķžˆ ģ„¤ėŖ…ķ• ź²Œģš”. ė…¼ė¬øģ—ģ„œ 다룬 ėŖØėøģ“ ģˆ˜ķ–‰ķ•˜ėŠ” task넼 ź°„ė‹Øķžˆ ģ„¤ėŖ…ķ•˜ė©“,

    ​ (1) 2-D image넼 input으딜 받아,

    ​ (2) 그에 ėŒ€ķ•“ CNN으딜 feature extractionģ„ ģˆ˜ķ–‰ķ•˜ģ—¬ input image에 ėŒ€ģ‘ė˜ėŠ” feature vector넼 얻고,

    ​ (3) LSTMģ—ģ„œ feature vector와 attention mechanismģ„ ģ‚¬ģš©ķ•˜ģ—¬ ,

    ​ (4) word넼 generationķ•˜ėŠ” ė°©ģ‹ģœ¼ė”œ image넼 captioningķ•©ė‹ˆė‹¤.

2. Motivation

  • Neural net ģ“ģ „ image captioning 방법들

    Neural netģ„ image captioning task에 ģ‚¬ģš©ķ•˜źø° ģ „ź¹Œģ§€, 크게 두 ķė¦„ģ“ ģžˆģ—ˆģŠµė‹ˆė‹¤. ź·øėŸ¬ė‚˜ ģ“ėÆø Neural Netģ„ ģ°Øģš©ķ•˜ėŠ” ė°©ģ‹ģ— 크게 밀려 ė” ģ“ģƒ ģ‚¬ģš©ķ•˜ģ§€ ģ•ŠėŠ”ė‹¤ź³  ķ•©ė‹ˆė‹¤.

    • object detectionź³¼ attribute discovery넼 먼저 ģ§„ķ–‰ķ•œ 후, caption templateģ„ ģƒģ„±ķ•˜ėŠ” ė°©ģ‹

      ģ°øģ”°: Kulkarni et al. (2013), Li et al. (2011), Yang et al. (2011)

    • captioning ķ•˜ė ¤ėŠ” image와 ė¹„ģŠ·ķ•˜ģ§€ė§Œ ģ“ėÆø captioned image넼 DBģ—ģ„œ 찾아낸 후, ģš°ė¦¬ģ˜ imageģ˜ query에 ė§žź²Œ 찾아낸 ģ“ėÆøģ§€ģ˜ ģŗ”ģ…˜ģ„ ģˆ˜ģ •ķ•˜ėŠ” ė°©ģ‹

      ģ°øģ”°: Kuznetsova et al., 2012; 2014

  • The encoder-decoder framework (sequence-to-sequence training with neural networks for machine translation)

    machine translation ė¶„ģ•¼ģ˜ ģ£¼ė„˜ź°€ 된 ė°©ģ‹ģœ¼ė”œ, ė‹¹ģ‹œģ—ėŠ” seq-to-seq trainingģ„ 주딜 RNNģ—ģ„œ ģˆ˜ķ–‰ķ•˜ėŠ” 구씰넼 ģ‚¬ģš©ķ–ˆģŠµė‹ˆė‹¤. image captioningģ€ ź·øė¦¼ģ„ 'translating'ķ•˜ėŠ” ģž‘ģ—…ź³¼ ė¹„ģŠ·ķ•˜źø° ė•Œė¬øģ—, Cho et al., 2014ģ—ģ„œ 다룬 encoder-decoder framework for machine translationģ“ ķšØź³¼ģ ģ¼ ź²ƒģ“ė¼ź³  ģ €ģžėŠ” ė§ķ–ˆģŠµė‹ˆė‹¤.

    ģ°øģ”°: (Cho et al., 2014).

  • Attention mechanism

  • Show and Tell: A Neural Image Caption Generator

    Show, Attend, and Tellģ˜ ģ „ ė²„ģ ¼ģ“ė¼ź³  ķ•  수 ģžˆģŠµė‹ˆė‹¤. LSTMģ„ ģ‚¬ģš©ķ•˜ģ—¬ image captioningģ„ ķ•œė‹¤ėŠ” ģ ģ—ģ„œ ė¹„ģŠ·ķ•˜ģ§€ė§Œ, ėŖØėøģ—ź²Œ image넼 ķ•˜ėŠ” ģ‹œģ ģ€ LSTM에 imageź°€ ģž…ė „ė  ė•Œ ķ•œė²ˆ ėæģž…ė‹ˆė‹¤. 그렇기 ė•Œė¬øģ— sequenceź°€ 길얓지멓 ėŖØėøģ€ sequenceģ˜ ģ•žė¶€ė¶„ģ„ 점차 ģžŠģ–“ė²„ė¦¬ź²Œ ė˜ėŠ”ė°, ģ“ėŠ” RNN, LSTM 등 sequential ėŖØėøģ˜ ź³ ģ§ˆģ ģø ė¬øģ œģ ģ“ė¼ź³  ķ•  수 ģžˆģŠµė‹ˆė‹¤.

Idea

ģ“ė¦„ģ—ģ„œė„ ģ•Œ 수 ģžˆė“Æģ“, Show, Attend, and Tellģ˜ ėŖØėøģ€ Show and Tell에 ė‚˜ģ˜Ø Generator에 attention mechanismģ„ ė”ķ•œ 구씰딜 ė˜ģ–“ģžˆģŠµė‹ˆė‹¤. ģ•žģ„œ ģ–øźø‰ķ–ˆė“Æģ“, RNN, LSTM 등 sequential ėŖØėøģ˜ ķŠ¹ģ„±ģƒ, sequenceź°€ 길얓지멓 ėŖØėøģ€ sequenceģ˜ ģ•žė¶€ė¶„ģ„ 점차 ģžŠģ–“ė²„ė¦½ė‹ˆė‹¤.

Show, Attend, and Tellģ—ģ„œėŠ” Decoder에 visual attentionģ„ ģ¶”ź°€ķ•Øģœ¼ė”œģØ

  • sequenceź°€ źøøģ–“ģ øė„ ėŖØėøģ“ sequenceģ˜ ėŖØė“  ė¶€ė¶„ģ„ 기억할 수 ģžˆź²Œ ķ•˜ź³ ,

  • ėŖØėøģ“ ź·øė¦¼ģ˜ ģ–“ėŠ 부분에 주목(attention)ķ•˜ģ—¬ 단얓넼 captioningķ–ˆėŠ”ģ§€ ģ•Œ 수 ģžˆģ–“, ķ•“ģ„ź°€ėŠ„ģ„±(Interpretability)ģ„ ė³“ģž„ķ•˜ģ˜€ģœ¼ė©°,

  • state-of-the-art ģ„±ėŠ„ģ„ 낼 수 ģžˆź²Œ ė˜ģ—ˆģŠµė‹ˆė‹¤.

3. Method

overall
  1. Encoder: Convolutional feature

    CNN으딜 źµ¬ģ„±ėœ Encoder ėŠ” 2D input image넼 받아 aaė¼ėŠ” feature vector넼 ģ¶œė „ķ•©ė‹ˆė‹¤. CNNģ˜ ė§ˆģ§€ė§‰ layerź°€ D개 neuron, Lź°œģ˜ channel딜 ģ“ė£Øģ–“ģ øģžˆģŠµė‹ˆė‹¤. ė”°ė¼ģ„œ feature extractionģ„ ģˆ˜ķ–‰ķ•œ ź²°ź³¼ėŠ” 각 aia_iėŠ” D차원 범터가 되고, ģ“ėŸ¬ķ•œ ė²”ķ„°ė“¤ģ“ ģ“ L개 ģžˆėŠ” ķ˜•ķƒœź°€ ė©ė‹ˆė‹¤.

    encoder

  2. Decoder: LSTM with attentiond over the image

    decoder

    decoderė”œėŠ” LSTMģ„ ģ‚¬ģš©ķ•©ė‹ˆė‹¤. 큰 ė§„ė½ģ—ģ„œ ģ„¤ėŖ…ķ•˜ė©“, 각 time step t = 1 .. Cė§ˆė‹¤ caption vector yy ģ˜ elementģø ķ•œ 단얓 yty_t넼 outputķ•˜ėŠ” ė°, ģ“ė•Œ 세 ģš”ģ†Œ htāˆ’1,z^,Eytāˆ’1h_{t-1}, \hat{z}, Ey_{t-1}넼 input으딜 ė°˜ģ˜ķ•˜ź² ė‹¤ėŠ” ź²ƒģ“ LSTM 구씰넼 ģ°Øģš©ķ•œ 핵심 ģ“ģœ ģž…ė‹ˆė‹¤. 즉 ģ“ģ „ stepģ˜ 결과넼 input으딜 받아 ģ“ė²ˆ stepģ˜ 결과넼 ė‚“ėŠ” autoregressiveķ•œ ė°©ģ‹ģœ¼ė”œ, ė‹Øģ–“ė“¤ģ„ ģ°Øė”€ėŒ€ė”œ ģƒģ„±ķ•˜ėŠ” Sequential modelģø ź²ƒģž…ė‹ˆė‹¤.

    ģ“ė•Œ inputź³¼ outputģ€ ź°ź°ģ˜ ģ˜ėÆøė„¼ 가지고 ģžˆģŠµė‹ˆė‹¤.

    • output yty_t = ķ˜„ģž¬ time stampģ—ģ„œ ė§Œė“¤ģ–“ė‚¼ 단얓

    • input htāˆ’1h_{t-1} = ė°”ė”œ 직전(t-1) ģ‹œģ ģ˜ hidden state

    • input Eytāˆ’1Ey_{t-1} = 직전 time stampģ—ģ„œ ė§Œė“¤ģ–“ė‚ø 단얓 ytāˆ’1y_{t-1}넼 Embedding matrix E (∈Rmāˆ—K\in R^{m*K})에 ź³±ķ•˜ģ—¬ embedding ķ•œ 범터

    • input z^\hat{z} = CNN encoder output aa 와 직전 hidden stateģø htāˆ’1h_{t-1}ģ„ ģ“ģš©ķ•“ ź³„ģ‚°ķ•œ context vector.

    inputź³¼ outputė“¤ģ„ LSTMģ˜ 각 gate와 matchingķ•“ģ„œ ģ„¤ėŖ…ķ•˜ė©“ ė‹¤ģŒź³¼ ź°™ģŠµė‹ˆė‹¤

    lstm

  3. Attention

    attention mechanismģ„ 통핓 ź²°ģ •ė˜ėŠ” vectorėŠ” context vector z^\hat{z} ģž…ė‹ˆė‹¤. ģœ„ģ—ģ„œ ģ–øźø‰ķ•œėŒ€ė”œ, CNN encoder output aa 와 직전 hidden stateģø htāˆ’1h_{t-1}ģ„ ģ“ģš©ķ•“ context vector넼 계산할 수 ģžˆģŠµė‹ˆė‹¤.

    Context vector넼 źµ¬ķ•˜ėŠ” ź³¼ģ •ģ„ ģ°Øė”€ėŒ€ė”œ ģ‚“ķŽ“ė³“ė©“,

    • CNN encoder output aia_i와 직전 hidden state htāˆ’1h_{t-1}넼 ķ•Øģˆ˜ fattf_{att}에 넣얓 etie_{ti} 넼 źµ¬ķ•©ė‹ˆė‹¤(i= 1 ... L).

      ģ“ė•Œ fattf_{att}ėŠ” weight vector넼 ź³„ģ‚°ķ•˜źø° ģœ„ķ•œ attention modelģ“ė©°, hard attentionź³¼ soft attention으딜 ė‚˜ė‰©ė‹ˆė‹¤. ģ“ėŠ” ė’¤ģ—ģ„œ ė‹¤ģ‹œ ģ„¤ėŖ…ķ•©ė‹ˆė‹¤ .

      eti

      etie_{ti} (i= 1 ... L)에 ėŒ€ķ•“ģ„œ softmax layer넼 거치멓 αti\alpha_{ti} 넼 ģ–»ģŠµė‹ˆė‹¤.

      ati

      ź²°źµ­ αt=(αt1,...,αtL)\alpha_t = (\alpha_{t1}, ..., \alpha_{tL})ģ€ a1,a2,...aLa_1, a_2, ... a_L 중 얓디에 weight넼 주얓 attentionķ•  ź²ƒģøģ§€ė„¼ ź²°ģ •ķ•˜ėŠ” vectorģø ź²ƒģž…ė‹ˆė‹¤.

      ź·øė ‡ź²Œ źµ¬ķ•œ aia_i와 αi\alpha_i ź°€ Ļ•\phi넼 거치멓 context vector z^\hat{z} ź°€ ė©ė‹ˆė‹¤.

      zhat

Note; Attention - Stochastic hard vs Deterministic soft

Attention model fattf_{att}ģ€ 크게 Hard attentionź³¼ Soft attention으딜 ė‚˜ė‰©ė‹ˆė‹¤. ģ“ ė‘˜ģ€ 마치 0ź³¼ 1ģ„ ģ‚¬ģš©ķ•˜ģ—¬ ģ§ˆģ ģø ģ°Øģ“(유/묓)넼 źµ¬ė¶„ķ•˜ėŠ” hard labelź³¼ ģ‹¤ģˆ˜ 전첓 ķ˜¹ģ€ [0, 1]에 ģ†ķ•˜ėŠ” ģ‹¤ģˆ˜ė„¼ ģ‚¬ģš©ķ•˜ėŠ” soft label ģ˜ ģš©ė”€ģ™€ ė¹„ģŠ·ķ•©ė‹ˆė‹¤. modelģ“ sum-to-1 vector넼 ģ“ģš©ķ•˜ģ—¬ ģ–“ė–¤ 부분에 attendķ•  ź²ƒģøģ§€ ź²°ģ •ķ•  ė•Œ, 0ź³¼ 1ė”œģØ deterministicķ•˜ź²Œ attendķ•˜ėŠ” hard attentionģ„ ģ‚¬ģš©ķ•  ģˆ˜ė„ ģžˆź³ , 1ģ„ ģ—¬ėŸ¬ 파트딜 ė¶„ģ‚°ķ•˜ėŠ” soft attentionģ„ ģ‚¬ģš©ķ•  ģˆ˜ė„ ģžˆģŠµė‹ˆė‹¤. ģ“ ė•Œė¬øģ— ė¹šģ–“ģ§€ėŠ” ģ°Øģ“ģ ģ€ hidden stateģ˜ weight넼 ź³„ģ‚°ķ•˜ėŠ” functionģ“ differentiableķ•œģ§€ ģ—¬ė¶€ģž…ė‹ˆė‹¤.

ė”°ė¼ģ„œ,

  • Soft Attentionģ€ Encoderģ˜ hidden state넼 ėÆøė¶„ķ•˜ģ—¬ cost넼 źµ¬ķ•˜ź³  attention mechanismģ„ 통핓 gradientź°€ ķ˜ė ¤ė³“ė‚“ėŠ” ė°©ģ‹ģœ¼ė”œ ėŖØėøģ„ ķ•™ģŠµģ‹œķ‚µė‹ˆė‹¤.

  • ķ•œķŽø Hard Attentionģ€ trainingģ„ ģˆ˜ķ–‰ķ•  ė•Œ, 매 timestampė§ˆė‹¤ ģŗ”ģ…˜ ėŖØėøģ“ focusķ•“ģ•¼ķ•˜ėŠ” ģœ„ģ¹˜ė„¼ random samplingķ•˜źø° 떄문에 ėŖØėøģ— stochasticityź°€ ģƒźø°ź³ , ė”°ė¼ģ„œ hidden stateģ˜ weight넼 ź³„ģ‚°ķ•˜ėŠ” functionģ“ differentiableķ•˜ģ§€ ģ•ŠģŠµė‹ˆė‹¤.

ė§Œģ•½ weight functionģ“ indifferentiableķ•˜ė‹¤ė©“, end-to-end딜 ķ•œė²ˆģ— ķ•™ģŠµķ•  수 없고, ė„ģ¤‘ģ— gradient flow넼 ź·¼ģ‚¬ķ•“ģ•¼ķ•˜ėŠ” ė²ˆź±°ė”œģ›€ģ“ ģƒź¹ė‹ˆė‹¤. ė”°ė¼ģ„œ ķ˜„ģž¬ėŠ” gradient넼 ģ§ģ ‘ģ ģœ¼ė”œ ź³„ģ‚°ķ•˜ģ—¬ end-to-end ėŖØėøģ— ģ“°ģ¼ 수 ģžˆėŠ” soft attentionģ„ ė” ė§Žģ“ ģ”ė‹ˆė‹¤.

bird

ģœ„ figureģ—ģ„œ hard/soft attentionģ˜ 경우넼 ģž˜ visualizationķ•“ģ¤ė‹ˆė‹¤. ģœ—ģ¤„ģ€ soft attention, ģ•„ėž«ģ¤„ģ€ hard attentionģ˜ ź²½ģš°ģøė°ģš”. ķ•˜ė‹Øģ˜ caption (A, bird, flying, over, ...)ģ„ targetķ•˜ģ—¬ attendķ•  ė•Œ, soft attentionģ˜ 경우 ģƒėŒ€ģ ģœ¼ė”œ captionź³¼ ė¬“ź“€ķ•œ featureź¹Œģ§€ attendķ•˜ź³  ģžˆģŠµė‹ˆė‹¤(non-deterministicķ•˜ėÆ€ė”œ). hard attentionģ˜ ź²½ģš°ė„ ģƒ˜ķ”Œė§ģ„ 통핓 ź³„ģ‚°ė˜ėÆ€ė”œ, ģ˜¤ė”Æģ“ captionģ˜ feature만 targetķ•˜ź³  ģžˆėŠ” ź²ƒģ€ ģ•„ė‹ˆģ§€ė§Œ, soft attention에 비핓 훨씬 ģ ģ€ featureė§Œģ„ focusingķ•“ģ„œ density function 중 ė§Žģ€ ė¶€ė¶„ģ„ ķ• ģ• ķ•˜ģ—¬ attendķ•˜ź³  ģžˆģŠµė‹ˆė‹¤.

4. Experiment & Result

Experimental setup

  • Dataset: Flickr8k, Flickr30k, and MS COCO

    • Flickr8k/30k: ķ•œ ģž„ģ˜ ģ“ėÆøģ§€ģ™€ 그에 ģƒģ‘ķ•˜ėŠ” ė¬øģž„ ė‹Øģœ„ ģ“ėÆøģ§€ 설명(sentence-based image description)ģ„ ź°–ģ¶˜ ė°ģ“ķ„°ģ…‹ģž…ė‹ˆė‹¤. Flickr8kėŠ” 약 8,000ģž„ģ˜ ģ“ėÆøģ§€, Flickr30kėŠ” 약 30,000ģž„ģ˜ ģ“ėÆøģ§€ź°€ 각 ģ“ėÆøģ§€ 당 5개 Captionģ„ ź°€ģ§‘ė‹ˆė‹¤.

    • MS COCO: ź°ģ²“ ķƒģ§€ (object detection), ģ„øź·øėØ¼ķ…Œģ“ģ…˜ (segmentation), ķ‚¤ķ¬ģøķŠø ķƒģ§€ (keypoint detection) ė“±ģ˜ task넼 목적으딜 ė§Œė“¤ģ–“ģ§„ ė°ģ“ķ„°ģ…‹ģž…ė‹ˆė‹¤

  • Baselines: Google NIC, Log Bilinear, CMU/MS Research, MS Research, BRNN

  • Evaluation metric: BLEU-1,2,3,4/METEOR metrics

    • BLEU (Bilingual Evaluation Understudy) score: translation taskģ—ģ„œ ėŒ€ķ‘œģ ģœ¼ė”œ ģ‚¬ģš©ķ•˜ėŠ” n-gram based metricģž…ė‹ˆė‹¤. 크게 3가지 ģš”ģ†Œė”œ ģ“ė£Øģ–“ģ ø ģžˆģ–“ģš”.

      • Precision: 먼저 referenceź³¼ predictionģ‚¬ģ“ģ— n-gramģ“ ģ–¼ė§ˆė‚˜ ź²¹ģ¹˜ėŠ”ģ§€ ģø”ģ •ķ•©ė‹ˆė‹¤.

      • Clipping: ź°™ģ€ 단얓가 ģ—¬ėŸ¬ 번 ė‚˜ģ˜¤ėŠ” 경우 precisionģ„ ė³“ģ •ķ•“ģ¤ė‹ˆė‹¤. predictionģ˜ 중복 단얓넼 precision에 ė°˜ģ˜ķ•  ė•Œ, 아묓리 ė§Žģ“ ė‚˜ģ˜¤ė”ė¼ė„ referenceģ˜ ģ¤‘ė³µķšŸģˆ˜ė„¼ ģ“ˆź³¼ķ•“ģ„œ countė˜ģ§€ ģ•ŠģŠµė‹ˆė‹¤.

      • Brevity Penalty: ģ˜ˆģ»ØėŒ€ ķ•œ ė‹Øģ–“ė”œ ė§Œė“  ė¬øģž„ģ“ ģžˆģ„ ė•Œ ģ“ėŠ” ģ œėŒ€ė”œ 된 ė¬øģž„ģ“ ģ•„ė‹ˆģ§€ė§Œ, precisionģ“ 매우 ė†’ź²Œ ė‚˜ģ˜µė‹ˆė‹¤. ė”°ė¼ģ„œ predictionģ˜ źøøģ“ė„¼ reference źøøģ“ė”œ ė‚˜ėˆ , ė¬øģž„źøøģ“ģ— ėŒ€ķ•œ ź³¼ģ ķ•©ģ„ ė³“ģ •ķ•“ģ¤ė‹ˆė‹¤.

    • Meteor (Metric for Evaluation of Translation with Explicit ORdering) score: BLEU넼 ė³“ģ™„ķ•“ģ„œ ė‚˜ģ˜Ø metricģž…ė‹ˆė‹¤. ģ •ģ˜ėŠ” unigram precisionź³¼ recallģ˜ harmonic meanģ„ 통핓 ź³„ģ‚°ķ•˜ėŠ”ė°, 다넸 metricź³¼ėŠ” 달리 exact word matching ė°©ģ‹ģ„ ģ‚¬ģš©ķ•©ė‹ˆė‹¤. sentence levelź³¼ segment levelģ—ģ„œ human judgement와 ė†’ģ€ ģƒź“€ź“€ź³„ė„¼ ė³“ģøė‹¤ėŠ” ģ ģ—ģ„œ, corpus levelģ—ģ„œ ģž‘ė™ķ•˜ėŠ” BLEU와 ģ°Øģ“ź°€ ģžˆģŠµė‹ˆė‹¤.

  • Training setup

    • encoder CNN: Oxford VGGnet pretrained on ImageNet without finetuning.

    • stochastic gradient descent: using adaptive learning rates.

      • For the Flickr8k dataset: RMSProp

      • Flickr30k/MS COCO dataset: Adam algorithm

Result

res1

ėŖØė“  ė°ģ“ķ„° ģ…‹ģ—ģ„œ 기씓 ėŖØėøė“¤ė³“ė‹¤ attention based approach넼 ģ¼ģ„ ė•Œ BLEU, METEOR scoreź°€ 훨씬 ė†’ģ•˜ģŠµė‹ˆė‹¤.

res2

Caption generation ėŖØėøģ“ 그림 중 ģ–“ėŠ ė¶€ė¶„ģ„ ģ£¼ėŖ©ķ•˜ģ—¬ 단얓넼 ģƒģ„±ķ–ˆėŠ”ģ§€ ķ‘œķ˜„ķ•˜ģ—¬ captioning process에 ķ•“ģ„ź°€ėŠ„ģ„±ģ„ ė¶€ģ—¬ķ•˜ģ˜€ģŠµė‹ˆė‹¤.

5. Conclusion

Show and Tell ė…¼ė¬øģ“ ė°œķ‘œė˜źø° ģ“ģ „ź¹Œģ§€ image captioningģ€ 주딜 object detectionģ„ 기반으딜 ķ–ˆģŠµė‹ˆė‹¤. 주얓진 ģ“ėÆøģ§€ģ—ģ„œ 물첓넼 detectķ•˜ź³  ģ“ė„¼ 직접 ģžģ—°ģ–“ė”œ ģ—°ź²°ķ•˜ėŠ” ė°©ģ‹ģ„ ķƒķ•œ ź²ƒģž…ė‹ˆė‹¤.

Show and Tell ė…¼ė¬øģ€ 기씓 ė°©ė²•ģ„ ķƒˆķ”¼ķ•˜ģ—¬, end-to-end ė°©ģ‹ģœ¼ė”œ image captioningģ„ ģˆ˜ķ–‰ķ–ˆģŠµė‹ˆė‹¤. ģ“ėÆøģ§€ė„¼ CNN으딜 ģøģ½”ė”©ķ•˜ģ—¬ representation vector넼 얻고, captionģ„ LSTM으딜 ė””ģ½”ė”©ķ•˜ģ—¬ ģ„±ėŠ„ģ„ 크게 ķ–„ģƒ ģ‹œģ¼°ģŠµė‹ˆė‹¤.

Show, Attend, and Tell ėŖØėøģ€, Show and tellģ—ģ„œ ģ°Øģš©ķ•œ 구씰에 Attention mechanismģ„ ģ¶”ź°€ķ•œ 것과 ź°™ģŠµė‹ˆė‹¤. ėŖØė“  ģ“ėÆøģ§€ė„¼ ź· ė“±ķ•˜ź²Œ 볓지 ģ•Šź³ , 핓당 captionģ“ ģ–“ėŠ ģ“ėÆøģ§€ģ— ķ•“ė‹¹ķ•˜ėŠ”ģ§€ ź°€ģ¤‘ģ¹˜ė„¼ ė¶„ė°°ķ•˜ģ—¬ ķ•“ģ„ķ•œ ź²ƒģž…ė‹ˆė‹¤.

ė‹¤ģ‹œė§ķ•“, Attentionģ„ 통핓

  • sequential modelģ˜ gradient vanishing 문제넼 ķ•“ź²°ķ•˜ź³ ,

  • attendķ•˜ėŠ” ė¶€ė¶„ģ„ 눈으딜 ķ™•ģøķ•  수 ģžˆė‹¤ėŠ” ģ ģ—ģ„œ interpretability넼 ė¶€ģ—¬ķ–ˆģŠµė‹ˆė‹¤.

Take home message (ģ˜¤ėŠ˜ģ˜ źµķ›ˆ)

  1. Show, Attend, and tellģ€ vision taskģ—ģ„œ Visual attentionģ„ ė„ģž…ķ–ˆė˜ ģ‹œė„ģ“ė©° ģ“ ėŖ…ė§„ģ€ ģ§€źøˆź¹Œģ§€ ģ“ģ–“ģ§€ź³  ģžˆė‹¤!

  2. ģˆ˜ģ—… ģžė£Œģ— ķ¬ķ•Øė˜ģ–“ ģžˆģ„ ė²•ķ•œ ź³ ģ „ ė…¼ė¬øģ„ ģ°¾ģ•„ė³“ėŠ” ź²ƒė„... ź°€ė”ģ€ 좋다.

Author / Reviewer information

Author

ģ“ėÆ¼ģž¬ (Lee Min Jae)

  • M.S. student, KAIST AI

  • https://github.com/mjbooo slalektm@gmail.com

Reviewer

  1. ģ–‘ģ†Œģ˜: ģ¹“ģ“ģŠ¤ķŠø AI ėŒ€ķ•™ģ› ģ„ģ‚¬ź³¼ģ •

  2. 박여정: ģ¹“ģ“ģŠ¤ķŠø AI ėŒ€ķ•™ģ› ģ„ģ‚¬ź³¼ģ •

  3. 오상윤: ģ¹“ģ“ģŠ¤ķŠø 기계공학과 박사과정

Reference & Additional materials

Show, Attend, and Tell paper

https://arxiv.org/abs/1502.03044

On the 'show, attend, and tell' model

http://sanghyukchun.github.io/93/

https://hulk89.github.io/nhttps://jomuljomul.tistory.com/entry/Deep-Learning-Attention-Mechanism-%EC%96%B4%ED%85%90%EC%85%98eural%20machine%20translation/2017/04/04/attention-mechanism/

https://ahjeong.tistory.com/8

An implementation code with Pytorch (unofficial)

https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning

On attention part

https://github.com/Kyushik/Attention

On MS COCO dataset

https://ndb796.tistory.com/667

On BLEU SCORE

https://wikidocs.net/31695

https://donghwa-kim.github.io/BLEU.html

Last updated

Was this helpful?