BEIT [Kor]

Bao et al. / BEIT - BERT Pre-Training of Image Transformers / ICLR 2022 Oral

1. Problem definition

์ด ๋…ผ๋ฌธ์€ self-supervised pre-training์„ ํ†ตํ•ด ์ด๋ฏธ์ง€์˜ representation learning์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์— ๋น„์ „ ์˜์—ญ์—์„œ ํ•œ ์ด๋ฏธ์ง€์— ์„œ๋กœ ๋‹ค๋ฅธ perturbation์„ ์ ์šฉํ•œ ๋’ค ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ representation learning์„ ์ง„ํ–‰ํ•˜๋˜ SimCLR๋‚˜ BYOL ๋“ฑ๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ, NLP ์˜์—ญ์—์„œ ํฐ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘” BERT์˜ Masked Language Modeling(MLM)์„ ์ด๋ฏธ์ง€์— ์ ์šฉ์‹œํ‚จ ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ์ฃผ๋œ contribution์ด๋ผ ํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

pre-training์œผ๋กœ ํ•™์Šต๋œ representation์˜ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•œ fine-tuning task (ํ˜น์€ downstream task)๋กœ๋Š” ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜(image classification)์™€ semantic segmentation์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

2. Motivation

Self-supervised Representation Learning

๋น„์ „ ์˜์—ญ์—์„œ ์ฃผ๋กœ ์ด๋ฃจ์–ด์ง„ representation learning ์ค‘ ๋Œ€ํ‘œ์ ์ธ ์—ฐ๊ตฌ๋ฅผ ๊ผฝ์œผ๋ผ๋ฉด SimCLR (Chen et al.)๋ฅผ ๋นผ๋†“์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” contrastive learning์„ ํ†ตํ•ด ์ด๋ฏธ์ง€์˜ representation learning์„ ์ง„ํ–‰ํ•œ ์—ฐ๊ตฌ์ธ๋ฐ์š”, contrastive learning์˜ ๊ธฐ๋ณธ ๊ฐœ๋…๊ณผ ํ•จ๊ป˜ ๊ฐ„๋‹จํžˆ ์„ค๋ช…๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

Contrastive learning์€ ๊ฐ ์ด๋ฏธ์ง€๊ฐ€ ๋ชจ๋ธ์„ ํ†ต๊ณผํ•ด์„œ ๋‚˜์˜จ latent vector (ํ˜น์€ representation vector)๊ฐ€ ์กด์žฌํ•˜๋Š” latent space ์ƒ์—์„œ, positive pair๋“ค์˜ latent vector๋“ค๋ผ๋ฆฌ๋Š” ๊ฐ€๊น๊ฒŒ negative pair๋“ค์˜ latent vector๋“ค๋ผ๋ฆฌ๋Š” ๋ฉ€๊ฒŒ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋กœ InfoNCE๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š”, ์•„๋ž˜์˜ loss ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์„ ์ตœ์ ํ™”์‹œํ‚ค๋Š”๋ฐ์š”, ์ง๊ด€์ ์œผ๋กœ ์„ค๋ช…๋“œ๋ฆฌ์ž๋ฉด positive pair ํ˜น์€ negative pair์—์„œ ๋‚˜์˜จ latent vector pair๋“ค๋ผ๋ฆฌ์˜ similarity๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค ์ด๋ฅผ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ์˜ logit ๊ฐ’์œผ๋กœ ์ทจ๊ธ‰ํ•˜์—ฌ cross entropy๋กœ ํ•™์Šต์‹œํ‚จ๋‹ค๊ณ  ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ, cross entropy ํ…€์˜ Ground Truth label์€ positive pair๊ฐ€ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต ๊ณผ์ •์—์„œ positive pair์˜ similarity๋Š” ๋†’์ด๊ณ  negative pair์˜ similarity๋Š” ๋‚ฎ์•„์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Li,j=โˆ’logโกexp(sim(zi,zj))โˆ‘k=1N1[kโ‰ i]exp(sim(zi,zk))L_{i,j}=-\log{\frac{exp(sim(z_i, z_j))}{\sum_{k=1}^{N}{\textbf{1}_{[k \neq i]}exp(sim(z_i, z_k))}}}

์—ฌ๊ธฐ์„œ (i, j) pair๋Š” positive pair์˜ ์ด๋ฏธ์ง€ ์ธ๋ฑ์Šค๋ฅผ ์˜๋ฏธํ•˜๊ณ , ziz_i๋Š” ii๋ฒˆ์งธ ์ด๋ฏธ์ง€๊ฐ€ ๋ชจ๋ธ์„ ํ†ต๊ณผํ•˜์—ฌ ๋‚˜์˜จ latent vector์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ similarity๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” sim(โ‹…)sim(\cdot) ํ•จ์ˆ˜๋Š” ๋‚ด์  ํ˜น์€ cosine similarity๋ฅผ ์‚ฌ์šฉํ•˜๊ณค ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ contrastive learning์˜ ๊ณ ์งˆ์ ์ธ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋Š” ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์—์„œ collapse๊ฐ€ ์ผ์–ด๋‚œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋ฌด์Šจ ๋ง์ด๋ƒ๋ฉด, ์šฐ๋ฆฌ๊ฐ€ ๊ธฐ๋Œ€ํ•˜๊ธฐ๋กœ๋Š” ๋ชจ๋ธ์ด positive pair์ธ ์ด๋ฏธ์ง€๋“ค๋ผ๋ฆฌ๋Š” ๋ฉ€๊ฒŒ ํ•˜๊ณ  negative pair์ธ ์ด๋ฏธ์ง€๋“ค๋ผ๋ฆฌ๋Š” ๋ฉ€๊ฒŒ ํ•ด์„œ latent space ์ƒ์— ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€๋“ค์˜ representatation vector๋ฅผ ์—ฌ๊ธฐ์ €๊ธฐ ํฉ๋ฟŒ๋ ค์ค„ ์ค„ ์•Œ์•˜๋Š”๋ฐ, ์‹ค์ œ๋กœ ํ•ด๋ณด๋‹ˆ๊นŒ ๊ทธ๋ ‡๊ฒŒ ๋˜์ง€ ์•Š๊ณ  ์ด๋ฏธ์ง€๋“ค์˜ representation vector๋“ค์ด latent space์˜ ์•„์ฃผ ์ž‘์€ ๋ถ€๋ถ„ ์•ˆ์—์„œ๋งŒ ๋†€๊ณ  ์žˆ๋”๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ์ด๋ ‡๊ฒŒ ํ•™์Šต๋˜๋Š” ํ•ต์‹ฌ์ ์ธ ์›์ธ์€ ์œ„์—์„œ ์ •์˜๋œ loss ํ•จ์ˆ˜์™€ ๊ด€๋ จ์ด ์žˆ๋Š”๋ฐ์š”, ์ž์„ธํžˆ ๋ณด์‹œ๋ฉด positive pair ๊ฐ„์˜ similarity๊ฐ€ ๋งค์šฐ ๋†’๊ธฐ๋งŒ ํ•˜๋ฉด loss ๊ฐ’์ด ๋–จ์–ด์งˆ ๊ฒƒ์ด๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜์‹ค ์ˆ˜ ์žˆ์„ ๊ฒ๋‹ˆ๋‹ค. ๋ฌผ๋ก  negative pair ๊ฐ„์˜ similarity์— ๋น„ํ•ด์„œ positive pair ๊ฐ„์˜ similarity๊ฐ€ ๋†’์•„์•ผ ํ•˜๊ฒ ์ง€๋งŒ (์ด๋ ‡๊ฒŒ ํ•™์Šต๋˜๊ธธ ๊ธฐ๋Œ€ํ•œ ๊ฒƒ์ด๊ธฐ๋„ ํ•˜๊ณ ์š”), ๋ชจ๋ธ์ด ํ•ด๋‹น loss ํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ณผ์ •์—์„œ ๊ทธ๋ƒฅ ๋ชจ๋“  ์ด๋ฏธ์ง€๋“ค์„ ๋น„์Šทํ•œ representation vector๋กœ ๋งŒ๋“ค์–ด๋ฒ„๋ฆฌ๋Š” ๊ฒƒ์ด negative pair๋“ค ๊ฐ„์˜ similarity๋„ ํ•จ๊ป˜ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค loss๋ฅผ ๋–จ์–ด๋œจ๋ฆฌ๊ธฐ ์ˆ˜์›”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฐ ์ผ์ด ๋ฐœ์ƒํ–ˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ๋ชจ๋ธ์ด pre-training ๋‹จ๊ณ„์—์„œ ์ด๋ ‡๊ฒŒ ๋ชจ๋“  ์ด๋ฏธ์ง€๋“ค์„ ๋น„์Šทํ•œ representation vector๋กœ ๋งŒ๋“ค์–ด๋ฒ„๋ฆฌ๋ฉด ๋‹น์—ฐํžˆ fine-tuning ๋‹จ๊ณ„์—์„œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋‚˜ semantic segmentation ๋“ฑ์˜ downstream task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์˜คํžˆ๋ ค ๋” ์–ด๋ ต๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณ ์งˆ์ ์ธ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•จ๊ณผ ๋™์‹œ์— contrastive learning์„ ํ†ตํ•ด ์˜๋ฏธ์žˆ๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ค€ ๋Œ€ํ‘œ์ ์ธ ๋น„์ „ ํŽ˜์ดํผ๊ฐ€ ๋ฐ”๋กœ SimCLR (Chen et al.)์ž…๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์˜ ์ฃผ์š” contribution๊ณผ ๊ทธ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌํ•ด๋ณผ ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋ฅผ ํฌ๊ฒŒ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ๊ณผ ๋™์‹œ์— ๋ฐฐ์น˜ ๋‚ด์—์„œ ์ž์‹ ์˜ positive sample์„ ์ œ์™ธํ•œ ๋ชจ๋“  ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋ฅผ negative sample๋กœ ์ทจ๊ธ‰ํ•จ์œผ๋กœ์จ ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ์€ negative pair๋ฅผ ํ†ตํ•ด ์œ„์—์„œ ๋ง์”€๋“œ๋ฆฐ model collapse๋ฅผ ๋ฐฉ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Contrastive learning์˜ ํ•ต์‹ฌ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” positive pair๋ฅผ ํ•œ ์ด๋ฏธ์ง€์— ์„œ๋กœ ๋‹ค๋ฅธ transformation (ํ˜น์€ perturbation)์„ ์ ์šฉํ•˜์—ฌ ๋‚˜์˜จ ๋‘ ์ด๋ฏธ์ง€๋กœ ์ •์˜ํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ–ˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด downstream task์— ๋Œ€ํ•ด ์˜๋ฏธ ์žˆ๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์œ„์™€ ๊ฐ™์ด positive pair๋ฅผ ๊ตฌ์„ฑํ•  ๋•Œ ํšจ๊ณผ์ ์ธ transformation์˜ ์กฐํ•ฉ์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•œํŽธ, representation learning์€ ์œ„์ฒ˜๋Ÿผ contrastive learning์„ ํ†ตํ•ด์„œ๋งŒ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ์˜คํžˆ๋ ค contrastive task๋Š” pre-training ๋‹จ๊ณ„์—์„œ ํ•  ์ˆ˜ ์žˆ๋Š” pretext task์˜ ์ผ๋ถ€์ผ ๋ฟ์ธ๋ฐ์š”, ์ด ๋…ผ๋ฌธ์—์„œ ์ด๋ฏธ์ง€์˜ representation learning์„ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹ ๋˜ํ•œ contrastive learning์ด ์•„๋‹Œ Masked Image Modeling(MIM)์œผ๋กœ, BERT (Devlin et al.)์—์„œ NLP๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋œ MLM์„ ์ด๋ฏธ์ง€ ๋„๋ฉ”์ธ์— ๋งž๊ฒŒ ๋ณ€ํ˜•์‹œํ‚จ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ์ด๋ผ๊ณ  ํ•ด๋‹นํ•  ์ˆ˜ ์žˆ๋Š” ์•„์ด๋””์–ด๊ฐ€ BERT์—์„œ motivated๋œ ๋งŒํผ, BERT์— ๋Œ€ํ•ด์„œ๋„ ๊ฐ„๋‹จํžˆ ์„ค๋ช…๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค. BERT๋Š” ์ž์—ฐ์–ด ๋„๋ฉ”์ธ์—์„œ pre-training ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ–ˆ๊ณ , ์ด๋ฅผ Transformer Encoder์— ์ ์šฉํ•˜์—ฌ NLP ์˜์—ญ์—์„œ ๋Œ€๋ถ€๋ถ„์˜ downstream task์— ๋Œ€ํ•ด ์ผ๊ด€์ ์œผ๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ค€ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. BERT์—์„œ ์ œ์‹œํ•œ pre-training ๋ฐฉ๋ฒ•์—๋Š” ์ด ๋‘ ๊ฐ€์ง€, MLM๊ณผ Next Sentence Prediction(NSP)์ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • Masked Language Modeling MLM์€ input token๋“ค๋กœ ์ด๋ฃจ์–ด์ง„ sequence์—์„œ ์ผ๋ถ€ randomํ•œ token์„ ์„ ํƒํ•˜์—ฌ [MASK]๋ผ๋Š” token์œผ๋กœ ๋Œ€์ฒดํ•œ ์ƒํƒœ๋กœ ๋ชจ๋ธ์„ ํ†ต๊ณผ์‹œํ‚จ ๋’ค, ํ•ด๋‹น token์˜ output์— linear layer๋ฅผ ๋‹ฌ์•„ mask๋˜๊ธฐ ์ „์˜ ์›๋ž˜ token์ด ๋ฌด์—‡์ด์—ˆ๋Š”์ง€๋ฅผ ๋งž์ถ”๋Š” task์ž…๋‹ˆ๋‹ค.

  • Next Sentence Prediction BERT๋Š” ์œ„์—์„œ MLM์„ ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์‹œ์— NSP๋ฅผ ํ†ตํ•ด์„œ๋„ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ์š”, NSP๋Š” input sequence๊ฐ€ ๋‘ ๊ฐœ์˜ sentence๋“ค๋กœ ์ด๋ฃจ์–ด์กŒ์„ ๋•Œ ๊ทธ ๋‘ sentence๊ฐ€ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š” ๋ฌธ์žฅ์ธ์ง€, ์•„๋‹Œ์ง€๋ฅผ ๋งž์ถ”๋Š” task์ž…๋‹ˆ๋‹ค.

Transformers for Visual Tasks

NLP ์˜์—ญ์—์„œ Transformer ์•„ํ‚คํ…์ณ๋Š” ๊ธฐ์กด์˜ RNN์— ๋น„ํ•ด ๋†€๋ผ์šธ๋งŒํผ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ํ๋ฆ„์— ๋”ฐ๋ผ ๋น„์ „ ์˜์—ญ์—์„œ๋„ Transformer๋ฅผ ํ™œ์šฉํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ์ด๋ฃจ์–ด์กŒ๋Š”๋ฐ, ๋Œ€ํ‘œ์ ์œผ๋กœ fully-Transformer-based ์•„ํ‚คํ…์ณ๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ํƒœ์Šคํฌ๋ฅผ ์ง„ํ–‰ํ•œ vision transformer (Dosovitskiy et al.) (ViT)์™€, semantic segmentation๊ณผ ๊ฐ™์€ scene understanding ํƒœ์Šคํฌ๋ฅผ ์ง„ํ–‰ํ•œ swin transformer (Liu et al.)๋ฅผ ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ๋“ค์€ ํ˜„์žฌ ๋‹ค์–‘ํ•œ SOTA ๋ชจ๋ธ๋“ค์˜ backbone ์•„ํ‚คํ…์ณ๋กœ ์“ฐ์ด๊ณ  ์žˆ๋Š”๋ฐ์š”, Transformer์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด์ธ self-attention mechanism ํƒ“์— ์ถฉ๋ถ„ํ•œ computational resource์™€ training time์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Transformer ์•„ํ‚คํ…์ณ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ CNN-based ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋งŽ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์š”๊ตฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์€ ๋ณด์žฅ๋˜์ง€๋งŒ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์กฐ๊ฑด์ด ๊นŒ๋‹ค๋กญ๋‹ค๋Š” ์ ์ด ๋Œ€ํ‘œ์ ์ธ ๋ฌธ์ œ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Idea

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์š”์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Self-supervised Learning์„ ํ†ตํ•ด Vision Transformer๊ฐ€ CNN-based model์— ๋น„ํ•ด ๋” ๋งŽ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์š”๊ตฌํ•œ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ

  • ์ด ๋•Œ, self-supervised learning์˜ ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ์„œ ๊ธฐ์กด์— NLP ์˜์—ญ์—์„œ ์—„์ฒญ๋‚œ ์„ฑ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€ BERT์˜ MLM์„ ์ด๋ฏธ์ง€ ๋„๋ฉ”์ธ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ณ€ํ˜•ํ•˜์—ฌ ์ œ์‹œ

3. Method

Image Patch

๋จผ์ €, input์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” 224 x 224 ์ด๋ฏธ์ง€๋ฅผ 16 x 16์˜ ์ž‘์€ patch๋“ค๋กœ ์ชผ๊ฐญ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด (224 / 16) x (224 / 16) = 14 x 14๊ฐœ์˜ patch๋กœ ์ชผ๊ฐœ์ง€๋ฉฐ, ์ขŒ์ƒ๋ถ€ํ„ฐ ์šฐํ•˜๊นŒ์ง€ ์ˆœ์„œ๋Œ€๋กœ Vision Transformer์˜ input sequence๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Visual Token

BERT์ฒ˜๋Ÿผ input sequence์˜ ์ผ๋ถ€๋ฅผ maskingํ•œ ๋’ค mask๋˜๊ธฐ ์ „์˜ token์„ ์˜ˆ์ธกํ•˜๋Š” MLM์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”, input patch์— ๋Œ€ํ•œ discretization์ด ์ด๋ฃจ์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ๋ก , mask๋œ token์ด ํ†ต๊ณผํ•ด์„œ ๋‚˜์˜จ hidden token์„ mask๋˜๊ธฐ ์ „์˜ ์›๋ณธ ์ด๋ฏธ์ง€๋กœ ๋ณต์›์‹œํ‚ค๋Š” ์ผ์ข…์˜ regression task๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜๋Š” ์žˆ๊ฒ ์ง€๋งŒ, ์ €์ž์˜ ๋ง์— ๋”ฐ๋ฅด๋ฉด ์ด๋Š” ์ ์ ˆํ•œ ๋ฐฉ๋ฒ•์ด ์•„๋‹ˆ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

However, such pixel-level recovery task tends to waste modeling capability on pre-training short-range dependencies and high-frequency details.

๋”ฐ๋ผ์„œ, ์šฐ๋ฆฌ๋Š” MLM์ฒ˜๋Ÿผ discrimination (classification) task, ์ฆ‰ hidden token์„ ํ†ตํ•ด mask๋˜๊ธฐ ์ „์˜ ์›๋ณธ์„ ์˜ˆ์ธกํ•˜๊ฒŒ๋” ํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๊ฒ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ์•ž์„œ ์–ธ๊ธ‰ํ•œ๋Œ€๋กœ input patch์— ๋Œ€ํ•œ discretization์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ํ–ˆ๋Š”๋ฐ์š”, ์ด๊ฒŒ ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋Š” ๊ฑธ๊นŒ์š”?

์œ„ ๊ทธ๋ฆผ์„ ๋ณด์‹œ์ฃ . ๋งŒ์•ฝ ์šฐ๋ฆฌ๊ฐ€ MLM์„ ํ•œ๋‹ค๋ฉด ์œ„์™€ ๊ฐ™์ด mask๋œ token์ด ๋ชจ๋ธ์„ ํ†ต๊ณผํ•˜๊ณ  ๋‚˜์˜จ hidden token์—์„œ ๋ฏธ๋ฆฌ ์ •์˜๋œ vocabulary์— ์žˆ๋Š” ๋‹จ์–ด๋“ค ์ค‘ ํ•˜๋‚˜๋กœ classification์„ ์ง„ํ–‰ํ•  ๊ฒ๋‹ˆ๋‹ค. Classification ๊ฒฐ๊ณผ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ๊ฐ’์„ ๊ฐ€์ง„ ๋‹จ์–ด๊ฐ€ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๋Š” ๊ฐ€์žฅ ๊ทธ๋Ÿด๋“ฏํ•œ ๋‹จ์–ด๊ฒ ์ฃ . ๊ทธ๋Ÿฐ๋ฐ ์šฐ๋ฆฌ๋Š” ์ž์—ฐ์–ด๋ฅผ input์œผ๋กœ ํ†ต๊ณผ์‹œํ‚ฌ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ image patch๋“ค์„ ํ†ต๊ณผ์‹œํ‚ฌ ๊ฑด๋ฐ, ์—ฌ๊ธฐ์„œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ vocabulary๊ฐ€ ๋”ฐ๋กœ ์žˆ์ง€๊ฐ€ ์•Š์€๋ฐ, ๋Œ€์ฒด ์–ด๋–ป๊ฒŒ masked hidden token์— ๋Œ€ํ•œ classification์„ ์ˆ˜ํ–‰ํ•  ๊ฒƒ์ด๋ƒ๋Š” ๊ฒ๋‹ˆ๋‹ค. ๋ฐ”๋กœ ์ด ๋ฌธ์ œ๋กœ ์ธํ•ด์„œ ์šฐ๋ฆฌ๋Š” image patch๋“ค์„ ๋งˆ์น˜ ์ž์—ฐ์–ด๊ฐ€ ๊ทธ๋ ‡๊ฒŒ ํ•˜๋“ฏ์ด tokenize (discretize)ํ•˜๊ณ , token๋“ค์˜ set์ธ vocabulary๋ฅผ ์ •์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ discretize๋ผ๋Š” ๊ฒƒ์€, continuousํ•œ RGB๊ฐ’๋“ค๋กœ ์ด๋ฃจ์–ด์ง„ image patch๋ฅผ discreteํ•œ ๋‹จ์œ„๋กœ ์ชผ๊ฐฌ์œผ๋กœ์จ ์›๋ณธ ์ด๋ฏธ์ง€ ์˜ˆ์ธก์„ classification task๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ๋” ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฅผ ์œ„ํ•ด์„œ ์šฐ๋ฆฌ๋Š” ๋ฏธ๋ฆฌ ์ •์˜๋œ vocabulary์™€ tokenizer๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” DALL-E (Ramesh et al.)์—์„œ ๊ณต๊ฐœํ•œ tokenizer๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ์ด๋Š” discrete variational auto-encoder์ธ VQ-VAE (Vector quantized-Variational AutoEncoder) (Oord et al.)๋กœ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ์ฆ‰, VQ-VAE์˜ codebook์ด vocabulary๊ฐ€ ๋˜๋ฉฐ vector quantization์„ ํ†ตํ•ด ๊ฐ image patch๊ฐ€ codebook ์ƒ์˜ ํŠน์ • vector๋กœ tokenize ๋ฉ๋‹ˆ๋‹ค. quantization ๊ณผ์ •์— ๋Œ€ํ•œ ์„ค๋ช…์€ ๋ณธ ๋ฆฌ๋ทฐ์˜ scope๋ฅผ ๋ฒ—์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์„œ์ˆ ํ•˜์ง€ ์•Š๊ฒ ์ง€๋งŒ, ์ค‘์š”ํ•œ ๊ฒƒ์€ ์œ„ ๋ชจ๋“ˆ์„ ํ†ตํ•ด์„œ ๊ฐ image patch๊ฐ€ discreteํ•œ visual token์ด ๋˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด MLM๊ณผ ๊ฐ™์€ task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ํŒŒ์•…ํ•˜์…จ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.

Masked Image Modeling

์—ฌ๊ธฐ๊นŒ์ง€ ์™”์œผ๋ฉด ์ด์ œ ๋‚จ์€๊ฑด (1) input์œผ๋กœ ๋“ค์–ด๊ฐˆ image patch์˜ ์ผ๋ถ€๋ฅผ maskingํ•œ ๋’ค (2) Transformer Encoder์— ํ†ต๊ณผ์‹œํ‚ค๊ณ , (3) mask๋œ ์œ„์น˜์—์„œ ๋‚˜์˜จ hidden output token์„ ๊ฐ€์ง€๊ณ  mask๋˜๊ธฐ ์ „์˜ ์›๋ณธ image patch์˜ code๋กœ classificationํ•˜๋Š”, ์ด๋ฅธ๋ฐ” Masked Image Modeling (MIM)์„ ์ง„ํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„๋ฅผ ์ฐจ๊ทผ์ฐจ๊ทผ ๋”ฐ๋ผ๊ฐ€๋ด…์‹œ๋‹ค.

(1) Blockwise Masking

์ด ๋…ผ๋ฌธ์—์„œ๋Š” (14 x 14)๊ฐœ์˜ image patch๋“ค ์ค‘ ์•ฝ 40% ์ •๋„๋ฅผ maskingํ•˜๊ธฐ๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ƒ๊ฐํ•˜๊ธฐ์— ๊ฐ€์žฅ naiveํ•œ ๋ฐฉ์‹์€ (14 x 14)๊ฐœ์˜ patch๋“ค์„ ๊ฐ๊ฐ 40% ํ™•๋ฅ ๋กœ masking๋ ์ง€ ์•ˆ๋ ์ง€ ์„ ํƒํ•˜๊ฒŒ๋” ํ•˜๋ฉด ๋  ํ…Œ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค๋ฅธ ๋ฐฉ์‹์˜ masking ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ blockwise masking์ธ๋ฐ์š”, ์‰ฝ๊ฒŒ ๋ง์”€๋“œ๋ฆฌ์ž๋ฉด ๊ฐ image patch๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ maskingํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ์—ฐ์†๋œ image patch๋ฅผ ๊ณจ๋ผ์„œ block ๋‹จ์œ„๋กœ maskingํ•˜์ž๋Š” ๊ฒ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด์„œ span masking์ฒ˜๋Ÿผ transformer์— input์œผ๋กœ ๋„ฃ๊ธฐ ์œ„ํ•ด ํ•œ ์ค„๋กœ ์„ธ์šด sequence์—์„œ ์—ฐ์†๋œ token์„ maskingํ•  ์ˆ˜๋„ ์žˆ๊ฒ ์ง€๋งŒ, ๊ทธ๊ฒƒ๋ณด๋‹ค๋Š” image์˜ ํŠน์„ฑ์„ ์‚ด๋ ค์„œ ํ•œ ์ค„๋กœ ์„ธ์šฐ๊ธฐ ์ „์˜ image patch๋“ค์„ block ๋‹จ์œ„๋กœ ๋ฌถ๋Š” ๊ฒƒ์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์œ„ Figure 3์˜ ์™ผ์ชฝ์— blockwise masking์ด ๋˜๋Š” ๋ถ€๋ถ„์„ ๋ณด์‹œ๋ฉด 2 x 2์˜ patch๊ฐ€ masking๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ masking๋  image patch๋“ค์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์„ ํƒ๋œ image patch๋“ค์€ learnableํ•œ special token์œผ๋กœ ๋Œ€์ฒด๋ฉ๋‹ˆ๋‹ค.

Input: N(= h x w) image patches
Output: Masked positions M

M = set()
repeat
    # maskingํ•  patch์˜ ๊ฐœ์ˆ˜ ์ƒ˜ํ”Œ๋ง (์ตœ์†Œ 16๊ฐœ ์ด์ƒ)
    size = Rand(16, 0.4N - |M|)
    # aspect ratio (๊ฐ€๋กœ:์„ธ๋กœ ๋น„)๋ฅผ 0.3:1 ~ 1:0.3์˜ ๋ฒ”์œ„์—์„œ ์ƒ˜ํ”Œ๋ง
    ratio = Rand(0.3, 1/0.3)
    # maskingํ•  patch block์˜ ๊ฐ€๋กœ(b), ์„ธ๋กœ(a) ๊ธธ์ด ๊ณ„์‚ฐ
    a = sqrt(size * ratio); b = sqrt(size/ratio)
    # ํ•ด๋‹น block์ด ์œ„์น˜ํ•  ์ขŒ์ƒ๋‹จ์˜ ์ขŒํ‘œ(t, l) ์ƒ˜ํ”Œ๋ง
    t = Rand(0, h - a); l = Rand(0, w - b)
    # ํ•ด๋‹น block์— ํฌํ•จ๋œ image patch๋“ค์„ Masking list์— ํฌํ•จ
    for i = [t, t + a]
        for j = [l, l + b]
            M = Union(M, {(i, j)})
        end for
    end for
# masking๋  ์ „์ฒด image patch์˜ ๊ฐœ์ˆ˜๊ฐ€ 40%๋ฅผ ๋„˜์—ˆ์„ ๊ฒฝ์šฐ ์ค‘๋‹จ
until |M| > 0.4N
return M

(2) Forwarding to Transformer Encoder

์ด๋ ‡๊ฒŒ ํ•ด์„œ masking์ด ์™„๋ฃŒ๋˜๊ณ  ํ•œ ์ค„๋กœ ์„ธ์›Œ์ง„ input sequence์—์„œ ๋งจ ์•ž์— "start of sequence"๋ฅผ ๋œปํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ special token์„ ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด, input์œผ๋กœ ๋“ค์–ด๊ฐˆ ์• ๋“ค์€ ๋‹ค ์ •ํ•ด์กŒ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, mask token๊ณผ sos token์ด ์•„๋‹Œ image patch๋“ค์€ ์•„์ง rawํ•œ RGB ํ”ฝ์…€ ๊ฐ’๋“ค๋กœ ์ด๋ฃจ์–ด์กŒ๊ธฐ ๋•Œ๋ฌธ์— transformer์— ํƒœ์šฐ๊ธฐ ์œ„ํ•ด์„œ image patch๋“ค์„ ๊ฐ๊ฐ vector๋กœ ํ‘œํ˜„ํ•ด์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ image patch๋ฅผ linear layer๋ฅผ ํ†ตํ•ด ์ •ํ•ด์ง„ ์ฐจ์› (์•„๋งˆ 768์ด๊ฒ ์ฃ ?)์œผ๋กœ projectionํ•จ์œผ๋กœ์จ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์ž, ๊ทธ๋Ÿผ ์ •๋ง ์ตœ์ข…์ ์œผ๋กœ transformer encoder์— ๋“ค์–ด๊ฐˆ input์ด ๊ฒฐ์ •๋์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— patch๋“ค ๊ฐ„์— ์ˆœ์„œ ์ •๋ณด ์—ญํ• ์„ ํ•ด์ค„ learnableํ•œ position embedding์ด ๋”ํ•ด์ง€๋ฉด, ๊ทธ๋Œ€๋กœ transformer encoder๋ฅผ ํ†ต๊ณผํ•ด์„œ ๊ฐ input token๋งˆ๋‹ค hidden output token์ด ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

(3) Masked Image Modeling

์ด์ œ ์šฐ๋ฆฌ๋Š” Masked Image Modeling์„ ํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— hidden output token๋“ค ์ค‘์—์„œ mask๊ฐ€ ๋˜์—ˆ๋˜ ์œ„์น˜์˜ output token๋งŒ์„ ์‚ฌ์šฉํ•  ๊ฒ๋‹ˆ๋‹ค. ํ•ด๋‹น hidden vector๋“ค์„, VQ-VAE๋ฅผ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ ๋ฏธ๋ฆฌ ๊ตฌํ•ด๋†จ๋˜ codebook์˜ size๋กœ projection (์ €์ž๋“ค์€ ์ด Linear layer๋ฅผ Masked Image Modeling Head๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.) ํ•ด์„œ ์›๋ณธ image patch์˜ code๋กœ classification์„ ์ง„ํ–‰ํ•˜๋ฉด ๋์ž…๋‹ˆ๋‹ค.

4. Experiment & Result

Experimental Setup

Pre-training Setup

  • Dataset ์ด ๋…ผ๋ฌธ์—์„œ pre-training์„ ์œ„ํ•ด ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์…‹์€ ImageNet-1K์˜ training set์ž…๋‹ˆ๋‹ค. ์ด 1.2๋ฐฑ๋งŒ๊ฐœ์˜ 224 x 224์˜ resolution ์ด๋ฏธ์ง€๋“ค์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ data augmentation์œผ๋กœ random resized cropping, horizontal flipping, color jittering์„ ์ ์šฉํ–ˆ์œผ๋ฉฐ mask ratio๋Š” ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ๋Œ€๋กœ ์ด 40%, ๊ฐœ์ˆ˜๋กœ ์น˜๋ฉด 14 x 14 = 196๊ฐœ์˜ image patch ์ค‘ ์ตœ๋Œ€ 75๊ฐœ์˜ image patch๋ฅผ maskingํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  • Model Architecture ๋ชจ๋ธ ์•„ํ‚คํ…์ณ๋Š” BERT-Base (ํ˜น์€ ViT-Base) ๋ชจ๋ธ๊ณผ ์™„์ „ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š”, Transformer Encoder layer์˜ ๊ฐœ์ˆ˜๋Š” 12๊ฐœ, Encoder์˜ hidden dimension size๋Š” 768, feed-forward dimension size๋Š” 3072, multi-head attention heads์˜ ๊ฐœ์ˆ˜๋Š” 12๊ฐœ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ฐ image patch์˜ size๋Š” 16 x 16์œผ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Training Hyperparameters 2K batch size๋กœ ์ด 500K steps (=800 epochs)๋งŒํผ ํ•™์Šต์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. 16๊ฐœ์˜ Nvidia Tesla V100 32GB GPU๋กœ ์ด 5์ผ๋™์•ˆ ํ•™์Šตํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  • Baseline ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ์œ„ํ•ด contrastive learning ๋ฐฉ์‹์˜ SSL ๋ชจ๋ธ์ธ MoCo v3์™€ self-distillation ๋ฐฉ์‹์˜ DINO ๋ชจ๋ธ์„ baseline์œผ๋กœ ์‚ผ์•˜์Šต๋‹ˆ๋‹ค.

Fine-tuning Setup

  • Fine-tuning Task Pre-training์˜ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•œ downstream task๋กœ๋Š” image classification๊ณผ semantic segmentation์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜, intermediate fine-tuning์ด๋ผ๋Š” ๊ฒƒ์„ ์ง„ํ–‰ํ–ˆ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ์ด๊ฑด self-supervised pre-training์ด ๋๋‚œ ๋’ค ํ•ด๋‹น pre-training ๋ฐ์ดํ„ฐ์…‹์— downstream task๋กœ ๋‹ค์‹œ ํ•œ ๋ฒˆ fine-tuning์„ ์ง„ํ–‰ํ•œ ๋’ค์—, ์ตœ์ข…์ ์œผ๋กœ target dataset์— fine-tuning์„ ์‹œํ‚ค๋Š” ์ž‘์—…์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Dataset image classificaton์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ dataset์œผ๋กœ๋Š” CIFAR-100๊ณผ pre-trainingํ•  ๋•Œ ์‚ฌ์šฉํ–ˆ๋˜ ImageNet-1K๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, Semantic segmentation์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ dataset์œผ๋กœ๋Š” ADE20K์™€ ImageNet-1K๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Evaluation Metric Image classification์˜ ์„ฑ๋Šฅ ๊ฒ€์ฆ์„ ์œ„ํ•ด์„œ๋Š” Top-1 accuracy๋ฅผ ์‚ฌ์šฉํ–ˆ๊ณ , semantic segmentation์˜ ์„ฑ๋Šฅ ๊ฒ€์ฆ์„ ์œ„ํ•ด์„œ๋Š” mIoU metric์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

Result

Image Classification

Image classification์˜ ๊ฒฐ๊ณผ๋Š” ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. Baseline์ด ์ข€ ๋งŽ์•„์„œ ๋ณด๊ธฐ์— ์ข€ ๋‚œ์žกํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ์š”, ๋ณด์‹ค๋งŒํ•œ ๋ถ€๋ถ„๋งŒ ์ง‘์–ด์„œ ๋ง์”€๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค. ์šฐ์„  MoCo v3์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ BEIT์˜ ์„ฑ๋Šฅ์ด ๋” ๋›ฐ์–ด๋‚œ ๊ฒƒ์„ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Intermediate fine-tuning์„ ๊ฑฐ์ณค์„ ๋•Œ๋Š” DINO๋ณด๋‹ค BEIT์˜ ์„ฑ๋Šฅ์ด ์•„์ฃผ ์•ฝ๊ฐ„ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ๋„ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์œผ์‹ค ๊ฒ๋‹ˆ๋‹ค. ํ•œํŽธ, ์„ฑ๋Šฅ๊ณผ๋Š” ๋ณ„๊ฐœ๋กœ BEIT์˜ ํ•™์Šต ์ˆ˜๋ ด ์†๋„๊ฐ€ random initialization๋œ DeiT์˜ ํ•™์Šต ์ˆ˜๋ ด ์†๋„๋ณด๋‹ค ๋น ๋ฅด๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ์ž๋ฃŒ๋Š” ์•„๋ž˜์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Semantic Segmentation

Semantic Segmentation์˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Semantic segmentation task์—์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ BEIT๊ฐ€ DINO๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•˜์œผ๋ฉฐ, intermediate fine-tuning์„ ํ•  ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, self-attention map์„ ๋ณด์‹œ๋ฉด pre-training ๋‹จ๊ณ„์—์„œ semantic ๊ฒฝ๊ณ„์™€ ๊ด€๋ จ๋œ ์•„๋ฌด๋Ÿฐ annotation ์—†์ด ํ•™์Šต๋์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  pre-training๋œ self-attention map์ด ์ด๋ฏธ์ง€ ๋‚ด์˜ object์˜ semanticํ•œ ๊ฒฝ๊ณ„๋ฅผ ์ž˜ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

5. Conclusion

๊ฒฐ๋ก ์ ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Vision Transformer๋ฅผ self-supervision์œผ๋กœ pre-trainingํ•จ์œผ๋กœ์จ image classification, semantic segmentation ๋“ฑ์˜ downstream task์— ๋Œ€ํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œ.

  • ๊ธฐ์กด์˜ BERT์ฒ˜๋Ÿผ MLM ๋ฐฉ์‹๋Œ€๋กœ pre-trainingํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ด๋ฏธ์ง€ ๋„๋ฉ”์ธ์— ๋งž๊ฒŒ ๋ณ€ํ˜•์‹œ์ผœ ์ ์šฉํ•จ.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

์‚ฌ์‹ค ์ด ๋…ผ๋ฌธ์€ ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ–ˆ๋‹ค๊ธฐ ๋ณด๋‹ค๋Š” ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ViT์— ์ ์šฉํ•ด๋ณธ ๊ฒƒ์— ๋ถˆ๊ณผํ•ฉ๋‹ˆ๋‹ค.

BERT์˜ MLM ํ•™์Šต๋ฐฉ์‹์„ ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ”๊ณ , ์ด๋ฅผ ์œ„ํ•œ image tokenizer๋„ DALL-E์—์„œ ๊ณต๊ฐœํ•œ tokenizer๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ Backbone ์•„ํ‚คํ…์ณ๋„ ViT๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์ฃ .

๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ICLR 2022, ๊ทธ๊ฒƒ๋„ Oral๋กœ ๋ถ™์€ ๊ฒƒ์„ ๋ณด๋ฉด, ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ž˜ ์ฃผ๋ฌผ๋Ÿฌ์„œ ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์ด๋‚˜ ์ปจ์…‰์— ์ ์šฉํ•˜๋Š” ๊ฒƒ๋„ ๊ดœ์ฐฎ์€ ์—ฐ๊ตฌ์ฃผ์ œ๊ฐ€ ๋  ๊ฒƒ์ด๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ญ๋‹ˆ๋‹ค.

Author / Reviewer information

Author

์˜ค์ •์šฐ (Jungwoo Oh)

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

Last updated