MAE [Kor]

Kaiming He, Xinlei Chen / Masked Autoencoders Are Scalable Vision Learners / Facebook AI Research(FAIR) 2021

Kaiming He, Xinlei Chen / Masked Autoencoders Are Scalable Vision Learners / Facebook AI Research(FAIR) 2021

1. Problem definition

์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ label์ด ์žˆ๋Š” ์ˆ˜๋งŒ์žฅ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๊ธฐ๋ž€ ์–ด๋ ค์šด ์ผ์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ•˜๋“œ์›จ์–ด ๋“ฑ์˜ ๋ฐœ์ „์œผ๋กœ ํฐ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋ฉด์„œ self-supervised learning์„ ํ†ตํ•ด ์ด๋ฏธ์ง€๋ฅผ ๋ผ๋ฒจ ์—†์ด ํŒ๋ณ„ํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ๊ด€์‹ฌ์ด ๋†’์•„์กŒ๋‹ค. ์ด๋Ÿฐ self-supervised learning์ด NLP ๋ถ„์•ผ์—์„œ๋Š” ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค. ๊ทธ ์ค‘์—์„œ ๊ฐ€์žฅ ์œ ๋ช…ํ•œ GPT[1]์™€ BERT[2]๋Š” ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๋ฅผ ์ง€์šฐ๊ณ  ๊ทธ๊ฒƒ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์กŒ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” NLP์—๋งŒ ์ ์šฉ๋˜๋˜ ์ด๋Ÿฐ masked modeling ๋ฐฉ์‹์„ ์ปดํ“จํ„ฐ๋น„์ „์—๋„ ์ ์šฉํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค.

2. Motivation

Masked language modeling

NLP ๋ถ„์•ผ์— ์žˆ์–ด์„œ BERT[2]์™€ GPT[1]๋Š” masked lanugage modeling์„ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ model์ด๋‹ค. ์ด๋“ค์€ input sequence์˜ ์ผ๋ถ€๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๊ทธ ์—†์–ด์ง„ ๋ถ€๋ถ„์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ pre-training์ด ์ด๋ฃจ์–ด์ง„๋‹ค. ์ด pretraining๋œ ๊ฒƒ์„ downstream task์— ์ ์šฉํ•˜์—ฌ ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Autoencoding

Autoencoding[3]์€ learning representations์˜ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด๊ฒƒ์€ input์„ latent representation์— mappingํ•˜๋Š” encoder์™€ input์„ ๋ณต์›ํ•˜๋Š” decoder๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. Denosing autodencoders(DAE)[4]๋Š” input signal์„ ๋ถ•๊ดด์‹œํ‚ค๊ณ  original signal๋กœ ๋ณต์›ํ•˜๋Š” autodencoder์ด๋‹ค. ์ด ๋…ผ๋ฌธ์˜ MAE๋Š” DAE์™€๋Š” ๋‹ค๋ฅธ ๋ฐฉ์‹์„ ์ทจํ•˜๊ณ  ์žˆ๋‹ค.

Masked image encoding

์ด ๋ฐฉ์‹์€ image๊ฐ€ masking์— ์˜ํ•ด ๋ถ•๊ดด๋˜์—ˆ์„ ๋•Œ ์ด์— ๋Œ€ํ•œ representation์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์ด๋‹ค. DAE์—์„œ๋Š” masking์„ noise type์œผ๋กœ ๋ณด์—ฌ์กŒ๋‹ค. Context Encoder์—์„œ๋Š” CNN์„ ํ†ตํ•ด ์‚ฌ๋ผ์ง„ ๋ถ€๋ถ„์„ ์ฐพ๊ณ ์ž ํ•˜์˜€๋‹ค. ์ตœ๊ทผ์— NLP ๋ถ„์•ผ์—์„œ๋Š” Transformers[5]๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ encodingํ•œ ๊ฒƒ์—์„œ ์ฐฉ์•ˆํ•˜์—ฌ, iGPT[6]๋Š” unknown pixel์„ transformer๋กœ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๋” ์ตœ๊ทผ์—๋Š” BEiT๋Š” discrete tokens์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ์ œ์‹œํ•˜์˜€๋‹ค.

Self-supervised learning

์ด ๋ฐฉ์‹์€ ์ตœ๊ทผ์— ์ปดํ“จํ„ฐ๋น„์ „์—์„œ ๋งŽ์ด ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์œผ๋ฉฐ pre-training์— ๋Œ€ํ•ด ๋‹ค๋ฅธ pretext tasks๋“ค์„ ์—ฐ๊ตฌํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ ์ค‘์—๋Š” image์˜ ์œ ์‚ฌ์„ฑ๊ณผ ๋น„์œ ์‚ฌ์„ฑ์„ ํ•™์Šตํ•˜๋Š” contrastive learning[7],[8],[9]๊ฐ€ ์žˆ๋‹ค. ์ด๋Š” data augmentation์— ์˜์กดํ•˜๊ณ  ์žˆ๋‹ค.

Idea

์ด ๋…ผ๋ฌธ์˜ MAE๋Š” masked๋œ input image๋ฅผ encoder๋ฅผ ํ†ตํ•ด latent representation์— mappingํ•˜๊ณ  decoder๋ฅผ ํ†ตํ•ด ์›๋ž˜์˜ ์‹ ํ˜ธ๋กœ ๋ณต์›ํ•˜๊ณ ์ž ํ•œ๋‹ค. NLP์—๋งŒ ์‚ฌ์šฉ๋œ masked autoencoding์„ vision์—๋„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด language์™€ vision์˜ ์งˆ๋ฌธ๋“ค์„ ํ•ด๊ฒฐํ•˜์˜€๋‹ค.

  1. vision ๋ถ„์•ผ์—์„œ๋Š” CNN์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ์€ masked tokens๊ณผ positional embedding์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ์€ Vision Transformer(ViT)[10]๋ฅผ ํ†ตํ•ด ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

  2. information density๊ฐ€ vision๊ณผ language๋Š” ๋‹ค๋ฅด๋‹ค. language์˜ ๊ฒฝ์šฐ์—๋Š” ์ธ๊ฐ„์ด ๋งŒ๋“  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— highly semanticํ•˜๊ณ  information-denseํ•˜๋‹ค. ๊ทธ๋ž˜์„œ ์‚ฌ๋ผ์ง„ ๋ช‡๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” language ์ „์ฒด์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ image์˜ ๊ฒฝ์šฐ์—๋Š” nature-made์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ช‡ ๊ฐœ์˜ information์ด ์‚ฌ๋ผ์ง€๋”๋ผ๋„ ์ „์ฒด์— ๋Œ€ํ•œ ์ดํ•ด ์—†์ด๋„ ์ด์›ƒํ•œ patch๋กœ๋ถ€ํ„ฐ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด๋Ÿฐ low-level image semantic์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ image์˜ ๋งŽ์€ ๋น„์œจ์˜ random patch๋ฅผ ๋ฝ‘์•„๋‚ด๊ณ ์ž ํ•˜์˜€๋‹ค. ์ด๋Š” self-supervised task๋ฅผ ๋” ์–ด๋ ต๊ฒŒ ๋งŒ๋“ ๋‹ค.

  3. text์™€ image์˜ decoder์˜ ๋ชฉ์ ์ด ๋‹ค๋ฅด๋‹ค. text์˜ ๊ฒฝ์šฐ์—๋Š” decoder๊ฐ€ missing words๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๊ณ  ์ด๋Š” rich semantic information์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ image decoder์˜ ๊ฒฝ์šฐ์—๋Š” pixel๋ฅผ ๋ณต์›ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— text decoder์˜ recognition task ๋ณด๋‹ค๋Š” lower semantic level์„ ์ง€๋‹Œ๋‹ค. ๊ทธ๋ž˜์„œ decoder๊ฐ€ semantic level๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.

์ด ์„ธ๊ฐ€์ง€ ์งˆ๋ฌธ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ธ masked autoencoder(MAE)์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ•˜์˜€๋‹ค. ์ด ๋ชจ๋ธ์€ input image์— random patches ๋“ค์„ maskingํ•˜๊ณ  missing๋œ ๋ถ€๋ถ„์„ decoder๋ฅผ ํ†ตํ•ด ๋ณต์›ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. MAE์˜ encoder์™€ decoder๋Š” ๋น„๋Œ€์นญ์ ์ธ ๋””์ž์ธ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.(Figure1)

Figure1 Figure1. Masked Autoencoder architecture

75%๊ฐ€ masked๋œ image์— ๋Œ€ํ•ด์„œ visible patch๋งŒ encoder์— ๋„ฃ๊ณ , latent representation์„ ๋„์ถœํ•œ๋‹ค. ๊ทธ ํ›„ mask tokens๊ณผ ํ•จ๊ป˜ latent representation์„ small decoder์— ๋„ฃ์–ด ์‚ฌ๋ผ์ง„ ๋ถ€๋ถ„์„ ๋ณต์›ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ด ๋•Œ, encoder์—์„œ small portion๋งŒ ์ง„ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— pre-training time๊ณผ memory consumption์„ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

3. Method

MAE๋Š” ๋น„๋Œ€์นญ์ ์ธ design์„ ๊ฐ€์ง„ encoder์™€ ecoder๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.(Figure 1) encoder๋Š” masked๋œ input์—์„œ visible patches๋งŒ ๋ณด๊ณ  lightweight deocder๋ฅผ ์ด์šฉํ•˜์—ฌ ์‚ฌ๋ผ์ง„ ๋ถ€๋ถ„์„ ์˜ˆ์ธกํ•˜์˜€๋‹ค.

Masking

input image๋ฅผ maskingํ•˜๊ธฐ ์œ„ํ•ด ๊ฒน์น˜์ง€ ์•Š๊ฒŒ patch๋ฅผ ๋‚˜๋ˆ„๊ณ , uniform distribution์— ๋”ฐ๋ผ random patches๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์˜€๋‹ค. ์ด ๋•Œ, ์ด์›ƒํ•œ patch๋กœ ๋ถ€ํ„ฐ ์˜ˆ์ธก(extrapolation)ํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ uniform distributin์œผ๋กœ ์ƒ˜ํ”Œ๋งํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์ด๋Š” center๋งŒ ์น˜์ค‘ํ•˜์—ฌ masking๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ณ , ํšจ์œจ์ ์ธ encoder๋ฅผ ๋งŒ๋“ค๋„๋ก ํ•˜์˜€๋‹ค.

MAE encoder

standard ViT์˜ ๊ฒฝ์šฐ์—๋Š” patch๋“ค์„ linear projectionํ•˜์ง€๋งŒ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” masked patches๋“ค์€ ์ œ๊ฑฐํ•˜๊ณ  visible patches์—๋งŒ ์ž‘๋™ํ•˜๋„๋ก ํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ผ๋ถ€์—๋งŒ encoder๊ฐ€ ์ ์šฉ๋˜๋„๋ก ํ•˜์—ฌ time๊ณผ memory์˜ ์‚ฌ์šฉ์„ ์ค„์˜€๋‹ค.

MAE decoder

Figure 1์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด decoder์— ๋“ค์–ด๊ฐ€๋Š” input์€ encoded visible patches์™€ mask tokens๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. ๊ทธ ํ›„ ๊ฐ token์— positional embedding์„ ๋”ํ•˜์—ฌ image์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋”ํ•˜๋„๋ก ํ•˜์˜€๋‹ค. MAE decoder์˜ ๊ฒฝ์šฐ์—๋Š” recognition task๋ฅผ ํ•˜๋Š” text decoder์™€ ๋‹ฌ๋ฆฌ reconstruction task์—๋งŒ ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— encoder design๊ณผ ๋…๋ฆฝ์ ์œผ๋กœ design ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” encoder์— ๋น„ํ•ด ๋” ์ž‘๊ณ  narrowerํ•œ decoder์˜ ์‚ฌ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค.

Reconstruction target

decoder๋Š” each masked patch์˜ pixel ๊ฐ’๋“ค์„ ์˜ˆ์ธกํ•œ๋‹ค. ๊ทธ๋ž˜์„œ decoder output์˜ channel์ˆ˜๋Š” ๊ฐ patch์˜ pixel ๊ฐฏ์ˆ˜์™€ ๊ฐ™๋‹ค. ๊ทธ ํ›„ reshape๋ฅผ ํ†ตํ•ด ๋ณต์›๋œ image๋กœ ๋งŒ๋“ ๋‹ค. ์ด ๋•Œ, reconstucted image์™€ original image๋ฅผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด the mean squared error (MSE)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๊ธฐ์—์„œ ๋” ๋‚˜์•„๊ฐ€ ๊ฐ patch์˜ ๊ฐ’์„ normalization ํ•˜์—ฌ์„œ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ง€๋„๋ก ํ•˜์˜€๋‹ค. ์ด normalization์ด ๊ฒฐ๊ณผ๋ฅผ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์‹คํ—˜์„ ํ†ตํ•ด ๋ฐํžˆ๊ณ  ์žˆ๋‹ค.

imple implementation

MAE pre-training ์‹œ ๋‹ค์Œ์˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋จผ์ € ๋ชจ๋“  input patch์— token์„ ํ˜•์„ฑํ•˜์˜€๋‹ค. ๊ทธ ํ›„, tokens๋“ค์„ ๋žœ๋คํ•˜๊ฒŒ ์„ž๊ณ  masking ratio์— ๋”ฐ๋ผ ์ผ๋ถ€ patch๋“ค์„ ์ œ๊ฑฐํ•˜์˜€๋‹ค. ์ด ์ผ๋ถ€์˜ tokens๋“ค์„ encoder์— ๋„ฃ์–ด ๊ณผ์ •์ด ์ง„ํ–‰๋˜๋„๋ก ํ•œ ๊ฒƒ์€ input์„ maskingํ•œ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. encoding ํ›„์—๋Š” mask token์„ ๋”ํ•˜์—ฌ ์„ž์ง€ ์•Š์€ ์ฑ„๋กœ positional embedding์„ ๋”ํ•˜์—ฌ decoder์— ๋„ฃ์–ด์ง€๋„๋ก ๊ณ„์‚ฐ๋˜์—ˆ๋‹ค. ์ด๋Ÿฐ ๋ฐฉ์‹์€ sparse operation ์—†์ด ๋น ๋ฅด๊ฒŒ ์ž‘๋™๋˜๋„๋ก ํ•˜์˜€๋‹ค.

4. Experiment & Result

Experimental setup

Dataset

์ด ๋…ผ๋ฌธ์—์„œ๋Š” self-supervised pre-training์„ ์œ„ํ•ด ImageNet-1K(IN1K) training set์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

Evaluation

Pre-trained๋œ model์— ๋Œ€ํ•ด supervised training์„ ํ•˜์—ฌ (1)end-to-end fine tuning (2)linear probing์— ๋Œ€ํ•ด evaluation์„ ํ•˜์˜€๋‹ค. ์ด ๋•Œ, 224*224 crop๋œ image์— ๋Œ€ํ•ด top-1 validation accuracy๋ฅผ ๋„์ถœํ•˜์˜€๋‹ค.

baseline

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ViT-Large (ViT-L/16)์„ backbone์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” scratch๋ถ€ํ„ฐ ViT-Large๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์— ๋น„ํ•ด baseline MAE๋กœ๋ถ€ํ„ฐ fine-tuned ํ•˜์˜€์„ ๋•Œ ๋” ๋†’์€ accuracy๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ๋ฐํ˜€๋ƒˆ๋‹ค.

Result

Main Properties

Table1 Table1. Experiment Result

Masking ratio

Figure2 Figure2. Masking ratio์— ๋”ฐ๋ฅธ Accuracy ๋ณ€ํ™”

Figure2๋Š” masking ratio์˜ ์˜ํ–ฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. masking ratio์„ 15%๋กœ ์„ค์ •ํ•˜๋Š” BERT์™€ ๋‹ฌ๋ฆฌ MAE๋Š” 75%์˜ masking ratio๊ฐ€ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ, Figure2์—์„œ๋Š” fine-tuning๊ณผ linear probing์—์„œ ๋‹ค๋ฅธ ๊ฒฝํ–ฅ์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. fine-tuning์˜ ๊ฒฝ์šฐ์—๋Š” 40-80%์˜ masking ratio์—์„œ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์ง€๋งŒ, line-probing์˜ ๊ฒฝ์šฐ์—๋Š” 75%์˜ masking ratio์˜ ๊ฒฝ์šฐ์— 10%์˜ masking ratio์˜ ๊ฒฝ์šฐ์— ๋น„ํ•ด ์•ฝ 20%์˜ ๋” ๋†’์€ accuracy๋ฅผ ๋„์ถœํ•˜์˜€๋‹ค. ์ด ํ›„ ์‹คํ—˜์—์„œ๋Š” 75%์˜ masking์„ ํ†ตํ•ด pre-training์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

Decoder design

์ด์ „์—๋„ ๋งํ–ˆ๋“ฏ์ด decoder๋Š” reconstuction task์—๋งŒ ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ž์œ ๋กญ๊ฒŒ design ๋  ์ˆ˜ ์žˆ๋‹ค. Table1-(a)์—์„œ๋Š” decoder depth์— ๋”ฐ๋ฅธ ์ •ํ™•๋„์˜ ๋ณ€ํ™”๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ์ด๋•Œ deep decoder๋Š” linear probing์— ๋” ๋งŽ์€ ์˜ํ–ฅ์„ ๋ผ์น˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ ์ด์œ ๋Š” autoencoder์—์„œ์˜ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์— ์žˆ๋Š” layer๋“ค์€ recognition๋ณด๋‹ค๋Š” reconstuction์— ๋” ํŠนํ™”๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋Š” deep decoder๋ฅผ ์‚ฌ์šฉํ• ์ˆ˜๋ก reconstuction์— ๋” ํŠนํ™”๋œ ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ๊ทธ๋ž˜์„œ deep decoder๋ฅผ ์‚ฌ์šฉํ• ์ˆ˜๋ก ์ค‘๊ฐ„์˜ layer๋“ค์ด recognition์— ๋” ํŠนํ™”๋˜๊ธฐ ๋•Œ๋ฌธ์— linear probing์˜ ๊ฒฝ์šฐ ๋” ์ข‹์€ accuracy๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” Table1์˜ (a)์—์„œ 8%์˜ ์ •ํ™•๋„ ํ–ฅ์ƒ์ด ๋„์ถœ๋˜๋Š” ๊ฒƒ์„ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ fine-tuning์˜ ๊ฒฝ์šฐ์—๋Š” ๋งˆ์ง€๋ง‰ layer๊นŒ์ง€ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋งˆ์ง€๋ง‰ layer๋ฅผ recognition์— ๋งž๊ฒŒ fine-tuning ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ fine-tuning์˜ ๊ฒฝ์šฐ decoder depth์— ๊ด€๊ณ„์—†์ด accuracy๊ฐ€ ๊ฐ™๊ฒŒ ๋„์ถœ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ, fine-tuning์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด 1๊ฐœ์˜ decoder block์œผ๋กœ๋„ 84.8%์˜ ์ •ํ™•๋„๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— 1๊ฐœ์˜ decoder block์„ speed-up pre-training์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Table1-(b)์—์„œ๋Š” decoder์˜ width์— ๋”ฐ๋ฅธ accuracy๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ fine-tuning๊ณผ linear probing์—์„œ ๋ชจ๋‘ ์ข‹์€ accuracy๋ฅผ ๋„์ถœํ•˜๋Š” 512 width๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค. ์ดํ›„ ์‹คํ—˜์—์„œ๋Š” 512 width๋ฅผ ๊ฐ€์ง„ 8๊ฐœ์˜ block์˜ decoder๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

Mask token

Method์—์„œ๋Š” mask token์„ encoder์— ๋„ฃ์ง€ ์•Š๊ณ  decoder์—๋งŒ ๋„ฃ๊ธฐ๋กœ ํ•˜์˜€๋‹ค. Table1-(c)์—์„œ๋Š” ๊ทธ๊ฒƒ์— ๊ด€ํ•œ ์‹คํ—˜์„ ํ•ด๋ณด๊ธฐ๋กœ ํ•˜์˜€๋‹ค. mask token์„ encoder์— ๋„ฃ์–ด์„œ ์‹คํ—˜ํ•ด๋ณด๋ฉด linear probing์˜ ๊ฒฝ์šฐ 14%์˜ accuracy๊ฐ€ ๋–จ์–ด์ง€๋Š” ๊ฒƒ์œผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฐ accuracy์˜ ๊ฐ์†Œ๋Š” pre-training๊ณผ deployment์‚ฌ์ด์˜ ์ฐจ์ด๋กœ ์ธํ•ด ๋ฐœ์ƒํ•œ๋‹ค. pre-training ์‹œ์—๋Š” mask token์ด ๋”ํ•ด์ง€์ง€๋งŒ deployment๋Š” input image์—์„œ ๋ถ•๊ดด๋œ ๋ถ€๋ถ„์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ์ฐจ์ด๋กœ ์ธํ•ด accuracy๊ฐ€ ๊ฐ์†Œํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๋ถ•๊ดด๋œ ๋ถ€๋ถ„์— ๋Œ€ํ•œ mask token์€ decoder์—์„œ๋งŒ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

Table2 Table2. baseline model์— ๋”ฐ๋ฅธ MAE training ์‹œ๊ฐ„

๋˜ํ•œ, Table2์—์„œ๋Š” encoder์—์„œ visibleํ•œ input patch๋งŒ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ training ์‹œ๊ฐ„์„ ์ค„์ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. large encoder(ViT-H)๋ฅผ ์‚ฌ์šฉํ• ์ˆ˜๋ก, decoder depth๋ฅผ ์ค„์ผ์ˆ˜๋ก pre-training ์‹œ๊ฐ„์ด ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ ์‹œ๊ฐ„์˜ ์ค„์–ด๋“ฆ์€ self-attention complexity์˜ ์ฆ๊ฐ€๋กœ ์ธํ•ด ์ด์ฐจํ•จ์ˆ˜์ ์œผ๋กœ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Reconstruction target

Table1-(d)์—์„œ๋Š” input์— ๋”ฐ๋ฅธ accuracy์˜ ์ฐจ์ด๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ๊ฐ patch์— normalization์„ ์ ์šฉํ–ˆ์„ ๋•Œ ์ ์šฉํ•˜์ง€ ์•Š์„ ๋•Œ๋ณด๋‹ค ๋” ๋†’์€ accuracy๋ฅผ ์–ป๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, PCA๋ฅผ ์ ์šฉํ–ˆ์„ ๋•Œ์—๋Š” accuracy๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์€ high frequency component๊ฐ€ patch์•ˆ์— ์œ ์ง€๋  ๋•Œ ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๋˜ํ•œ, normalization ๋Œ€์‹  tokenization์„ ์‚ฌ์šฉํ•  ๋•Œ์˜ accuracy ์ฐจ์ด๋„ ์ธก์ •ํ•˜์˜€๋‹ค. DALLE pre-trained dVAE[11]๋ฅผ tokenizer๋กœ ์‚ฌ์šฉํ•˜์—ฌ decoder์—์„œ token์„ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ unnormalized์— ๋น„ํ•ด ์กฐ๊ธˆ์˜ accuracy๊ฐ€ ์ฆ๊ฐ€ ํ•˜๊ฑฐ๋‚˜ ์˜คํžˆ๋ ค ๊ฐ์†Œํ•˜๊ธฐ๋„ ํ•˜์˜€๋‹ค. ๋˜ํ•œ, tokenization์„ ์‚ฌ์šฉํ•˜๋ฉด dVAE์˜ pre-training์ด ํ•„์š”ํ•˜์—ฌ ์‹œ๊ฐ„์ด ๋” ๊ฑธ๋ฆฐ๋‹ค. ์ด๋Š” patch๋‹จ์œ„์˜ normalization์ด ๋” ํšจ์œจ์ ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Data augmentation

Table1-(e)์—์„œ๋Š” data augmentation์— ๋”ฐ๋ฅธ accuracy์˜ ๋ณ€ํ™”๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. cropping์€ ๋” ์ข‹์€ accuracy๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์ง€๋งŒ, color jittering์€ ์˜คํžˆ๋ ค accuracy๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๊ณ  ์žˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ ์ฃผ๋ชฉํ•  ์ ์€ data augmentation์„ ์ ์šฉํ•˜์ง€ ์•Š์•„๋„ ์ข‹์€ accuracy๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค. ์ด๋Š” ๋‹ค์–‘ํ•œ data augmentaion์„ ์‚ฌ์šฉํ•˜๋Š” contrastive learning์ด๋‚˜ BYOL[12], SimCLR[13]๊ณผ ๊ฐ™์€ ๋ฐฉ์‹๊ณผ ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ด๊ณ  ์žˆ๋‹ค. ๋˜ํ•œ, MAE๋Š” data augmentation ๋Œ€์‹ ์˜ randomness๋ฅผ random masking์„ ํ†ตํ•ด ๋”ํ•˜๊ณ  ์žˆ๋‹ค.

Mask sampling strategy

Figure3 Figure3. Masking sampling strategy

Figure3์—์„œ๋Š” ๋‹ค๋ฅธ mask sampling ์ „๋žต์„ ๋ณด์—ฌ์ฃผ๊ณ , ์ด์— ๋Œ€ํ•œ accuracy ์ฐจ์ด๋ฅผ Table1-(f)์—์„œ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. Figure5์˜ ์ค‘๊ฐ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ block-wise๋กœ masking์„ ํ–ˆ์„ ๋•Œ 50%๋งŒ degrading ํ–ˆ์Œ์—๋„ random sampling์— ๋น„ํ•ด ๋” ๋†’์€ training loss์™€ blurringํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค. ๋˜ํ•œ, Figure5์˜ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ grid masking์„ ํ–ˆ์„ ๋•Œ์—๋Š” ๋” ๋‚ฎ์€ training loss์™€ sharperํ•œ reconstuction ๊ทธ๋ฆผ์„ ์–ป์—ˆ์ง€๋งŒ, ์ค‘๊ฐ„์ค‘๊ฐ„ grid ํ˜•ํƒœ๊ฐ€ ๋ณด์ด๋Š” ์ข‹์ง€ ๋ชปํ•œ ๊ทธ๋ฆผ์„ ๋„์ถœํ•จ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด higher masking ratio๋ฅผ ๊ฐ€์ง„ random sampling์ด ๊ฐ€์žฅ ์ข‹์€ reconstruction๊ณผ accuracy๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Training schedule

Figure4 Figure4. Epoch์— ๋”ฐ๋ฅธ accuracy ๋ณ€ํ™”

Figure4์—์„œ๋Š” Epcoh์— ๋”ฐ๋ฅธ accuracy์˜ ๋ณ€ํ™”๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋‘ ๊ฒฝ์šฐ์˜ ๋ชจ๋‘ epoch์— ๋”ฐ๋ผ accuracy๊ฐ€ steadilyํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” 300epoch ์ดํ›„์—๋Š” ๋”์ด์ƒ accuracy๊ฐ€ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๋Š” contrastive learning๊ณผ๋Š” ๋‹ค๋ฅด๋‹ค. ์ด๋Š” ํ•œ epoch๋‹น ๋ณด๋Š” patch์˜ ์ˆ˜๊ฐ€ MAE์— ๋น„ํ•ด contrasitve learning์˜ ๊ฒฝ์šฐ ํ›จ์”ฌ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ, MAE์˜ ๊ฒฝ์šฐ ์ ์€ ์ˆ˜์˜ patch๊ฐ€ randomํ•˜๊ฒŒ ๋“ค์–ด์˜ค๊ธฐ ๋•Œ๋ฌธ์— accuracy๊ฐ€ ๊ณ„์† ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค.

Comparisons with Previous Results

Table3 Table3. ImageNet-1K์— ๋Œ€ํ•œ method์— ๋”ฐ๋ฅธ results

Table3์—์„œ๋Š” ImageNet-1K์— ๋Œ€ํ•œ self-supervised method์™€ MAE๋ฅผ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ์ œ์‹œํ•˜๊ณ  ์žˆ๋‹ค. Figure6์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋‹ค๋ฅธ self-supervised learning์— ๋น„ํ•ด MSE๊ฐ€ ๋” ๋†’์€ accuracy๋ฅผ ๋„์ถœํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋” ํฐ ๋ชจ๋ธ์ธ ViT-H๋ฅผ ์‚ฌ์šฉํ• ์ˆ˜๋ก ๋” ๋†’์€ accuracy๋ฅผ ๋„์ถœํ•œ๋‹ค. ๋˜ํ•œ, BEiT[2]์™€ ๋น„๊ตํ•ด๋ดค์„ ๋•Œ์—๋„ MAE๊ฐ€ ๋” ๋†’์€ accuracy๋ฅผ ๋„์ถœํ•œ๋‹ค. ์—ฌ๊ธฐ์—์„œ ์ค‘์š”ํ•œ ์ ์€ MAE๊ฐ€ ๋” ๋น ๋ฅด๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ pre-training ๋œ๋‹ค๋Š” ์ ์ด๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ MAE๋Š” ๋น ๋ฅด๊ฒŒ pre-trained๋˜๊ธฐ ๋•Œ๋ฌธ์— 1600 epoch์œผ๋กœ ํ•™์Šตํ•  ๋•Œ์˜ ์‹œ๊ฐ„์ด MoCo v3๋ฅผ 300 epcoh์œผ๋กœ ํ•™์Šตํ–ˆ์„ ๋•Œ ์‹œ๊ฐ„๋ณด๋‹ค ๋” ์ ๋‹ค.

Partial Fine-tuning

Figure5 Figure5. Partial Fine-tuning

Figure5์—์„œ๋Š” Fine-tuningํ•˜๋Š” block์˜ ๊ฐฏ์ˆ˜์— ๋”ฐ๋ฅธ accuracy์˜ ๋ณ€ํ™”๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ์ด ๋•Œ, 0 block fine-tuning์€ linear probing, 24 block fine-tuning์€ full fine-tuning์„ ์˜๋ฏธํ•œ๋‹ค. linear probing์˜ ๊ฒฝ์šฐ feature layer๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ feature๋“ค์„ ์‚ฌ์šฉํ•  ๊ธฐํšŒ๋ฅผ ์žƒ๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ partial fine-tuning์„ ์ ์šฉํ•˜๊ณ ์ž ํ•˜์˜€๊ณ , 1๊ฐœ์˜ partial fine-tuning์„ ์ ์šฉํ•˜์˜€์„ ๋•Œ 73.5%์—์„œ 81%๋กœ accuracy๊ฐ€ ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ์•ฝ๊ฐ„์˜ fine-tuning๋งŒ ์ ์šฉํ•ด๋„ full fine-tuning๋งŒํผ ์ข‹์€ accuracy๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•„ partial fine-tuning์ด MAE์— ํšจ์œจ์ ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๋˜ํ•œ, Figure5์—์„œ contrastive learning์„ ์‚ฌ์šฉํ•œ MoCo v3[14]์™€์˜ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•˜๊ณ  ์žˆ๋Š”๋ฐ, partial fine-tuning์„ ์ ์šฉํ•œ MAE์˜ ๊ฒฝ์šฐ ํ›จ์”ฌ ๋†’์€ accuracy๋ฅผ ๋„์ถœํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Transfer Learning Experiments

Table4 Table4. COCO object detection and segmentation

Table4๋Š” pre-trained model์„ ์ด์šฉํ•˜์—ฌ downstream task๋ฅผ ํ‰๊ฐ€ ํ•œ ๊ฒƒ์ด๋‹ค. COCO datset์„ ์ด์šฉํ•˜์—ฌ object detection๊ณผ segmentation์„ ํ•˜์˜€์„ ๋•Œ label์ด ์žˆ๋Š” supervised learning์— ๋น„ํ•ด ๋” ๋†’์€ point๋ฅผ ๋„์ถœํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.(50.3 vs 47.9 / 53.3 vs 49.3) ๋น„์Šทํ•˜๊ฒŒ, ๋‹ค๋ฅธ task์ธ semantic segmentation๊ณผ classification tasks๋„ MSE๋กœ pre-trainedํ•œ ๋ชจ๋ธ์ด supervised learning๋ณด๋‹ค ๋” ๋†’์€ accuracy๋ฅผ ๋„์ถœํ•œ๋‹ค.

5. Conclusion

์ด ๋…ผ๋ฌธ์—์„œ๋Š” self-supervised learning์„ computer vision์— ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค. Masked Autoencoder ๋ฐฉ์‹์„ ํ™œ์šฉํ•˜์—ฌ label์„ ์ด์šฉํ•œ supervised learning์ด ์•„๋‹Œ, input์˜ ์‚ฌ๋ผ์ง„ ๋ถ€๋ถ„์„ ๋ณต์›ํ•˜๋ฉด์„œ self-supervised learning์„ ํ•˜๊ณ  ์žˆ๋‹ค. ์ด ๋•Œ, object๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋“ฑ์˜ semantic ํ•˜๊ฒŒ ์ง€์šฐ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ pixel์„ randomํ•˜๊ฒŒ ์ œ๊ฑฐํ•˜์—ฌ ์ด๋ฅผ ๋ณต์›ํ•˜๋„๋ก ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด supervised learning๋ณด๋‹ค ๋” ๋†’์€ accuracy๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค.

์ด๋ ‡๊ฒŒ computer vision์—์„œ self-supervised learning ๋ฐฉ์‹์„ ์ ์šฉํ•˜์˜€๋‹ค๋Š” ๊ฒƒ์ด ํฅ๋ฏธ๋กœ์› ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Masked Autoencoder์„ ํ™œ์šฉํ•˜์—ฌ pre-training์˜ ์‹œ๊ฐ„๊ณผ memory ์‚ฌ์šฉ์„ ์ค„์ธ ๊ฒƒ๋„ ํฐ contribution์ด๋ผ๊ณ  ์ƒ๊ฐ๋œ๋‹ค. ์ด ๋ฐฉ์‹์„ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์ด ์—†๋Š” task๋‚˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ผ๋ถ€ ์†Œ์‹ค๋œ image์— ๋Œ€ํ•ด์„œ ์ ์šฉํ•ด ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.

์•„์‰ฌ์šด ์ ์€ ์ด ๋…ผ๋ฌธ์—์„œ ์ฃผ๋กœ fine-tuning, linear probing์„ ์ด์šฉํ•œ accuracy์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋งŒ ์ œ์‹œํ•  ๋ฟ ๋ณต์›๋œ ์ด๋ฏธ์ง€ ์ž์ฒด์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ๋Š” ์ ์€ ๊ฒƒ ๊ฐ™๋‹ค. ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด ๋ณต์›๋œ ์ด๋ฏธ์ง€๊ฐ€ original ์ด๋ฏธ์ง€์™€ ๋น„๊ตํ•ด ๋ณด์•˜์„ ๋•Œ blurring ํ•œ ๊ฒƒ์„ ์ œ์™ธํ•˜๋ฉด ๋Œ€๋ถ€๋ถ„ ์ž˜ ๋ณต์› ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์—ฌ์ง„๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ œ๋Œ€๋กœ ๋ณต์›๋˜์ง€ ๋ชปํ•œ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ๋Š” ์ œ์‹œํ•˜๊ณ  ์žˆ์ง€ ์•Š๋‹ค. ๊ทธ๋Ÿฐ ๊ฒฝ์šฐ๋ฅผ ์ œ์‹œํ•˜์—ฌ ๋ชจ๋ธ์ด ๋ฌด์—‡๊ณผ ํ˜ผ๋™์„ ํ–ˆ๊ณ  ์™œ ๊ทธ๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋Š”์ง€์— ๋Œ€ํ•œ ๋ถ„์„์ด ์กฐ๊ธˆ ๋” ์žˆ์—ˆ์œผ๋ฉด ์ข‹์•˜์„ ๊ฒƒ ๊ฐ™๋‹ค. ๋˜ํ•œ, original image์™€ reconstuction image๋ฅผ ๋น„๊ตํ•  ๋•Œ MSE Loss๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ๋‹ค. MSE Loss ์ด์™ธ์—๋„ ๋‹ค๋ฅธ Loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ reconstuction image์˜ resolution์„ ๋†’ํžˆ๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋” ์žˆ์—ˆ๋‹ค๋ฉด ์ข‹์•˜์„ ๊ฒƒ ๊ฐ™๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

Self-supervised learning์„ MAE๋ผ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฏธ์ง€์— ์ ์šฉํ•˜์—ฌ ๊ทธ ์ •ํ™•๋„๋ฅผ ๋†’ํžˆ๊ณ  ์žˆ๋‹ค

์ด๋ฅผ ํ†ตํ•ด ์–ธ์–ด์™€ ์ด๋ฏธ์ง€์˜ ๊ฒฝ๊ณ„๋Š” ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์•ž์œผ๋กœ ๊ทธ๋Ÿฐ ๋…์ฐฝ์ ์ธ ๋ฐฉ์‹์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋” ํ•„์š”ํ•  ๊ฒƒ์œผ๋กœ ์ƒ๊ฐ๋œ๋‹ค.

Author / Reviewer information

Author

๊น€์„ธํฌ (Sehui Kim)

  • Affiliation (KAIST AI)

  • Contact information (sae0919@kaist.ac.kr)

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

[1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Ben- jamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language mod- els are few-shot learners. In NeurIPS, 2020.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.

[3] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length, and helmholtz free energy. In NeurIPS, 1994.

[4] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre- Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.

[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

[6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pix- els. In ICML, 2020.

[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual rep- resentations. In ICML, 2020.

[8] Xinlei Chen and Kaiming He. Exploring simple Siamese represen- tation learning. In CVPR, 2021.

[9] Jean-Bastien Grill, Florian Strub, Florent Altche ฬ, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi Munos, and Michal Valko. Boot- strap your own latent - a new approach to self-supervised learning. In NeurIPS, 2020.

[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

[11] AdityaRamesh,MikhailPavlov,GabrielGoh,ScottGray,Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021.

[12] Jean-Bastien Grill, Florian Strub, Florent Altche ฬ, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi Munos, and Michal Valko. Boot- strap your own latent - a new approach to self-supervised learning. In NeurIPS, 2020.

[13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual rep- resentations. In ICML, 2020.

[14] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised Vision Transformers. In ICCV, 2021.

Last updated

Was this helpful?