IPT [Kor]

Chen et al. / Pre-Trained Image Processing Transformer / CVPR 2021

1. Problem definition

์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ(Image processing)๋Š” ๋ณด๋‹ค ๊ธ€๋กœ๋ฒŒํ•œ ์ด๋ฏธ์ง€ ๋ถ„์„ ๋˜๋Š” ์ปดํ“จํ„ฐ ๋น„์ „ ์‹œ์Šคํ…œ์˜ low-level ๋ถ€๋ถ„ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์˜ ๊ฒฐ๊ณผ๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์˜ ์ธ์‹ ๋ฐ ์ดํ•ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ํ›„์† ์ƒ์œ„ ๋ ˆ๋ฒจ ๋ถ€๋ถ„์— ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์€ GPU๋ฅผ ํ™œ์šฉํ•œ ํ•˜๋“œ์›จ์–ด ์ปดํ“จํŒ… ์„ฑ๋Šฅ์ด ๊ฐ•๋ ฅํ•˜๊ฒŒ ์ฆ๊ฐ€ํ–ˆ๊ณ  Pre-Trained Deep Learning Model๊ณผ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ด๋Ÿฌํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์€ ์ด๋ฏธ์ง€ ์ดˆ๊ณ ํ•ด์ƒ๋„(super-resolution), ์ธํŽ˜์ธํŒ…(inpainting), ๋””๋ ˆ์ธ(deraining), ์ฑ„์ƒ‰(colorization)๊ณผ ๊ฐ™์€ ๋‚ฎ์€ ์ˆ˜์ค€์˜ ๋น„์ „ ์ž‘์—…์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋„๋ฆฌ ์ ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ Pre-Training์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ Task๋“ค์„ ์ผ๋ฐ˜ํ™”ํ•œ ์—ฐ๊ตฌ๋Š” ๊ฑฐ์˜ ์—†์Šต๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Pre-Trained Deep Learning Model์ธ IPT(image processing transformer)๋ฅผ ํ†ตํ•ด ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ, ์ดˆ๊ณ ํ•ด์ƒ๋„ ๋ฐ ๋””๋ ˆ์ด๋‹์™€ ๊ฐ™์€ low-level ์ปดํ“จํ„ฐ ๋น„์ „ Task์— ๋Œ€ํ•ด ์ผ๋ฐ˜ํ™”ํ•˜๊ณ  ํ˜„ state-of-the-art ์ด์ƒ์˜ ๊ฒฐ๊ณผ(์„ฑ๋Šฅ)๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ ๋งŽ์€ ์–‘์˜ ์†์ƒ๋œ ์ด๋ฏธ์ง€ ์Œ์„ ์‹คํ—˜์— ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ž˜ ์•Œ๋ ค์ง„ ImageNet ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

2. Motivation

1. Image processing

์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ๋Š” super-resolution(ํ•ด์ƒ๋„๋ฅผ ๋†’์ด๋Š” ์ž‘์—…), denoising(๋…ธ์ด์ฆˆ ์ œ๊ฑฐ), dehazing(์—ฐ๋ฌด, ์•ˆ๊ฐœ ๋“ฑ ๋Œ€๊ธฐ ์ค‘์˜ ๋ฏธ์„ธ์ž…์ž ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ) , deraining(๋น„๋‚ด๋ฆฌ๋Š”๋“ฏํ•œ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ), debluring(๋ธ”๋Ÿฌ ์ œ๊ฑฐ ์ž‘์—…) ๋“ฑ์„ ํฌํ•จํ•œ ์ด๋ฏธ์ง€ ์กฐ์ž‘์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • (Dong et al.) ์€ ์ดˆ๊ณ ํ•ด์ƒ๋„๋ฅผ ์œ„ํ•ด SRCNN์„ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค. Low Resolution(์ €ํ•ด์ƒ๋„) ์ด๋ฏธ์ง€์—์„œ High Resolution(๊ณ ํ•ด์ƒ๋„) ์ด๋ฏธ์ง€๋ฅผ ์žฌ๊ตฌ์„ฑํ•˜๋Š” end-to-end ๋ชจ๋ธ์„ ๋„์ž…ํ•œ ์„ ๊ตฌ์ ์ธ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

  • (Kim et al.) ์€ ์œ„์˜ ์—ฐ๊ตฌ์—์„œ ๋” ๊นŠ์€ ์ปจ๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ํฌ๊ธฐ์„ ํ‚ค์› ์Šต๋‹ˆ๋‹ค.

  • (Ahn et al. & Lim et al.) ์€ SR(super-resolution) Task์— Residual block ๊ฐœ๋…์„ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • (Zhang et al. & Anwar & Barnes) ๋Š” attention์˜ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ SR Task์— ํ™œ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ด์™ธ์—๋„ ๋‹ค๋ฅธ Task๋“ค์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋„ ๋งŽ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • (Tian et al. ์ดํ•˜ 5๊ฐœ ๋…ผ๋ฌธ)์—์„œ๋Š” ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ์™€ ๊ด€๋ จ๋œ Denoising์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.

  • (Cai et al. ์ดํ•˜ 4๊ฐœ ๋…ผ๋ฌธ)์—์„œ๋Š” dehazing์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.

  • (Hu et al. ์ดํ•˜ 6๊ฐœ ๋…ผ๋ฌธ)์—์„œ๋Š” deraining์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.

  • (Tao et al. ์ดํ•˜ 4๊ฐœ ๋…ผ๋ฌธ)์—์„œ๋Š” debluring์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.

Idea 1. ์œ„์˜ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” ๊ฐœ๋ณ„์ ์ธ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ๊ตฌํ–ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ•˜๋‚˜์˜ ํฐ ๋ชจ๋ธ(pre-trained)๊ณผ ๋Œ€์šฉ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ Task๋“ค์— ๋Œ€ํ•ด์„œ ์‹คํ—˜ํ•˜๊ณ  ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

2. Transformer

  • (Vaswani et al.) Transfomer๋Š” ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž‘์—…์—์„œ ๊ฐ•๋ ฅํ•œ unsupervised ๋˜๋Š” self-supervised pretraining framework๋กœ ์„ฑ๊ณต์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

  • (Radford et al.) GPTs๋Š” ๊ฑฐ๋Œ€ํ•œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์ž๊ธฐํšŒ๊ท€ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.

  • (Devlin et al.) BERT๋Š” ๋ช…์‹œ์ ์ธ ๊ฐ๋… ์—†์ด ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šตํ•˜๊ณ  ์ปจํ…์ŠคํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งˆ์Šคํ‚น ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

  • (Colin et al.)๋Š” ์—ฌ๋Ÿฌ Downstream Task์— ๋Œ€ํ•œ ๋ณดํŽธ์ ์ธ Pre-training Framework๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

NLP ๋ถ„์•ผ์—์„œ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์„ฑ๊ณต์œผ๋กœ ์ธํ•ด ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ๋„ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๋ ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  • (Yuan et al.)์—์„œ๋Š” ์ด๋ฏธ์ง€ ๋ถ„ํ• ์„ ์œ„ํ•œ spatial attention์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

  • (Fu et al.)๋Š” spatial attention๊ณผ channel attention์„ ๊ฒฐํ•ฉํ•˜์—ฌ context ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ DANET์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค.

  • (Kolesnikov et al.)์€ Transformer ๋ธ”๋ก์œผ๋กœ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.(convolutional neural network๋ฅผ selfโ€‘attention block์œผ๋กœ ๋Œ€์ฒด)

  • (Wu et al. & Zhao et al.)์€ ์ด๋ฏธ์ง€ ์ธ์‹ ์ž‘์—…์„ ์œ„ํ•œ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  • (Jiang et al.)์€ Transformer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด TransGAN์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

Idea 2. ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ์™€ Transformer๋ฅผ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์— ํ™œ์šฉํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์„ ๋งŽ์ด ์žˆ์—ˆ์ง€๋งŒ, Transformer์™€ ๊ฐ™์€ Pre-Training๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์™€ ๊ฐ™์ด low-level vision tasks์— ์ดˆ์ ์„ ๋งž์ถ˜ ๊ด€๋ จ ์—ฐ๊ตฌ๋Š” ๊ฑฐ์˜ ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ์ž‘์—…์— ๋Œ€ํ•œ ๋ณดํŽธ์ ์ธ Pre-Training ์ ‘๊ทผ ๋ฐฉ์‹์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

3. Method

A. Image Processing Transformer (IPT)

IPT์˜ ์ „์ฒด ์•„ํ‚คํ…์ฒ˜๋Š” 4๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. (Heads - Incoder - Decoder - Tails) ์†์ƒ๋œ Input Image(๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ๋Š” ์ด๋ฏธ์ง€ ๋ฐ ์ €ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€)์—์„œ Feature์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ Head Input Data์—์„œ ์†Œ์‹ค๋œ ์ •๋ณด๋ฅผ ๋ณต๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ์ธ์ฝ”๋” - ๋””์ฝ”๋” Transformer ๋””์ฝ”๋”์—์„œ ๋‚˜์˜จ representation๋“ค์„ ์ ์ ˆํ•˜๊ฒŒ ์ด๋ฏธ์ง€๋กœ ๋ณต์›ํ•˜๋Š” Tails image

1. Heads

๋‹ค๋ฅธ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ Task์„ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ํ—ค๋“œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ Task๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ Head๋Š” 3๊ฐœ์˜ ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค. xโˆˆR3ร—Hร—Wx โˆˆ R^{3ร—Hร—W} (3 means R, G, and B) , ํ—ค๋“œ๋Š” C(๋ณดํ†ต 64)๊ฐœ์˜ ์ฑ„๋„์„ ๊ฐ€์ง„ feature map fHโˆˆRCร—Hร—Wf_{H} โˆˆ R^{Cร—Hร—W} ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ณต์‹ํ™”ํ•˜์ž๋ฉด fH=Hi(x)f_{H} = H^{i}(x) ์ด๋ฉฐ, ์—ฌ๊ธฐ์„œ Hi(i=1,...,Nt)H^{i} (i = {1, ... , N_{t}}) ๋Š” i๋ฒˆ์งธ Task์˜ ํ—ค๋“œ, NiN_{i} ๋Š” task์˜ ์ˆ˜๋กœ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

2. Transformer encoder

Input features๋ฅผ Transformer body์— ์ ์šฉ์‹œํ‚ค๊ธฐ ์ „์— features๋ฅผ "word"์ฒ˜๋Ÿผ ๊ฐ„์ฃผ ๋  ์ˆ˜ ์žˆ๋„๋ก **ํŒจ์น˜(Patch)**๋กœ ๋ถ„ํ• ๋ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ feature map fHโˆˆRCร—Hร—Wf_{H} โˆˆ R^{Cร—Hร—W} ์—์„œ ์•„๋ž˜์˜ ์‹๊ณผ ๊ฐ™์ด ํŒจ์น˜๋“ค์˜ sequence๋กœ ์žฌ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. fpiโˆˆRP2ร—C,i=1,...,Nf_{p^{i}} โˆˆ R^{P^{2}ร—C} , i = {1, . . . , N} ์—ฌ๊ธฐ์„œ N=HW/P2N = HW/P^{2} ๋Š” ํŒจ์น˜์˜ ๊ฐฏ์ˆ˜(sequence์˜ ๊ธธ์ด)์ด๋ฉฐ P๋Š” ํŒจ์น˜ ์‚ฌ์ด์ฆˆ์ž…๋‹ˆ๋‹ค. ๊ฐ ํŒจ์น˜์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด Feature fpif_{p^{i}} ์˜ ๊ฐ ํŒจ์น˜์— ๋Œ€ํ•œ EpiโˆˆRP2ร—CE_{p^{i}} โˆˆ R^{P^{2}ร—C} ๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์œ„์น˜ ์ธ์ฝ”๋”ฉ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„, Epi+fpiE_{p^{i}} + f_{p^{i}} ๋Š” Transformer encoder์˜ ์ž…๋ ฅ ๊ฐ’์ด ๋ฉ๋‹ˆ๋‹ค. Encoder layer์—๋Š” original Transformer ๊ตฌ์กฐ์™€ ๊ฐ™์ด multihead self-attention module ๊ณผ a feed forward network๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค. ์—ญ์‹œ Encoder์˜ Input๊ณผ Output์€ ๊ฐ™์€ ์‚ฌ์ด์ฆˆ์ด๋ฉฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณต์‹์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. image ์—ฌ๊ธฐ์„œ, l ์€ ์ธ์ฝ”๋”์˜ ๋ ˆ์ด์–ด ๊ฐฏ์ˆ˜์ด๋ฉฐ, MSA๋Š” Multi-head Self-Attention module, LN์€ Layer Normalization, FFN์€ ๋‘๊ฐœ์˜ Fully Connected Layers๋ฅผ ํฌํ•จํ•œ Feed Forward Network๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

3. Transformer decoder

๋””์ฝ”๋” ๋˜ํ•œ ๊ธฐ์กด Transformer์™€ ๋™์ผํ•œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋”ฐ๋ฅด๋ฉฐ, 2๊ฐœ์˜ MSA ๋ ˆ์ด์–ด์™€ 1๊ฐœ์˜ FFN ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ํ•œ๊ฐ€์ง€ ์ฐจ์ด์ ์ด ์žˆ๋‹ค๋ฉด, Task๋ณ„ ์ž„๋ฒ ๋”ฉ์„ ๋””์ฝ”๋”์˜ Input์œผ๋กœ ์ถ”๊ฐ€ ํ™œ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Task๋ณ„ ์ž„๋ฒ ๋”ฉ์˜ ๊ฒฝ์šฐ EtiโˆˆRP2ร—C,i=1,...,NtE^{i}_{t} โˆˆ R^{P^{2}ร—C} , i = {1, ... , N_{t}} ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๊ฐ๊ฐ ๋‹ค๋ฅธ Task ๋ณ„๋กœ feature๋ฅผ decode ํ•ฉ๋‹ˆ๋‹ค. ๋””์ฝ”๋”์˜ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณต์‹์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. image ์—ฌ๊ธฐ์„œ, FDiโˆˆRP2ร—CF_{D_{i}} โˆˆR^{P^{2}ร—C} ๋Š” ๋””์ฝ”๋”์˜ outputs์ด๊ณ , decode๋œ P2ร—CP^{2}ร—C size์˜ N๊ฐœ์˜ ํŒจ์น˜ feature์˜ ๊ฒฝ์šฐ Cร—Hร—WC ร— H ร— W size๋ฅผ ๊ฐ–๋Š” fDf_{D} feature๋กœ ์žฌ๊ตฌ์„ฑ ๋ฉ๋‹ˆ๋‹ค.

4. Tails

Tails์˜ ๊ฒฝ์šฐ Heads์˜ ์†์„ฑ๊ณผ ๋™์ผํ•˜๋ฉฐ multi tails๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ๊ฐ ๋‹ค๋ฅธ Task๋ณ„๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณต์‹ํ™” ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. fT=Ti(fD)f_{T} = T^{i}(f_{D}) ์—ฌ๊ธฐ์„œ Ti(i=1,...,Nt)T^{i} (i = {1, ... , N_{t}}) ๋Š” i๋ฒˆ์งธ Task์˜ Head๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, NtN_{t} ๋Š” task์˜ ๊ฐฏ์ˆ˜์ž…๋‹ˆ๋‹ค. output ftf_{t} ๋Š” ํŠน์ • task์— ์˜ํ•ด ๊ฒฐ์ •๋œ 3ร—Hโ€ฒร—Wโ€ฒ3 ร— H' ร— W' ์ด๋ฏธ์ง€ ์‚ฌ์ด์ฆˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Hโ€ฒ=2H,Wโ€ฒ=2HH' = 2H, W' = 2H ๋ผ๋ฉด 2๋ฐฐ ํ™•๋Œ€ํ•œ super-resolution task(๊ณ ํ•ด์ƒ๋„ ์ž‘์—…)์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

B. Pre-training on ImageNet

Transformer ์ž์ฒด์˜ ์•„ํ‚คํ…์ฒ˜ ์™ธ์—๋„ ์„ฑ๊ณต์ ์ธ ํ•™์Šต์˜ ํ•ต์‹ฌ ์š”์†Œ ์ค‘ ํ•˜๋‚˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ž˜ ํ™œ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ์ •์ƒ ์ด๋ฏธ์ง€์™€ ์†์ƒ๋œ ์ด๋ฏธ์ง€๊ฐ€ ์‚ฌ์šฉ๋˜๋ฏ€๋กœ ์ด์— ๋งž๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ImageNet ๋ฒค์น˜๋งˆํฌ์˜ ์ด๋ฏธ์ง€๋Š” ์งˆ๊ฐ ๋ฐ ์ƒ‰์ƒ์ด ํ’๋ถ€ํ•œ 100๋งŒ ๊ฐœ ์ด์ƒ์˜ nature ์ด๋ฏธ์ง€๊ฐ€ ํฌํ•จ๋˜์–ด์žˆ๊ณ  1000๊ฐœ ์ด์ƒ์˜ ๋‹ค์–‘ํ•œ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ ˆ์ด๋ธ”์„ ์ œ๊ฑฐํ•˜๊ณ  ๋‹ค์–‘ํ•œ Task์— ๋งž๊ฒŒ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ์ด๋ฏธ์ง€๋ฅผ ์ €ํ•˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜๋™์œผ๋กœ ๋‹ค์Œ ๊ณต์‹๊ณผ ๊ฐ™์ด ์†์ƒ์‹œ์ผœ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ค€๋น„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Icorrupted=f(Iclean)I_{corrupted} = f(I_{clean}) ์—ฌ๊ธฐ์„œ, f ๋Š” ์ €ํ•˜(์†์ƒ) ๋ณ€ํ™˜์ด๋ผ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ Task์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ง€๋„ ๋ฐฉ์‹์œผ๋กœ IPT๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์†์‹ค ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณต์‹ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Lsupervised=sumi=1NtL1(IPT(Icorruptedi),Iclean)L_{supervised} = sum _{i=1} ^{N_{t}} L1(IPT(I_{corrupted}^{i}), I_{clean}) ์—ฌ๊ธฐ์„œ L1์€ ๊ธฐ์กด L1 ์†์‹ค์„ ๋‚˜ํƒ€๋‚ด๊ณ  ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ์ž‘์—…์œผ๋กœ ๋™์‹œ์— ํ›ˆ๋ จ๋˜์—ˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. IPT ๋ชจ๋ธ์„ pre-trainingํ•œ ํ›„์—๋Š” ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ task์— ๋Œ€ํ•œ ๊ณ ์œ ํ•œ feature๊ณผ ๋ณ€ํ™˜์„ ์บก์ฒ˜(weight๋ฅผ ์ €์žฅ)ํ•˜๋ฏ€๋กœ ์ƒˆ๋กœ ์ œ๊ณต๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›ํ•˜๋Š” ์ž‘์—…์— ์ ์šฉํ•˜๋„๋ก ๋”์šฑ Fine-tuningํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ, ๊ณ„์‚ฐ ๋น„์šฉ์„ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค๋ฅธ Heads์™€ Tails๋Š” ์‚ญ์ œ๋˜๊ณ  ๋‚จ์€ Heads์™€ Tails ๋ฐ Transformer body์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ์—ญ์ „ํŒŒ์— ๋”ฐ๋ผ ์—…๋ฐ์ดํŠธ ๋ฉ๋‹ˆ๋‹ค.

๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ์ €ํ•˜ ๋ชจ๋ธ์ด ์žˆ๊ณ  ๋ชจ๋“  ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ task์— ์ ์šฉ์‹œํ‚ฌ ์ˆ˜ ์—†๊ธฐ์— IPT์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋”์šฑ ์ข‹์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. NLP์—์„œ์˜ Word ์ฒ˜๋Ÿผ Patch๋ผ๋ฆฌ์˜ ๊ด€๊ณ„๋„ ์ค‘์š”ํ•˜๊ธฐ์— ๋™์ผํ•œ feature map์—์„œ ์ž˜๋ฆฐ patch๋Š” ์œ ์‚ฌํ•œ ์œ„์น˜์— ํฌํ•จ๋˜์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์กฐํ•™์Šต(contrastive learning)์„ ํ†ตํ•ด ๋ณดํŽธ์ ์ธ features๋ฅผ ํ•™์Šตํ•˜์—ฌ unseen tasks์— ๋Œ€ํ•ด์„œ๋„ IPT๋ชจ๋ธ์ด ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ™์€ ์ด๋ฏธ์ง€์˜ ํŒจ์น˜ feature ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉฐ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์˜ ํŒจ์น˜ feature ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋Š” ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋Œ€์กฐํ•™์Šต์˜ Loss Function์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. image ๋˜ํ•œ, supervised ๋ฐ self-supervised ์ •๋ณด๋ฅผ ์™„์ „ํžˆ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด IPT์˜ ์ตœ์ข… ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณต์‹ํ™” ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. image

4. Experiment & Result

A. Experimental Setup

1. DataSet

1๋ฐฑ๋งŒ ๊ฐœ ์ด์ƒ์˜ ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€ ImageNet ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ 3์ฑ„๋„ 48X48 ํŒจ์น˜๋“ค๋กœ crop๋ฉ๋‹ˆ๋‹ค. (1์ฒœ๋งŒ ๊ฐœ ์ด์ƒ์˜ ํŒจ์น˜) ์†์ƒ๋œ ๋ฐ์ดํ„ฐ๋Š” 6๊ฐ€์ง€(2๋ฐฐ, 3๋ฐฐ, 4๋ฐฐ bicubic interpolation, 30, 50 level ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ, rain streaks(๋น„ ๋‚ด๋ฆฌ๋Š” ๋…ธ์ด์ฆˆ))๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด CNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์—๋„ ๋™์ผํ•œ ํ…Œ์ŠคํŠธ ์ „๋žต์ด ์ ์šฉ๋˜์—ˆ์œผ๋ฉฐ CNN ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ PSNR ๊ฐ’์€ ๊ธฐ์ค€์„ ์˜ ๊ฐ’๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

2. Training & Fine-tuning.

NVIDIA V100 32์žฅ์„ ์‚ฌ์šฉํ•˜์—ฌ Adam optimizer ฮฒ1 = 0.9, ฮฒ2 = 0.999๋กœ 300์—ํญ ์ˆ˜์ •๋œ ImageNet dataset์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. Learning rate๋Š” 5eโˆ’55e^{-5} ๋ถ€ํ„ฐ 2eโˆ’52e^{-5} ๊นŒ์ง€ 256 ๋ฐฐ์น˜ ํฌ๊ธฐ๋กœ 200 ์—ํญ ๋™์•ˆ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์„ธํŠธ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์–ด ๋‹จ์ผ ๋ฐฐ์น˜์— ๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„๋กœ ๋ชจ๋“  input์„ ํƒœ์šธ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ ๋ฐ˜๋ณต์—์„œ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ์ž‘์—…์˜ ์ด๋ฏธ์ง€ ๋ฐฐ์น˜๋ฅผ ์Œ“์Šต๋‹ˆ๋‹ค. IPT Model์„ pre-training ํ•œ ์ดํ›„ ์›ํ•˜๋Š” task(e.g., 3๋ฐฐ super-resolution)๋ฅผ 2eโˆ’52e^{-5} learning rate๋กœ 30 ์—ํญ ๋™์•ˆ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. SRCNN ๋ฐฉ์‹ ๋˜ํ•œ ImageNet ํ•™์Šต๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด super-resolution task์˜ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋จ์„ ๋ณด์—ฌ์คฌ์Šต๋‹ˆ๋‹ค.

B. Result

์ดˆํ•ด์ƒ๋„์™€ ์˜์ƒ ์žก์Œ ์ œ๊ฑฐ๋ฅผ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ image processing tasks ์—์„œ pre-trained๋œ IPT์˜ ์„ฑ๋Šฅ์€ state-of-the-art๋ฅผ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

1. Super-resolution

IPT Model์„ ๋ช‡๋ช‡์˜ state-of-the-art CNN-based SR ๋ฐฉ์‹๊ณผ ๋น„๊ตํ–ˆ๊ณ  Table 1์—์„œ์™€ ๊ฐ™์ด ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์—์„œ ร—2, ร—3, ร—4 scale ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•˜๊ณ  ร—2 scale Urban100 dataset์—์„œ 33.76dB PSNR์„ ๋‹ฌ์„ฑํ•จ์„ ๊ฐ•์กฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ „ ๋ชจ๋ธ๋“ค์ด ์ด์ „ SOTA๋ณด๋‹ค <0.2dB ์”ฉ ๊ฐœ์„ ๋˜์—ˆ์—ˆ์ง€๋งŒ ์ด๋ฒˆ ๋ชจ๋ธ์€ ~0.4dB์ด๋‚˜ ๊ฐœ์„ ๋˜์–ด ๋Œ€๊ทœ๋ชจ pre-trained Model์˜ ์šฐ์ˆ˜์„ฑ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.

2. Denoising

ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” ๊นจ๋—ํ•œ ์ด๋ฏธ์ง€์—์„œ ฯƒ = 30, 50 level์˜ ๊ฐ€์šฐ์Šค ์žก์Œ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ƒ์„ฑ๋˜์—ˆ๊ณ  SOTA Model๊ณผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. Table 2๋Š” BSD68 ๋ฐ Urban100 ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, IPT ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ๊ฐ€์šฐ์Šค ๋…ธ์ด์ฆˆ ๋ ˆ๋ฒจ์—์„œ ์ตœ์ƒ์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Urban100 ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” โˆผ2dB ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ๊ณ , Pre-training ๋ฐฉ์‹, Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์šฐ์ˆ˜์„ฑ์„ ๋‚˜ํƒ€๋‚ด์—ˆ์Šต๋‹ˆ๋‹ค. image

๊ธฐ์กด ๋ฐฉ์‹์œผ๋กœ๋Š” ๋…ธ์ด์ฆˆ ์ด๋ฏธ์ง€์—์„œ ๊นจ๋—ํ•œ ์ด๋ฏธ์ง€๋กœ์˜ ๋ณต๊ตฌ๊ฐ€ ์–ด๋ ค์› ๊ณ  ์ถฉ๋ถ„ํ•œ ๋””ํ…Œ์ผ์„ ์žฌ๊ตฌ์„ฑํ•˜์ง€ ๋ชปํ•ด ๋น„์ •์ƒ์ ์ธ ํ”ฝ์…€์„ ์ƒ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. IPT์˜ ๊ฒฝ์šฐ ๋จธ๋ฆฌ์นด๋ฝ์˜ ๋ช‡ ๊ฐ€์ง€ ๋””ํ…Œ์ผ๊นŒ์ง€ ์ž˜ ๋ณต๊ตฌํ•˜๋ฉฐ ์‹œ๊ฐ์ ์ธ ํ’ˆ์งˆ์ด ์ด์ „ ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. image image

3. Generalization Ability

๋‹ค์–‘ํ•œ ์†์ƒ๋œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์€ ๊ฐ€๋Šฅํ•ด๋„, ์ž์—ฐ์ ์ธ ์ด๋ฏธ์ง€๋Š” ๋ณต์žก๋„๊ฐ€ ๋†’๊ณ  transformer์˜ pre-training์„ ์œ„ํ•ด ๋ชจ๋“  ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ์„ฑ(์ƒ์„ฑ)ํ•  ์ˆ˜ ์—†๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ IPT ๋ชจ๋ธ์ด Vision task๋ฅผ ๋„˜์–ด NLP๋ถ„์•ผ์—์„œ๊นŒ์ง€ ์—ฌ๋Ÿฌ task๋ฅผ ์ž˜ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ฒ€์ฆํ•˜๊ณ ์ž ImageNet ์ด์™ธ์— ์†์ƒ๋œ ์ด๋ฏธ์ง€(๋…ธ์ด์ฆˆ 10 & 70 level)์˜ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. IPT ๋ชจ๋ธ์€ CNN ๋ฐ ๋‹ค๋ฅธ ๋ชจ๋ธ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. image

4. Impact of data percentage

๋ฐ์ดํ„ฐ ๋ฐฑ๋ถ„์œจ์ด Transformer ๋ฐ CNN ๋ชจ๋ธ์˜ pre-training ์„ฑ๋Šฅ์— ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€ ์‹คํ—˜ํ•ฉ๋‹ˆ๋‹ค. ImageNet ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ 20%, 40%, 60%, 80% ๋ฐ 100%์„ ์‚ฌ์šฉํ•˜์—ฌ Figure 6๊ณผ ๊ฐ™์ด ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด pre-trainingํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ์†Œ๋Ÿ‰ ํ•™์Šต๋˜๋Š” ๊ฒฝ์šฐ CNN ๋ชจ๋ธ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์—์„  transformer ๊ธฐ๋ฐ˜ pre-trained ๋ชจ๋ธ(IPT)์ด ์„ฑ๋Šฅ์„ ์••๋„ํ•ฉ๋‹ˆ๋‹ค.

5. Impact of contrastive learning

Pre-trained model์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ค๊ณ ์ž ร—2 scale super-resolution task์—์„œ Set4 ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•ด ฮป ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‹คํ—˜ํ•ฉ๋‹ˆ๋‹ค. ฮป=0 ์—์„œ๋ณด๋‹ค ฮป = 0.1 ์—์„œ 0.1dB ๋†’์€ 38.37dB PSNR ๊ฐ’์ด ๋‚˜์™”๊ณ  ์ตœ์ ์˜ ฮป ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์„ ์ฐพ์•˜์Šต๋‹ˆ๋‹ค. image

5. Conclusion

์ด ๋…ผ๋ฌธ์—์„œ๋Š” NLP ๋ถ„์•ผ์—์„œ ๊ทธ๋ฆฌ๊ณ  ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ๊นŒ์ง€ ๋ฐœ์ „๋˜๊ณ  ์žˆ๋Š” Transformer ๊ธฐ๋ฐ˜ Pre-training ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ IPT๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ๋ฌธ์ œ์—์„œ ์ตœ์‹  SOTA ์ด์ƒ์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ด๋ฏธ์ง€์™€ ์†์ƒ๋œ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ ์Œ์„ ํ†ตํ•ด IPT ๋ชจ๋ธ์„ ์‚ฌ์ „ ํ•™์Šตํ•˜์—ฌ ๊ฐ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ task์— ๋”ฐ๋ผ ์‹ ์†ํ•˜๊ฒŒ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ๋„ ๋‹ค์–‘ํ•œ Task์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๊ณ  ์ผ๋ฐ˜ํ™” ๋  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์••๋„์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๊ณ  ๋ฐ์ดํ„ฐ์˜ ๋น„๋ก€ํ•˜์—ฌ ์„ฑ๋Šฅ์ด ๋†’์•„์งˆ ๊ฒƒ์ด๋ผ๊ณ  ํŒ๋‹จ๋ฉ๋‹ˆ๋‹ค.

A. Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

  1. ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ Task์—์„œ๋„ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•œ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ Pre-training & Fine-tuning ๊ธฐ๋ฒ•์€ ์„ฑ๋Šฅ์ด ์•„์ฃผ ํšจ๊ณผ์ ์ด์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ๋งŽ์œผ๋ฉด ๋งŽ์„์ˆ˜๋ก ๋น„๋ก€ํ•˜์—ฌ ์„ฑ๋Šฅ์€ ์ข‹์•„์ง‘๋‹ˆ๋‹ค.

  2. NLP์˜ Word์™€ ๊ฐ™์ด ์ด๋ฏธ์ง€ input ๋ฐ์ดํ„ฐ๋ฅผ Patch๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ Transformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  3. IPT ๋ชจ๋ธ์„ ์‚ฌ์ „ ํ•™์Šตํ•œ ํ›„ ๊ฐ Task์— ๋งž๋Š” ๊ณ ์œ  Feature๋“ค๊ณผ ๋ณ€ํ™˜์„ ์บก์ณํ•˜์—ฌ Fine-tuning ์‹œ ์›ํ•˜๋Š” Task์— ๋งž๊ฒŒ ํ•„์š”์—†๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ์‚ญ์ œํ•˜์—ฌ ๋น„์šฉ์ ์ธ ์ธก๋ฉด์—์„œ๋„ ์œ ๋ฆฌํ•ด๋ณด์˜€์Šต๋‹ˆ๋‹ค.

Author / Reviewer information

Author

๋ฐ•์ค€ํ˜• (Junhyung Park)

  • Affiliation (KAIST AI / NAVER)

  • Machine Learning Engineer @ NAVER Shopping AI Team

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. โ€ฆ

Reference & Additional materials

Last updated

Was this helpful?