IPT [Kor]
Chen et al. / Pre-Trained Image Processing Transformer / CVPR 2021
1. Problem definition
ģ“ėÆøģ§ ģ²ė¦¬(Image processing)ė ė³“ė¤ źøė”ė²ķ ģ“ėÆøģ§ ė¶ģ ėė 컓ķØķ° ė¹ģ ģģ¤ķ ģ low-level ė¶ė¶ ģ¤ ķėģ ėė¤. ģ“ėÆøģ§ ģ²ė¦¬ģ ź²°ź³¼ė ģ“ėÆøģ§ ė°ģ“ķ°ģ ģøģ ė° ģ“ķ“넼 ģķķė ķģ ģģ ė 벨 ė¶ė¶ģ ķ¬ź² ģķ„ģ ėÆøģ¹ ģ ģģµėė¤. ģµź·¼ ė„ė¬ėģ GPU넼 ķģ©ķ ķėģØģ“ 컓ķØķ ģ±ė„ģ“ ź°ė „ķź² ģ¦ź°ķź³ Pre-Trained Deep Learning Modelź³¼ ėź·ėŖØ ė°ģ“ķ°ģ ģ ķµķģ¬ źø°ģ”“ ė°©ė²ė³“ė¤ ģ¢ģ 결과넼 볓ģ¬ģ£¼ģģµėė¤. ėķ, ģ“ė¬ķ ė„ė¬ė źø°ģ ģ ģ“ėÆøģ§ ģ“ź³ ķ“ģė(super-resolution), ģøķģøķ (inpainting), ėė ģø(deraining), ģ±ģ(colorization)ź³¼ ź°ģ ė®ģ ģģ¤ģ ė¹ģ ģģ ģ ķ“ź²°ķźø° ģķ“ ė리 ģ ģ©ėź³ ģģµėė¤. ķģ§ė§ Pre-Trainingģ ķµķ“ ģ¬ė¬ ģ“ėÆøģ§ ģ²ė¦¬ Taskė¤ģ ģ¼ė°ķķ ģ°źµ¬ė ź±°ģ ģģµėė¤.
ė³ø ė ¼ė¬øģģė Pre-Trained Deep Learning Modelģø IPT(image processing transformer)넼 ķµķ“ ė øģ“ģ¦ ģ ź±°, ģ“ź³ ķ“ģė ė° ėė ģ“ėģ ź°ģ low-level 컓ķØķ° ė¹ģ Taskģ ėķ“ ģ¼ė°ķķź³ ķ state-of-the-art ģ“ģģ ź²°ź³¼(ģ±ė„)넼 볓ģ¬ģ¤ėė¤. ėķ ė§ģ ģģ ģģė ģ“ėÆøģ§ ģģ ģ¤ķģ ģ¬ģ©ķźø° ģķ“ ģ ģė ¤ģ§ ImageNet 벤ģ¹ė§ķ¬ė„¼ ķģ©ķ©ėė¤.
2. Motivation
A. Related work
1. Image processing
ģ“ėÆøģ§ ģ²ė¦¬ė super-resolution(ķ“ģė넼 ėģ“ė ģģ ), denoising(ė øģ“ģ¦ ģ ź±°), dehazing(ģ°ė¬“, ģź° ė± ėźø° ģ¤ģ 미ģøģ ģ ė øģ“ģ¦ ģ ź±°) , deraining(ė¹ė“리ėėÆķ ė øģ“ģ¦ ģ ź±°), debluring(ėøė¬ ģ ź±° ģģ ) ė±ģ ķ¬ķØķ ģ“ėÆøģ§ ģ”°ģģ¼ė” 구ģ±ė©ėė¤.
(Dong et al.) ģ ģ“ź³ ķ“ģė넼 ģķ“ SRCNNģ ģ ģķģģµėė¤. Low Resolution(ģ ķ“ģė) ģ“미ģ§ģģ High Resolution(ź³ ķ“ģė) ģ“미ģ§ė„¼ ģ¬źµ¬ģ±ķė end-to-end ėŖØėøģ ėģ ķ ģ źµ¬ģ ģø ģ°źµ¬ģ ėė¤.
(Kim et al.) ģ ģģ ģ°źµ¬ģģ ė ź¹ģ 컨볼루ģ ė¤ķøģķ¬ė„¼ ģ¬ģ©ķģ¬ ģ¬ģøµ ģ ź²½ė§ģ ķ¬źø°ģ ķ¤ģ ģµėė¤.
(Ahn et al. & Lim et al.) ģ SR(super-resolution) Taskģ Residual block ź°ė ģ ģ¶ź°ķģģµėė¤.
(Zhang et al. & Anwar & Barnes) ė attentionģ ź°ė „ķ ģ±ė„ģ SR Taskģ ķģ©ķģģµėė¤.
ģ“ģøģė ė¤ė„ø Taskė¤ģ ėķ ģ°źµ¬ė ė§ģ“ ģģµėė¤.
(Tian et al. ģ“ķ 5ź° ė ¼ė¬ø)ģģė ė øģ“ģ¦ ģ ź±°ģ ź“ė Øė Denoisingģ ėķ“ ģ°źµ¬ķģµėė¤.
(Cai et al. ģ“ķ 4ź° ė ¼ė¬ø)ģģė dehazingģ ėķ“ ģ°źµ¬ķģµėė¤.
(Hu et al. ģ“ķ 6ź° ė ¼ė¬ø)ģģė derainingģ ėķ“ ģ°źµ¬ķģµėė¤.
(Tao et al. ģ“ķ 4ź° ė ¼ė¬ø)ģģė debluringģ ėķ“ ģ°źµ¬ķģµėė¤.
Idea 1. ģģ ģ°źµ¬ė¤ģģė ź°ė³ģ ģø ė¤ė„ø ė°©ė²ģ ģ¬ģ©ķģ¬ ģ°źµ¬ķģ§ė§, ģ“ ė ¼ė¬øģģė ķėģ ķ° ėŖØėø(pre-trained)ź³¼ ėģ©ėģ ė°ģ“ķ°ė„¼ ģ¬ģ©ķģ¬ ģ¬ė¬ ģ“ėÆøģ§ ģ²ė¦¬ Taskė¤ģ ėķ“ģ ģ¤ķķź³ ģ¢ģ 결과넼 볓ģ¬ģ£¼ģģµėė¤.
2. Transformer
(Vaswani et al.) Transfomerė ė¤ģķ ģģ°ģ“ ģ²ė¦¬ ģģ ģģ ź°ė „ķ unsupervised ėė self-supervised pretraining frameworkė” ģ±ź³µģ ģ ģ¦ķģµėė¤.
(Radford et al.) GPTsė ź±°ėķ ķ ģ¤ķø ė°ģ“ķ° ģøķøģģ ė¤ģ ėØģ“넼 ģģø”ķė ģźø°ķź· ė°©ģģ¼ė” ģ¬ģ ķė Øė©ėė¤.
(Devlin et al.) BERTė ėŖ ģģ ģø ź°ė ģģ“ ė°ģ“ķ°ģģ ķģµķź³ ģ»Øķ ģ¤ķøė„¼ źø°ė°ģ¼ė” ė§ģ¤ķ¹ ėØģ“넼 ģģø”ķ©ėė¤.
(Colin et al.)ė ģ¬ė¬ Downstream Taskģ ėķ 볓ķøģ ģø Pre-training Framework넼 ģ ģķ©ėė¤.
NLP ė¶ģ¼ģģ Transformer źø°ė° ėŖØėøģ ģ±ź³µģ¼ė” ģøķ“ 컓ķØķ° ė¹ģ ė¶ģ¼ģģė Transformer źø°ė° ėŖØėøģ ķģ©ķė ¤ė ģ°źµ¬ź° ģģµėė¤.
(Yuan et al.)ģģė ģ“ėÆøģ§ ė¶ķ ģ ģķ spatial attentionģ ģź°ķ©ėė¤.
(Fu et al.)ė spatial attentionź³¼ channel attentionģ ź²°ķ©ķģ¬ context ģ 볓넼 ķģ©ķ DANETģ ģ ģķģµėė¤.
(Kolesnikov et al.)ģ Transformer ėøė”ģ¼ė” ģ“ėÆøģ§ ė¶ė„넼 ģķķ©ėė¤.(convolutional neural network넼 selfāattention blockģ¼ė” ė첓)
(Wu et al. & Zhao et al.)ģ ģ“ėÆøģ§ ģøģ ģģ ģ ģķ Transformer źø°ė° ėŖØėøģ ėķ ģ¬ģ ķģµ ė°©ė²ģ ģ ģķ©ėė¤.
(Jiang et al.)ģ Transformer넼 ģ¬ģ©ķģ¬ ģ“미ģ§ė„¼ ģģ±ķźø° ģķ“ TransGANģ ģ ģķ©ėė¤.
Idea 2. ģ“ėÆøģ§ ģ²ė¦¬ģ ėķ ģ°źµ¬ģ Transformer넼 컓ķØķ° ė¹ģ ė¶ģ¼ģ ķģ©ķė ģ°źµ¬ė¤ģ ė§ģ“ ģģģ§ė§, Transformerģ ź°ģ Pre-TrainingėŖØėøģ ķģ©ķģ¬ ģ“ėÆøģ§ ģ²ė¦¬ģ ź°ģ“ low-level vision tasksģ ģ“ģ ģ ė§ģ¶ ź“ė Ø ģ°źµ¬ė ź±°ģ ģģµėė¤. ė°ė¼ģ ģ“ ė ¼ė¬øģģė ģ“ėÆøģ§ ģ²ė¦¬ ģģ ģ ėķ 볓ķøģ ģø Pre-Training ģ ź·¼ ė°©ģģ ķźµ¬ķ©ėė¤.
3. Method
A. Image Processing Transformer (IPT)
IPTģ ģ 첓 ģķ¤ķ
ģ²ė 4ź°ģ§ źµ¬ģ± ģģė” źµ¬ģ±ė©ėė¤. (Heads - Incoder - Decoder - Tails)
ģģė Input Image(ė
øģ“ģ¦ź° ģė ģ“ėÆøģ§ ė° ģ ķ“ģė ģ“미ģ§)ģģ Featureģ ģ¶ģ¶ķźø° ģķ Head
Input Dataģģ ģģ¤ė ģ 볓넼 ė³µźµ¬ķźø° ģķ ģøģ½ė - ėģ½ė Transformer
ėģ½ėģģ ėģØ representationė¤ģ ģ ģ ķź² ģ“미ģ§ė” ė³µģķė Tails
1. Heads
ė¤ė„ø ģ“ėÆøģ§ ģ²ė¦¬ Taskģ ģ”°ģ ķźø° ģķ“ ė¤ģ¤ ķ¤ė ģķ¤ķ ģ²ė„¼ ģ¬ģ©ķģ¬ ź° Task넼 ź°ė³ģ ģ¼ė” ģ²ė¦¬ķ©ėė¤. ź° Headė 3ź°ģ 컨볼루ģ ė ģ“ģ“ė” źµ¬ģ±ė©ėė¤. ģ ė „ ģ“미ģ§ė„¼ ė¤ģź³¼ ź°ģ“ ķģķ©ėė¤. (3 means R, G, and B) , ķ¤ėė C(ė³“ķµ 64)ź°ģ ģ±ėģ ź°ģ§ feature map ģ ģģ±ķ©ėė¤. ź³µģķķģė©“ ģ“ė©°, ģ¬źø°ģ ė iė²ģ§ø Taskģ ķ¤ė, ė taskģ ģė” ėķė ėė¤.
2. Transformer encoder
Input features넼 Transformer bodyģ ģ ģ©ģķ¤źø° ģ ģ features넼 "word"ģ²ė¼ ź°ģ£¼ ė ģ ģėė” **ķØģ¹(Patch)**ė” ė¶ķ ė©ėė¤.
구첓ģ ģ¼ė” feature map ģģ ģėģ ģź³¼ ź°ģ“ ķØģ¹ė¤ģ sequenceė” ģ¬źµ¬ģ±ė©ėė¤.
ģ¬źø°ģ ė ķØģ¹ģ ź°Æģ(sequenceģ źøøģ“)ģ“ė©° Pė ķØģ¹ ģ¬ģ“ģ¦ģ
ėė¤.
ź° ķØģ¹ģ ģģ¹ ģ 볓넼 ģ ģ§ķźø° ģķ“ Feature ģ ź° ķØģ¹ģ ėķ ė” ķģµ ź°ė„ķ ģģ¹ ģøģ½ė©ģ ģ¶ź°ķ©ėė¤. ģ“ķ, ė Transformer encoderģ ģ
ė „ ź°ģ“ ė©ėė¤.
Encoder layerģė original Transformer źµ¬ģ”°ģ ź°ģ“ multihead self-attention module ź³¼ a feed forward networkė” źµ¬ģ±ėģ“ģģµėė¤. ģģ Encoderģ Inputź³¼ Outputģ ź°ģ ģ¬ģ“ģ¦ģ“ė©° ė¤ģź³¼ ź°ģ“ ź³µģģ ź³ģ°ķ ģ ģģµėė¤.
ģ¬źø°ģ, l ģ ģøģ½ėģ ė ģ“ģ“ ź°Æģģ“ė©°, MSAė Multi-head Self-Attention module, LNģ Layer Normalization, FFNģ ėź°ģ Fully Connected Layers넼 ķ¬ķØķ Feed Forward Network넼 ėķė
ėė¤.
3. Transformer decoder
ėģ½ė ėķ źø°ģ”“ Transformerģ ėģ¼ķ ģķ¤ķ
ģ²ė„¼ ė°ė„“ė©°, 2ź°ģ MSA ė ģ“ģ“ģ 1ź°ģ FFN ė ģ“ģ“ė” źµ¬ģ±ė©ėė¤. ķź°ģ§ ģ°Øģ“ģ ģ“ ģė¤ė©“, Taskė³ ģė² ė©ģ ėģ½ėģ Inputģ¼ė” ģ¶ź° ķģ©ķė¤ė ź²ģ
ėė¤. Taskė³ ģė² ė©ģ ź²½ģ° ģ¼ė” ėķė“ė©°, ź°ź° ė¤ė„ø Task ė³ė” feature넼 decode ķ©ėė¤.
ėģ½ėģ ź²½ģ° ė¤ģź³¼ ź°ģ“ ź³µģģ ź³ģ°ķ ģ ģģµėė¤.
ģ¬źø°ģ, ė ėģ½ėģ outputsģ“ź³ , decodeė sizeģ Nź°ģ ķØģ¹ featureģ ź²½ģ° size넼 ź°ė featureė” ģ¬źµ¬ģ± ė©ėė¤.
4. Tails
Tailsģ ź²½ģ° Headsģ ģģ±ź³¼ ėģ¼ķė©° multi tails넼 ģ¬ģ©ķģ¬ ź°ź° ė¤ė„ø Taskė³ė” ģ²ė¦¬ķ©ėė¤. ė¤ģź³¼ ź°ģ“ ź³µģķ ķ ģ ģģµėė¤. ģ¬źø°ģ ė iė²ģ§ø Taskģ Head넼 ėķė“ė©°, ė taskģ ź°Æģģ ėė¤. output ė ķ¹ģ taskģ ģķ“ ź²°ģ ė ģ“ėÆøģ§ ģ¬ģ“ģ¦ź° ė©ėė¤. ģ넼 ė¤ģ“, ė¼ė©“ 2ė°° ķėķ super-resolution task(ź³ ķ“ģė ģģ )ģ“ ė ģ ģģµėė¤.
B. Pre-training on ImageNet
Transformer ģ첓ģ ģķ¤ķ ģ² ģøģė ģ±ź³µģ ģø ķģµģ ķµģ¬ ģģ ģ¤ ķėė ėź·ėŖØ ė°ģ“ķ° ģøķøė„¼ ģ ķģ©ķ“ģ¼ ķ©ėė¤. ėķ, ķģµģ ģķ“ģė ģ ģ ģ“미ģ§ģ ģģė ģ“ėÆøģ§ź° ģ¬ģ©ėėÆė” ģ“ģ ė§ė ė°ģ“ķ° ģøķøź° ķģķ©ėė¤. ImageNet 벤ģ¹ė§ķ¬ģ ģ“미ģ§ė ģ§ź° ė° ģģģ“ ķė¶ķ 100ė§ ź° ģ“ģģ nature ģ“ėÆøģ§ź° ķ¬ķØėģ“ģź³ 1000ź° ģ“ģģ ė¤ģķ ģ¹“ķ ź³ 리넼 ź°ģ§ź³ ģģµėė¤. ė°ė¼ģ ė ģ“ėøģ ģ ź±°ķź³ ė¤ģķ Taskģ ė§ź² ģ¬ģ©ė ģ ģėė” ģ“미ģ§ė„¼ ģ ķ ėŖØėøģ ģ¬ģ©ķģ¬ ģėģ¼ė” ė¤ģ ź³µģź³¼ ź°ģ“ ģģģģ¼ ė°ģ“ķ° ģøķøė„¼ ģ¤ė¹ķ ģ ģģµėė¤. ģ¬źø°ģ, f ė ģ ķ(ģģ) ė³ķģ“ė¼ ķ ģ ģģ¼ė©° Taskģ ė°ė¼ ė¬ė¼ģ§ėė¤. ģ§ė ė°©ģģ¼ė” IPT넼 ķģµķźø° ģķ ģģ¤ ķØģė ė¤ģź³¼ ź°ģ“ ź³µģķķ ģ ģģµėė¤. ģ¬źø°ģ L1ģ źø°ģ”“ L1 ģģ¤ģ ėķė“ź³ ķė ģģķ¬ź° ģ¬ė¬ ģ“ėÆøģ§ ģ²ė¦¬ ģģ ģ¼ė” ėģģ ķė Øėģģģ ģ미ķ©ėė¤. IPT ėŖØėøģ pre-trainingķ ķģė ė¤ģķ ģ“ėÆøģ§ ģ²ė¦¬ taskģ ėķ ź³ ģ ķ featureź³¼ ė³ķģ ģŗ”ģ²(weight넼 ģ ģ„)ķėÆė” ģė” ģ ź³µė ė°ģ“ķ° ģøķøė„¼ ģ¬ģ©ķģ¬ ģķė ģģ ģ ģ ģ©ķėė” ėģ± Fine-tuningķ ģ ģģµėė¤. ģ“ė, ź³ģ° ė¹ģ©ģ ģ ģ½ķźø° ģķ“ ė¤ė„ø Headsģ Tailsė ģģ ėź³ ėØģ Headsģ Tails ė° Transformer bodyģ 매ź°ė³ģė ģģ ķģ ė°ė¼ ģ ė°ģ“ķø ė©ėė¤.
ė¤ģķ ė°ģ“ķ° ķģ§ ģ ķ ėŖØėøģ“ ģź³ ėŖØė ģ“ėÆøģ§ ģ²ė¦¬ taskģ ģ ģ©ģķ¬ ģ ģźø°ģ IPTģ ģ¼ė°ķ ģ±ė„ģ“ ėģ± ģ¢ģģ¼ ķ©ėė¤.
NLPģģģ Word ģ²ė¼ Patchė¼ė¦¬ģ ź“ź³ė ģ¤ģķźø°ģ ėģ¼ķ feature mapģģ ģ린 patchė ģ ģ¬ķ ģģ¹ģ ķ¬ķØėģ“ģ¼ķ©ėė¤.
ėģ”°ķģµ(contrastive learning)ģ ķµķ“ 볓ķøģ ģø features넼 ķģµķģ¬ unseen tasksģ ėķ“ģė IPTėŖØėøģ“ ķģ©ė ģ ģėė” ķģµėė¤.
ź°ģ ģ“미ģ§ģ ķØģ¹ feature ģ¬ģ“ģ ź±°ė¦¬ė„¼ ģµģķķė©° ė¤ė„ø ģ“미ģ§ģ ķØģ¹ feature ģ¬ģ“ģ ź±°ė¦¬ė ģµėķķėė” ķģģµėė¤.
ėģ”°ķģµģ Loss Functionģ ė¤ģź³¼ ź°ģµėė¤.
ėķ, supervised ė° self-supervised ģ 볓넼 ģģ ķ ķģ©ķźø° ģķ“ IPTģ ģµģ¢
ėŖ©ģ ķØģ넼 ė¤ģź³¼ ź°ģ“ ź³µģķ ķ ģ ģģµėė¤.
4. Experiment & Result
A. Experimental Setup
1. DataSet
1ė°±ė§ ź° ģ“ģģ ģ»¬ė¬ ģ“ėÆøģ§ ImageNet ė°ģ“ķ° ģøķøė„¼ ģ¬ģ©ķė©° 3ģ±ė 48X48 ķØģ¹ė¤ė” cropė©ėė¤. (1ģ²ė§ ź° ģ“ģģ ķØģ¹) ģģė ė°ģ“ķ°ė 6ź°ģ§(2ė°°, 3ė°°, 4ė°° bicubic interpolation, 30, 50 level ź°ģ°ģģ ė øģ“ģ¦, rain streaks(ė¹ ė“리ė ė øģ“ģ¦))ė” ģģ±ķ©ėė¤. ź³µģ ķ ė¹źµė„¼ ģķ“ CNN źø°ė° ėŖØėøģė ėģ¼ķ ķ ģ¤ķø ģ ėµģ“ ģ ģ©ėģģ¼ė©° CNN ėŖØėøģ ź²°ź³¼ PSNR ź°ģ źø°ģ¤ģ ģ ź°ź³¼ ėģ¼ķ©ėė¤.
2. Training & Fine-tuning.
NVIDIA V100 32ģ„ģ ģ¬ģ©ķģ¬ Adam optimizer β1 = 0.9, β2 = 0.999ė” 300ģķ ģģ ė ImageNet datasetģ ķģµķ©ėė¤. Learning rateė ė¶ķ° ź¹ģ§ 256 ė°°ģ¹ ķ¬źø°ė” 200 ģķ ėģ ģ¤ģ“ėėė¤. ķė Ø ģøķøė ģė” ė¤ė„ø ģģ ģ¼ė” 구ģ±ėģ“ ģģ“ ėØģ¼ ė°°ģ¹ģ ė©ėŖØė¦¬ ķź³ė” ėŖØė inputģ ķģø ģ ģģµėė¤. ė°ė¼ģ ź° ė°ė³µģģ 묓ģģė” ģ ķė ģģ ģ ģ“ėÆøģ§ ė°°ģ¹ė„¼ ģģµėė¤. IPT Modelģ pre-training ķ ģ“ķ ģķė task(e.g., 3ė°° super-resolution)넼 learning rateė” 30 ģķ ėģ ķģµķ©ėė¤. SRCNN ė°©ģ ėķ ImageNet ķģµė°©ģģ ģ¬ģ©ķė©“ super-resolution taskģ ģ±ė„ģ“ ź°ģ ėØģ 볓ģ¬ģ¤¬ģµėė¤.
B. Result
ģ“ķ“ģėģ ģģ ģ”ģ ģ ź±°ė„¼ ķ¬ķØķ ė¤ģķ image processing tasks ģģ pre-trainedė IPTģ ģ±ė„ģ state-of-the-art넼 ė„ź°ķģµėė¤.
1. Super-resolution
IPT Modelģ ėŖėŖģ state-of-the-art CNN-based SR ė°©ģź³¼ ė¹źµķź³ Table 1ģģģ ź°ģ“ ėŖØė ė°ģ“ķ°ģ ģģ Ć2, Ć3, Ć4 scale ģ±ė„ģ“ ź°ģ„ ģ¢ģź³ Ć2 scale Urban100 datasetģģ 33.76dB PSNRģ ė¬ģ±ķØģ ź°ģ”°ķģµėė¤. ģ“ģ ėŖØėøė¤ģ“ ģ“ģ SOTAė³“ė¤ <0.2dB ģ© ź°ģ ėģģģ§ė§ ģ“ė² ėŖØėøģ ~0.4dBģ“ė ź°ģ ėģ“ ėź·ėŖØ pre-trained Modelģ ģ°ģģ±ģ ėķėģµėė¤.
2. Denoising
ķģµ ė° ķ
ģ¤ķø ė°ģ“ķ°ė 깨ėķ ģ“미ģ§ģģ Ļ = 30, 50 levelģ ź°ģ°ģ¤ ģ”ģģ ģ¶ź°ķģ¬ ģģ±ėģź³ SOTA Modelź³¼ ė¹źµķģµėė¤.
Table 2ė BSD68 ė° Urban100 ė°ģ“ķ° ģøķøģ ėķ ģ»¬ė¬ ģ“ėÆøģ§ ė
øģ“ģ¦ ģ ź±° ź²°ź³¼ģ“ė©°, IPT ėŖØėøģ“ ė¤ģķ ź°ģ°ģ¤ ė
øģ“ģ¦ ė 벨ģģ ģµģģ ģ±ė„ģ 볓ģ¬ģ¤ėė¤. Urban100 ė°ģ“ķ°ģ
ģģė ā¼2dB ģ±ė„ ķ„ģģ 볓ģ¬ģ£¼ź³ , Pre-training ė°©ģ, Transformer źø°ė° ėŖØėøģ ģ°ģģ±ģ ėķė“ģģµėė¤.
기씓 ė°©ģģ¼ė”ė ė
øģ“ģ¦ ģ“미ģ§ģģ ź¹Øėķ ģ“미ģ§ė”ģ ė³µźµ¬ź° ģ“ė ¤ģ ź³ ģ¶©ė¶ķ ėķ
ģ¼ģ ģ¬źµ¬ģ±ķģ§ ėŖ»ķ“ ė¹ģ ģģ ģø ķ½ģ
ģ ģģ±ķģµėė¤. IPTģ ź²½ģ° 머리칓ė½ģ ėŖ ź°ģ§ ėķ
ģ¼ź¹ģ§ ģ 복구ķė©° ģź°ģ ģø ķģ§ģ“ ģ“ģ ėŖØėøģ ė„ź°ķģµėė¤.
3. Generalization Ability
ė¤ģķ ģģė ģ“ėÆøģ§ ģģ±ģ ź°ė„ķ“ė, ģģ°ģ ģø ģ“미ģ§ė ė³µģ”ėź° ėź³ transformerģ pre-trainingģ ģķ“ ėŖØė ģ“ėÆøģ§ ė°ģ“ķ°ģ
ģ ķ©ģ±(ģģ±)ķ ģ ģė ķź³ź° ģģµėė¤. ė°ė¼ģ IPT ėŖØėøģ“ Vision task넼 ėģ“ NLPė¶ģ¼ģģź¹ģ§ ģ¬ė¬ task넼 ģ ģ²ė¦¬ķ ģ ģė ė„ė „ģ“ ģģ“ģ¼ ķ©ėė¤. ģ“ė¬ķ ģ¼ė°ķ ė„ė „ģ ź²ģ¦ķź³ ģ ImageNet ģ“ģøģ ģģė ģ“미ģ§(ė
øģ“ģ¦ 10 & 70 level)ģ ė
øģ“ģ¦ ģ ź±° ķ
ģ¤ķøė„¼ ģ§ķķģµėė¤.
IPT ėŖØėøģ CNN ė° ė¤ė„ø ėŖØėøė³“ė¤ ģ¢ģ ģ±ė„ģ 볓ģ¬ģ£¼ģģµėė¤.
4. Impact of data percentage
ė°ģ“ķ° ė°±ė¶ģØģ“ Transformer ė° CNN ėŖØėøģ pre-training ģ±ė„ģ ģ“ė ķ ģķ„ģ 주ėģ§ ģ¤ķķ©ėė¤. ImageNet ė°ģ“ķ° ģøķøģ 20%, 40%, 60%, 80% ė° 100%ģ ģ¬ģ©ķģ¬ Figure 6ź³¼ ź°ģ“ 결과넼 ķģøķģģµėė¤. ėŖØėøģ“ pre-trainingķģ§ ģź±°ė ģė ķģµėė ź²½ģ° CNN ėŖØėøģ“ ė ģ¢ģ ģ±ė„ģ 볓ģ¬ģ£¼ģ§ė§, ėź·ėŖØ ė°ģ“ķ°ģģ transformer źø°ė° pre-trained ėŖØėø(IPT)ģ“ ģ±ė„ģ ģėķ©ėė¤.
5. Impact of contrastive learning
Pre-trained modelģ ģ±ė„ģ ź°ģ ģķ¤ź³ ģ Ć2 scale super-resolution taskģģ Set4 ė°ģ“ķ°ģ
ģ ķģ©ķ“ Ī» 매ź°ė³ģ넼 ģ¤ķķ©ėė¤.
Ī»=0 ģģė³“ė¤ Ī» = 0.1 ģģ 0.1dB ėģ 38.37dB PSNR ź°ģ“ ėģź³ ģµģ ģ Ī» 매ź°ė³ģ ź°ģ ģ°¾ģģµėė¤.
5. Conclusion
ģ“ ė ¼ė¬øģģė NLP ė¶ģ¼ģģ ź·øė¦¬ź³ 컓ķØķ° ė¹ģ ė¶ģ¼ź¹ģ§ ė°ģ ėź³ ģė Transformer źø°ė° Pre-training źø°ė²ģ ģ¬ģ©ķģ¬ IPTėŖØėøģ ź°ė°ķź³ ė¤ģķ ģ“ėÆøģ§ ģ²ė¦¬ 문ģ ģģ ģµģ SOTA ģ“ģģ ģ±ė„ģ 볓ģ¬ģ£¼ģģµėė¤. ģė³ø ģ“미ģ§ģ ģģė ģ“ėÆøģ§ ė°ģ“ķ° ģģ ķµķ“ IPT ėŖØėøģ ģ¬ģ ķģµķģ¬ ź° ģ“ėÆøģ§ ģ²ė¦¬ taskģ ė°ė¼ ģ ģķź² ėÆøģø ģ”°ģ ķ ģ ģėė” ķ©ėė¤. ė°ė¼ģ ķėģ ėŖØėøė”ė ė¤ģķ Taskģ ģ ģ©ķ ģ ģź³ ģ¼ė°ķ ė ģ ģė ė„ė „ģ ģ ģ¦ķģµėė¤. ķ¹ķ ėź·ėŖØ ė°ģ“ķ°ģ ģģ ģėģ ģø ģ±ė„ģ 볓ģ¬ģ£¼ģź³ ė°ģ“ķ°ģ ė¹ė”ķģ¬ ģ±ė„ģ“ ėģģ§ ź²ģ“ė¼ź³ ķėØė©ėė¤.
A. Take home message (ģ¤ėģ źµķ)
ģ“ėÆøģ§ ģ²ė¦¬ Taskģģė ėź·ėŖØ ė°ģ“ķ°ģ ģ ķģ©ķ Transformer źø°ė° ėŖØėøģ Pre-training & Fine-tuning źø°ė²ģ ģ±ė„ģ“ ģ주 ķØź³¼ģ ģ“ģģµėė¤. ėķ ė°ģ“ķ°ģ ģģ“ ė§ģ¼ė©“ ė§ģģė” ė¹ė”ķģ¬ ģ±ė„ģ ģ¢ģģ§ėė¤.
NLPģ Wordģ ź°ģ“ ģ“ėÆøģ§ input ė°ģ“ķ°ė„¼ Patchė” ė³ķķģ¬ Transformer źø°ė°ģ ėŖØėøģ ģ¬ģ©ķ ģ ģģµėė¤.
IPT ėŖØėøģ ģ¬ģ ķģµķ ķ ź° Taskģ ė§ė ź³ ģ Featureė¤ź³¼ ė³ķģ ģŗ”ģ³ķģ¬ Fine-tuning ģ ģķė Taskģ ė§ź² ķģģė 매ź°ė³ģė ģģ ķģ¬ ė¹ģ©ģ ģø ģø”ė©“ģģė ģ 리ķ“볓ģģµėė¤.
Author / Reviewer information
Author
ė°ģ¤ķ (Junhyung Park)
Affiliation (KAIST AI / NAVER)
Machine Learning Engineer @ NAVER Shopping AI Team
Reviewer
Korean name (English name): Affiliation / Contact information
Korean name (English name): Affiliation / Contact information
ā¦
Reference & Additional materials
Last updated
Was this helpful?