Swin Transformer [kor]
Liu Z et al. / Swin Transformer Hierarchical Vision Transformer using Shifted Windows / arXiv prerint 2021
1. Problem definition
ģµź·¼ natural language processing (NLP) ģģ ķ° ģ±ź³µģ ź±°ė self-attention, Transformer 구씰넼 general vision taskģ ģ ģ©ģķ¤ė ģ°źµ¬ź° ė§ģ“ ģ§ķėź³ ģģµėė¤. ź·øģ¤ģģė Vision Transformer (ViT) [3] ė classificationģģ sota넼 ė¬ģ±ķė ė± ģ°ģķ ģ±ė„ģ 볓ģ¬ģ£¼ģģ¼ė©° ViT넼 ģė ķģ ģ°źµ¬ė¤ģ“ ė§ģ“ ģ§ķėź³ ģģµėė¤. ģ“ė¬ķ ģ°źµ¬ė¤ ģ¤ ķėģø Swin Transformerė ģ“ė ķ ė°©ė²ģ¼ė” general vision taskģ transformer 구씰넼 ģ ģ©ģķ¤ė ¤ ķģėģ§ ģź°ķ“ ė³“ėė” ķź² ģµėė¤.
2. Motivation
ģ“ ė ¼ė¬øģģė ģģ ė§ģė린 ź²ź³¼ ź°ģ“ Transformer구씰넼 general vision taskģ ģ ģ©ģķ¤ė 주ģ ģ ģ°źµ¬ģ ėė¤. ź“ė Ø ģ°źµ¬ģ¤ ķėģø classificationģ ģ ģ©ė Vision Transformer (ViT)ģ ģ“ģ“ ė³“ė¤ ģ¼ė°ģ ģø vision taskģ ģ ģ©ė ģ ģė ģė”ģ“ źµ¬ģ”°ė„¼ ģ ģķģģ¼ė©° ė ¼ė¬øģ ģ ģė ģ“넼 ķµķ“ Visionź³¼ language featureģ joint modelingģ ź°ė„ģ¼ ķź³ ė ė¶ģ¼ ėŖØėģ ėģģ“ ė ģ ģģ ź²ģ“ė¼ ģøźøķģģµėė¤.
Related work
CNN and variants:
기씓ģ vision taskģģ ģ£¼ė” ģ¬ģ©ėė ė°©ė²ģ¼ė” ė§ģ“ ģź³ ź³ģė Convolution neural networksģ ź“ķ ė“ģ©ģ ėė¤. AlexNetė¶ķ° ģģķģ¬ ė deepķź³ effectiveķ źµ¬ģ”°ź° ģ ģėģģ¼ė©° convolution layerģ첓넼 ź°ģ ķ ė°©ė²ė¤ģ ėķ“ ģøźøķģģµėė¤. ģ§źøź¹ģ§ģ CNNģ ģ“ė¬ģ“ė¬ķ ėŖØėøė¤ģ“ ģė¤ ė¼ė ģøźøģ“ė©° ė ¼ė¬øģģ ģ¤ģķ ė¶ė¶ģ“ ģėė¼ ģģøķ ėŖØėø ģ“ė¦ģ źø°ģ¬ķģ§ ģģģµėė¤. ė ¼ė¬øģģģ ķµģ¬ģ visionź³¼ language ģ¬ģ“ģ modelingģ ģķ“ transformerģ ģ ģ¬ė „ģ ź°ģ”°ķź³ modelingģ ė³ķģ źø°ģ¬ķ기넼 ģķė¤ź³ ģøźøķź³ ģģµėė¤.
self-attention based backbone architectures:
convolution layerģ ģ¼ė¶ė¶ģ“ė ģ ė¶ė„¼ self-attentionģ¼ė” ė³ź²½ķė ģ°źµ¬ė¤ģ ķ“ė¹ķė©° ķ¬ź² Stand-alone self-attention model [4], Local Relation Networks [5]ź° ģģµėė¤. ģ“ ģ¤ Local Relation Networksė self-attetentionģ“ ź°ź°ģ pixelģ local windowģģ ź³ģ°ėė©° 기씓 vision taskģ ģ±ė„ģ ķ„ģģķ¬ ģ ģģģ 볓ģ¬ģ£¼ģģµėė¤. ķģ§ė§ sliding ė°©ģģ ģ¬ģ©ķģ¬ ģ°ģ°ėģ ģ¦ź°ģ ė°ė¼ latencyź° ģ¬ź°ķź² ģ¦ź°ķė ėØģ ģ“ ģ”“ģ¬ķė¤ź³ ķ©ėė¤. ģ“ ė ¼ė¬øģģė sliding window ėģ consecutive layersģ¬ģ“ģ shift sindowsė¼ė ķØģ¬ ķØź³¼ģ ģø ė°©ė²ģ ģ ģķģ¬ ģ“넼 ķ“ź²°ķė ¤ ķģģµėė¤.
self-attention/Transformers to complement CNNs:
Standard CNN 구씰ģ self-attentionģ“ė Transformers넼 ź²°ķ©ķ ė°©ė²ė¤ė” self-attetnion layerź° distant dependencies넼 encoding ķØģ¼ė”ģØ backboneģ“ė head networks넼 볓ģķ ģ ģė¤ź³ ģė ¤ģ ø ģģµėė¤. ėķ ģµź·¼ ģ°źµ¬ģ ź²½ģ° encoder-decoder구씰ģ transformer넼 object detectionģ“ė instance segmentationģ ģ ģ©ķź³ ģģµėė¤. ģ“ ė ¼ė¬øģģė transformer넼 basic visual feature extractionģ¼ė” ģ ģ©ķė ¤ ķģź³ ģ“ė źø°ģ”“ ź“ė Ø ģ°źµ¬ė¤ģ 볓ģķ ģ ģė¤ ģøźøķģģµėė¤.
Transformer based vision backbones:
Vision taskģ transformer구씰넼 ģ ģ©ķ ė°©ė²ė¤ė” Vision Transformer (ViT)ģ ź·ø ķģ ė ¼ė¬øė¤ģ ķ“ė¹ķ©ėė¤. ģ“ ė°©ė²ģ ģ“미ģ§ė„¼ ź°ź°ģ ź³ ģ ė sizeģ patchė” ėėź³ ģ“ė¬ķ patch넼 tokenģ¼ė” ģ¬ģ©ķė ė°©ė²ė¤ģ ėė¤. CNN ė°©ė²ź³¼ ė¹ģ·ķ ģ±ė„ģ“ģ§ė§ ė³“ė¤ ė¹ ė„ø ģė넼 볓ģģµėė¤. ģ“ ė ¼ė¬øģģė Vitģ classification ģ±ė„ģ ķØź³¼ģ ģ¼ė” 볓ģ“ė ģ“ė¬ķ 구씰ė general-purpose backboneģ¼ė” ģ¬ģ©ķźø°ģė low-resolution feature mapź³¼ ģ“ėÆøģ§ ķ¬źø°ģ ė°ė„ø ģ°ģ°ė ģ¦ź°ė” ģøķ“ ģ ķ©ķģ§ ģė¤ź³ ģøźøķė©° ģ“넼 ź°ģ ķė ė°©ė²ģ ģ ģķģģµėė¤.
Idea
ģ“ ė ¼ė¬øģģė low-resolution feature mapģ ģķ“ general-purpose backboneģ¼ė” ģ¬ģ©ėźø°ģė ģ ķ©ķģ§ ģģ źø°ģ”“ģ ViTģ ė°©ė²ģ ė³ź²½ķģ¬ layerź° ź¹ģ“ģ§ģė” patch넼 mergeķ“ ėź°ė hierarchical 구씰넼 ģ ģķģģµėė¤. 기씓 Vitė ģ“ėÆøģ§ź° 커ģ§ģė” ģ°ģ°ėģ“ ė§¤ģ° ģ¦ź°ķė¤ė ėØģ ģ“ ģ”“ģ¬ķģģµėė¤. ģ“넼 ź°ź°ģ local patchģģģė§ self-attentionģ ź³ģ°ķė shifted window based self-attentionģ ģ ģķØģ¼ė”ģØ ģķķģģ¼ė©° feature pyramid 구씰넼 ģ ģķØģ¼ė”ģØ ė¤ė„ø vision taskģė ģ¬ģ©ź°ė„ķ ź³ģøµģ ģø ģ 볓넼 ķģ©ķ ģ ģė¤ź³ ķ©ėė¤.
3. Method

Figure 1ģ swin transformerģ hierarchical feature mapź³¼ 기씓 ViTģ feature mapģ 볓ģ¬ģ¤ėė¤. 기씓ģ Vitė single low resolution feature mapģ ģģ±ķ“ė“ėė° ė°ė©“ swin transformerė hierarchical feature mapģ¼ė” deeper layerė” ź°ģė” patches넼 mergeķ“ ėź°ė©° window size넼 ėķ ź°ėė¤.
ViTģ ź²½ģ° ź³ ģ ė patch size 넼 ģ¬ģ©ķė©° ź·ø ź²°ź³¼ output feature mapģ resolutionģ źø°ģ”“ input image sizeģ ģ“ ė©ėė¤. ė°ė©“ swin transformerģ ź²½ģ° patch size넼 ģģ ź²ė¶ķ° ģ ģ ķ¤ģź°ė©° ģėģ ģ¼ė” high resolution feature mapė¶ķ° low resolution feature map ź¹ģ§ hiearachicalķ feature mapģ ģ¶ģ¶ ķ ģ ģģµėė¤.
ģ“ė¬ķ hiearachicalķ feature mapģ źø°ģ”“ CNNģģ ģ주 ģ¬ģ©ėė feature pyramid networks, U-Netź³¼ ź°ģ źø°ģ ģ ź°ėØķź² ģ ģ©ķ ģ ģź² ķ©ėė¤. ėķ modelģ“ ģ¬ė¬ scaleė” ė¶ķ° ģ ģ°ķź² feature mapģ ė½ģė¼ ģ ģź² ķė ģķ ģ ķź² ķ©ėė¤. (CNNģģ receptive fieldģ ģķ ź³¼ ė¹ģ·ķ ė“ģ©ģø ź² ź°ģµėė¤. Detectionģ¼ė” ģ넼 ė¤ė©“ patch sizeź° ķ“ ģė” ķ° object넼 ģ ķģ§ķė©° ė°ėģ¼ ź²½ģ° ģģ object넼 ģ ķģ§ķė ģķ ģ ķė ė“ģ©ģ“ė¼ź³ ģź°ķ©ėė¤.)
3.1. Shifted Window based Self-Attention
ķØģØģ ģø modelingģ ģķ“ ė³ø ė ¼ė¬øģģė źø°ģ”“ ViTģģ ķėģ token(patch)ģ ė¤ė„ø ėŖØė token(patch) ģ¬ģ“ģ self-attentionģ ź³ģ°ķė ė°©ė²ģ ģģ ķģ¬ ķėģ local windowsģģģė§ ź³ģ°ķė ė°©ė²ģ ģ ģķģģ¼ė©° ģ“넼 window based multi-head self attention (W-MSA)ė¼ ķ©ėė¤. ź°ź°ģ windowź° patches넼 ź°ģ§ź³ ģė¤ ź°ģ ķģ ė multi-head self attention (MSA)ģ window based multi-head self attention (W-MSA)ģ computational complexityė ė¤ģź³¼ ź°ģµėė¤.
ģģģģ 볓ė¤ģķ¼ źø°ģ”“ģ MSAģ ź²½ģ° ķ° ģ¬ģ“ģ¦ģ ģ“미ģ§, ģ¦ hwź° ķ° ź²½ģ° ģ ķ©ķģ§ ģģ ė°ė©“ ģ ģė ė°©ė²ģ scalableķ ź²ģ ģ ģ ģģµėė¤.
ģėģ Result sectionģģ ViTģ Swin Transformerģ FLOPS(ģ°ģ°ė) ė¹źµė„¼ 볓ģė©“ ģ“ķ“ķźø° ģ¬ģ°ģ¤ ź²ėė¤.
ķģ§ė§ local window ė“ė¶ģģė§ self attentionģ ź³ģ°ķź² ėė©“ 기씓과 ė¬ė¦¬ windowź°ģ connectionģ“ ģģ“ģ§ź² ėė©° ė modelģ ģ±ė„ģ ģ ķģķ¬ ģ ģģµėė¤. ė³ø ė ¼ė¬øģģė ģ“넼 ķ“ź²°ķźø° ģķ“ ė ¼ė¬øģģė shifted window ė°©ė²ģ ģ¬ģ©ķģģµėė¤.

Figure 2ė shifted windowģ ė°©ė²ģ 볓ģ¬ģ¤ėė¤. ģ²ģģ ėŖØėģ ģ¼ģŖ½ ģė¶ķ° ģģķ“ feature mapģ size넼 ź°ģ§ window넼 ģ“ģ©, ė” partitioning ķė regular window partitioning strategy넼 ģ¬ģ©ķ©ėė¤. ģ“ķ layerģģ źø°ģ”“ģ window넼 ė§ķ¼ ģ“ėģķ¤ė ė°©ė²ģ¼ė” window넼 ģ“ėģķ¤ź² ė©ėė¤.
ģ“ė shifted window ė°©ģģ ģ¬ģ©ķź² ėė©“ ėŖėŖ windowģ sizeź° ė³“ė¤ ģģģ§ ģ ģģµėė¤. ė ¼ė¬øģ ģ ģė ģ“ė¬ķ 문ģ 넼 paddingģ¼ė” ķ“ź²°ķ ź²½ģ° computational costź° ģ¦ź°ķź² ėė©° ė³“ė¤ ķØģØģ ģø ė°©ė²ģø cyclic shift ė°©ė²ģ ģ ģķģģµėė¤.

Figure 4ė cyclic shift ė°©ė²ģ 볓ģ¬ģ£¼ė 그림ģ ėė¤. ķ“ė¹ ė°©ė²ģ batch windowė feature mapģģ ģøģ ķģ§ ģģ ģ¬ė¬ź°ģ sub windowė” źµ¬ģ±ėė©° masking ė°©ė²ģ ģ“ģ©, self-attentionģ ź°ź°ģ sub-windowģģ ź³ģ°ėź² ģ ķķė¤ź³ ķ©ėė¤. batched windowģ ģė regular window partitioningź³¼ ėģ¼ķģ¬ paddingė°©ė²ė³“ė¤ ķØģØģ ģ“ė¼ź³ ģ¤ėŖ ķź³ ģģµėė¤.
3.2. Overall Architectures

Figure 3ģ Swin Transformer tiny versionģ architecture넼 볓ģ¬ģ¤ėė¤. Swin Transformerė image넼 ģ ė „ģ¼ė” ė°ģ ģģķź² ė©ėė¤. patch partitioningģģ ViTģ ź°ģ“ image넼 patchė” ėėź² ė©ėė¤. ģ“ķ ėėģ“ģ§ patch넼 tokenģ¼ė” transformerģ ģ ė „ģ¼ė” ģ¬ģ©ķė ė°©ģģ ź°ģ§ź³ ģģµėė¤.
ģ“ķ ź°ź°ģ stageė§ė¤ patch mergingģ¼ė” patch넼 ź²°ķ©ķ“ window size넼 ėķģ£¼ź² ė©ėė¤. ģ“ė ź² ķØģ¼ė”ģØ ź°ź°ģ stageė ģė” ė¤ė„ø scale feature넼 ź°ģ§ ģ ģź² ėė©° vision taskģ ģ¬ģ©ź°ė„ķ ź³ģøµģ ģø ģ 볓넼 ķģ©ķ ģ ģė¤ź³ ķ©ėė¤.
Swin Transformer blockģ ģģ ģ¤ėŖ ė린 W-MSAģ SW-MSAė” ģ“루ģ“ģ ø ģģµėė¤. hierarchical representationģ ģ ź³µķźø° ģķ“ tokenģ ģė patch merging layer넼 ķµź³¼ķØģ ė°ė¼ ģ¤ģ“ė¤ź² ėė©° ė§¤ė² tokenģ ģ넼 4ė°° ģ¤ģ“ź³ output dimensionģ 2ė°° ėė¦°ė¤ź³ ķ©ėė¤. ė°ė¼ģ ź° stageģ output resolutionsģ ź·øė¦¼ģģ 볓ė¤ģķ¼ ģģ ģģķģ¬ ė” ģ¤ģ“ė¤ź² ė©ėė¤. ģ“ė¬ķ feature mapģ resolutionģ ģ ķģ ģø convolution networksģø VGG [6]ģ ResNet [7]ź³¼ ź°ģ¼ė©° ė°ė¼ģ ģ½ź² 기씓 CNNėŖØėøģ ė첓ķ ģ ģė¤ź³ ģ ģė ė§ķź³ ģģµėė¤.
W-MSAģ ģģģ ģ¤ėŖ ķ ģ°ģ°ėģ ģ¤ģø window based multi-head self attentionģ“ė©° SW-MSAģ connectionģģ¤ģ ķ“ź²°ķźø° ģķ“ patch넼 shift ģģ¼ ģķķė Shifted Window based Self-Attentionģ ģ미ķ©ėė¤. SW-MSAģģ W-MSAģģ ģ¬ģ©ķ patch넼 shiftģģ¼ ė¤ģ ķė² ģķķė¤ź³ ģź°ķė©“ ė ź² ź°ģµėė¤.
4. Experiment & Result
Experimental setup
ź°ź°ģ vision taskģ ģ¤ķķ“볓기 ģķ“ ė ¼ė¬øģģė ķ¬ź² 3ź°ģ§ classification, object detection, semantic segmentation task ģ¤ķģ ģ§ķķģģ¼ė©° ė¹źµ ėģģ¼ė”ė ź°ź°ģ task, classification, object detection, semantic segmentationģ state-of-the-arts ėŖØėøė¤ģ ģ¬ģ©ķģģµėė¤.
Dataset
ź°ź°ģ datasetģ ė¤ģź³¼ ź°ģµėė¤.
Image Classification : ImageNet-1K image classfication [8]
Object Detection : COCO object detection [9]
Semantic Segmentation : ADE20K semantic segmentation [10]
Training step
Image Classification on ImaegNet-1K
Regular ImageNet-1K training
AdamW optimizerģ cosine decay learning rate schedular넼 ģ¬ģ©ķģģ¼ė©° cosine decayė” 300 epochs, linear warm-upģ¼ė” 20 epochs ķģµķģģµėė¤.
batch sizeė 1024ģ“ė©° ģ“źø° learning rateė 0.001, weight decay ė 0.05ź° ģ¬ģ©ėģģµėė¤.
Pre-trainiong on ImageNet-22K and fine-tunnign on ImageNet-1K
Pre-trainģ AdamW optimizerģ linear decay learning rate scheduler넼 ģ¬ģ©ķģģ¼ė©° 90 epochs, linear warm-upģ¼ė” 5 epochs ķģµķģģµėė¤.
batch sizeė 4096ģ“ė©° ģ“źø° learning rateė 0.001, weight decay ė 0.01ź° ģ¬ģ©ėģģµėė¤.
fine-tuningģė batch size 1024, learning rate , weight decay ģ“ ģ¬ģ©ėģģµėė¤.
Object Detection on COCO
multi-scale training ė°©ģģ¼ė” ģ“미ģ§ģ ź°ė” ģøė”ģ¤ ģ§§ģ ė¶ė¶ģ 480 ~ 800, źø“ ė¶ė¶ģ ģµė 1333ģ¼ė” ģ¬ģ©ķė¤ź³ ķ©ėė¤.
AdamW optimizerģ ģ“źø° learning rate 0.00001, weight decay 0.05, batch size 16, epochs 36 ģ ģ¬ģ©ķģģ¼ė©° 27, 33 epochģ learning rateź° 10x ė§ķ¼ ģ¤ģ“ź²ė ķė¤ź³ ķ©ėė¤.
Semantic segmentation on ADE20K
AdamW optimizerģ ģ“źø° learning rate , weight decay 0.01, linear warmup 1,500 iterationsģ ģ¬ģ©ķģģ¼ė©° modelģ 160K iterationsėģ ķģµķė¤ź³ ķ©ėė¤.
źø°ķ flipping, random re-scaling, random photometric distortionė±ģ augmentationģ“ ģ¬ģ©ė¬ė¤ź³ ķ©ėė¤.
Evaluation matrics
Image Classification : param, FLOPS, throughput, top-1 acc.
Object Detection : AP, param, FLOPS
Semantic Segmentation : mIoU param, FLOPS, FPS
Result
Image Classification, Object Detection, Semantic Segmentation ģ ėķ ģ±ė„ģ ģģ¹ė” ė¹źµķ ķģ ėė¤.

ģ¼ģŖ½ė¶ķ° Image Classification, Object Detection, Semantic Segmentationģ ķ“ė¹ķė©° Image Classificationģ ź²½ģ° 기씓 state-of-the-artģ classificationģ ģ¬ģ©ė ViTģģ ģ±ė„ģ ė¹źµķ ģė£ė” EfficientNet-B7ź³¼ ė¹ģ·ķ ģ±ė„ģ 볓ģøė¤ź³ ķ©ėė¤. ėķ ViT ėŖØėøė¤ģ ź²½ģ° źø°ģ”“ė³“ė¤ ģ ģ parameterģė” ė ėģ ģ±ė„ģ ė¬ģ±ķė¤ė ź²ģ 볓ģ¬ģ¤ėė¤.
Object Detection, Semantic Segmentationģ ź²½ģ° 기씓 ėŖØėøė¤ģ backboneģ ė³ź²½ķģ¬ ģ±ė„ģ ė¹źµķģģµėė¤. 기씓 ė°©ė²ė¤ģģ backboneģ Swin Transformerė” ė³ź²½ķģģ ė ź±°ģ ėė¶ė¶ 기씓 ģ±ė„ģ ė„ź°ķ ź²ģ 볓ģøė¤ ķ©ėė¤.
5. Conclusion
ė³ø ė ¼ė¬øģģė hierarchical feature representationģ ģķķ ģ ģģ¼ė©° image sizeģ ė¹ķ“ ģ ģ computational complexity넼 ź°ģ§ė ģė”ģ“ transformer 구씰넼 ģ ģķģģµėė¤. 기씓 ViTģ multi-head self-attentionģ ģ°ģ°ė 문ģ 넼 window based self-attetnionģ¼ė” ķ“ź²°ķź³ windowź°ģ connection문ģ 넼 shifted window ė°©ģģ¼ė” ķ“ź²°ķģģµėė¤. Calssficationģ“ģøģ vision taskģ ķģķ ė¶ė¶ģ ė¶ģķź³ multi scaleģ ģķ“ patch넼 mergeķė hierarchical 구씰넼 ģ ģķģģµėė¤. ģ ģė ėŖØėøģ Object Detection, Semantic Segmentationģģ state-of-the-art넼 ė¬ģ±ķģģµėė¤. 기씓ģ Vision transformerģ 문ģ 넼 ģ ė¶ģķź³ classificationģ“ģøģ ė¤ė„ø vision task넼 ģķ ė¶ģ ė° ėŖØėø ģ¤ź³ź° ė볓ģ“ė ė ¼ė¬øģ“ģģµėė¤.
Take home message (ģ¤ėģ źµķ)
기씓 ė°©ė²ģ ėØģ ģ ė¶ģķź³ ź°ģ ķė ź²ź³¼ ģķķ“ģ¼ķ taskģ ģ§ģ¤ķģ¬ ģ¤ģķ ź²ģ“ 묓ģģøģ§ ģź°ķ“ ė³“ėź²ģ“ ģ¤ģķė¤ź³ ģź°ķ©ėė¤.
Author / Reviewer information
Author
ģ“ķģ (Hyeonsu Lee)
Affiliation (KAIST AI / NAVER)
Machine Learning Engineer @ NAVER Papago team
Reviewer
Korean name (English name): Affiliation / Contact information
Korean name (English name): Affiliation / Contact information
..
Reference & Additional materials
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248ā255. Ieee, 2009 9.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Ā“ Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740ā755. Springer, 2014
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal on Computer Vision, 2018.
Last updated
Was this helpful?