Swin Transformer [kor]

Liu Z et al. / Swin Transformer Hierarchical Vision Transformer using Shifted Windows / arXiv prerint 2021

1. Problem definition

최근 natural language processing (NLP) ģ—ģ„œ 큰 ģ„±ź³µģ„ ź±°ė‘” self-attention, Transformer 구씰넼 general vision task에 ģ ģš©ģ‹œķ‚¤ėŠ” 연구가 ė§Žģ“ ģ§„ķ–‰ė˜ź³  ģžˆģŠµė‹ˆė‹¤. ź·øģ¤‘ģ—ģ„œė„ Vision Transformer (ViT) [3] ėŠ” classificationģ—ģ„œ sota넼 ė‹¬ģ„±ķ•˜ėŠ” 등 ģš°ģˆ˜ķ•œ ģ„±ėŠ„ģ„ ė³“ģ—¬ģ£¼ģ—ˆģœ¼ė©° ViT넼 ģž‡ėŠ” ķ›„ģ† ģ—°źµ¬ė“¤ģ“ ė§Žģ“ ģ§„ķ–‰ė˜ź³  ģžˆģŠµė‹ˆė‹¤. ģ“ėŸ¬ķ•œ 연구들 중 ķ•˜ė‚˜ģø Swin TransformerėŠ” ģ–“ė– ķ•œ ė°©ė²•ģœ¼ė”œ general vision task에 transformer 구씰넼 ģ ģš©ģ‹œķ‚¤ė ¤ ķ•˜ģ˜€ėŠ”ģ§€ ģ†Œź°œķ•“ ė³“ė„ė” ķ•˜ź² ģŠµė‹ˆė‹¤.

2. Motivation

ģ“ ė…¼ė¬øģ—ģ„œėŠ” ģ•žģ„œ ė§ģ”€ė“œė¦° 것과 ź°™ģ“ Transformer구씰넼 general vision task에 ģ ģš©ģ‹œķ‚¤ėŠ” ģ£¼ģ œģ˜ ģ—°źµ¬ģž…ė‹ˆė‹¤. ꓀련 연구중 ķ•˜ė‚˜ģø classification에 적용된 Vision Transformer (ViT)에 ģ“ģ–“ 볓다 ģ¼ė°˜ģ ģø vision task에 적용될 수 ģžˆėŠ” 새딜욓 구씰넼 ģ œģ•ˆķ•˜ģ˜€ģœ¼ė©° ė…¼ė¬øģ˜ ģ €ģžėŠ” ģ“ė„¼ 통핓 Visionź³¼ language featureģ˜ joint modelingģ„ ź°€ėŠ„ģ¼€ ķ•˜ź³  두 분야 모두에 ė„ģ›€ģ“ 될 수 ģžˆģ„ ź²ƒģ“ė¼ ģ–øźø‰ķ•˜ģ˜€ģŠµė‹ˆė‹¤.

CNN and variants:

  • źø°ģ”“ģ˜ vision taskģ—ģ„œ 주딜 ģ‚¬ģš©ė˜ėŠ” ė°©ė²•ģœ¼ė”œ ė§Žģ“ ģ•Œź³ ź³„ģ‹œėŠ” Convolution neural networks에 ź“€ķ•œ ė‚“ģš©ģž…ė‹ˆė‹¤. AlexNet부터 ģ‹œģž‘ķ•˜ģ—¬ ė” deepķ•˜ź³  effectiveķ•œ 구씰가 ģ œģ•ˆė˜ģ—ˆģœ¼ė©° convolution layerģžģ²“ė„¼ ź°œģ„ ķ•œ 방법들에 ėŒ€ķ•“ ģ–øźø‰ķ•˜ģ˜€ģŠµė‹ˆė‹¤. ģ§€źøˆź¹Œģ§€ģ˜ CNN에 ģ“ėŸ¬ģ“ėŸ¬ķ•œ ėŖØėøė“¤ģ“ ģžˆė‹¤ ė¼ėŠ” ģ–øźø‰ģ“ė©° ė…¼ė¬øģ—ģ„œ ģ¤‘ģš”ķ•œ ė¶€ė¶„ģ“ ģ•„ė‹ˆė¼ ģžģ„øķ•œ ėŖØėø ģ“ė¦„ģ€ źø°ģž¬ķ•˜ģ§€ ģ•Šģ•˜ģŠµė‹ˆė‹¤. ė…¼ė¬øģ—ģ„œģ˜ ķ•µģ‹¬ģ€ visionź³¼ language ģ‚¬ģ“ģ˜ modelingģ„ ģœ„ķ•“ transformerģ˜ ģž ģž¬ė „ģ„ ź°•ģ”°ķ•˜ź³  modelingģ˜ 변화에 źø°ģ—¬ķ•˜źø°ė„¼ ģ›ķ•œė‹¤ź³  ģ–øźø‰ķ•˜ź³  ģžˆģŠµė‹ˆė‹¤.

self-attention based backbone architectures:

  • convolution layerģ˜ ģ¼ė¶€ė¶„ģ“ė‚˜ 전부넼 self-attention으딜 ė³€ź²½ķ•˜ėŠ” 연구들에 ķ•“ė‹¹ķ•˜ė©° 크게 Stand-alone self-attention model [4], Local Relation Networks [5]ź°€ ģžˆģŠµė‹ˆė‹¤. ģ“ 중 Local Relation NetworksėŠ” self-attetentionģ“ ź°ź°ģ˜ pixelģ˜ local windowģ—ģ„œ ź³„ģ‚°ė˜ė©° 기씓 vision taskģ˜ ģ„±ėŠ„ģ„ ķ–„ģƒģ‹œķ‚¬ 수 ģžˆģŒģ„ ė³“ģ—¬ģ£¼ģ—ˆģŠµė‹ˆė‹¤. ķ•˜ģ§€ė§Œ sliding ė°©ģ‹ģ„ ģ‚¬ģš©ķ•˜ģ—¬ ģ—°ģ‚°ėŸ‰ģ˜ ģ¦ź°€ģ— ė”°ė¼ latencyź°€ ģ‹¬ź°ķ•˜ź²Œ ģ¦ź°€ķ•˜ėŠ” ė‹Øģ ģ“ ģ”“ģž¬ķ•œė‹¤ź³  ķ•©ė‹ˆė‹¤. ģ“ ė…¼ė¬øģ—ģ„œėŠ” sliding window ėŒ€ģ‹  consecutive layersģ‚¬ģ“ģ˜ shift sindowsė¼ėŠ” 훨씬 ķšØź³¼ģ ģø ė°©ė²•ģ„ ģ œģ•ˆķ•˜ģ—¬ ģ“ė„¼ ķ•“ź²°ķ•˜ė ¤ ķ•˜ģ˜€ģŠµė‹ˆė‹¤.

self-attention/Transformers to complement CNNs:

  • Standard CNN 구씰에 self-attentionģ“ė‚˜ Transformers넼 ź²°ķ•©ķ•œ ė°©ė²•ė“¤ė”œ self-attetnion layerź°€ distant dependencies넼 encoding ķ•Øģœ¼ė”œģØ backboneģ“ė‚˜ head networks넼 볓완할 수 ģžˆė‹¤ź³  ģ•Œė ¤ģ ø ģžˆģŠµė‹ˆė‹¤. ė˜ķ•œ 최근 ģ—°źµ¬ģ˜ 경우 encoder-decoderźµ¬ģ”°ģ˜ transformer넼 object detectionģ“ė‚˜ instance segmentation에 ģ ģš©ķ•˜ź³  ģžˆģŠµė‹ˆė‹¤. ģ“ ė…¼ė¬øģ—ģ„œėŠ” transformer넼 basic visual feature extraction으딜 ģ ģš©ķ•˜ė ¤ ķ•˜ģ˜€ź³  ģ“ėŠ” 기씓 ꓀련 ģ—°źµ¬ė“¤ģ„ 볓완할 수 ģžˆė‹¤ ģ–øźø‰ķ•˜ģ˜€ģŠµė‹ˆė‹¤.

Transformer based vision backbones:

  • Vision task에 transformer구씰넼 ģ ģš©ķ•œ ė°©ė²•ė“¤ė”œ Vision Transformer (ViT)와 ź·ø ķ›„ģ† 논문들에 ķ•“ė‹¹ķ•©ė‹ˆė‹¤. ģ“ ė°©ė²•ģ€ ģ“ėÆøģ§€ė„¼ ź°ź°ģ˜ ź³ ģ •ėœ sizeģ˜ patch딜 ė‚˜ėˆ„ź³  ģ“ėŸ¬ķ•œ patch넼 token으딜 ģ‚¬ģš©ķ•˜ėŠ” ė°©ė²•ė“¤ģž…ė‹ˆė‹¤. CNN 방법과 ė¹„ģŠ·ķ•œ ģ„±ėŠ„ģ“ģ§€ė§Œ 볓다 빠넸 ģ†ė„ė„¼ ė³“ģ˜€ģŠµė‹ˆė‹¤. ģ“ ė…¼ė¬øģ—ģ„œėŠ” Vitģ˜ classification ģ„±ėŠ„ģ€ 효과적으딜 ė³“ģ“ė‚˜ ģ“ėŸ¬ķ•œ źµ¬ģ”°ėŠ” general-purpose backbone으딜 ģ‚¬ģš©ķ•˜źø°ģ—ėŠ” low-resolution feature mapź³¼ ģ“ėÆøģ§€ 크기에 따넸 ģ—°ģ‚°ėŸ‰ ģ¦ź°€ė”œ ģøķ•“ ģ ķ•©ķ•˜ģ§€ ģ•Šė‹¤ź³  ģ–øźø‰ķ•˜ė©° ģ“ė„¼ ź°œģ„ ķ•˜ėŠ” ė°©ė²•ģ„ ģ œģ•ˆķ•˜ģ˜€ģŠµė‹ˆė‹¤.

Idea

ģ“ ė…¼ė¬øģ—ģ„œėŠ” low-resolution feature map에 ģ˜ķ•“ general-purpose backbone으딜 ģ‚¬ģš©ė˜źø°ģ—ėŠ” ģ ķ•©ķ•˜ģ§€ ģ•Šģ€ źø°ģ”“ģ˜ ViTģ˜ ė°©ė²•ģ„ ė³€ź²½ķ•˜ģ—¬ layerź°€ ź¹Šģ–“ģ§ˆģˆ˜ė” patch넼 mergeķ•“ ė‚˜ź°€ėŠ” hierarchical 구씰넼 ģ œģ•ˆķ•˜ģ˜€ģŠµė‹ˆė‹¤. 기씓 VitėŠ” ģ“ėÆøģ§€ź°€ ģ»¤ģ§ˆģˆ˜ė” ģ—°ģ‚°ėŸ‰ģ“ 매우 ģ¦ź°€ķ•œė‹¤ėŠ” ė‹Øģ ģ“ ģ”“ģž¬ķ•˜ģ˜€ģŠµė‹ˆė‹¤. ģ“ė„¼ ź°ź°ģ˜ local patchģ•ˆģ—ģ„œė§Œ self-attentionģ„ ź³„ģ‚°ķ•˜ėŠ” shifted window based self-attentionģ„ ģ œģ•ˆķ•Øģœ¼ė”œģØ ģ™„ķ™”ķ•˜ģ˜€ģœ¼ė©° feature pyramid 구씰넼 ģ œģ•ˆķ•Øģœ¼ė”œģØ 다넸 vision taskģ—ė„ ģ‚¬ģš©ź°€ėŠ„ķ•œ ź³„ģøµģ ģø 정볓넼 ķ™œģš©ķ•  수 ģžˆė‹¤ź³  ķ•©ė‹ˆė‹¤.

3. Method

Figure 1ģ€ swin transformerģ˜ hierarchical feature mapź³¼ 기씓 ViTģ˜ feature mapģ„ ė³“ģ—¬ģ¤ė‹ˆė‹¤. źø°ģ”“ģ˜ VitėŠ” single low resolution feature mapģ„ ģƒģ„±ķ•“ė‚“ėŠ”ė° 반멓 swin transformerėŠ” hierarchical feature map으딜 deeper layer딜 ź°ˆģˆ˜ė” patches넼 mergeķ•“ ė‚˜ź°€ė©° window size넼 ė„“ķ˜€ ź°‘ė‹ˆė‹¤.

ViTģ˜ 경우 ź³ ģ •ėœ patch size (16x16)(16x16)넼 ģ‚¬ģš©ķ•˜ė©° ź·ø ź²°ź³¼ output feature mapģ˜ resolutionģ€ 기씓 input image sizeģ˜ 1/161/16ģ“ ė©ė‹ˆė‹¤. 반멓 swin transformerģ˜ 경우 patch size넼 ģž‘ģ€ ź²ƒė¶€ķ„° 점점 ķ‚¤ģ›Œź°€ė©° ģƒėŒ€ģ ģœ¼ė”œ high resolution feature map부터 low resolution feature map ź¹Œģ§€ hiearachicalķ•œ feature mapģ„ ģ¶”ģ¶œ ķ•  수 ģžˆģŠµė‹ˆė‹¤.

ģ“ėŸ¬ķ•œ hiearachicalķ•œ feature mapģ€ 기씓 CNNģ—ģ„œ ģžģ£¼ ģ‚¬ģš©ė˜ėŠ” feature pyramid networks, U-Netź³¼ ź°™ģ€ źø°ģˆ ģ„ ź°„ė‹Øķ•˜ź²Œ ģ ģš©ķ•  수 ģžˆź²Œ ķ•©ė‹ˆė‹¤. ė˜ķ•œ modelģ“ ģ—¬ėŸ¬ scale딜 부터 ģœ ģ—°ķ•˜ź²Œ feature mapģ„ 뽑아낼 수 ģžˆź²Œ ķ•˜ėŠ” ģ—­ķ• ģ„ ķ•˜ź²Œ ķ•©ė‹ˆė‹¤. (CNNģ—ģ„œ receptive fieldģ˜ ģ—­ķ• ź³¼ ė¹„ģŠ·ķ•œ ė‚“ģš©ģø 것 ź°™ģŠµė‹ˆė‹¤. Detection으딜 예넼 들멓 patch sizeź°€ 큓 ģˆ˜ė” 큰 object넼 ģž˜ ķƒģ§€ķ•˜ė©° ė°˜ėŒ€ģ¼ 경우 ģž‘ģ€ object넼 ģž˜ ķƒģ§€ķ•˜ėŠ” ģ—­ķ• ģ„ ķ•˜ėŠ” ė‚“ģš©ģ“ė¼ź³  ģƒź°ķ•©ė‹ˆė‹¤.)

3.1. Shifted Window based Self-Attention

ķšØģœØģ ģø modelingģ„ ģœ„ķ•“ ė³ø ė…¼ė¬øģ—ģ„œėŠ” 기씓 ViTģ—ģ„œ ķ•˜ė‚˜ģ˜ token(patch)와 다넸 ėŖØė“  token(patch) ģ‚¬ģ“ģ˜ self-attentionģ„ ź³„ģ‚°ķ•˜ėŠ” ė°©ė²•ģ„ ģˆ˜ģ •ķ•˜ģ—¬ ķ•˜ė‚˜ģ˜ local windowsģ•ˆģ—ģ„œė§Œ ź³„ģ‚°ķ•˜ėŠ” ė°©ė²•ģ„ ģ œģ•ˆķ•˜ģ˜€ģœ¼ė©° ģ“ė„¼ window based multi-head self attention (W-MSA)ė¼ ķ•©ė‹ˆė‹¤. ź°ź°ģ˜ windowź°€ MxMM x M patches넼 가지고 ģžˆė‹¤ ź°€ģ •ķ–ˆģ„ ė•Œ multi-head self attention (MSA)와 window based multi-head self attention (W-MSA)ģ˜ computational complexityėŠ” ė‹¤ģŒź³¼ ź°™ģŠµė‹ˆė‹¤.

Ī©(MSA)=4hwC2+2(hw)2C\Omega(MSA) = 4hwC^2 + 2(hw)^2C

ģˆ˜ģ‹ģ—ģ„œ ė³“ė‹¤ģ‹œķ”¼ źø°ģ”“ģ˜ MSAģ˜ 경우 큰 ģ‚¬ģ“ģ¦ˆģ˜ ģ“ėÆøģ§€, 즉 hwź°€ 큰 경우 ģ ķ•©ķ•˜ģ§€ ģ•Šģ€ 반멓 ģ œģ•ˆėœ ė°©ė²•ģ€ scalableķ•œ ź²ƒģ„ ģ•Œ 수 ģžˆģŠµė‹ˆė‹¤. (hw>>M)(hw >> M)

ģ•„ėž˜ģ˜ Result sectionģ—ģ„œ ViT와 Swin Transformerģ˜ FLOPS(ģ—°ģ‚°ėŸ‰) 비교넼 ė³“ģ‹œė©“ ģ“ķ•“ķ•˜źø° ģ‰¬ģš°ģ‹¤ ź²ė‹ˆė‹¤.

ķ•˜ģ§€ė§Œ local window ė‚“ė¶€ģ—ģ„œė§Œ self attentionģ„ ź³„ģ‚°ķ•˜ź²Œ 되멓 기씓과 달리 windowź°„ģ˜ connectionģ“ ģ—†ģ–“ģ§€ź²Œ 되며 ėŠ” modelģ˜ ģ„±ėŠ„ģ„ ģ €ķ•˜ģ‹œķ‚¬ 수 ģžˆģŠµė‹ˆė‹¤. ė³ø ė…¼ė¬øģ—ģ„œėŠ” ģ“ė„¼ ķ•“ź²°ķ•˜źø° ģœ„ķ•“ ė…¼ė¬øģ—ģ„œėŠ” shifted window ė°©ė²•ģ„ ģ‚¬ģš©ķ•˜ģ˜€ģŠµė‹ˆė‹¤.

Figure 2ėŠ” shifted windowģ˜ ė°©ė²•ģ„ ė³“ģ—¬ģ¤ė‹ˆė‹¤. ģ²˜ģŒģ— ėŖØė“ˆģ€ 왼쪽 ģœ„ė¶€ķ„° ģ‹œģž‘ķ•“ feature mapģ„ size넼 가진 window넼 ģ“ģš©, 딜 partitioning ķ•˜ėŠ” regular window partitioning strategy넼 ģ‚¬ģš©ķ•©ė‹ˆė‹¤. ģ“ķ›„ layerģ—ģ„œ źø°ģ”“ģ˜ window넼 ⌊M2āŒ‹,⌊M2āŒ‹\lfloor{M\over2}\rfloor,\lfloor{M\over2}\rfloor 만큼 ģ“ė™ģ‹œķ‚¤ėŠ” ė°©ė²•ģœ¼ė”œ window넼 ģ“ė™ģ‹œķ‚¤ź²Œ ė©ė‹ˆė‹¤.

ģ“ė•Œ shifted window ė°©ģ‹ģ„ ģ‚¬ģš©ķ•˜ź²Œ 되멓 몇몇 windowģ˜ sizeź°€ 볓다 ģž‘ģ•„ģ§ˆ 수 ģžˆģŠµė‹ˆė‹¤. ė…¼ė¬øģ˜ ģ €ģžėŠ” ģ“ėŸ¬ķ•œ 문제넼 padding으딜 ķ•“ź²°ķ•  경우 computational costź°€ ģ¦ź°€ķ•˜ź²Œ 되며 볓다 ķšØģœØģ ģø ė°©ė²•ģø cyclic shift ė°©ė²•ģ„ ģ œģ•ˆķ•˜ģ˜€ģŠµė‹ˆė‹¤.

Figure 4ėŠ” cyclic shift ė°©ė²•ģ„ ė³“ģ—¬ģ£¼ėŠ” ź·øė¦¼ģž…ė‹ˆė‹¤. 핓당 ė°©ė²•ģ€ batch windowėŠ” feature mapģ—ģ„œ ģøģ ‘ķ•˜ģ§€ ģ•Šģ€ ģ—¬ėŸ¬ź°œģ˜ sub window딜 źµ¬ģ„±ė˜ė©° masking ė°©ė²•ģ„ ģ“ģš©, self-attentionģ„ ź°ź°ģ˜ sub-windowģ—ģ„œ ź³„ģ‚°ė˜ź²Œ ģ œķ•œķ•œė‹¤ź³  ķ•©ė‹ˆė‹¤. batched windowģ˜ ģˆ˜ėŠ” regular window partitioningź³¼ ė™ģ¼ķ•˜ģ—¬ padding방법볓다 ķšØģœØģ ģ“ė¼ź³  ģ„¤ėŖ…ķ•˜ź³  ģžˆģŠµė‹ˆė‹¤.

3.2. Overall Architectures

Figure 3ģ€ Swin Transformer tiny versionģ˜ architecture넼 ė³“ģ—¬ģ¤ė‹ˆė‹¤. Swin TransformerėŠ” image넼 ģž…ė „ģœ¼ė”œ 받아 ģ‹œģž‘ķ•˜ź²Œ ė©ė‹ˆė‹¤. patch partitioningģ—ģ„œ ViT와 ź°™ģ“ image넼 patch딜 ė‚˜ėˆ„ź²Œ ė©ė‹ˆė‹¤. ģ“ķ›„ ė‚˜ėˆ„ģ–“ģ§„ patch넼 token으딜 transformerģ˜ ģž…ė „ģœ¼ė”œ ģ‚¬ģš©ķ•˜ėŠ” ė°©ģ‹ģ„ 가지고 ģžˆģŠµė‹ˆė‹¤.

ģ“ķ›„ ź°ź°ģ˜ stageė§ˆė‹¤ patch merging으딜 patch넼 ź²°ķ•©ķ•“ window size넼 ė„“ķ˜€ģ£¼ź²Œ ė©ė‹ˆė‹¤. ģ“ė ‡ź²Œ ķ•Øģœ¼ė”œģØ ź°ź°ģ˜ stageėŠ” ģ„œė”œ 다넸 scale feature넼 ź°€ģ§ˆ 수 ģžˆź²Œ 되며 vision task에 ģ‚¬ģš©ź°€ėŠ„ķ•œ ź³„ģøµģ ģø 정볓넼 ķ™œģš©ķ•  수 ģžˆė‹¤ź³  ķ•©ė‹ˆė‹¤.

Swin Transformer blockģ€ ģ•žģ„œ ģ„¤ėŖ…ė“œė¦° W-MSA와 SW-MSA딜 ģ“ė£Øģ–“ģ ø ģžˆģŠµė‹ˆė‹¤. hierarchical representationģ„ ģ œź³µķ•˜źø° ģœ„ķ•“ tokenģ˜ ģˆ˜ėŠ” patch merging layer넼 통과함에 ė”°ė¼ ģ¤„ģ–“ė“¤ź²Œ 되며 매번 tokenģ˜ 수넼 4ė°° ģ¤„ģ“ź³  output dimensionģ„ 2ė°° ėŠ˜ė¦°ė‹¤ź³  ķ•©ė‹ˆė‹¤. ė”°ė¼ģ„œ 각 stageģ˜ output resolutionsģ€ ź·øė¦¼ģ—ģ„œ ė³“ė‹¤ģ‹œķ”¼ ģ—ģ„œ ģ‹œģž‘ķ•˜ģ—¬ 딜 ģ¤„ģ–“ė“¤ź²Œ ė©ė‹ˆė‹¤. ģ“ėŸ¬ķ•œ feature mapģ˜ resolutionģ€ ģ „ķ˜•ģ ģø convolution networksģø VGG [6]와 ResNet [7]ź³¼ ź°™ģœ¼ė©° ė”°ė¼ģ„œ ģ‰½ź²Œ 기씓 CNNėŖØėøģ„ ėŒ€ģ²“ķ•  수 ģžˆė‹¤ź³  ģ €ģžėŠ” ė§ķ•˜ź³  ģžˆģŠµė‹ˆė‹¤.

W-MSAģ€ ģœ„ģ—ģ„œ ģ„¤ėŖ…ķ•œ ģ—°ģ‚°ėŸ‰ģ„ ģ¤„ģø window based multi-head self attentionģ“ė©° SW-MSAģ€ connectionģ†Œģ‹¤ģ„ ķ•“ź²°ķ•˜źø° ģœ„ķ•“ patch넼 shift ģ‹œģ¼œ ģˆ˜ķ–‰ķ•˜ėŠ” Shifted Window based Self-Attentionģ„ ģ˜ėÆøķ•©ė‹ˆė‹¤. SW-MSAģ—ģ„œ W-MSAģ—ģ„œ ģ‚¬ģš©ķ•œ patch넼 shiftģ‹œģ¼œ ė‹¤ģ‹œ ķ•œė²ˆ ģˆ˜ķ–‰ķ•œė‹¤ź³  ģƒź°ķ•˜ė©“ 될 것 ź°™ģŠµė‹ˆė‹¤.

4. Experiment & Result

Experimental setup

ź°ź°ģ˜ vision task에 ģ‹¤ķ—˜ķ•“ė³“źø° ģœ„ķ•“ ė…¼ė¬øģ—ģ„œėŠ” 크게 3가지 classification, object detection, semantic segmentation task ģ‹¤ķ—˜ģ„ ģ§„ķ–‰ķ•˜ģ˜€ģœ¼ė©° 비교 ėŒ€ģƒģœ¼ė”œėŠ” ź°ź°ģ˜ task, classification, object detection, semantic segmentationģ˜ state-of-the-arts ėŖØėøė“¤ģ„ ģ‚¬ģš©ķ•˜ģ˜€ģŠµė‹ˆė‹¤.

Dataset

ź°ź°ģ˜ datasetģ€ ė‹¤ģŒź³¼ ź°™ģŠµė‹ˆė‹¤.

  • Image Classification : ImageNet-1K image classfication [8]

  • Object Detection : COCO object detection [9]

  • Semantic Segmentation : ADE20K semantic segmentation [10]

Training step

  • Image Classification on ImaegNet-1K

    • Regular ImageNet-1K training

      AdamW optimizer와 cosine decay learning rate schedular넼 ģ‚¬ģš©ķ•˜ģ˜€ģœ¼ė©° cosine decay딜 300 epochs, linear warm-up으딜 20 epochs ķ•™ģŠµķ•˜ģ˜€ģŠµė‹ˆė‹¤.

      batch sizeėŠ” 1024ģ“ė©° 쓈기 learning rateėŠ” 0.001, weight decay ėŠ” 0.05ź°€ ģ‚¬ģš©ė˜ģ—ˆģŠµė‹ˆė‹¤.

    • Pre-trainiong on ImageNet-22K and fine-tunnign on ImageNet-1K

      Pre-train에 AdamW optimizer와 linear decay learning rate scheduler넼 ģ‚¬ģš©ķ•˜ģ˜€ģœ¼ė©° 90 epochs, linear warm-up으딜 5 epochs ķ•™ģŠµķ•˜ģ˜€ģŠµė‹ˆė‹¤.

      batch sizeėŠ” 4096ģ“ė©° 쓈기 learning rateėŠ” 0.001, weight decay ėŠ” 0.01ź°€ ģ‚¬ģš©ė˜ģ—ˆģŠµė‹ˆė‹¤.

      fine-tuningģ—ėŠ” batch size 1024, learning rate 10āˆ’510^{-5}, weight decay 10āˆ’810^{-8}ģ“ ģ‚¬ģš©ė˜ģ—ˆģŠµė‹ˆė‹¤.

  • Object Detection on COCO

    multi-scale training ė°©ģ‹ģœ¼ė”œ ģ“ėÆøģ§€ģ˜ ź°€ė”œ ģ„øė”œģ¤‘ ģ§§ģ€ ė¶€ė¶„ģ€ 480 ~ 800, źø“ ė¶€ė¶„ģ€ ģµœėŒ€ 1333으딜 ģ‚¬ģš©ķ–ˆė‹¤ź³  ķ•©ė‹ˆė‹¤.

    AdamW optimizer와 쓈기 learning rate 0.00001, weight decay 0.05, batch size 16, epochs 36 ģ„ ģ‚¬ģš©ķ•˜ģ˜€ģœ¼ė©° 27, 33 epoch에 learning rateź°€ 10x 만큼 ģ¤„ģ“ź²Œė” ķ–ˆė‹¤ź³  ķ•©ė‹ˆė‹¤.

  • Semantic segmentation on ADE20K

    AdamW optimizer와 쓈기 learning rate 6x10āˆ’56x10^{-5}, weight decay 0.01, linear warmup 1,500 iterationsģ„ ģ‚¬ģš©ķ•˜ģ˜€ģœ¼ė©° modelģ€ 160K iterationsė™ģ•ˆ ķ•™ģŠµķ–ˆė‹¤ź³  ķ•©ė‹ˆė‹¤.

źø°ķƒ€ flipping, random re-scaling, random photometric distortionė“±ģ˜ augmentationģ“ ģ‚¬ģš©ė¬ė‹¤ź³  ķ•©ė‹ˆė‹¤.

Evaluation matrics

  • Image Classification : param, FLOPS, throughput, top-1 acc.

  • Object Detection : AP, param, FLOPS

  • Semantic Segmentation : mIoU param, FLOPS, FPS

Result

Image Classification, Object Detection, Semantic Segmentation 에 ėŒ€ķ•œ ģ„±ėŠ„ģ„ 수치딜 ė¹„źµķ•œ ķ‘œģž…ė‹ˆė‹¤.

왼쪽부터 Image Classification, Object Detection, Semantic Segmentation에 ķ•“ė‹¹ķ•˜ė©° Image Classificationģ˜ 경우 기씓 state-of-the-art와 classification에 ģ‚¬ģš©ėœ ViTģ™€ģ˜ ģ„±ėŠ„ģ„ ė¹„źµķ•œ ģžė£Œė”œ EfficientNet-B7ź³¼ ė¹„ģŠ·ķ•œ ģ„±ėŠ„ģ„ ė³“ģøė‹¤ź³  ķ•©ė‹ˆė‹¤. ė˜ķ•œ ViT ėŖØėøė“¤ģ˜ 경우 기씓볓다 ģ ģ€ parameter수딜 ė” ė†’ģ€ ģ„±ėŠ„ģ„ ė‹¬ģ„±ķ–ˆė‹¤ėŠ” ź²ƒģ„ ė³“ģ—¬ģ¤ė‹ˆė‹¤.

Object Detection, Semantic Segmentationģ˜ 경우 기씓 ėŖØėøė“¤ģ˜ backboneģ„ ė³€ź²½ķ•˜ģ—¬ ģ„±ėŠ„ģ„ ė¹„źµķ•˜ģ˜€ģŠµė‹ˆė‹¤. 기씓 ė°©ė²•ė“¤ģ—ģ„œ backboneģ„ Swin Transformer딜 ė³€ź²½ķ•˜ģ˜€ģ„ ė•Œ ź±°ģ˜ ėŒ€ė¶€ė¶„ 기씓 ģ„±ėŠ„ģ„ ėŠ„ź°€ķ•œ ź²ƒģ„ ė³“ģøė‹¤ ķ•©ė‹ˆė‹¤.

5. Conclusion

ė³ø ė…¼ė¬øģ—ģ„œėŠ” hierarchical feature representationģ„ ģˆ˜ķ–‰ķ•  수 ģžˆģœ¼ė©° image size에 비핓 ģ ģ€ computational complexity넼 ź°€ģ§€ėŠ” 새딜욓 transformer 구씰넼 ģ œģ•ˆķ•˜ģ˜€ģŠµė‹ˆė‹¤. 기씓 ViTģ˜ multi-head self-attentionģ˜ ģ—°ģ‚°ėŸ‰ 문제넼 window based self-attetnion으딜 ķ•“ź²°ķ•˜ź³  windowź°„ģ˜ connection문제넼 shifted window ė°©ģ‹ģœ¼ė”œ ķ•“ź²°ķ•˜ģ˜€ģŠµė‹ˆė‹¤. Calssficationģ“ģ™øģ˜ vision task에 ķ•„ģš”ķ•œ ė¶€ė¶„ģ„ ė¶„ģ„ķ•˜ź³  multi scaleģ„ ģœ„ķ•“ patch넼 mergeķ•˜ėŠ” hierarchical 구씰넼 ģ œģ•ˆķ•˜ģ˜€ģŠµė‹ˆė‹¤. ģ œģ•ˆėœ ėŖØėøģ€ Object Detection, Semantic Segmentationģ—ģ„œ state-of-the-art넼 ė‹¬ģ„±ķ•˜ģ˜€ģŠµė‹ˆė‹¤. źø°ģ”“ģ˜ Vision transformerģ˜ 문제넼 ģž˜ ė¶„ģ„ķ•˜ź³  classificationģ“ģ™øģ˜ 다넸 vision task넼 ģœ„ķ•œ ė¶„ģ„ ė° ėŖØėø 설계가 ė‹ė³“ģ“ėŠ” ė…¼ė¬øģ“ģ—ˆģŠµė‹ˆė‹¤.

Take home message (ģ˜¤ėŠ˜ģ˜ źµķ›ˆ)

기씓 ė°©ė²•ģ˜ ė‹Øģ ģ„ ė¶„ģ„ķ•˜ź³  ź°œģ„ ķ•˜ėŠ” 것과 ģˆ˜ķ–‰ķ•“ģ•¼ķ•  task에 ģ§‘ģ¤‘ķ•˜ģ—¬ ģ¤‘ģš”ķ•œ ź²ƒģ“ ė¬“ģ—‡ģøģ§€ ģƒź°ķ•“ ė³“ėŠ”ź²ƒģ“ ģ¤‘ģš”ķ•˜ė‹¤ź³  ģƒź°ķ•©ė‹ˆė‹¤.

Author / Reviewer information

Author

ģ“ķ˜„ģˆ˜ (Hyeonsu Lee)

  • Affiliation (KAIST AI / NAVER)

  • Machine Learning Engineer @ NAVER Papago team

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ..

Reference & Additional materials

  1. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009 9.

  2. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Ā“ Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

  3. Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal on Computer Vision, 2018.

Last updated

Was this helpful?