Swin Transformer [kor]

Liu Z et al. / Swin Transformer Hierarchical Vision Transformer using Shifted Windows / arXiv prerint 2021

1. Problem definition

์ตœ๊ทผ natural language processing (NLP) ์—์„œ ํฐ ์„ฑ๊ณต์„ ๊ฑฐ๋‘” self-attention, Transformer ๊ตฌ์กฐ๋ฅผ general vision task์— ์ ์šฉ์‹œํ‚ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ์ง„ํ–‰๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ์ค‘์—์„œ๋„ Vision Transformer (ViT) [3] ๋Š” classification์—์„œ sota๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋“ฑ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉฐ ViT๋ฅผ ์ž‡๋Š” ํ›„์† ์—ฐ๊ตฌ๋“ค์ด ๋งŽ์ด ์ง„ํ–‰๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์—ฐ๊ตฌ๋“ค ์ค‘ ํ•˜๋‚˜์ธ Swin Transformer๋Š” ์–ด๋– ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ general vision task์— transformer ๊ตฌ์กฐ๋ฅผ ์ ์šฉ์‹œํ‚ค๋ ค ํ•˜์˜€๋Š”์ง€ ์†Œ๊ฐœํ•ด ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

2. Motivation

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์•ž์„œ ๋ง์”€๋“œ๋ฆฐ ๊ฒƒ๊ณผ ๊ฐ™์ด Transformer๊ตฌ์กฐ๋ฅผ general vision task์— ์ ์šฉ์‹œํ‚ค๋Š” ์ฃผ์ œ์˜ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ๊ด€๋ จ ์—ฐ๊ตฌ์ค‘ ํ•˜๋‚˜์ธ classification์— ์ ์šฉ๋œ Vision Transformer (ViT)์— ์ด์–ด ๋ณด๋‹ค ์ผ๋ฐ˜์ ์ธ vision task์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ ๋…ผ๋ฌธ์˜ ์ €์ž๋Š” ์ด๋ฅผ ํ†ตํ•ด Vision๊ณผ language feature์˜ joint modeling์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๊ณ  ๋‘ ๋ถ„์•ผ ๋ชจ๋‘์— ๋„์›€์ด ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ ์–ธ๊ธ‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

CNN and variants:

  • ๊ธฐ์กด์˜ vision task์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŽ์ด ์•Œ๊ณ ๊ณ„์‹œ๋Š” Convolution neural networks์— ๊ด€ํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. AlexNet๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ๋” deepํ•˜๊ณ  effectiveํ•œ ๊ตฌ์กฐ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ์œผ๋ฉฐ convolution layer์ž์ฒด๋ฅผ ๊ฐœ์„ ํ•œ ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€ํ•ด ์–ธ๊ธ‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ง€๊ธˆ๊นŒ์ง€์˜ CNN์— ์ด๋Ÿฌ์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์ด ์žˆ๋‹ค ๋ผ๋Š” ์–ธ๊ธ‰์ด๋ฉฐ ๋…ผ๋ฌธ์—์„œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ด ์•„๋‹ˆ๋ผ ์ž์„ธํ•œ ๋ชจ๋ธ ์ด๋ฆ„์€ ๊ธฐ์žฌํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ์˜ ํ•ต์‹ฌ์€ vision๊ณผ language ์‚ฌ์ด์˜ modeling์„ ์œ„ํ•ด transformer์˜ ์ž ์žฌ๋ ฅ์„ ๊ฐ•์กฐํ•˜๊ณ  modeling์˜ ๋ณ€ํ™”์— ๊ธฐ์—ฌํ•˜๊ธฐ๋ฅผ ์›ํ•œ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

self-attention based backbone architectures:

  • convolution layer์˜ ์ผ๋ถ€๋ถ„์ด๋‚˜ ์ „๋ถ€๋ฅผ self-attention์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์— ํ•ด๋‹นํ•˜๋ฉฐ ํฌ๊ฒŒ Stand-alone self-attention model [4], Local Relation Networks [5]๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ค‘ Local Relation Networks๋Š” self-attetention์ด ๊ฐ๊ฐ์˜ pixel์˜ local window์—์„œ ๊ณ„์‚ฐ๋˜๋ฉฐ ๊ธฐ์กด vision task์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ sliding ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ๋Ÿ‰์˜ ์ฆ๊ฐ€์— ๋”ฐ๋ผ latency๊ฐ€ ์‹ฌ๊ฐํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” sliding window ๋Œ€์‹  consecutive layers์‚ฌ์ด์˜ shift sindows๋ผ๋Š” ํ›จ์”ฌ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

self-attention/Transformers to complement CNNs:

  • Standard CNN ๊ตฌ์กฐ์— self-attention์ด๋‚˜ Transformers๋ฅผ ๊ฒฐํ•ฉํ•œ ๋ฐฉ๋ฒ•๋“ค๋กœ self-attetnion layer๊ฐ€ distant dependencies๋ฅผ encoding ํ•จ์œผ๋กœ์จ backbone์ด๋‚˜ head networks๋ฅผ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ตœ๊ทผ ์—ฐ๊ตฌ์˜ ๊ฒฝ์šฐ encoder-decoder๊ตฌ์กฐ์˜ transformer๋ฅผ object detection์ด๋‚˜ instance segmentation์— ์ ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” transformer๋ฅผ basic visual feature extraction์œผ๋กœ ์ ์šฉํ•˜๋ ค ํ•˜์˜€๊ณ  ์ด๋Š” ๊ธฐ์กด ๊ด€๋ จ ์—ฐ๊ตฌ๋“ค์„ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ๋‹ค ์–ธ๊ธ‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Transformer based vision backbones:

  • Vision task์— transformer๊ตฌ์กฐ๋ฅผ ์ ์šฉํ•œ ๋ฐฉ๋ฒ•๋“ค๋กœ Vision Transformer (ViT)์™€ ๊ทธ ํ›„์† ๋…ผ๋ฌธ๋“ค์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ๊ฐ์˜ ๊ณ ์ •๋œ size์˜ patch๋กœ ๋‚˜๋ˆ„๊ณ  ์ด๋Ÿฌํ•œ patch๋ฅผ token์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ž…๋‹ˆ๋‹ค. CNN ๋ฐฉ๋ฒ•๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์ด์ง€๋งŒ ๋ณด๋‹ค ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Vit์˜ classification ์„ฑ๋Šฅ์€ ํšจ๊ณผ์ ์œผ๋กœ ๋ณด์ด๋‚˜ ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋Š” general-purpose backbone์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ์—๋Š” low-resolution feature map๊ณผ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€๋กœ ์ธํ•ด ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๋ฉฐ ์ด๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Idea

์ด ๋…ผ๋ฌธ์—์„œ๋Š” low-resolution feature map์— ์˜ํ•ด general-purpose backbone์œผ๋กœ ์‚ฌ์šฉ๋˜๊ธฐ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์€ ๊ธฐ์กด์˜ ViT์˜ ๋ฐฉ๋ฒ•์„ ๋ณ€๊ฒฝํ•˜์—ฌ layer๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก patch๋ฅผ mergeํ•ด ๋‚˜๊ฐ€๋Š” hierarchical ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด Vit๋Š” ์ด๋ฏธ์ง€๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งค์šฐ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ฐ๊ฐ์˜ local patch์•ˆ์—์„œ๋งŒ self-attention์„ ๊ณ„์‚ฐํ•˜๋Š” shifted window based self-attention์„ ์ œ์•ˆํ•จ์œผ๋กœ์จ ์™„ํ™”ํ•˜์˜€์œผ๋ฉฐ feature pyramid ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•จ์œผ๋กœ์จ ๋‹ค๋ฅธ vision task์—๋„ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ๊ณ„์ธต์ ์ธ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

3. Method

Figure 1์€ swin transformer์˜ hierarchical feature map๊ณผ ๊ธฐ์กด ViT์˜ feature map์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ Vit๋Š” single low resolution feature map์„ ์ƒ์„ฑํ•ด๋‚ด๋Š”๋ฐ ๋ฐ˜๋ฉด swin transformer๋Š” hierarchical feature map์œผ๋กœ deeper layer๋กœ ๊ฐˆ์ˆ˜๋ก patches๋ฅผ mergeํ•ด ๋‚˜๊ฐ€๋ฉฐ window size๋ฅผ ๋„“ํ˜€ ๊ฐ‘๋‹ˆ๋‹ค.

ViT์˜ ๊ฒฝ์šฐ ๊ณ ์ •๋œ patch size (16x16)(16x16)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ ๊ทธ ๊ฒฐ๊ณผ output feature map์˜ resolution์€ ๊ธฐ์กด input image size์˜ 1/161/16์ด ๋ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด swin transformer์˜ ๊ฒฝ์šฐ patch size๋ฅผ ์ž‘์€ ๊ฒƒ๋ถ€ํ„ฐ ์ ์  ํ‚ค์›Œ๊ฐ€๋ฉฐ ์ƒ๋Œ€์ ์œผ๋กœ high resolution feature map๋ถ€ํ„ฐ low resolution feature map ๊นŒ์ง€ hiearachicalํ•œ feature map์„ ์ถ”์ถœ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ hiearachicalํ•œ feature map์€ ๊ธฐ์กด CNN์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” feature pyramid networks, U-Net๊ณผ ๊ฐ™์€ ๊ธฐ์ˆ ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ model์ด ์—ฌ๋Ÿฌ scale๋กœ ๋ถ€ํ„ฐ ์œ ์—ฐํ•˜๊ฒŒ feature map์„ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ์—ญํ• ์„ ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. (CNN์—์„œ receptive field์˜ ์—ญํ• ๊ณผ ๋น„์Šทํ•œ ๋‚ด์šฉ์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. Detection์œผ๋กœ ์˜ˆ๋ฅผ ๋“ค๋ฉด patch size๊ฐ€ ํด ์ˆ˜๋ก ํฐ object๋ฅผ ์ž˜ ํƒ์ง€ํ•˜๋ฉฐ ๋ฐ˜๋Œ€์ผ ๊ฒฝ์šฐ ์ž‘์€ object๋ฅผ ์ž˜ ํƒ์ง€ํ•˜๋Š” ์—ญํ• ์„ ํ•˜๋Š” ๋‚ด์šฉ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.)

3.1. Shifted Window based Self-Attention

ํšจ์œจ์ ์ธ modeling์„ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด ViT์—์„œ ํ•˜๋‚˜์˜ token(patch)์™€ ๋‹ค๋ฅธ ๋ชจ๋“  token(patch) ์‚ฌ์ด์˜ self-attention์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ˆ˜์ •ํ•˜์—ฌ ํ•˜๋‚˜์˜ local windows์•ˆ์—์„œ๋งŒ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ ์ด๋ฅผ window based multi-head self attention (W-MSA)๋ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ window๊ฐ€ MxMM x M patches๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค ๊ฐ€์ •ํ–ˆ์„ ๋•Œ multi-head self attention (MSA)์™€ window based multi-head self attention (W-MSA)์˜ computational complexity๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

ฮฉ(MSA)=4hwC2+2(hw)2C\Omega(MSA) = 4hwC^2 + 2(hw)^2C

์ˆ˜์‹์—์„œ ๋ณด๋‹ค์‹œํ”ผ ๊ธฐ์กด์˜ MSA์˜ ๊ฒฝ์šฐ ํฐ ์‚ฌ์ด์ฆˆ์˜ ์ด๋ฏธ์ง€, ์ฆ‰ hw๊ฐ€ ํฐ ๊ฒฝ์šฐ ์ ํ•ฉํ•˜์ง€ ์•Š์€ ๋ฐ˜๋ฉด ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ scalableํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (hw>>M)(hw >> M)

์•„๋ž˜์˜ Result section์—์„œ ViT์™€ Swin Transformer์˜ FLOPS(์—ฐ์‚ฐ๋Ÿ‰) ๋น„๊ต๋ฅผ ๋ณด์‹œ๋ฉด ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šฐ์‹ค ๊ฒ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ local window ๋‚ด๋ถ€์—์„œ๋งŒ self attention์„ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋˜๋ฉด ๊ธฐ์กด๊ณผ ๋‹ฌ๋ฆฌ window๊ฐ„์˜ connection์ด ์—†์–ด์ง€๊ฒŒ ๋˜๋ฉฐ ๋Š” model์˜ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” shifted window ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Figure 2๋Š” shifted window์˜ ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ฒ˜์Œ์— ๋ชจ๋“ˆ์€ ์™ผ์ชฝ ์œ„๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด feature map์„ size๋ฅผ ๊ฐ€์ง„ window๋ฅผ ์ด์šฉ, ๋กœ partitioning ํ•˜๋Š” regular window partitioning strategy๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ layer์—์„œ ๊ธฐ์กด์˜ window๋ฅผ โŒŠM2โŒ‹,โŒŠM2โŒ‹\lfloor{M\over2}\rfloor,\lfloor{M\over2}\rfloor ๋งŒํผ ์ด๋™์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ window๋ฅผ ์ด๋™์‹œํ‚ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด๋•Œ shifted window ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ๋ช‡๋ช‡ window์˜ size๊ฐ€ ๋ณด๋‹ค ์ž‘์•„์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์˜ ์ €์ž๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ padding์œผ๋กœ ํ•ด๊ฒฐํ•  ๊ฒฝ์šฐ computational cost๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋˜๋ฉฐ ๋ณด๋‹ค ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์ธ cyclic shift ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Figure 4๋Š” cyclic shift ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์€ batch window๋Š” feature map์—์„œ ์ธ์ ‘ํ•˜์ง€ ์•Š์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ sub window๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ masking ๋ฐฉ๋ฒ•์„ ์ด์šฉ, self-attention์„ ๊ฐ๊ฐ์˜ sub-window์—์„œ ๊ณ„์‚ฐ๋˜๊ฒŒ ์ œํ•œํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. batched window์˜ ์ˆ˜๋Š” regular window partitioning๊ณผ ๋™์ผํ•˜์—ฌ padding๋ฐฉ๋ฒ•๋ณด๋‹ค ํšจ์œจ์ ์ด๋ผ๊ณ  ์„ค๋ช…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

3.2. Overall Architectures

Figure 3์€ Swin Transformer tiny version์˜ architecture๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Swin Transformer๋Š” image๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์‹œ์ž‘ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. patch partitioning์—์„œ ViT์™€ ๊ฐ™์ด image๋ฅผ patch๋กœ ๋‚˜๋ˆ„๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ดํ›„ ๋‚˜๋ˆ„์–ด์ง„ patch๋ฅผ token์œผ๋กœ transformer์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ดํ›„ ๊ฐ๊ฐ์˜ stage๋งˆ๋‹ค patch merging์œผ๋กœ patch๋ฅผ ๊ฒฐํ•ฉํ•ด window size๋ฅผ ๋„“ํ˜€์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ ๊ฐ๊ฐ์˜ stage๋Š” ์„œ๋กœ ๋‹ค๋ฅธ scale feature๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋ฉฐ vision task์— ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ๊ณ„์ธต์ ์ธ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Swin Transformer block์€ ์•ž์„œ ์„ค๋ช…๋“œ๋ฆฐ W-MSA์™€ SW-MSA๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. hierarchical representation์„ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด token์˜ ์ˆ˜๋Š” patch merging layer๋ฅผ ํ†ต๊ณผํ•จ์— ๋”ฐ๋ผ ์ค„์–ด๋“ค๊ฒŒ ๋˜๋ฉฐ ๋งค๋ฒˆ token์˜ ์ˆ˜๋ฅผ 4๋ฐฐ ์ค„์ด๊ณ  output dimension์„ 2๋ฐฐ ๋Š˜๋ฆฐ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ stage์˜ output resolutions์€ ๊ทธ๋ฆผ์—์„œ ๋ณด๋‹ค์‹œํ”ผ ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๋กœ ์ค„์–ด๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ feature map์˜ resolution์€ ์ „ํ˜•์ ์ธ convolution networks์ธ VGG [6]์™€ ResNet [7]๊ณผ ๊ฐ™์œผ๋ฉฐ ๋”ฐ๋ผ์„œ ์‰ฝ๊ฒŒ ๊ธฐ์กด CNN๋ชจ๋ธ์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ €์ž๋Š” ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

W-MSA์€ ์œ„์—์„œ ์„ค๋ช…ํ•œ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ธ window based multi-head self attention์ด๋ฉฐ SW-MSA์€ connection์†Œ์‹ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด patch๋ฅผ shift ์‹œ์ผœ ์ˆ˜ํ–‰ํ•˜๋Š” Shifted Window based Self-Attention์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. SW-MSA์—์„œ W-MSA์—์„œ ์‚ฌ์šฉํ•œ patch๋ฅผ shift์‹œ์ผœ ๋‹ค์‹œ ํ•œ๋ฒˆ ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

4. Experiment & Result

Experimental setup

๊ฐ๊ฐ์˜ vision task์— ์‹คํ—˜ํ•ด๋ณด๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ํฌ๊ฒŒ 3๊ฐ€์ง€ classification, object detection, semantic segmentation task ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€์œผ๋ฉฐ ๋น„๊ต ๋Œ€์ƒ์œผ๋กœ๋Š” ๊ฐ๊ฐ์˜ task, classification, object detection, semantic segmentation์˜ state-of-the-arts ๋ชจ๋ธ๋“ค์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Dataset

๊ฐ๊ฐ์˜ dataset์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Image Classification : ImageNet-1K image classfication [8]

  • Object Detection : COCO object detection [9]

  • Semantic Segmentation : ADE20K semantic segmentation [10]

Training step

  • Image Classification on ImaegNet-1K

    • Regular ImageNet-1K training

      AdamW optimizer์™€ cosine decay learning rate schedular๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ cosine decay๋กœ 300 epochs, linear warm-up์œผ๋กœ 20 epochs ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.

      batch size๋Š” 1024์ด๋ฉฐ ์ดˆ๊ธฐ learning rate๋Š” 0.001, weight decay ๋Š” 0.05๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

    • Pre-trainiong on ImageNet-22K and fine-tunnign on ImageNet-1K

      Pre-train์— AdamW optimizer์™€ linear decay learning rate scheduler๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ 90 epochs, linear warm-up์œผ๋กœ 5 epochs ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.

      batch size๋Š” 4096์ด๋ฉฐ ์ดˆ๊ธฐ learning rate๋Š” 0.001, weight decay ๋Š” 0.01๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

      fine-tuning์—๋Š” batch size 1024, learning rate 10โˆ’510^{-5}, weight decay 10โˆ’810^{-8}์ด ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • Object Detection on COCO

    multi-scale training ๋ฐฉ์‹์œผ๋กœ ์ด๋ฏธ์ง€์˜ ๊ฐ€๋กœ ์„ธ๋กœ์ค‘ ์งง์€ ๋ถ€๋ถ„์€ 480 ~ 800, ๊ธด ๋ถ€๋ถ„์€ ์ตœ๋Œ€ 1333์œผ๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

    AdamW optimizer์™€ ์ดˆ๊ธฐ learning rate 0.00001, weight decay 0.05, batch size 16, epochs 36 ์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ 27, 33 epoch์— learning rate๊ฐ€ 10x ๋งŒํผ ์ค„์ด๊ฒŒ๋” ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  • Semantic segmentation on ADE20K

    AdamW optimizer์™€ ์ดˆ๊ธฐ learning rate 6x10โˆ’56x10^{-5}, weight decay 0.01, linear warmup 1,500 iterations์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ model์€ 160K iterations๋™์•ˆ ํ•™์Šตํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐํƒ€ flipping, random re-scaling, random photometric distortion๋“ฑ์˜ augmentation์ด ์‚ฌ์šฉ๋ฌ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Evaluation matrics

  • Image Classification : param, FLOPS, throughput, top-1 acc.

  • Object Detection : AP, param, FLOPS

  • Semantic Segmentation : mIoU param, FLOPS, FPS

Result

Image Classification, Object Detection, Semantic Segmentation ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ์ˆ˜์น˜๋กœ ๋น„๊ตํ•œ ํ‘œ์ž…๋‹ˆ๋‹ค.

์™ผ์ชฝ๋ถ€ํ„ฐ Image Classification, Object Detection, Semantic Segmentation์— ํ•ด๋‹นํ•˜๋ฉฐ Image Classification์˜ ๊ฒฝ์šฐ ๊ธฐ์กด state-of-the-art์™€ classification์— ์‚ฌ์šฉ๋œ ViT์™€์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ์ž๋ฃŒ๋กœ EfficientNet-B7๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ViT ๋ชจ๋ธ๋“ค์˜ ๊ฒฝ์šฐ ๊ธฐ์กด๋ณด๋‹ค ์ ์€ parameter์ˆ˜๋กœ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Object Detection, Semantic Segmentation์˜ ๊ฒฝ์šฐ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์˜ backbone์„ ๋ณ€๊ฒฝํ•˜์—ฌ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์—์„œ backbone์„ Swin Transformer๋กœ ๋ณ€๊ฒฝํ•˜์˜€์„ ๋•Œ ๊ฑฐ์˜ ๋Œ€๋ถ€๋ถ„ ๊ธฐ์กด ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•œ ๊ฒƒ์„ ๋ณด์ธ๋‹ค ํ•ฉ๋‹ˆ๋‹ค.

5. Conclusion

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” hierarchical feature representation์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ image size์— ๋น„ํ•ด ์ ์€ computational complexity๋ฅผ ๊ฐ€์ง€๋Š” ์ƒˆ๋กœ์šด transformer ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด ViT์˜ multi-head self-attention์˜ ์—ฐ์‚ฐ๋Ÿ‰ ๋ฌธ์ œ๋ฅผ window based self-attetnion์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ณ  window๊ฐ„์˜ connection๋ฌธ์ œ๋ฅผ shifted window ๋ฐฉ์‹์œผ๋กœ ํ•ด๊ฒฐํ•˜์˜€์Šต๋‹ˆ๋‹ค. Calssfication์ด์™ธ์˜ vision task์— ํ•„์š”ํ•œ ๋ถ€๋ถ„์„ ๋ถ„์„ํ•˜๊ณ  multi scale์„ ์œ„ํ•ด patch๋ฅผ mergeํ•˜๋Š” hierarchical ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ Object Detection, Semantic Segmentation์—์„œ state-of-the-art๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ Vision transformer์˜ ๋ฌธ์ œ๋ฅผ ์ž˜ ๋ถ„์„ํ•˜๊ณ  classification์ด์™ธ์˜ ๋‹ค๋ฅธ vision task๋ฅผ ์œ„ํ•œ ๋ถ„์„ ๋ฐ ๋ชจ๋ธ ์„ค๊ณ„๊ฐ€ ๋‹๋ณด์ด๋Š” ๋…ผ๋ฌธ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ๋‹จ์ ์„ ๋ถ„์„ํ•˜๊ณ  ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ๊ณผ ์ˆ˜ํ–‰ํ•ด์•ผํ•  task์— ์ง‘์ค‘ํ•˜์—ฌ ์ค‘์š”ํ•œ ๊ฒƒ์ด ๋ฌด์—‡์ธ์ง€ ์ƒ๊ฐํ•ด ๋ณด๋Š”๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

Author / Reviewer information

Author

์ดํ˜„์ˆ˜ (Hyeonsu Lee)

  • Affiliation (KAIST AI / NAVER)

  • Machine Learning Engineer @ NAVER Papago team

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ..

Reference & Additional materials

  1. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248โ€“255. Ieee, 2009 9.

  2. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence ยด Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740โ€“755. Springer, 2014

  3. Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal on Computer Vision, 2018.

Last updated

Was this helpful?