Squeeze-and-Attention Networks for Semantic segmentation [Kor]

Zhong et al. / Squeeze-and-Attention Networks for Semantic segmentation/ CVPR 2020

1. Problem Definition

RGB ์ด๋ฏธ์ง€์—์„œ ๊ฐ ํ”ฝ์…€์„ ํŠน์ • ๋ผ๋ฒจ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—… (Semantic Segmentation)์„ ํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฐœ๋ฐœํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ’‰๋‹ˆ๋‹ค. Semantic Segmentation์€ ์ด๋ฏธ์ง€ ๋‚ด์˜ ๋ฌผ์ฒด๋“ค์„ ์˜๋ฏธ ์žˆ๋Š” ๋‹จ์œ„๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—…์ด๋ฉฐ, ์ด๋Š” ์ž์œจ ์ฃผํ–‰์ด๋‚˜ ๋‹ค์–‘ํ•œ ๋น„์ „ ์†Œํ”„ํŠธ์›จ์–ด์—์„œ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ๋Š” PASCAL Context dataset ( 59 classes, 4998 training images, and 5105 test images)์™€ PASCAL VOC dataset( 20 classes, 10582 training images, and 1449 validation images, 1456 test images) ๋‘ ๊ฐ€์ง€๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, Semantic segmentation ๋ถ„์•ผ์—์„œ ๋„คํŠธ์›Œํฌ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ benchmark ๊ธฐ์ค€์œผ๋กœ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” dataset์ž…๋‹ˆ๋‹ค.

์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด, RGB color image( height X width X 3 )๋ฅผ ๋„คํŠธ์›Œํฌ ์ธํ’‹์œผ๋กœ ์ž…๋ ฅ ๋ฐ›์•„์„œ Semantic labels result( height X width X 1 )๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

2. Motivation

Multi-scale context :

  • Laplacian pyramid structure์—์„œ multi scale feature๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹๊ณผ multi-path RefineNet์—์„œ multi-scale input์—์„œ feature๋ฅผ ์ถ”์ถœํ•ด ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€๋Š”๋ฐ, ์ด๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ residual network stage์—์„œ multi-scale์˜ dense prediction๊ฒฐ๊ณผ๋ฅผ ๋ณ‘ํ•ฉํ•˜์—ฌ ์‚ฌ์šฉํ•จ์œผ๋กœ์„œ multi-scale context์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์˜€์Œ.

  • Laplacian pyramid structure๋Š” ์ด๋ฏธ์ง€ ์Šค์ผ€์ผ์„ ์ ์  ์ค„์—ฌ๋‚˜๊ฐ€๋ฉด์„œ ๊ฐ ์Šค์ผ€์ผ์—์„œ ์–ป์€ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ํ•ฉํ•˜์—ฌ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ตฌ์กฐ๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค.

  • Multi-path RefineNet์—์„œ๋Š” ์—ฌ๋Ÿฌ ์Šค์ผ€์ผ์—์„œ ์–ป์€ ์ •๋ณด๋ฅผ ์ž‘์€ ์Šค์ผ€์ผ๋ถ€ํ„ฐ ํฐ ์Šค์ผ€์ผ๋กœ ๊ฐ network(path)์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์ณ๊ฐ€๋ฉด์„œ ๊ฐ€์žฅ ํฐ ์Šค์ผ€์ผ ์ด๋ฏธ์ง€๋กœ ๋ณต๊ตฌํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋•Œ ๊ฐ scaled๋œ ์ด๋ฏธ์ง€๊ฐ€ ๊ฐ path์— ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— multi-path ๋ฐฉ์‹์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Channel-wise attention :

  • ํ”ผ์ฒ˜๋งต์˜ ์ฑ„๋„๋“ค์— weight๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ์ฑ„๋„์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•จ์œผ๋กœ์„œ ํ”ผ์ฒ˜๋“ค์„ ์ข€ ๋” ์ž ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋กœ Squeeze-and-Excitation(SE) ๋ชจ๋“ˆ์ด ์กด์žฌํ•˜๋ฉฐ, ์ด๋ฅผ ๋” ๋ฐœ์ „์‹œ์ผœ Squeeze-and-Attention(SA)๋ชจ๋“ˆ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค.

Pixel-group attention :

  • ํ•œ ์ฑ„๋„์—์„œ ๊ฐ ํ”ฝ์…€๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ํ†ตํ•ด ์—ฐ๊ฒฐ์„ฑ์„ ๊ฐ•์กฐํ•˜์—ฌ attentionํšจ๊ณผ๋ฅผ ์ฃผ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ ๋‹จ์ˆœํžˆ pixel-level์—์„œ์˜ ์„ฑ๋Šฅ์—๋งŒ ์ดˆ์ ์„ ๋งž์ถ”์–ด ๋„คํŠธ์›Œํฌ๋ฅผ ์„ค๊ณ„ํ•œ ๋ฐ˜๋ฉด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” pixel-grouping์„ ๊ฐ™์ด ์‚ฌ์šฉํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค.

Idea

๋…ผ๋ฌธ์˜ ์ €์ž๋Š” Segmentation์„ ํฌ๊ฒŒ ๋‘ ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ์ฐจ์›์ด ์–ฝํ˜€์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค. ํ•˜๋‚˜๋Š” pixel-wise prediction์ด๊ณ  ํ•˜๋‚˜๋Š” pixel-grouping์ด๋‹ค. pixel-wise๋Š” ๊ฐ ํ”ฝ์…€์ด ๋ฌด์Šจ ๋ฌผ์ฒด์ธ์ง€ ํŒ๋‹จํ•˜๋Š” ๊ฒƒ์ด๊ณ , pixel grouping์€ ํ”ฝ์…€ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ํ†ตํ•ด ์—ฐ๊ฒฐ์„ฑ์„ ๊ฐ•์กฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๊ธฐ์กด ๋…ผ๋ฌธ๋“ค์€ pixel-level ์œ„์ฃผ์˜ ์•„์ด๋””์–ด๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๋ฉด, ์ด๋ฒˆ ๋…ผ๋ฌธ์—์„œ๋Š” pixel-grouping ๊ธฐ์ˆ ์—๋„ ์ดˆ์ ์„ ๋งž์ถ”์–ด ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค.

fig1.PNG

๋„คํŠธ์›Œํฌ๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ์—…๋ฌด๋ฅผ ๊ตฌ๋ถ„ํ•˜๋ฉด ํฌ๊ฒŒ ๋‘๊ฐœ์˜ task๋กœ ๋‚˜๋‰œ๋‹ค.

  • Task1 : image classification์„ ์œ„ํ•ด ์ •ํ™•ํ•œ pixel-wise annotation๊ณผ spatial constraints๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ตœ๊ทผ segmentation ๋ชจ๋ธ๋“ค์€ ํ”ผ๋ผ๋ฏธ๋“œ ํ’€๋ง๊ณผ dilated convolution layer๋ฅผ ํ†ตํ•ด contextual feature๋ฅผ ์ž˜ ์ง‘ํ•ฉ์‹œ์ผœ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์–ด๋‚ด๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ปค๋„์˜ grid ๊ตฌ์กฐ ๋•Œ๋ฌธ์— spatial feature์˜ shape์— ์ œํ•œ์ด ์ƒ๊ธฐ๊ฒŒ ๋˜๊ณ , ์ด๋Š” pixel-wise prediction์˜ ๊ฒฐ๊ณผ๋Š” ์ข‹๊ฒŒ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ธ€๋กœ๋ฒŒํ•œ ๊ด€์ ์—์„œ์˜ ์ด๋ฏธ์ง€ ์ดํ•ด๋Š” ๋ถ€์กฑํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • Task2 : pixel์ด spatial ์ œํ•œ ์—†์ด ๊ฐ™์€ ๊ทธ๋ฃน์— ์†ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง์ ‘์ ์œผ๋กœ ๋„์™€์ฃผ๋Š” ์ž‘์—…์„ ํ•œ๋‹ค. Pixel grouping์€ ์ „์ฒด ๋ฒ”์œ„์—์„œ ์ƒ˜ํ”Œ๋ง ๋œ ์ด๋ฏธ์ง€๋ฅผ semantic spectrum์„ ํ†ตํ•ด ์ •์˜๋œ ํ”ฝ์…€ ๊ทธ๋ฃน์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ž‘์—…์ด ํฌํ•จ๋˜์–ด์žˆ๋‹ค. SA๋ชจ๋“ˆ์„ ์ƒˆ๋กœ ๊ฐœ๋ฐœํ•˜์˜€์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ๋””์ž์ธํ•˜๊ฒŒ ๋œ ๊ณ„๊ธฐ๋Š” local constraints of convolution kernel์— ์˜ํ•œ ์ œํ•œ์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ์˜€๋‹ค. SA ๋ชจ๋“ˆ์€ ๋‹ค์šด ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ณผ์ •์—์„œ ์™„์ „ํžˆ ํ•˜๋‚˜์˜ ์ฑ„๋„๋กœ ์••์ถ• ์‹œํ‚ค์ง€ ์•Š์Œ์œผ๋กœ์จ local spatial attention์„ ํšจ์œจ์ ์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์— ๋ฐœํ‘œ๋œ SE๋ชจ๋“ˆ๊ณผ์˜ ์ฐจ์ด์ ์œผ๋กœ๋Š” spatial attention์„ ํ†ตํ•ฉํ•˜๋Š” head unit์ด ์กด์žฌํ•˜์—ฌ multi-stage์—์„œ์˜ ์ •๋ณด๋ฅผ ํ•ฉ์ณ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์ด ๋”์šฑ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋‘ ๊ฐœ์˜ ํ…Œ์Šคํฌ๋ฅผ ์š”์•ฝํ•˜์ž๋ฉด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” 4๊ฐœ์˜ SA๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•œ SANet์„ ๋””์ž์ธํ•˜์˜€์œผ๋ฉฐ, SANet์€ ์•ž์„œ ๋งํ•œ ๋‘ ๊ฐœ์˜ task๋ฅผ ๋ชจ๋‘ ์ˆ˜ํ–‰ํ•œ๋‹ค. Multi-scale spatial feature์™€ non-local spectral feature๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์„œ ๊ธฐ์กด ์ œํ•œ์ (๊ฑฐ๋ฆฌ๊ฐ€ ๋จผ ํ”ฝ์…€ ๊ฐ„์˜ spatialํ•œ ์ •๋ณด๋ฅผ ์ž˜ ํ•™์Šตํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•ฉ์„ฑ ๊ณฑ ์ž์ฒด์˜ ํ•œ๊ณ„์ )์„ ๊ทน๋ณตํ•˜์˜€๊ณ , dilated ResNet๊ณผ Efficient nets์„ ์‚ฌ์šฉํ•˜์—ฌ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ multi-stage์˜ non-local feature ๋ฅผ ํ•ฉ์ณ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ์‹œ์ผฐ๋‹ค.

Method

Figure 2์— ๋‚˜์˜จ ๋ชจ๋“ˆ์€ ์ฐจ๋ก€๋Œ€๋กœ (a)-Residual, (b)-Squeeze-and-excitation(SE), (c)-Squeeze-and-attention(SA) ๋ชจ๋“ˆ์ด๋‹ค. SE๋ชจ๋“ˆ์€ residual block์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ค๊ณ„๋œ ๋ชจ๋“ˆ์ด๋ฉฐ, SA ๋ชจ๋“ˆ์€ SE ๋ชจ๋“ˆ์˜ ์•„์ด๋””์–ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ๋‹ค.

fig2.PNG

๋จผ์ €, SE ๋ชจ๋“ˆ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜์ž๋ฉด, Residual block์„ re-calibrating feature map channel์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ์‹œํ‚จ๋‹ค. ๊ทธ๋ฆผ์— ๋‚˜์˜จ ๊ฒƒ์ฒ˜๋Ÿผ, average pooling์œผ๋กœ ์ธํ’‹ ํ”ผ์ณ ๋งต์„ squeezeํ•˜์—ฌ 1x1 ๋ฒกํ„ฐ๋ฅผ ์–ป์€ ๋‹ค์Œ, fully connected convolution์„ ํ†ตํ•ด W1๊ณผ W2๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ excitation weight๋ฅผ ์–ป์–ด๋ƒ…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์–ป์€ weight์— X_in์— ๊ณฑํ•ด์ฃผ์–ด attentionํšจ๊ณผ๋ฅผ ์ฃผ๊ณ , ์ด๋ฅผ X_res ํ…์„œ์™€ ํ•ฉ์ณ์ฃผ๋Š” ์ตœ์ข… ์ถœ๋ ฅ์„ ํ•˜๋Š” ๋ชจ๋“ˆ์ž…๋‹ˆ๋‹ค. ์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Untitled
Untitled

SA๋ชจ๋“ˆ์€ SE๋ชจ๋“ˆ์—์„œ ์™„์ „ํžˆ 1x1 ๋ฒกํ„ฐํ˜•ํƒœ๋กœ Squeezeํ•˜๋Š” ๊ณผ์ • ๋Œ€์‹  not-fully squeezed operation์„ ํ†ตํ•ด spatial์— ์žˆ์–ด์„œ ๋” ๋‹ค์–‘ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” attention map์„ ์ƒ์„ฑํ•œ๋‹ค. ๋˜ํ•œ X_res์— attention map์„ ์ง์ ‘ ๊ณฑํ•ด์ฃผ์–ด local๊ณผ globalํ•œ ํŠน์ง•์„ ๋ชจ๋‘ ๊ณ ๋ คํ•œ attention map์„ ์ƒ์„ฑํ•˜์˜€๋‹ค. ์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Untitled
Untitled
fig3.PNG

SA-Network์˜ ์ „์ฒด ๋ชจ์‹๋„๋Š” Fig3์— ๋‚˜์™€์žˆ์Šต๋‹ˆ๋‹ค. SA๋ชจ๋“ˆ์„ ํ†ตํ•ด 4๊ฐœ์˜ backbone stage์—์„œ ํ”ผ์ฒ˜๋ฅผ ์ถ”์ถœํ•˜์˜€์œผ๋ฉฐ, Loss๋Š” ์ด 3๊ฐœ์˜ loss์˜ ๋น„์œจ ํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ์ด ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ ๋กœ์Šค์— ๋Œ€ํ•ด ์ •๋ฆฌ๋ฅผ ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Mask loss : ๊ฐ ํด๋ž˜์Šค์— ํ•ด๋‹น๋˜๋Š” pixel๋“ค์„ ์ž˜ ์„ ๋ณ„ํ–ˆ๋Š”๊ฐ€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๊ทธ๋ฆผ์—์„œ ๋ณด์ด๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๊ฐ ์ฑ„๋„์€ ํ•˜๋‚˜์˜ ํด๋ž˜์Šค์— ๋Œ€ํ•ด ๋งˆ์Šคํ‚น์ด ๋˜์–ด์žˆ๋Š”๋ฐ, ์ด ๋งˆ์Šคํ‚น์˜ ์˜ค์ฐจ๋ฅผ loss๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค.

  • Categorical loss : ๋งˆ์Šคํ‚นํ•œ ์ฑ„๋„์„ ๊ฐ ํด๋ž˜์Šค๋กœ ์ž˜ ๋ถ„๋ฅ˜ํ–ˆ๋Š”๊ฐ€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๊ฐ ์ฑ„๋„๋“ค์„ ํŠน์ • ํด๋ž˜์Šค๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—…์„ ํ•˜๋ฉฐ, ์ •๋‹ต ๋ผ๋ฒจ๊ณผ ๋น„๊ตํ•˜์—ฌ ์˜ค์ฐจ๋ฅผ loss๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค.

  • Dense loss : ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์ณ ํ•˜๋‚˜์˜ Semantic segmentation image๋กœ ๋‚˜ํƒ€๋ƒˆ์„๋•Œ pixel-wise loss๋ฅผ ๋งํ•œ๋‹ค.

Untitled

Pixel-group attention์— ๋Œ€ํ•œ ๋ณด์ถฉ ์„ค๋ช…

์•ž์„œ ์ €์ž๋Š” segmentation ์ž‘์—…์„ ๋‘ ๊ฐœ์˜ task๋กœ ๋ถ„๋ฆฌํ•˜์˜€๊ณ , ๊ทธ์ค‘ pixel-grouping์ด๋ผ๋Š” task๊ฐ€ ๊ธฐ์กด์— ๋งŽ์ด ์•Œ๋ ค์ง€์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ํ˜ผ๋ž€์Šค๋Ÿฌ์šธ ์ˆ˜ ์žˆ์–ด์„œ ๋ณด์ถฉ ์„ค๋ช…์„ ํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

๋‹จ์ˆœํ•œ Convolution ๊ณฑ์„ ํ†ตํ•ด์„œ ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ตฌ์„ฑ๋˜๊ฒŒ ๋˜๋ฉด, pixel๊ฐ„์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋งŽ์ด ๋–จ์–ด์ ธ ์žˆ๋Š” ๊ฒฝ์šฐ Convolution์˜ ๊ณต๊ฐ„์  ํ•œ๊ณ„ ๋•Œ๋ฌธ์— ๋‘ pixel ๊ฐ„์˜ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋„คํŠธ์›Œํฌ๋“ค์„ ๋ณด๊ฒŒ ๋˜๋ฉด ๋‹จ์–ด๊ฐ„ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ์–ด๋„ ์„œ๋กœ ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ์„ ํ•™์Šตํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ attention ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์—ฐ๊ด€์„ฑ์„ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์•„์ด๋””์–ด๋ฅผ Segmentation์— ์ ์šฉํ•˜๋ ค๋Š” ์‹œ๋„๋“ค์ด ์กด์žฌํ–ˆ๊ณ , ๋Œ€ํ‘œ์ ์œผ๋กœ ์ด๋ฏธ์ง€์˜ ํ”ฝ์…€๊ฐ„ correlation์„ ์ด์šฉํ•œ self-attention mask๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ pixel grouping์€ '๊ฐ™์€ class์— ์†ํ•˜๋Š” pixel ๊ฐ„์˜ ์ •๋ณด๋ฅผ spatialํ•œ ์ œํ•œ์„ ๋ฐ›์ง€ ์•Š๊ณ  ์—ฐ๊ด€์„ฑ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ์žฅ์น˜' ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ €์ž๋Š” ์ด๋ฒˆ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์— ์กด์žฌํ•˜๋˜ Squeeze-and-excitation(SE) ๋ชจ๋“ˆ์„ ์ฐธ๊ณ ํ•˜์—ฌ, ํšจ์œจ์ ์ธ ๋ฐฉ์‹์˜ pixel- grouping method์ธ Squeeze-and-attention(SA)๋ชจ๋“ˆ์„ ์ ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. SA ๋ชจ๋“ˆ์— ๋Œ€ํ•œ ์„ค๋ช…์€ ์œ„์— ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

4. Experiment & Result

Experimental setup

  • Dataset : PASCAL Context , PASCAL VOC

  • Baselines : ResNet50, ResNet101

  • Training setup :

    • Learning rate : 0.001(PASCAL Context), 0.0001(PASCAL VOC)

    • Optimizer : SGD and poly learning rate annealing schedule adopt

    • Training method :

      • PASCAL Context : 80 epochs

      • PASCAL VOC : COCO pretrained + 50 epochs on the validation set

    • Batch size : 16

Result

  1. ์ฒซ๋ฒˆ์งธ๋กœ loss์— ์‚ฌ์šฉ๋œ ์•ŒํŒŒ์™€ ๋ฒ ํƒ€์˜ ์ตœ์  ๊ฐ’์„ ๊ตฌํ•œ ์‹คํ—˜์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. ์•ŒํŒŒ์™€ ๋ฒ ํƒ€๋Š” ๊ฐ ๋กœ์Šค๋“ค๊ฐ„์˜ ๋น„์ค‘์„ ์กฐ์ ˆํ•˜์—ฌ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ค๋„๋ก ํ•˜๋Š” ํŠœ๋‹์ด ํ•„์š”ํ•œ Hyper-parameter์ด๋‹ค. ์•ŒํŒŒ์™€ ๋ฒ ํƒ€ ๊ฐ’์„ ๋ณ€ํ˜•์‹œ์ผœ๊ฐ€๋ฉฐ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ด๋ณธ ๊ฒฐ๊ณผ ์•ŒํŒŒ = 0.2, ๋ฒ ํƒ€ = 0.8์ผ ๋•Œ ๊ฐ€์žฅ ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋‚˜์™”๋‹ค.

    fig4.PNG

  2. ๋‘๋ฒˆ์งธ๋Š” SANet์„ ๋‹ค๋ฅธ ์ตœ์‹  ๋ชจ๋ธ๋“ค๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. table2๋ฅผ ๋ณด๋ฉด ์ตœ์‹  ๋„คํŠธ์›Œํฌ๋“ค์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ EffNet-b7 ๋„คํŠธ์›Œํฌ์— SA ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋งŒ๋“  SANet์˜ ๊ฒฝ์šฐ mIoU๊ฐ€ 54.4๋กœ PASCAL Context dataset์˜ ์ตœ๊ณ ๊ธฐ๋ก์„ ๊ฐฑ์‹ ํ•˜์˜€๋‹ค.

    table2.PNG

  3. SA ๋ชจ๋“ˆ์ด SE ๋ชจ๋“ˆ์— ๋น„ํ•ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. ์•ž์„œ ๋งํ•œ๊ฒƒ๊ณผ ๊ฐ™์ด SA๋ชจ๋“ˆ์€ SE ๋ชจ๋“ˆ๋กœ๋ถ€ํ„ฐ ๋ฐœ์ „์‹œํ‚จ ๋ชจ๋“ˆ์ด๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์ด ์–ผ๋งˆ๋งŒํผ ์ฆ๊ฐ€ํ–ˆ๋Š”์ง€๊ฐ€ ์ด ๋…ผ๋ฌธ์˜ ๋…ธ๋ฒจํ‹ฐ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค. table3๊ณผ ๊ฐ™์ด SE๋ชจ๋“ˆ์— ๋น„ํ•ด ์ •ํ™•๋„๊ฐ€ ๊ฐ๊ฐ 4.1%, 4.5%๊ฐ€ ์ฆ๊ฐ€ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

    table3.PNG

  4. ๋‹ค์Œ์€ ์ •์„ฑ์ ์œผ๋กœ baseline network์™€ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. Fig5์—์„œ (a)๋Š” raw input data, (b)๋Š” ground truth, (c)๋Š”Baseline, (d)๋Š” SANet์ด๋‹ค. Baseline ์œผ๋กœ ์‚ฌ์šฉ๋œ ๋„คํŠธ์›Œํฌ๋Š” dilated ResNet50 FCN์ด๋ฉฐ, SANet์€ Baseline network์— SA๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•œ ๋„คํŠธ์›Œํฌ์ด๋‹ค.

    fig5์˜ ์ฒซ๋ฒˆ์งธ ์ค„์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋ฌผ์ฒด๋“ค์˜ ๊ฒฝ๊ณ„์™€ ์กฐํ•ฉ์ด ๋‹จ์ˆœํ•œ ๊ฒฝ์šฐ์ด๋ฉฐ, ๋งจ ์•„๋žซ์ค„์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋ฌผ์ฒด์˜ ๊ตฌ์„ฑ์ด ๋ณต์žกํ•œ ๊ฒฝ์šฐ์ด๋‹ค. ๋‘ ๊ฒฝ์šฐ์—์„œ ๋ชจ๋‘ SANet์— baseline์— ๋น„ํ•ด ๋” ground truth์— ๊ฐ€๊นŒ์šด ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ „์ฒด์ ์œผ๋กœ SANet์ด baseline๋ณด๋‹ค๋Š” ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ๋งˆ์ง€๋ง‰ ์ด๋ฏธ์ง€์™€ ๊ฐ™์ด ๋ณต์žกํ•œ ๊ฒฝ์šฐ์—๋Š” ์•„์ง ๋” ๋งŽ์€ ํ–ฅ์ƒ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

fig5.PNG
  1. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ผ๋ฐ˜์ ์ธ convolution์˜ ๊ฒฐ๊ณผ์™€ SA๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•˜์˜€์„๋•Œ ๊ฒฐ๊ณผ๋ฅผ global-attention์ฐจ์›์—์„œ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•œ ์ •์„ฑ์ ์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•˜์˜€๋‹ค. (์ผ๋ฐ˜์ ์ธ convolution ๋˜ํ•œ spatial ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, SA๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•˜์˜€์„ ๋•Œ ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ์ด ์–ผ๋งˆ๋‚˜ ๋” ๋ฐœ์ „ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.) ๊ฐ ์Šคํ…Œ์ด์ง€์—์„œ SA๋ชจ๋“ˆ์˜ attention map์˜ ์—ญํ• ์„ ๋ณด๊ธฐ ์œ„ํ•ด head1๊ณผ head4์˜ ๋ชจ๋“ˆ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ถ”์ถœํ•ด ๋น„๊ตํ•˜์˜€๋‹ค. ๊ทธ๋ฆผ์—์„œ (b), (c), (d)๋Š” ๊ฐ ๋‹ค๋ฅธ ํด๋ž˜์Šค๋ฅผ ์„ ํƒํ•œ ๊ฒƒ์ด๊ณ , ๋นจ๊ฐ„์ƒ‰์œผ๋กœ ๋‚˜ํƒ€๋‚œ ๋ถ€๋ถ„์ด ํ™œ์„ฑํ™”๋œ ๊ณณ์ด๋‹ค. ๊ฐ Head์—์„œ์˜ ๊ฒฐ๊ณผ ๋น„๊ต๋ฅผ ํ†ตํ•ด Low-level๊ณผ high-level์—์„œ์˜ ์—ญํ• ์ด ๋‹ค๋ฅธ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

    • low-level : attn์€ ์‹œ์•ผ๊ฐ€ ๋„“์€ ๋ฐ˜๋ฉด, main์€ ์˜ค๋ธŒ์ ํŠธ ๊ฒฝ๊ณ„๊ฐ€ ๋ณด์กด๋œ ๋กœ์ปฌ ํŠน์ง• ์ถ”์ถœ์— ์ค‘์ ์„ ๋‘๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

    • high-level : attn์€ ์ฃผ๋กœ ์„ ํƒ๋œ ์ง€์ ์„ ๋‘˜๋Ÿฌ์‹ผ ์˜์—ญ์— ์ดˆ์ ์ด ๋งž์ถฐ์ ธ ์žˆ์œผ๋ฉฐ, main์€ low-level ๊ฒฝ์šฐ๋ณด๋‹ค ๋” ํ™•์‹คํ•œ semantic meaning์„ ๊ฐ€์ง„ homogeneousํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

    ์ฆ‰, Attention map์ด main channel์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์กฐํ•ด์ฃผ์–ด ๋” ์„ ๋ช…ํ•˜๊ณ  ์ •ํ™•ํ•œ output์„ ์ถœ๋ ฅํ•˜๋„๋ก ๋„์™€์ฃผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Untitled

Conclusion

  • ์ €์ž๋Š” semantic segmentation์ด ๋‘ ๊ฐœ์˜ ๋…๋ฆฝ๋œ ์ฐจ์›(pixel-wise prediction and pixel grouping)์œผ๋กœ ์ด๋ฃจ์–ด์กŒ๋‹ค๋Š” ์ƒˆ๋กœ์šด ์ƒ๊ฐ์œผ๋กœ๋ถ€ํ„ฐ ์ด๋Ÿฌํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค. (๋…๋ฆฝ๋œ ์ฐจ์›์ด๋ผ๊ณ  ์ €์ž๊ฐ€ ๋งํ•˜์˜€๋Š”๋ฐ, ์ด ๋…๋ฆฝ์˜ ์˜๋ฏธ๋Š” ๋‘ task๋ฅผ ์™„์ „ํžˆ ๋ถ„๋ฆฌํ•˜์—ฌ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๋‘ ๊ฐœ์˜ ์ฐจ์›์„ ๋ชจ๋‘ ๊ณ ๋ คํ•ด์•ผ ๋œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.)

  • ๊ฐœ๋ฐœํ•œ SA ๋ชจ๋“ˆ์€ pixel-wise dense prediction์˜ ์„ฑ๋Šฅ๋„ ์ฆ๊ฐ€ ์‹œ์ผœ์ฃผ๋ฉฐ, pixel-grouping์˜ ๊ณผ์ •์„ ์ž˜ ์ ์šฉ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค.

  • ๋‘๊ฐœ์˜ challenging benchmark dataset์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

  • ๋‹จ์ˆœํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ SA๋ชจ๋“ˆ์ด ๋‹ค๋ฅธ ์—ฐ๊ตฌ์— ์œ ์šฉํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค.

Take home message

  • ๋งŽ์€ ๋…ผ๋ฌธ์—์„œ์˜ ๋”ฅ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ์˜ ์•„์ด๋””์–ด๋ฅผ ์„ค๋ช…ํ•  ๋•Œ ๋‹ค์–‘ํ•œ ์‹œ๋„๋ฅผ ํ•œ ํ›„ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹๊ฒŒ ๋‚˜์˜ค๋ฉด ๊ทธ์— ๋งž๊ฒŒ ์งœ๋งž์ถฐ์„œ ์„ค๋ช…์„ ํ•˜๋Š” ๋А๋‚Œ์„ ๋งŽ์ด ๋ฐ›์•˜์—ˆ๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ์ด๋ฏธ์ง€๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ์‹œ์„ ์œผ๋กœ ๋ฐ”๋ผ๋ณด๊ณ  ๊ทธ๊ฒƒ์„ ๊ตฌํ˜„ํ•˜์—ฌ ์ ์šฉํ–ˆ๋‹ค๋Š” ์ ์ด ์ธ์ƒ๊นŠ์—ˆ๋‹ค.

  • ์‚ฌ์‹ค SE ๋ชจ๋“ˆ๊ณผ ํฌ๊ฒŒ ๋‹ค๋ฅธ์ง€ ์•Š์€ SA๋ชจ๋“ˆ์„ ๊ฐœ๋ฐœํ•˜์˜€์ง€๋งŒ, ์„ฑ๋Šฅ์€ ๋ˆˆ์— ๋„๊ฒŒ ํฐ ๋ฐœ์ „์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์•„์ง ๋„คํŠธ์›Œํฌ ์„ค๊ณ„์— ์žˆ์–ด์„œ ๋งŽ์€ ๊ณต๋ถ€๊ฐ€ ํ•„์š”ํ•˜์ง€๋งŒ, ์ด๋Ÿฌํ•œ ์‚ฌ์†Œํ•œ ๋ณ€ํ™”๊ฐ€ ํฐ ๊ฒฐ๊ณผ๋กœ ์ด์–ด์ง€๋Š” ๊ฒƒ์„ ๋ณด๋ฉด ๊ธฐ์ดˆ์ ์ธ ๋‚ด์šฉ์„ ์™„๋ฒฝํ•˜๊ฒŒ ํ•™์Šตํ•ด์•ผ ํ•  ํ•„์š”์„ฑ์— ๋Œ€ํ•ด ๋А๋ผ๊ฒŒ ๋˜์—ˆ๋‹ค.

Author / Reviewer information

Author

์ •๊ตฌ์ผ (Guil Jung)

  • M.S. student, Electrical Engineering Department, KAIST

  • Interested in Biomedical Imaging

  • jgl97123@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Last updated

Was this helpful?