RegSeg [Eng]

(Description) Roland Gao / Rethink Dilated Convolution for Real-time Semantic Segmentation / arXiv 2021

1. Problem definition

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” real time scene segmentation์—์„œ ์‚ฌ์šฉ๋˜๋Š” ImageNet backbone์œผ๋กœ๋ถ€ํ„ฐ ๋น„๋กฏ๋˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด real time scene segmentation ๋…ผ๋ฌธ๋“ค์—์„œ ์‚ฌ์šฉํ•œ ImageNet backbone์€ ๋ ๋ถ€๋ถ„์˜ ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด๋Š” ์ง€๋‚˜์น˜๊ฒŒ ๋งŽ์€ ์ฑ„๋„์ˆ˜๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ResNet18์€ 512๊ฐœ, ResNet50์€ 2048๊ฐœ๊นŒ์ง€ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹ค์‹œ๊ฐ„ ํ™˜๊ฒฝ์—์„œ ๋งŽ์€ ์—ฐ์‚ฐ๋Ÿ‰์„ ๋ถ€๋‹ด์‹œํ‚ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ImageNet ๋ชจ๋ธ๋“ค์ด ์ž…๋ ฅ๋ฐ›๋Š” ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋Š” 224 x 244์ธ ๋ฐ˜๋ฉด, semantic segmentation์˜ ๋ฐ์ดํ„ฐ์…‹์€ 1024 x 2048์œผ๋กœ ํ›จ์”ฌ ํฝ๋‹ˆ๋‹ค. ์ด๋Š” ImageNet ๋ชจ๋ธ๋“ค์˜ field-of-view๊ฐ€ ํฐ ์ด๋ฏธ์ง€๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š”๋ฐ ๋ถ€์กฑํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. RegSeg๋Š” ์ •ํ™•๋„๋ฅผ ์ €ํ•ดํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ์—ฐ์‚ฐ์–‘์„ ์ค„์ด๊ณ  ์ถฉ๋ถ„ํ•œ field-of-view๋ฅผ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.

2. Motivation

Segmentation ๋ถ„์•ผ์—์„œ ์ •ํ™•๋„์™€ ์—ฐ์‚ฐ ์†๋„ ๋ชจ๋‘ ํšจ๊ณผ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๊ธฐ์กด์˜ ์—ฐ๊ตฌ๋“ค์— ๋Œ€ํ•ด ๊ฐ„๋žตํ•˜๊ฒŒ ๋‹ค๋ค„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • Semantic segmentation

    • Fully Convolutional Networks Classification ๋ชจ๋ธ์„ segmentation์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด fc-layer๋ฅผ ๋ชจ๋‘ Conv-layer๋กœ ๊ต์ฒดํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    • DeepLabv3 ๋‹ค์–‘ํ•œ dilation rates๋ฅผ ์ ์šฉํ•œ dilated conv๋ฅผ ImageNet ๋ชจ๋ธ์— ์ถ”๊ฐ€ํ•˜์—ฌ receptive field๋ฅผ ํฌ๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    • PSPNet Pooling rate๋ฅผ ๋‹ฌ๋ฆฌํ•œ layer๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ๋ณ‘๋ ฌ๋กœ ์ถ”๊ฐ€ํ•œ Pyramid Pooling Module์„ ํ†ตํ•ด Global context information์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    • Deeplabv3+ Deeplabv3์— ๋””์ฝ”๋”์™€ 1 x 1 convolution์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•™์Šต์„ ์•ˆ์ •์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

  • Real-time semantic segmentation

    • BiseNetV2 Spatial Path์™€ Context Path ๋‘ ๊ฐœ์˜ ๊ฐ€์ง€๋ฅผ ๋งŒ๋“  ํ›„ ํ•ฉ์ณ ์‚ฌ์ „ ํ•™์Šต๋œ ImageNet ๋ชจ๋ธ ์—†์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

    • STDC BiseNet์˜ Spatial Path๋ฅผ ์—†์• ๊ณ  ํ•˜๋‚˜์˜ Path๋งŒ์„ ๊ฑฐ์น˜๊ฒŒ ํ•˜์—ฌ ๋” ๋น ๋ฅด๊ฒŒ ์ž‘๋™ํ•˜๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    • DDRNet-23 ๋‘ ๋ถ„๊ธฐ ์‚ฌ์ด์— ์ƒํ˜ธ ์œตํ•ฉ์„ ์ถ”๊ฐ€ํ•œ Deep Aggregation Pyramid Pooling Module(DAPPM)์„ backbone ๋์— ์ถ”๊ฐ€ํ•˜์—ฌ Cityscapes ๋ฐ์ดํ„ฐ์…‹์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  • Desinging Network design Spaces ๋„คํŠธ์›Œํฌ ๋””์ž์ธ์—์„œ ์„ ํƒ์ง€๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด์„œ manual network design์€ ์–ด๋ ค์›Œ์กŒ์Šต๋‹ˆ๋‹ค. ์ข‹์€ ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŽ์ด ์ฐพ์„ ์ˆ˜๋Š” ์žˆ์—ˆ์ง€๋งŒ ๊ทธ ์›๋ฆฌ๋ฅผ ์ฐพ์€ ๊ฒƒ์€ ์•„๋‹ˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜๋งŽ์€ ์‹คํ—˜๊ณผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ํ†ตํ•ด ๋ธ”๋ก ํƒ€์ž…์˜ RegNetY๋ฅผ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๋””์ž์ธ ํŒจ๋Ÿฌ๋‹ค์ž„์œผ๋กœ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Idea

๊ธฐ์กด์˜ Semantic segmentation ์—ฐ๊ตฌ๋“ค์ด ImageNet ๋ชจ๋ธ์„ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•ด real-time semantic segmentation ์—ฐ๊ตฌ๋“ค์—์„  ์—ฐ์‚ฐ๋Ÿ‰์ด ๋ฐฉ๋Œ€ํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. DDRNet-23์˜ ๊ฒฝ์šฐ 20.0M๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๋ฉด์„œ ๋™์‹œ์— receptive field๋ฅผ ๋Š˜๋ฆฌ๊ธฐ ์œ„ํ•ด RegNet์˜ ๋ธ”๋ก์„ ์ฐธ๊ณ ํ•˜์—ฌ dilated conv๊ฐ€ ์ ์šฉ๋œ ๋ธ”๋ก ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜๊ณ , ์ด๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ์Œ“์•˜์Šต๋‹ˆ๋‹ค.

3. Method

Dilated block

์ €์ž๋Š” RegNet์˜ Y ๋ธ”๋ก์—์„œ 3 x 3 conv๋ฅผ ํ•˜๋Š” ๋‹จ๊ณ„๋ฅผ ๋‘ ๊ฐœ์˜ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆˆ dilated conv๋กœ ๋Œ€์ฒดํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ Dilated Block(D Block)์œผ๋กœ ๋ช…๋ช…ํ•˜์˜€๊ณ  dilated rate๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉด์„œ ์ด 18๋ฒˆ ๋ฐ˜๋ณตํ•˜์˜€์Šต๋‹ˆ๋‹ค. Y๋ธ”๋ก๊ณผ D๋ธ”๋ก์˜ ์ฐจ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. dilated rate๊ฐ€ ๋ชจ๋‘ 1์ผ ๋•Œ๋Š” D๋ธ”๋ก์ด Y๋ธ”๋ก๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

figure 1

Stride๊ฐ€ 2์ผ ๋•Œ์˜ D๋ธ”๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

figure 2

๊ฐ D๋ธ”๋ก์—์„œ์˜ dilated rate์™€ stride๋Š” ๋‹ค์Œ ํ‘œ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ dilated rate๋ฅผ ๋‹ฌ๋ฆฌํ•˜๋ฉด์„œ multi-scale feature๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

figure 3

์ด์™€ ๊ฐ™์ด D๋ธ”๋ก์„ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ตฌ์„ฑ๋œ backbone์€ RegNet์˜ ์Šคํƒ€์ผ๊ณผ ์œ ์‚ฌํ•˜๋ฉฐ ๊ฐ ๋ธ”๋ก์˜ dilated rate๋Š” ์‹คํ—˜์„ ํ†ตํ•ด ์ •ํ•ด์ ธ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, dilation branch๋ฅผ 4๊ฐœ๋กœ ํ–ˆ์„ ๋•Œ 2๊ฐœ๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€ ๋ชปํ•˜์—ฌ 2๊ฐœ๋กœ๋งŒ ๋‚˜๋‰˜์–ด์กŒ์Šต๋‹ˆ๋‹ค.

Decoder

์œ„์˜ backbone์—์„œ ์†Œ์‹ค๋œ local deatils์„ ๋ณต๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ๋””์ฝ”๋”๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. Backbone์œผ๋กœ๋ถ€ํ„ฐ 1/4, 1/8, ๊ทธ๋ฆฌ๊ณ  1/16 ํฌ๊ธฐ์˜ feature maps์„ ์ž…๋ ฅ๋ฐ›์•„ 1 x 1 conv์™€ upsampling์„ ๊ฑฐ์ณ ํ•ฉ์ณ์ง‘๋‹ˆ๋‹ค. ๋””์ฝ”๋”์˜ ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์„ ํฌ๊ฒŒ ๋Š˜๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

figure 4

4. Experiment & Result

Experimental setup

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Cityscapes, CamVid์—์„œ DDRNet-23์„ ๋น„๋กฏํ•œ state-of-the-art model๋“ค๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๋Š” ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. Cityscapes์— ๋Œ€ํ•œ Training setup์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • momentum 0.9์˜ SGD

  • initial learning rate: 0.05

  • weight decay: 0.0001

  • ramdon scaling [400, 1600]

  • random cropping 768 x 768

  • 0.5%์˜ class uniform sampling

  • batch size = 8, 1000 epochs

Camvid์—์„œ๋Š” Citycapes pretrained model์„ ์‚ฌ์šฉํ•˜์˜€๊ณ  Cityscapes ์‹คํ—˜ ํ™˜๊ฒฝ๊ณผ์˜ ์ฐจ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • random horizontal flipping

  • random scaling of [288, 1152]

  • batch 12, 200 epochs

  • classuniform sampling ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ

Result

Cityscapes

Cityscapes์—์„œ์˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

figure 5

๋ชจ๋ธ ๊ฐ„์˜ FPS๋Š” ์ง์ ‘ ๋น„๊ตํ•  ์ˆ˜ ์—†์ง€๋งŒ, RegSeg๋Š” ์ถ”๊ฐ€์ ์ธ ๋ฐ์ดํ„ฐ ์—†๋Š” SOTA ๋ชจ๋ธ์ธ HardDNet๋ณด๋‹ค 1.5%p ๋” ๋†’๊ณ , ํ”ผ์–ด ๋ฆฌ๋ทฐ ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ SFNet์„ 0.5%p ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

figure 6

Cityscapes test set์—์„œ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์ •ํ™•๋„์™€ ํŒŒ๋ผ๋ฏธํ„ฐ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ์œ ์ง€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Ablation Studies

์ž‘์€ dilation rates๋ฅผ ์•ž์—์„œ ์‚ฌ์šฉํ•˜๊ณ  ํฐ dilateion rates๋ฅผ ๋’ค์—์„œ ์‚ฌ์šฉํ•˜๋˜ ๋ฌด์ž‘์ • field-of-view๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์ด๋Œ์–ด๋‚ด์ง€ ์•Š๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

figure 7

5. Conclusion

  • DDRNet-23์˜ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ค„์ด์ง€๋Š” ๋ชปํ•˜์˜€์ง€๋งŒ ๊ทธ๋ž˜๋„ ์ƒ๋‹นํžˆ ์šฐ์ˆ˜ํ•œ ๊ตํ™˜๋น„๋ฅผ ํ†ตํ•ด real-time-segmentation์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

  • Field-of-view๋ฅผ ๋Š˜๋ฆฌ๊ธฐ ์œ„ํ•œ dilated conv์€ DeepLab๋ถ€ํ„ฐ ์‚ฌ์šฉ๋˜์—ˆ์ง€๋งŒ, ๊ฐ€์ง€๋ฅผ ๋‘ ๊ฐœ๋กœ ์ค„์ด๋ฉด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ด๋Š”๋ฐ ํšจ๊ณผ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์ƒ๋‹นํžˆ ๋งŽ์€ ์‹คํ—˜์„ ํ†ตํ•ด ํšจ์œจ์ ์ธ dilated rate์™€ ๊ตฌ์กฐ๋ฅผ ์ฐพ๋Š” ๊ธฐ์—ฌ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Take home message

Dilated conv branch๋Š” ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ ๊นŠ์ด ์Œ“๋Š”๊ฒŒ ํšจ์œจ์ ์ด๋‹ค.

Field-of-view๋ฅผ ๋ฌด์ž‘์ • ๋Š˜๋ฆฐ๋‹ค๊ณ  ๊ผญ ์ •ํ™•๋„๊ฐ€ ํ–ฅ์ƒ๋˜์ง€๋Š” ์•Š๋Š”๋‹ค.

Author

์ด๋ช…์„ (MyeongSeok Lee)

Reference & Additional materials

  1. Gao, R. (2021). Rethink Dilated Convolution for Real-time Semantic Segmentation. arXiv preprint arXiv:2111.09957.

  2. Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollรกr, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428-10436).

Last updated

Was this helpful?