YOLOX [Kor]

Ge et al / YOLOX; Exceeding YOLO Series in 2021 / ArXiv 2021

1. Problem definition

Figure 1: Semantic segmentation.

Figure 1. YOLOX ํ™œ์šฉ ์˜ˆ์‹œ

Real-Time Object Detection(์‹ค์‹œ๊ฐ„ ๊ฐ์ฒด ๊ฐ์ง€)๋Š” ๊ธฐ๋ณธ ์ˆ˜์ค€์˜ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ฐ์ฒด ๊ฐ์ง€๋ฅผ ๋น ๋ฅด๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…์ด๋ฉฐ, ๊ธฐ์กด Object Detection์˜ ๋ฐฉ๋ฒ•๋ณด๋‹ค ์›”๋“ฑํžˆ ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„๊ฐ€ ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค. Real-Time Object Detection ๊ด€๋ จ ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€ classification๊ณผ localization ์˜ multi-task๋กœ ์ •์˜๋˜์—ˆ๋˜ ๊ธฐ์กด Object Detection ์„ ํ•˜๋‚˜์˜ regression ๋ฌธ์ œ๋กœ ์žฌํ•ด์„ํ•˜์—ฌ ๋‹จ์ผ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋กœ ๊ฐœ์„ ํ•œ YOLO(You Only Look Once, CVPR 2016) ๋ชจ๋ธ์ด ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ž…๋‹ˆ๋‹ค. ์ดํ›„ YOLO ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ ์‹œ๋ฆฌ์ฆˆ๋กœ ์ด์–ด์ง€๋ฉด์„œ ์‹ค์‹œ๊ฐ„ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ์ตœ์ ์˜ Speed / Accuracy Trade-off๋ฅผ ๊ฐ€์ง€๊ฒŒ๋” ์„ค๊ณ„๋˜๊ณค ํ–ˆ์Šต๋‹ˆ๋‹ค. YOLOv5 ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ 13.7ms ๋งŒ์— 48.2% AP๋ฅผ ๊ฐ€์ง€๋Š” ์ตœ์ ์˜ Trade Off๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๊ณ  ์žˆ๋Š” YOLOX ๋ชจ๋ธ ์—ญ์‹œ Real-Time Object Detection ์— ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๊ณ ์„ฑ๋Šฅ์˜ object detection model ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ, YOLOX-L ๋ชจ๋ธ์€ CVPR 2021์˜ Streaming Perception Challenge (Workshop on Autonomous Driving) ์—์„œ ๋‹จ์ผ ๋ชจ๋ธ ์„ฑ๋Šฅ๋งŒ์œผ๋กœ 1์œ„๋ฅผ ์ฐจ์ง€ํ•œ ๋ชจ๋ธ์ธ ๋งŒํผ ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

2. Motivation

YOLO (You Only Look Once) model์€ Josept Redmon์ด 2015๋…„ ๊ณต๊ฐœํ•œ version 1 ์„ ์‹œ์ž‘์œผ๋กœ version 5๊นŒ์ง€ ์ง„ํ–‰ ์ค‘์— ์žˆ์Šต๋‹ˆ๋‹ค. YOLO model์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” classification ๊ณผ localization ์„ ๋ณ„๋„์˜ task๋กœ ๋ถ„๋ฆฌํ•˜์ง€ ์•Š๊ณ , ํ•˜๋‚˜์˜ regression problem ์œผ๋กœ ๋ณด์•„ Convolution Neural Network ์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ ์šฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฆ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด, ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ์ฒด๋ฅผ ๊ฐ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์‹ ๊ฒฝ๋ง์˜ ๋‹จ์ผ ์ˆœ๋ฐฉํ–ฅ ์ „ํŒŒ๋งŒ ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค. YOLO ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋Š” ์„ธ ๊ฐ€์ง€๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  1. Residual blocks: ์ด๋ฏธ์ง€๋ฅผ ๋™์ผํ•œ ์ฐจ์›์˜ ๊ทธ๋ฆฌ๋“œ ์…€๋กœ ๋‚˜๋ˆ„๊ณ , ๋ชจ๋“  ๊ทธ๋ฆฌ๋“œ ์…€์€ ๊ทธ ์•ˆ์— ๋‚˜ํƒ€๋‚˜๋Š” ๊ฐœ์ฒด๋ฅผ ๊ฐ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐ์ฒด ์ค‘์‹ฌ์ด ํŠน์ • ๊ทธ๋ฆฌ๋“œ ์…€ ๋‚ด์— ๋‚˜ํƒ€๋‚˜๋ฉด ํ•ด๋‹น ์…€์ด ์ด๋ฅผ ๊ฐ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  2. Bounding box regression: Bounding box๋Š” ์ด๋ฏธ์ง€ ๋‚ด ๊ฐ์ฒด๋ฅผ ๊ฐ•์กฐํ•˜์—ฌ ํ‘œ์‹œํ•˜๋Š” ์œค๊ณฝ์„ ์œผ๋กœ, width (bwbw) / height (bhbh) / class (cc) / bounding box center(bx,bybx, by)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. YOLO๋Š” Bounding box regression์„ ์‚ฌ์šฉํ•˜์—ฌ object ์˜ width, height, class ๋ฐ center ๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋‚ด object๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

  3. Intersection over union (IOU): Intersection Over Union๋Š” bounding box๊ฐ€ ๊ฒน์น˜๋Š” ๋ฐฉ์‹์„ ํ‘œํ˜„ํ•˜๋Š” object detection ์˜ ํ˜„์ƒ์ž…๋‹ˆ๋‹ค. YOLO๋Š” IOU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐœ์ฒด๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ๋‘˜๋Ÿฌ์‹ธ๋Š” ์ถœ๋ ฅ ์ƒ์ž๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” YOLO ์‹œ๋ฆฌ์ฆˆ์˜ ๊ณ„๋ณด ๋ฐ ํ•ต์‹ฌ์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

drawing
  • YOLOv3

    • 2018๋…„ 4์›” ๋ฐœํ‘œ. Joseph Redmon ์ด ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ฐœํ‘œํ•œ YOLO ๋ชจ๋ธ์ด๋ฉฐ, Darknet 53์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • YOLOv4

    • 2020๋…„ 4์›” ๋ฐœํ‘œ. Alexey Bochkousky ๋กœ ์—ฐ๊ตฌ์ž๊ฐ€ ๋ฐ”๋€Œ์—ˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•(WRC, CSP ...) ๋“ฑ์„ ์‚ฌ์šฉํ•ด v3์— ๋น„ํ•ด AP, FPS๊ฐ€ ๊ฐ๊ฐ 10%, 12%๊ฐ€ ์ฆ๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. CSPNet ๊ธฐ๋ฐ˜์˜ backbone(CSPDarkNet53)์„ ์„ค๊ณ„ํ•˜์—ฌ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. Anchor-based model ์ด๋ฉฐ, anchor-based ์˜ ๊ฒฝ์šฐ ํด๋Ÿฌ์Šคํ„ฐ๋ง๋œ anchor ๋“ค์€ domain-specific ํ•˜๋ฉฐ ์ผ๋ฐ˜ํ™”๋˜๊ธฐ๊ฐ€ ์–ด๋ ต๊ณ  detection head๊ฐ€ ๋ณต์žกํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • YOLOv5

    • 2020๋…„ 6์›” ๋ฐœํ‘œ. Glenn Jocher๊ฐ€ ๋ฐœํ‘œํ–ˆ์œผ๋ฉฐ, v4์™€ ๊ฐ™์€ CSPNet ๊ธฐ๋ฐ˜์˜ backbone์„ ์„ค๊ณ„ํ•˜์—ฌ ์‚ฌ์šฉํ–ˆ๊ณ  ์„ฑ๋Šฅ์€ ๋น„์Šทํ•˜๋‚˜ ๊ฒฝ๋Ÿ‰ํ™”๋œ ๋ชจ๋ธ ํฌ๊ธฐ์™€ ์†๋„ ๋ฉด์—์„œ ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๊ณต์‹์ ์ธ ๋…ผ๋ฌธ์œผ๋กœ ๋ฐœํ‘œ๋˜์ง€ ์•Š๊ณ  pytorch ์ฝ”๋“œ ๊ณต๊ฐœ๋งŒ์œผ๋กœ ๊ทธ์ณ ๊ณต์‹์ ์ธ v5๋กœ ๋ช…์นญ์„ ๋ถ™์ด๊ธฐ์—๋Š” ๋…ผ๋ž€์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์—ญ์‹œ anchor-based ๋กœ ์ตœ์ ํ™”๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

  • PP-YOLO

    • 2020๋…„ 7์›” ๋ฐœํ‘œ. Shing Long์ด ๋ฐœํ‘œํ–ˆ์œผ๋ฉฐ, v4๋ณด๋‹ค ์ •ํ™•๋„์™€ ์†๋„๊ฐ€ ๋” ๋†’์Šต๋‹ˆ๋‹ค. v3 ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋‚˜, Darknet3 backbone์„ ResNet ์œผ๋กœ ๊ต์ฒดํ–ˆ์œผ๋ฉฐ ์˜คํ”ˆ์†Œ์Šค machine learning framework์ธ PaddlePaddle ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Idea

์ตœ๊ทผ ํ•™๊ณ„์—์„œ๋Š” anchor-free detectors, advanced label assignment strategies, end-to-end (NMS-free) detectors ๋“ฑ ๋‹ค์–‘ํ•œ object detection ๊ธฐ๋ฒ•์ด ์ƒˆ๋กœ ์ œ์‹œ๋˜์—ˆ์ง€๋งŒ, ๊ธฐ์กด YOLO ์‹œ๋ฆฌ์ฆˆ์— ์ ์šฉ๋˜์ง€๋Š” ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•๋“ค์„ ๊ธฐ์กด YOLO ๋ชจ๋ธ์„ ๊ฐœ์„ ์‹œํ‚ค๋Š” ๋ฐ์— ์ ์šฉํ•˜๊ณ  ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ์ธ 'YOLOX'์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. YOLOv4์™€ YOLOv5์˜ ํŒŒ์ดํ”„๋ผ์ธ์€ Anchor Based ์œ„์ฃผ๋กœ ์ตœ์ ํ™”๊ฐ€ ์ง„ํ–‰๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณด๋‹ค ๋ฒ”์šฉ์ ์ธ ์„ฑ๋Šฅ์€ ์˜คํžˆ๋ ค ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํŒ๋‹จํ•œ ๋ณธ ๋…ผ๋ฌธ์˜ ์ €์ž๋“ค์€ YOLOv3-SPP์™€ DarkNet53 ์„ baseline ์œผ๋กœ ์‚ผ์•˜์Šต๋‹ˆ๋‹ค. ์ด์— Decoupled head ์™€ Anchor free, Multi positive, SimOTA ๋ฐฉ์‹์„ ์ ์šฉํ•˜์—ฌ ์ตœ์‹  object detection ๊ธฐ์ˆ ๋“ค์„ ์ ์šฉํ•˜๊ณ  ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

3. Method

drawing

YOLOX๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ Input - Backbone - Neck - Dense Prediction์˜ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. Darknet53์˜ Backbone ์•„ํ‚คํ…์ณ๋ฅผ ํ†ตํ•ด Feature Map์„ ์ถ”์ถœํ•˜๋ฉฐ, SPP(Spatial Pyramid Pooling) Layer๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. FPN์„ ํ†ตํ•ด Multi-Scale Feature Map์„ ์–ป๊ณ  ์ด๋ฅผ ํ†ตํ•ด ์ž‘์€ ํ•ด์ƒ๋„์˜ Feature Map์—์„œ๋Š” ํฐ Object๋ฅผ ์ถ”์ถœํ•˜๊ณ  ํฐ ํ•ด์ƒ๋„์˜ Feature Map์—์„œ๋Š” ์ž‘์€ Object๋ฅผ ์ถ”์ถœํ•˜๊ฒŒ๋” ํ•œ Neck ๊ตฌ์กฐ๋ฅผ ์ฐจ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Head ๋ถ€๋ถ„์—์„œ๋Š” ๊ธฐ์กด YOLOv3~v5 ์™€ ๋‹ฌ๋ฆฌ Decoupled Head๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

Decoupled Head

YOLOv3์—์„œ๋Š” ํ•˜๋‚˜์˜ Head์—์„œ Classification๊ณผ Localization์„ ํ•จ๊ป˜ ์ง„ํ–‰ํ•˜์˜€์œผ๋‚˜, ์ดํ›„ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด Object detection task ์—์„œ classification ๊ณผ regression task ๊ฐ€ ์„œ๋กœ ์ƒ์ถฉ๋œ๋‹ค๋Š” ์‚ฌ์‹ค์ด ๋ฐํ˜€์กŒ์Šต๋‹ˆ๋‹ค. Classification ์—๋Š” Fully Connected Layer๊ฐ€ ํšจ๊ณผ์ ์ด์ง€๋งŒ, ๋ฐ˜๋ฉด์— Localization์—๋Š” Convolution Head๊ฐ€ ๋ณด๋‹ค ์ ์ ˆํ•œ๋ฐ ์ด ๋‘๊ฐ€์ง€๊ฐ€ ์„œ๋กœ ์ƒ์ถฉ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ Coupled detection ์˜ ๊ฒฝ์šฐ ์„ฑ๋Šฅ๋„ ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค. YOLOX์—์„œ๋Š” decoupled head๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ classification์—” Fully Connected Head๋ฅผ, Localization์—๋Š” Convolution Head๋ฅผ ์ ์šฉํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

Anchor-free

Anchor Box๋ž€ ์ž…๋ ฅ ์ด๋ฏธ์ง€ ํ˜น์€ ์˜์ƒ์— ๋Œ€ํ•ด ๊ฐ์ฒด ๊ฐ์ง€๋ฅผ ์œ„ํ•ด ์„ค์ •ํ•œ Bounding Box ์ค‘ ๊ฐ ํ”ฝ์…€์„ ์ค‘์•™์— ๋‘๊ณ  ํฌ๊ธฐ์™€ ์ข…ํšก๋น„๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ ์ƒ์„ฑ๋œ bounding box๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ Object Detection ๋ชจ๋ธ๋“ค์ด Anchor-based๋กœ ๋ฏธ๋ฆฌ ์„ธํŒ…ํ•ด๋†“์€ ์ˆ˜ ๋งŽ์€ anchor์—์„œ category๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  coordinates๋ฅผ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, ์ตœ๊ทผ์—๋Š” FPN๊ณผ Focal Loss์˜ ์ถœํ˜„์œผ๋กœ ์ธํ•ด anchor-free detector ๋ฐฉ์‹์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Anchor-free detector์—๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๋ฐ, ํ‚ค ํฌ์ธํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ object์˜ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” keypoint-based ๋ฐฉ๋ฒ•๊ณผ object์˜ ์ค‘์•™์„ ์˜ˆ์ธกํ•œ ํ›„ positive์ธ ๊ฒฝ์šฐ object boundary์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” center-based ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด Anchor ๊ธฐ๋ฐ˜์˜ Detector๋“ค์€ ๋น„๋ก ๊ทธ ์„ฑ๋Šฅ์€ ๋›ฐ์–ด๋‚  ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ฐœ๋ฐœ์ž๋“ค์ด ์ง์ ‘ Heuristic ํ•˜๊ฒŒ Tuning์„ ์ง„ํ–‰ํ•ด์ฃผ์–ด์•ผ ํ•˜๋Š” ๋ถˆํŽธํ•จ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ทธ๋ ‡๊ฒŒ Tuning๋œ Anchor Size ๋˜ํ•œ ํŠน์ • Task์— ์ข…์†์ ์ด๋ฏ€๋กœ Generalํ•œ ์„ฑ๋Šฅ์€ ๋–จ์–ด์ง€๋Š” ์ด์Šˆ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ anchor-free detector์€ anchor์— ๋‹ค์–‘ํ•œ Hyperparameter๋“ค์„ Tuningํ•ด์•ผ ํ•˜๋Š” ํ•„์š”์„ฑ์ด ์—†์œผ๋ฉด์„œ anchor-based detector์™€ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ๋•Œ๋ฌธ์—, object detection ๋ถ„์•ผ์—์„œ ๋” General ํ•˜๊ฒŒ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ์ž ์žฌ๋ ฅ ์žˆ๋‹ค๊ณ  ์—ฌ๊ฒจ์ง‘๋‹ˆ๋‹ค.

Multi positives

drawing

๊ธฐ์กด YOLOv3์˜ Assigning ๋ฐฉ์‹์„ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•œ๋‹ค๋ฉด ์ค‘์•™ ์œ„์น˜ ๊ฐ’ 1๊ฐœ ๋งŒ์„ Positive Sample๋กœ ์ง€์ •ํ•˜์—ฌ์•ผ ํ•˜์ง€๋งŒ, ์ด๋Š” ๊ทธ ์ฃผ๋ณ€์— ์˜ˆ์ธกํ•œ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ชจ๋‘ ์ œ์™ธํ•˜๊ฒŒ ๋˜๋Š” ํšจ๊ณผ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Positive Sample์„ ์ค‘์•™ ์œ„์น˜ ๊ฐ’ ์ฃผ๋ณ€ 3x3 ์‚ฌ์ด์ฆˆ๋กœ ๋ชจ๋‘ ์ง€์ •ํ•จ์œผ๋กœ์จ ์ด๋Ÿฌํ•œ ๊ณ ํ’ˆ์งˆ์˜ ์˜ˆ์ธก ๊ฐ’์— ๋Œ€ํ•ด์„œ ์ด๋“์„ ์ทจํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค (FCOS์˜ Center Sampling ๊ธฐ๋ฒ•). ์ด๋ ‡๊ฒŒ positive Sample์„ ์ฆ๊ฐ•ํ•ด์คŒ์œผ๋กœ์จ, ์‹ฌ๊ฐํ•œ class ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ์–ด๋А ์ •๋„ ์ƒ์‡„์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

SimOTA

drawing

Label assignment๋Š” sample data ์ค‘์— ์–ด๋–ค ๊ฒƒ์ด positive์ด๊ณ  negative ์ธ์ง€ ground truth object์— ํ• ๋‹นํ•ด์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. YOLOX๋Š” ๊ฐ์ฒด ํƒ์ง€์—์„œ Label Assignment๋ฅผ ๊ฐ ์ง€์ ์— ๋Œ€ํ•˜์—ฌ Positive๊ณผ Negative๋ฅผ ํ• ๋‹นํ•ด์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ Label assign ๋ฐฉ์‹์„ ๊ฐœ์„ ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. Anchor Free๋ฐฉ์‹์€ Ground Truth์˜ ๋ฐ•์Šค ์ค‘์•™ ๋ถ€๋ถ„์„ Positive๋กœ ์ฒ˜๋ฆฌํ•˜๋Š”๋ฐ, ๋ฌธ์ œ๋Š” label ์—ฌ๋Ÿฌ ๊ฐœ๊ฐ€ ํ•˜๋‚˜์˜ bounding box์— ๊ฒน์น  ๋•Œ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ ๋‹จ์ˆœํžˆ point by point๊ฐ€ ์•„๋‹Œ Global Labeling์ด ํ•„์š”ํ•œ๋ฐ, ์ด๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ €์ž๋Š” SimOTA๋ฅผ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. OTA(Optimal Transportation Algorithm)์€ Sinkhorn-knopp iteration๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด์„œ ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ์•„๋‚ด๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š”๋ฐ, ์ด๋Ÿฌํ•œ iteration์œผ๋กœ ์ธํ•ด ์•ฝ 25%์˜ ์ถ”๊ฐ€ ํ•™์Šต ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์•ฝ 300 Epoch์˜ ํ•™์Šต์ด ํ•„์š”ํ•œ YOLOX์—๊ฒŒ ๊ฝค๋‚˜ ํฐ ์˜ค๋ฒ„ํ—ค๋“œ์ด๋ฏ€๋กœ, ์ €์ž๋“ค์€ ์ด๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ iteration ์—†์ด ์ˆ˜ํ–‰ํ•˜๋Š” Simple OTA(SimOTA)๋ฅผ ์ ์šฉํ•˜์˜€์œผ๋ฉฐ AP 45.0%๋ฅผ 47.3%๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ground truth์™€ prediction์˜ cost ํ•จ์ˆ˜๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

drawing

4. Experiment & Result

Experimental setup

  • Dataset

    • COCO train2017

  • Baselines

    • YOLOv3-SPP + DarkNet53

  • Training setup

    • Initial learning rate: 0.01, lr X BatchSize/64

    • batch size: 128

    • weight decay: 0.0005, SGD momentum: 0.9

Result

drawing

YOLOX๋Š” Streaming Perception Challenge (WAD at CVPR 2021)์—์„œ ๋‹จ์ผ ๋ชจ๋ธ๋งŒ์œผ๋กœ 1์œ„๋ฅผ ๋‹ฌ์„ฑํ•œ SOTA ๋ชจ๋ธ์ด๋ฉฐ, ์—ฌํƒœ ๋‚˜์˜จ YOLO Series ๋ชจ๋‘๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” AP๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด YOLO ๋ชจ๋ธ๋“ค๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์†๋„์™€ ์„ฑ๋Šฅ๊ฐ„์˜ Trade Off๊ฐ€ ์กด์žฌํ•˜์ง€๋งŒ, ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋†’์€ ์„ฑ๋Šฅ๊ณผ FPS๋ฅผ ๋™์‹œ์— ์–ป์–ด๋‚ด๋Š” ๋ชจ์Šต์„ ๋ณด์ž…๋‹ˆ๋‹ค.

5. Conclusion

๋ณธ ๋…ผ๋ฌธ์˜ ์ €์ž๋“ค์€ YOLO์— ์ตœ์‹  Object Detection ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•œ YOLOX๋ฅผ ์†Œ๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. Decoupled Head, Multi-Postive, SimOTA, Strong Augmentation ๋“ฑ ์ตœ์‹  ์—ฐ๊ตฌ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ YOLOv3 ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, YOLOv5์— ์ ์šฉํ–ˆ์„ ๋•Œ๋„ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ž…๋‹ˆ๋‹ค. Anchor Free ๋ฐฉ์‹์„ ์ ์šฉํ•˜์—ฌ Generalํ•œ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜๋ฉฐ, ๋ชจ๋ธ ๊ตฌํ˜„์ž๋กœ ํ•˜์—ฌ๊ธˆ Anchor์™€ ๊ด€๋ จ๋œ ๋‹ค์–‘ํ•œ Hyperparameter๋ฅผ Tuningํ•  ํ•„์š”์—†์ด ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ–ˆ๋‹ค๋Š” ์˜๋ฏธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

YOLOX๋Š” Decoupled Head, Anchor-Free, Multi-Postive, SimOTA, Strong Augmentation ๋“ฑ ์ตœ์‹  ์—ฐ๊ตฌ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ YOLOv3 ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, YOLOv5์— ์ ์šฉํ–ˆ์„ ๋•Œ๋„ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

Author / Reviewer information

Author

๋ฐ•์ง€์œค (Jiyun Park)

  • Affiliation: KAIST Graduate School of Culture & Technology

  • Contact : june@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Citation

    • Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.

    • Wu, Y., Chen, Y., Yuan, L., Liu, Z., Wang, L., Li, H., & Fu, Y. (2020). Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10186-10195).

    • Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.

    • Zhang, S., Chi, C., Yao, Y., Lei, Z., & Li, S. Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9759-9768).

  2. References

    • https://danaing.github.io/computer-vision/2021/08/26/YOLOX.html

Last updated

Was this helpful?