MM-TTA [Kor]

Shin et al. / MM-TTA / CVPR 2022

Title & Description

MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation [Eng]

<Shin et al.> / <MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation> / <CVPR 2022>

English version of this article is available.

1. Problem definition

Domain adaptation๋Š” source data์—์„œ train๋œ ๋ชจ๋ธ์ด target data์— ์ ํ•ฉํ•˜๋„๋ก ๋ชจ๋ธ์„ ์ ์‘ ์‹œํ‚ค๋Š” task์ž…๋‹ˆ๋‹ค.

Source data๊ฐ€ ํ•ญ์ƒ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๊ธฐ์— Test-time adaptation์ด ์‹œ๋„๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Uni-modal semantic segmentation์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์€ multi-modal์— ๊ทธ๋Œ€๋กœ ์ ์šฉํ•  ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” multi-modality task์˜ ์žฅ์ ์„ ์ตœ๋Œ€ํ•œ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

2. Motivation

Test-time adaptation์€ source data ์—†์ด domain adaptation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. Test-time training์€ proxy task๋ฅผ ํ†ตํ•ด model parameter๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ training sample์„ ํ•„์š”๋กœ ํ•˜๊ณ , ์ตœ์ ์˜ proxy task๋ฅผ ์ฐพ๋Š” ๊ฒƒ์€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. TENT๋Š” proxy task์—†์ด batch norm parameter๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ, ๊ฐ„๋‹จํ•˜๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ TENT๋Š” entropy๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ง€๊ธฐ์— ์ž˜๋ชป๋œ prediction์— ๋Œ€ํ•œ confidence๋ฅผ ๋†’์ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. S4T๋Š” pseudo label์„ regularizeํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋Š”๋ฐ, spatial augmentation์ด ๊ฐ€๋Šฅํ•œ task์— ํ•œํ•ด์„œ๋งŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

3D semantic segmentation์€ 3D scene์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ํ†ตํ•ด ๊ฐ LiDAR point๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์—ฐ๊ตฌํ•˜๋Š” ์ค‘์š”ํ•œ task๋กœ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. 3D point๋“ค์„ 2D image plane์— ์ •์‚ฌ์˜ํ•˜๊ฑฐ๋‚˜ point cloud๋ฅผ voxelize, ํ˜น์€ SparseConvNet์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์€ 2D ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋ฐ, ๋ณต์žกํ•œ ์˜๋ฏธ๋ก ์  ์ •๋ณด๋ฅผ ์ดํ•ดํ•˜๋Š”๋ฐ์—๋Š” ์ด 2D ๋ฌธ๋งฅ ์ •๋ณด๊ฐ€ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

Idea

์ด๋Ÿฐ ๋‹จ์ ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ multi-modal 3D segmentation์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Multi-modal semantic segmentation์—์„œ๋Š” RGB์™€ point cloud์˜ ๋‘๊ฐ€์ง€ ์ •๋ณด๋ฅผ ์ž˜ ์œตํ•ฉํ•˜๋Š” ๊ธฐ๋ฒ•์ด ์ค‘์š”ํ•œ๋ฐ, RGB๋Š” ๋ฌธ๋งฅ์  ์ •๋ณด๋ฅผ, point cloud๋Š” ๊ธฐํ•˜ํ•™์  ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. 2D data์—๋Š” style distribution, 3D data์—๋Š” point distribution์˜ dataset bias๊ฐ€ ์กด์žฌํ•˜๋Š”๋ฐ, ์ด ๋•Œ๋ฌธ์— multi-modality model์˜ domain adaptation์ด ๋” ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” test-time adaptation ํ™˜๊ฒฝ์—์„œ multi-modal 3D semantic segmentation์˜ ๋‘ modality model์ด jointly learnํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.

3. Method

Intra-modal pseudo label generation ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Intra-PG๋ผ๋Š” ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•˜์˜€๋Š”๋ฐ, ๊ฐ๊ฐ์˜ modality์—์„œ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” online pseudo label์„ ๋งŒ๋“œ๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์†๋„๋กœ ์—…๋ฐ์ดํŠธ ๋˜๋Š” ๋‘๊ฐœ์˜ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, Fast model์€ batch normalization ํ†ต๊ณ„๋“ค์„ ๋ฐ”๋กœ ์—…๋ฐ์ดํŠธ ํ•˜๊ณ  Slow model์€ fast model๋กœ๋ถ€ํ„ฐ momentum update๋ฉ๋‹ˆ๋‹ค(์‹ 6). ๋‘ ๋ชจ๋ธ์€ ๊ณต๊ฒฉ์ ์œผ๋กœ, ์ ์ง„์ ์œผ๋กœ stableํ•˜๊ณ  ์ƒ๋ณด์ ์ธ supervisory signal์„ ์ค๋‹ˆ๋‹ค. Inference time์—๋Š” Slow model๋งŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋‘ ๋ชจ๋ธ์€ logit์˜ ํ‰๊ท ์„ ํ†ตํ•ด fusion๋ฉ๋‹ˆ๋‹ค.

Inter-modal pseudo label refinement ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Inter-PR์ด๋ผ๋Š” ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•˜์˜€๋Š”๋ฐ, Cross-modal fusion์„ ํ†ตํ•ด pseudo label์„ ๋ฐœ์ „์‹œํ‚ค๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์†๋„๋กœ ์—…๋ฐ์ดํŠธ ๋˜๋Š” ๋‘ ๋ชจ๋ธ์˜ consistency๋ฅผ ์ด์šฉํ•ด ์–ด๋–ค modality์˜ output์„ pseudo label๋กœ ์ทจํ•  ๊ฒƒ์ธ์ง€ ์ •ํ•ฉ๋‹ˆ๋‹ค. Modality๋ฅผ ๊ณ ๋ฅด๋Š” ๋ฐฉ๋ฒ•์—๋Š” hard์™€ soft selection ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๋ฐ harder selection์€ ๋‘ ๋ชจ๋ธ ์‚ฌ์ด consistency๊ฐ€ ๋†’์€ modality๋ฅผ ๊ทธ๋Œ€๋กœ ์ทจํ•˜๋Š” ๊ฒƒ์ด๊ณ  soft selection์€ ๋‘ ๋ชจ๋ธ์˜ output์˜ weighted sum์„ ํ†ตํ•ด pseudo label์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค. Consistency๋Š” KL Divergence์˜ ์—ญ์ˆ˜๋ฅผ ํ†ตํ•ด ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‘ modality์˜ consistency๊ฐ€ ์ผ์ • threshold๋ณด๋‹ค ๋‚ฎ์€ ๊ฒฝ์šฐ ํ•ด๋‹น pseudo label์€ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. Loss ํ•จ์ˆ˜๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Q. inter-PR ์—์„œ hard selection ์‹œ, ๋‘ ๋ชจ๋ธ ์‚ฌ์ด consistency ๊ฐ€ ๋†’์€ modality ๋ฅผ ์ทจํ•œ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ๋‘ ๋ชจ๋ธ์€ fast model ์™€ slow model ์ธ๊ฐ€์š”? ๊ทธ๋ž˜์„œ ๊ฐ modality ๋ณ„๋กœ fast ์™€ slow model ์˜ consistency ๋ฅผ ๊ตฌํ•˜๊ณ , ๋” consistent ํ•œ modality ๋ฅผ ์„ ํƒํ•˜๋Š”๊ฑด๊ฐ€์š”? ๊ทธ๋ ‡๋‹ค๋ฉด, ๋‘ ๋ชจ๋ธ์˜ consistency ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์žก์€ ์ด์œ ๊ฐ€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ชจ๋ธ์ด consistent ํ•˜์ง€ ์•Š๋‹ค๋ฉด, ๋ถˆ์•ˆ์ •ํ•œ ๋ชจ๋ธ๋กœ ์—ฌ๊ธฐ๋Š” ๊ฑด๊ฐ€์š”?

A. ๋ณธ ๋…ผ๋ฌธ์—์„œ fast, slow model์˜ consistency๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์€ source data ์ ‘๊ทผ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ TTA setting์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•จ์ด๋ผ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. UDA์˜ ๊ฒฝ์šฐ Source data์— ๋Œ€ํ•ด์„œ๋„ ๊พธ์ค€ํžˆ ํ•™์Šตํ•˜์—ฌ ๋ชจ๋ธ์ด task์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ ๋ณด๋‹ค test set์— ๋Œ€ํ•œ loss๋งŒ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋น ์ง€๋Š” ๊ฒƒ์„ ๋ง‰์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ TTA์—์„œ๋Š” ๊ทธ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— Source data์˜ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ prediction ์œผ๋กœ๋ถ€ํ„ฐ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋Š” ์„ ์—์„œ test data์— adapt ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

4. Experiment & Result

Experimental setup

Dataset

A2D2 dataset์€ 2.3 MegaPixel ์นด๋ฉ”๋ผ์™€ 16์ฑ„๋„ LiDAR๋กœ, SemanticKITTI๋Š” 0.7 MegaPixel ์นด๋ฉ”๋ผ์™€ 64์ฑ„๋„ LiDAR๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค. nuScenes๋Š” real-world case๋ฅผ ์œ„ํ•ด ์ด์šฉ๋˜์—ˆ๋Š”๋ฐ ๋‚ฎ ์‹œ๊ฐ„๋™์•ˆ ์ˆ˜์ง‘๋œ image๋“ค์€ ๋ถ„๋ช… ๋ฐค ์‹œ๊ฐ„์˜ ์ด๋ฏธ์ง€์™€ ๋ช…ํ™•ํžˆ ๋‹ค๋ฅธ ๋น› ์กฐ๊ฑด์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Synthia-to-semanticKITTI๋Š” synthetic๊ณผ real data ์‚ฌ์ด์˜ test-time adaptation์„ ์œ„ํ•ด ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Baselines

Entropy๋ฅผ ํ†ตํ•œ self-learning์€ TENT์—์„œ ์ œ์•ˆ๋˜์—ˆ๋Š”๋ฐ, model prediction์˜ entropy๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๋ฐฉ์‹์„ ์ทจํ•ฉ๋‹ˆ๋‹ค. ์ด ์‹คํ—˜์—๋Š” Fast model๋งŒ์ด ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ์ด loss ํ•จ์ˆ˜์˜ ๊ฒฝ์šฐ distribution์„ ๋” ์ข๊ฒŒ ๋งŒ๋“ค๊ฒŒ ํ•  ๋ฟ์ด๊ธฐ์— ํ‹€๋ฆฐ prediction์„ ๋” ๊ฐ•ํ™”ํ•  ์ˆ˜ ์žˆ๊ณ , cross-modal consistency์— ๋Œ€ํ•ด ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

Q. ๋‘ modality ๊ฐ„์˜ consistency ๋ฅผ ๊ณ„์‚ฐํ•˜์ง€ ๋ชปํ•œ๋‹ค๊ณ  ๋‚˜์™€ ์žˆ๋Š”๋ฐ, ๋”ฐ๋กœ penalty ๋ฅผ ์ฃผ์ง€ ์•Š์•„๋„ cross-modal consistency ๊ฐ€ ๋ณด์กด๋˜๋Š” ๊ฑด๊ฐ€์š”? ์•„๋‹ˆ๋ฉด, ๋ณธ ์—ฐ๊ตฌ๋Š” ๋‘ modality ์ค‘ ๋” consistent ํ•œ modality ๋ฅผ ์„ ํƒํ•˜๊ธฐ ๋•Œ๋ฌธ์—, cross-modal consistency ๋Š” ์ค‘์š”ํ•˜์ง€ ์•Š์€๊ฑด๊ฐ€์š”?

A. ๋‘ modality ๊ฐ„์˜ consistency๋ฅผ ์ œ๋Œ€๋กœ ์ธก์ •ํ•˜์ง€ ๋ชปํ•˜๋Š” ์ด์œ  ์—ญ์‹œ source data์— ์ ‘๊ทผ์„ ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ๋‘ modality์˜ prediction์ด ๋™์ผํ•œ ์˜ค๋‹ต์ผ ๊ฒฝ์šฐ๊ฐ€ ์ข‹์€ ์˜ˆ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Consistent ํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๊ทธ prediction์— ๋Œ€ํ•ด์„œ๋Š” penalize ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด์ง€์š”. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ modality์˜ prediction์—์„œ consensus๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ๋ณด๋‹ค๋Š” 1๋ฒˆ ์งˆ๋ฌธ์˜ ๋‹ต๋ณ€๊ณผ ๊ฐ™์ด ๋” consistentํ•œ modality์˜ output์„ pseudo-label๋กœ ํ•˜์—ฌ ๋‘ modality๊ฐ€ ๊ฐ™์€ prediction์„ ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

Q. Baseline ๋ชจ๋ธ์—์„œ entropy, consistency, pseudo label ์„ ์ด์šฉํ•œ self-learning ๋ชจ๋ธ์ด ๊ฐ๊ฐ TENT, xMUDA, MM-TTA ๋ผ๊ณ  ์ดํ•ดํ•˜๋ฉด ๋ ๊นŒ์š”? ๊ฐ category ์— ํ•ด๋‹นํ•˜๋Š” baseline ๋ชจ๋ธ์ด ๋ฌด์—‡์ธ์ง€ ํ—ท๊ฐˆ๋ฆฝ๋‹ˆ๋‹ค.

A. TENT์˜ ๊ฒฝ์šฐ entropy๋ฅผ ๊ณ ๋ คํ•œ ๋ฐฉ๋ฒ•์ด๊ณ , xMUDA๊ฐ€ consistency๋ฅผ ๊ณ ๋ คํ•œ ๋ฐฉ๋ฒ•์ธ ๊ฒƒ์€ ๋งž์Šต๋‹ˆ๋‹ค๋งŒ xMUDA์—์„œ๋„ pseudo-label์„ ์‚ฌ์šฉํ•œ setting์ด ์žˆ์Šต๋‹ˆ๋‹ค. Cross-modal consistency์— ์ถ”๊ฐ€์ ์œผ๋กœ ๊ฐ modality ๋‚ด์—์„œ pseudo-label๋กœ self-training์„ ํ•˜๋Š” ๊ฒƒ์ด์ง€์š”. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ MM-TTA์˜ ํ•ต์‹ฌ์€ ๋‘ modality๊ฐ„์˜ interaction์„ ํ†ตํ•œ pseudo-label generation์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Consistency๋ฅผ ํ†ตํ•œ Self-learning๋Š” ๋‘ modality model๊ฐ„์˜ consistency๋ฅผ ํ‚ค์šฐ๋Š” ๋ฐฉ์‹์œผ๋กœ multi-modal test-time adaptation์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. xMUDA์™€ ๊ฐ™์ด source data๋ฅผ ํ†ตํ•ด regularize ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•๋“ค๊ณผ ๋‹ฌ๋ฆฌ MM-TTA๋Š” source data์— ์ ‘๊ทผํ•˜์ง€ ๋ชปํ•˜๋Š” ์ƒํ™ฉ์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ํ‹€๋ฆฐ prediction์ด ๋ฐœ์ƒํ•œ ๊ฒฝ์šฐ ๋‘ modality ์‚ฌ์ด์˜ consistency๋ฅผ ์ œ๋Œ€๋กœ ์ธก์ •ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Pseudo-label์„ ํ†ตํ•œ self-learning์€ segmantation loss๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. Pseudo-label์€ ์‹ 4์™€ ๊ฐ™์ด prediction๋“ค์„ thresholdingํ•˜์—ฌ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Batch normalization statistic๋งŒ์„ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ๋‹ค ๋‘ modality์˜ pseudo label๊ฐ„ ์ •์ œ๊ฐ€ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ ์˜ ํ•™์Šต ๋ฐฉ๋ฒ•์ด ์•„๋‹™๋‹ˆ๋‹ค.

Training Setup

์ด ๋…ผ๋ฌธ์—์„œ๋Š” two-stream multi-modal framework์ธ xMUDA์˜ ์„ธํŒ…์„ ๋”ฐ๋ž๋Š”๋ฐ, ResNet34๋กœ ์ด๋ฃจ์–ด์ง„ U-Net encoder๋ฅผ ์ด์šฉํ•ด 2D branch๋ฅผ ๊ตฌ์„ฑํ•˜์˜€๊ณ , 3D branch์˜ ๊ฒฝ์šฐ SparseConvNetํ˜น์€ MinkowskiNet๋ฅผ ์ด์šฉํ•ด voxelize๋œ point cloud input์„ sparse convolution์ด ํ™œ์šฉ๋œ U-Net์— ํ†ต๊ณผ ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

SparseConvNet์˜ ๊ฒฝ์šฐ ๊ณตํ‰ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด xMUDA official pre-trained model์„ ์ด์šฉํ–ˆ์œผ๋ฉฐ MincowskiNet์˜ ๊ฒฝ์šฐ source data๋ฅผ ์ด์šฉํ•ด ์ฒ˜์Œ๋ถ€ํ„ฐ trainํ•˜์˜€์Šต๋‹ˆ๋‹ค.

TTA๋Š” batch norm affine parameter๋งŒ์„ updateํ•˜๋ฉฐ, 1 epoch adaptation ์ดํ›„์˜ ์„ฑ๋Šฅ์„ reportํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Evaluation metric

๋…ผ๋ฌธ์˜ ์ €์ž๋“ค์€ mIoU๋ฅผ ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. mIoU๋Š” semantic segmentation task์—์„œ ํ”ํžˆ ์‚ฌ์šฉ๋˜๋Š” ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค. mIoU๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” confusion matrix๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Confusion matrix๋Š” ๊ฐ category ์Œ์ด ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ์กด์žฌํ•˜๋Š”์ง€๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง‘๋‹ˆ๋‹ค. ์ด ๋•Œ category ์Œ์ด๋ž€ ground truth์™€ prediction์˜ ์กฐํ•ฉ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. #class * # class ๊ฐœ์˜ ์กฐํ•ฉ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. Confusion matrix์˜ ๋Œ€๊ฐ์„  ์„ฑ๋ถ„๋“ค์€ intersection์œผ๋กœ, ๊ทธ ์œ„์— ๋†“์ธ ์‹ญ์ž๊ฐ€๊ฐ€ ํ†ต๊ณผํ•˜๋Š” ๋ชจ๋“  ์„ฑ๋ถ„๋“ค์€ union์œผ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ์กฐํ•ฉ๋“ค์˜ IoU๋ฅผ ํ‰๊ท ์„ ์ทจํ•˜๋ฉด mIoU๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Result

UDA์˜ ๊ฒฝ์šฐ xMUDA framework๋ฅผ ํ†ตํ•ด ๋น„๊ตํ–ˆ๋Š”๋ฐ, ์ด๋•Œ consistency loss, offline pseudo-label์„ ์ด์šฉํ•œ self-training์„ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. TTA baseline์˜ ๊ฒฝ์šฐ TENT, xMUDA, xMUDA_pl์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. TENT๋ฅผ ๋‘๊ฐ€์ง€ modality์— ํ™•์žฅํ•˜์˜€๋Š”๋ฐ, 2D์™€ 3D logit์˜ ensemble์˜ entropy๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.

MM-TTA๋Š” ๋ชจ๋“  baseline ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , entropy์™€ pseudo-label๋ฅผ ์ด์šฉํ•œ ๊ธฐ๋ฒ•์ด consistency loss๋ฅผ ์ด์šฉํ•œ ๋ชจ๋ธ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” A2D2-to-SemanticKITTI์™€ Synthia-to-SemanticKITTI์˜ ๊ฒฝ์šฐ modality๊ฐ„ consistency๋ฅผ ์žก์•„๋‚ด๋Š”๊ฒƒ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋ช‡๋ช‡ TTA baseline๋“ค์ด 2D์™€ 3D performance ๊ฐ๊ฐ์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€์ง€๋งŒ, ensemble result๋Š” source-only model๋ณด๋‹ค ์ข‹์ง€ ๋ชปํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‘ multi modal output์ด jointly learnํ•˜๋„๋ก ๋””์ž์ธ๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

nuScenes์˜ Day-to-Night Domain gap์€ LiDAR๋ณด๋‹ค๋Š” RGB์—์„œ ๋” ํฐ๋ฐ, ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— 2D branch์˜ ์„ฑ๋Šฅ์ด ์–ผ๋งˆ๋‚˜ ํ–ฅ์ƒ๋˜๋Š”์ง€๊ฐ€ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. Inter-PR์ด ์ด ๋ถ€๋ถ„์—์„œ ๊ธฐ์—ฌํ•˜๊ณ  ์žˆ๊ณ , ๊ทธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

$xMUDA$ : ๋‘ modality ๊ฐ„์˜ consistency

$xMUDA_{PL}$ : ๋‘ modality ๊ฐ„์˜ consistency + intra pseudo-label

$TENT$ : Entropy๋ฅผ ์ด์šฉํ•œ Self-training

$TENT_{ENS}$ : Entropy๋ฅผ ์ด์šฉํ•œ Self-training, ๋‘ modality logit์˜ ensemble์—์„œ Entropy minimization

$MM-TTA$ : ๋‘ modality๊ฐ„ interaction์„ ํ†ตํ•ด ๋งŒ๋“ค์–ด์ง„ pseudo-label์„ ์ด์šฉํ•œ self-training

5. Conclusion

์ด ๋…ผ๋ฌธ์—์„œ๋Š” multi-modal 3D semantic segmentation์—์„œ์˜ test-time adaptation์ด๋ผ๋Š” ๋ฌธ์ œ๋ฅผ ์ •์˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•œ๊ณ„๊ฐ€ ์žˆ๋Š” ๊ธฐ๋ฒ•๋“ค์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ๋ณด๋‹ค pseudo label์„ modality ๋‚ด์—์„œ ํ˜น์€ modality๊ฐ„์— ์ •์ œํ•ด์ฃผ๋Š” ์ฐธ์‹ ํ•œ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์˜ method๋Š” 3D semantic segmentation์ด๋ผ๋Š” task์˜ ํŠน์ง•์„ ๊นŠ๊ฒŒ ๋ถ„์„ํ•˜์ง€๋Š” ์•Š์•˜๊ธฐ์— ๋” ๋ฐœ์ „๋  ์—ฌ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ multi-modal supervisory signal์„ ์ด์šฉํ•œ ๋ชจ๋“  task์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

Q. ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค์— ํ•„์š”ํ•œ unseen data์— ๋Œ€ํ•œ practicalํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ test time adaptation๊ด€๋ จ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜๊ณ  ์žˆ๊ณ , ํŠนํžˆ MM-TTA๋Š” multi modal ์ƒํ™ฉ์—์„œ ๋‹ค์–‘ํ•œ ์„ผ์„œ ์ž…๋ ฅ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ์„ผ์„œ ๋ฐ์ดํ„ฐ์˜ fusion์‹œ, ์ž…๋ ฅ ์ฃผ๊ธฐ๊ฐ€ ๋‹ค๋ฅด๊ณ  sync๊ฐ€ ๋งž์ง€ ์•Š๋Š” ๋ถ€๋ถ„๋“ค์€ ๊ตฌ์ฒด์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌ๋˜์—ˆ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

A. ๋ณธ ๋…ผ๋ฌธ์€ ๋‘ modality์—์„œ์˜ representation์„ fuseํ•˜์ง€ ์•Š๊ณ  ๊ฐ๊ฐ์ด prediction์„ ํ•˜๊ณ  ๋†’์€ confident๋ฅผ ๊ฐ–๋Š” modality์˜ prediction์„ ์ทจํ•˜๋Š” ๋ฐฉ์‹์ด๋ฉฐ real time์—์„œ ์‹ค์ œ sync๋ฅผ ๋งž์ถ”๋Š” ๊ฒƒ๋ณด๋‹ค๋Š” ๋‘ modality๊ฐ„์˜ interaction์— ๋” ๋ฌด๊ฒŒ๋ฅผ ๋‘” ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ์„ผ์„œ์˜ ์ž…๋ ฅ์˜ sync๋ฅผ ๋งž์ถ”๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋Š” ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์œผ๋กœ ์•„์ฃผ ์ข‹์€ ์ฃผ์ œ์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

Test-time adaptation์€ real-world ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ ํ•ฉํ•˜๊ธฐ์— ์ตœ๊ทผ ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋ฅผ ์‹œ์ž‘์œผ๋กœ community์—์„œ๋Š” task ํ˜น์€ modality์— ์ ํ•ฉํ•œ feature๋ฅผ ์ž˜ ์ •์ œํ•ด๋‚ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด framework๋Š” ๋‹ค๋ฅธ ๋ถ„์•ผ์—์„œ๋„ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. Test-time adaptation์˜ ์ดˆ๊ธฐ work์œผ๋กœ์„œ, ๋ชจ๋“  machine learning community์— ํ›Œ๋ฅญํ•œ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์ œ๊ณตํ•  ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

Author / Reviewer information

Author

** ๋ฅ˜ํ˜•๊ณค (Hyeonggon Ryu)**

  • Affiliation (KAIST)

  • Contact information (gonhy.ryu@kaist.ac.kr)

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Inkyu Shin, Yi-Hsuan Tsai, Bingbing Zhuang, Samuel Schulter, Buyu Liu, Sparsh Garg, In So Kweon, Kuk-Jin Yoon. MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation. In CVPR, 2022.

  2. Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.

  3. MaximilianJaritz,Tuan-HungVu,RaouldeCharette,E ฬmilie Wirbel, and Patrick Pe ฬrez. xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation. In CVPR, 2020.

  4. Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.

  5. Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.

Last updated

Was this helpful?