NeRF [Kor]

1. Problem definition

NeRF๊ฐ€ ์ˆ˜ํ–‰ํ•˜๊ณ ์ž ํ•˜๋Š” View Synthesis ๋ผ๋Š” ๋ฌธ์ œ๋Š”, ๋‹ค์–‘ํ•œ ์นด๋ฉ”๋ผ ๊ฐ๋„์—์„œ ์ฐ์€ ๋ฌผ์ฒด์˜ ์ด๋ฏธ์ง€๋“ค์„ input์œผ๋กœ ๋ฐ›์•„, ์ƒˆ๋กœ์šด ๊ฐ๋„์—์„œ ๋ฐ”๋ผ๋ณด๋Š” ๋ฌผ์ฒด์˜ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š”(์˜ˆ์ธกํ•˜๋Š”) ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์•„๋ž˜์˜ figure๊ฐ€ ๊ทธ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

NeRF๋Š” ํ•ด๋‹น ๋ฌธ์ œ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด formulateํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐ„ ์ขŒํ‘œ $x = (x,y,z)$์™€ ๋ณด๋Š” ๊ฐ๋„ $d = (\theta, \phi)$๋ฅผ input์œผ๋กœ ๋ฐ›์•„(์ด 5D ์ขŒํ‘œ), ํ•ด๋‹น ๋ฌผ์ฒด์˜ volume density์™€ emitter color์„ ์‚ฐ์ถœํ•˜๊ณ , ์ด๋กœ๋ถ€ํ„ฐ ์ „ํ†ต์ ์ธ ๋žœ๋”๋ง ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ 2D ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค. ํ•ด๋‹น ์ด๋ฏธ์ง€๋ฅผ ground truth์™€ ๋น„๊ตํ•˜์—ฌ loss๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๋ชจ๋“  ๊ณผ์ •์ด ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ „์ฒด ๋ชจ๋ธ์„ ํ•œ ๋ฒˆ์— ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

Figure 1. View Synthesis ๋ฌธ์ œ

2. Motivation

- Neural 3D shape representations

์ตœ๊ทผ์— 3D ๋ฌผ์ฒด์— ๋Œ€ํ•œ ํ‘œํ˜„(3D shape representation)์„ ์–ป๊ธฐ ์œ„ํ•œ ๋งŽ์€ ์—ฐ๊ตฌ๋“ค์ด ์ œ์•ˆ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ 3์ฐจ์›์˜ ์œ„์น˜ ์ •๋ณด $(x, y, z)$ ์„ input์œผ๋กœ ๋ฐ›์•„, signed distance ํ•จ์ˆ˜ ํ˜น์€ occupancy field๋กœ mappingํ•˜๋Š” neural network์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์€ ๊ต‰์žฅํžˆ ๋น„์šฉ์ด ํฐ ground truth 3D ์ด๋ฏธ์ง€๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ํ•œ๊ณ„์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ์˜ค์ง 2D ์ด๋ฏธ์ง€๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Niemeyer et al. , Sitzmann et al. ์˜ ์—ฐ๊ตฌ๋“ค์ด ๋Œ€ํ‘œ์ ์ž…๋‹ˆ๋‹ค.

์ด ๋ฐฉ๋ฒ•๋“ค์€ 2D ์ด๋ฏธ์ง€ ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งค์šฐ ํšจ์œจ์ ์ด๊ณ  ๊ฝค๋‚˜ ์ •ํ™•ํ•œ ๋žœ๋”๋ง ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋“ค ๋ชจ๋‘ ๊ตฌ์กฐ๊ฐ€ ๊ทธ๋ฆฌ ๋ณต์žกํ•˜์ง€ ์•Š์€ ๋ฌผ์ฒด๋“ค์— ๋Œ€ํ•ด์„œ๋งŒ ์‚ฌ์šฉ๋˜์—ˆ๊ณ , ํ‘œ๋ฉด ๊ตฌ์กฐ๊ฐ€ ๋ณต์žกํ•œ ๊ฒƒ๋“ค์— ๋Œ€ํ•ด์„œ๋Š” oversmoothing๋˜๋Š” ํ•œ๊ณ„์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. NeRF๋Š” 5D์˜ radiance field์„ ์ธ์ฝ”๋”ฉํ•˜๋Š” neural network์„ ๋””์ž์ธํ•˜์—ฌ ๊ณ ํ•ด์ƒ๋„์™€ ๋ณต์žกํ•œ ๊ตฌ์กฐ์˜ ๋ฌผ์ฒด๋“ค๋„ photorealisticํ•œ ๋ทฐ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

- View synthesis and image-based rendering

๊ธฐ์กด์— ๋นฝ๋นฝํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๊ฐ๋„์—์„œ ์ดฌ์˜๋œ ์ด๋ฏธ์ง€๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋žœ๋”๋งํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค๋ฉด, ์ตœ๊ทผ์—๋Š” ํ›จ์”ฌ ์ ์€ ์–‘์˜ (๋ช‡ ๊ฐœ์˜ ๊ฐ๋„์—์„œ๋งŒ ์ดฌ์˜๋œ) ์ด๋ฏธ์ง€๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋žœ๋”๋งํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด ์ฃผ๋กœ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์œ ๋ช…ํ•œ ๋ฐฉ๋ฒ•์ด mesh ๊ธฐ๋ฐ˜์˜ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ rasterizer ํ˜น์€ pathtracer์€ gradient descent์„ ์‚ฌ์šฉํ•˜์—ฌ mesh representation์„ ์ง์ ‘์ ์œผ๋กœ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด์™€ ๊ฐ™์ด ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜์˜ mesh ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ reprojectionํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต์šด๋ฐ, ๋ณดํ†ต local minima์— ๋น ์ง€๊ธฐ ์‰ฝ๊ณ , ์ตœ์ ํ™”๋ฅผ ์‹œ์ž‘ํ•˜๊ธฐ ์ „ ์ดˆ๊ธฐํ™” ๋‹จ๊ณ„์—์„œ ํ˜„์‹ค์—์„œ ๋ณดํ†ต ์–ป์„ ์ˆ˜ ์—†๋Š” ํ…œํ”Œํ• mesh๋ฅผ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋˜ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ์ฒด์  ์ธก์ •์‹์˜(volumetric) ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์€ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜์˜ ์ตœ์ ํ™”์— ๊ต‰์žฅํžˆ ์ž˜ ๋™์ž‘ํ•˜๊ณ , ๋”ฐ๋ผ์„œ mesh ๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ํ›จ์”ฌ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ CNN์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋Š”๋ฐ, ์ด๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๊ต‰์žฅํžˆ ๋งŽ๊ณ , ๋ณต์žกํ•œ ๊ตฌ์กฐ๋‚˜ ๊ณ ํ™”์งˆ์˜ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ๋‹ค๋ฃฐ ๋•Œ scalableํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. NeRF๋Š” ์ž˜ ํ•™์Šต๋œ MLP๋กœ๋ถ€ํ„ฐ ์—ฐ์†์ ์ธ volume representation์„ ๋งŒ๋“ค์–ด ๋‚ผ ์ˆ˜ ์žˆ๊ณ , ๋™์‹œ์— ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆซ์ž๋„ ํฌ๊ฒŒ ์ค„์˜€์Šต๋‹ˆ๋‹ค.

2.2. Main Idea

  • ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋“ค์ด 3์ฐจ์›์˜ ์œ„์น˜ ์ •๋ณด $(x,y,z)$ ๋กœ๋ถ€ํ„ฐ 3D ๋ฌผ์ฒด์— ๋Œ€ํ•œ ํ‘œํ˜„์„ ์–ป์œผ๋ ค๊ณ  ํ–ˆ๋‹ค๋ฉด, NeRF๋Š” 3์ฐจ์›์˜ ์œ„์น˜ ์ •๋ณด์— 2D์˜ ๋ณด๋Š” ๊ฐ๋„(viewing direction)์„ ๋”ํ•˜์—ฌ 5D ๋ฒกํ„ฐ๋ฅผ input์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ๋˜ํ•œ, NeRF๋Š” ๋žœ๋”๋ง ๊ณผ์ •์—์„œ discreteํ•œ ์ ๋ถ„์„ ํ•˜์ง€ ์•Š๊ณ , stratified sampling approach์™€ ์ด๋ฅผ ๋” ๊ฐ•ํ™”ํ•œ Hierarchical volume sampling์„ ์‚ฌ์šฉํ•˜์—ฌ ๋žœ๋”๋ง์˜ ์„ฑ๋Šฅ์„ ๋†’์ž…๋‹ˆ๋‹ค.

  • ์ถ”๊ฐ€์ ์œผ๋กœ, ๊ธฐ์กด์˜ ๋ชจ๋ธ๋“ค์€ ์ €ํ™”์งˆ ํ˜น์€ ๊ตฌ์กฐ๊ฐ€ ๊ฐ„๋‹จํ•œ ๋ฌผ์ฒด์— ๋” ํŽธํ–ฅ๋˜์–ด ํ•™์Šตํ•˜์—ฌ ๊ณ ํ™”์งˆ, ๊ตฌ์กฐ๊ฐ€ ๋ณต์žกํ•œ ๋ฌผ์ฒด์—์„œ ์„ฑ๋Šฅ์ด ๋งค์šฐ ๋–จ์–ด์กŒ์Šต๋‹ˆ๋‹ค. NeRF๋Š” Positional Encoding์„ ์‚ฌ์šฉํ•˜์—ฌ input์„ ๊ณ ์ฐจ์›์˜ space๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ํ™”์งˆ, ๊ตฌ์กฐ๊ฐ€ ๋ณต์žกํ•œ ๋ฌผ์ฒด์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  • ๋งˆ์ง€๋ง‰์œผ๋กœ, NeRF๋Š” CNN ๋Œ€์‹  ์˜ค์ง MLP๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ›จ์”ฌ ์ ์€ ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ๊ฐ์— ๋Œ€ํ•œ ๊ตฌ์ฒด์ ์ธ ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์—์„œ ์ž์„ธํžˆ ์†Œ๊ฐœํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

3. Method

3.1. Neural Radiance Field Scene Representation

๋จผ์ €, NeRF๋Š” 3์ฐจ์›์˜ ์œ„์น˜ ์ •๋ณด $X = (x, y, z)$์™€ 2์ฐจ์›์˜ ๋ณด๋Š” ๋ฐฉํ–ฅ $d = (\theta, \phi)$ ์„ input์œผ๋กœ ๋ฐ›์•„ ์ƒ‰์ƒ $c = (r, g, b)$์™€ ์ฒด์  ๋ฐ€๋„(volume density) $\sigma$์„ output์œผ๋กœ ๋‚ด๋Š” MLP๋ฅผ $F_{\Theta} : (X,d) \rightarrow (c,\sigma)$ ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์ธ $F_{\Theta}$์˜ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์˜ Figure์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ดˆ๋ก์ƒ‰์ด Input ๋ฒกํ„ฐ์ด๊ณ , ์ค‘๊ฐ„์˜ hidden layer๊ฐ€ ํŒŒ๋ž€์ƒ‰, output ๋ฒกํ„ฐ๊ฐ€ ๋นจ๊ฐ„์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  layer๋Š” fully-connected์ด๊ณ , ๊ฒ€์€์ƒ‰ ํ™”์‚ดํ‘œ๋Š” ReLU activation, ์ฃผํ™ฉ์ƒ‰ ํ™”์‚ดํ‘œ๋Š” without activation function, ๊ฒ€์€์ƒ‰ ์ ์„  ํ™”์‚ดํ‘œ๋Š” sigmoid activation์ด ๋”ํ•ด์ง„ ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Figure 3. Neural Network ๊ตฌ์กฐ

NeRF๋Š” ํŠน์ • ์‹œ๊ฐ์—์„œ ๋ณด์ด๋Š” ๋ทฐ๋งŒ ์ž˜ ํ‘œํ˜„ํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋“  ๊ฐ๋„์—์„œ ๋ฌผ์ฒด๊ฐ€ ์ž˜ ํ‘œํ˜„๋˜๊ธฐ(multiview consistent) ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ €, ์ฒด์  ๋ฐ€๋„ $\sigma$ ๋Š” ์˜ค์ง ์œ„์น˜ ์ •๋ณด $X$๋งŒ ๊ฐ€์ง€๊ณ  ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋จผ์ € $X$๋งŒ ์ดˆ๊ธฐ 8๊ฐœ layer์— ํ†ต๊ณผ์‹œ์ผœ ์ฒด์  ๋ฐ€๋„๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ, ์ƒ‰์ƒ์„ ์˜ˆ์ธกํ•  ๋•Œ๋Š” ์œ„์น˜ ์ •๋ณด์™€ ๋ณด๋Š” ๋ฐฉํ–ฅ์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฒด์  ๋ฐ€๋„๋ฅผ ์–ป์€ step์˜ feature ๋ฒกํ„ฐ์—์„œ ๋ณด๋Š” ๋ฐฉํ–ฅ์ธ $d$๋ฅผ concatenateํ•˜์—ฌ feature์„ ๋งŒ๋“ค๊ณ , ํ•˜๋‚˜์˜ layer์— ํ†ต๊ณผ์‹œ์ผœ view-dependentํ•œ RGB ์ƒ‰์ƒ์„ ์–ป์Šต๋‹ˆ๋‹ค.

3.2. Volume Rendering with Radiance Field

NeRF๋Š” ์ „ํ†ต์ ์ธ volume rendering ๊ธฐ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ Œ๋”๋ง์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์นด๋ฉ”๋ผ ์œ„์น˜์—์„œ ๋‚˜์•„๊ฐ€๋Š” ๊ด‘์„  $r(t) = o + td$์ด ๋ฌผ์ฒด๋ฅผ $t_n$๋ถ€ํ„ฐ $t_f$๊นŒ์ง€ ํ†ต๊ณผํ•œ๋‹ค๊ณ  ํ•  ๋•Œ, ํ•ด๋‹น ๋ฌผ์ฒด์˜ ์˜ˆ์ƒ ์ƒ‰์ƒ $C(r)$์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ ๋ถ„์‹์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

Table 1. NeRF ์„ฑ๋Šฅ

์ด๋Ÿฌํ•œ $C(r)$์˜ ์ ๋ถ„์‹์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ๋ณดํ†ต discretized voxel grid์„ ๋ Œ๋”๋งํ•  ๋•Œ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” Deterministic quadrature์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋Š” ๊ฒฐ๊ตญ discreteํ•œ ์ ๋ถ„์„ ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•„์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. NeRF๋Š” ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์•„๋ž˜์˜ stratified sampling approach์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Table 1. NeRF ์„ฑ๋Šฅ

stratified sampling approach์€ $t_n$๋ถ€ํ„ฐ $t_f$ ๊นŒ์ง€์˜ ์ ๋ถ„ ๊ตฌ๊ฐ„์„ N๊ฐœ์˜ bin์œผ๋กœ ์ชผ๊ฐ  ํ›„, ๊ฐ bin์—์„œ ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ์„ uniformํ•˜๊ฒŒ ๋ฝ‘์•„์„œ, ์ด๋“ค์„ ์ ๋ถ„ ๊ตฌ๊ฐ„์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋น„๋ก ์ ๋ถ„์‹์„ ๊ทผ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด discreteํ•œ ์ƒ˜ํ”Œ์„ ์‚ฌ์šฉํ•˜๊ธฐ๋Š” ํ•˜์ง€๋งŒ, MLP๊ฐ€ ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ์—ฐ์†์ ์ธ ํฌ์ง€์…˜์—์„œ ๊ณ„์† ํ‰๊ฐ€๋˜๋ฉฐ ํ•™์Šต๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ํ•ด๋‹น ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์—ฐ์†์ ์ธ scene representation์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3.3. Optimizing a Neural Radiance Field

3.1๊ณผ 3.2์—์„œ NeRF์˜ ํ•ต์‹ฌ์ ์ธ ๋ถ€๋ถ„์„ ๋‹ค๋ค˜์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ €์ž๋Š” ์ด ๋‘๊ฐ€์ง€๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์„ ๋•Œ๋Š” ๋งŒ์กฑ์Šค๋Ÿฝ์ง€ ์•Š์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค๊ณ  ๋งํ•ฉ๋‹ˆ๋‹ค. (์ด๋Š” ๋’ค์— Ablation Study์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.) ๋”ฐ๋ผ์„œ, ๊ณ ํ™”์งˆ์˜ ๋ณต์žกํ•œ ๋ฌผ์ฒด์— ๋Œ€ํ•ด์„œ NeRF๊ฐ€ ์ž˜ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ถ”๊ฐ€์ ์ธ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

- Positional Encoding

์ฒซ๋ฒˆ์งธ๋Š” ๋ฐ”๋กœ Positional Encoding์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์•„๋ฌด๋ฆฌ Neural network๊ฐ€ ์ด๋ก ์ ์œผ๋กœ๋Š” ๋ชจ๋“  ํ•จ์ˆ˜๋ฅผ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๋”๋ผ๋„, ๋ง‰์ƒ $F_{\Theta}$๋ฅผ ํ•™์Šตํ•˜๋ฉด ๋†’์€ ํ•ด์ƒ๋„์™€ ๋ณต์žก๋„ ์˜์—ญ์—์„œ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋Š” MLP๊ฐ€ ๋‚ฎ์€ ํ•ด์ƒ๋„์™€ ๋ณต์žก๋„ ์˜์—ญ์— ํŽธํ–ฅ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. (์—ฌ๊ธฐ์„œ, ๋†’์€ ํ•ด์ƒ๋„์™€ ๋ณต์žก๋„๋ฅผ high-frequency, ๋‚ฎ์€ ํ•ด์ƒ๋„์™€ ๋ณต์žก๋„๋ฅผ low-frequency๋ผ๊ณ  ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.)

๋”ฐ๋ผ์„œ, NeRF๋Š” ๋†’์€ frequency ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด 5D input์„ ๋” ๋†’์€ ์ฐจ์›์˜ ๊ณต๊ฐ„์œผ๋กœ ๋ณด๋‚ธ ํ›„์— ์ด๋ฅผ input์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์šฐ๋ฆฌ์˜ $F_{\Theta}$๋ฅผ ์•„๋ž˜์˜ ๋†’์€ ์ฐจ์›์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” ํ•จ์ˆ˜์ธ $\gamma$์™€ ํ•ฉ์„ฑํ•˜์—ฌ ์‚ฌ์šฉํ•˜์˜€์„ ๋•Œ, ์„ฑ๋Šฅ์ด ๋งค์šฐ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Table 1. NeRF ์„ฑ๋Šฅ

- Hierarchical volume sampling

๋‹ค์Œ์€ stratified sampling์„ ๋” ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” Hierarchical volume sampling ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ N๊ฐœ์˜ bin์„ ๋งŒ๋“ค์–ด์„œ, ๊ฐ bin์—์„œ uniformํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ์ ๋ถ„์„ ํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ด๊ณ  ์„ฑ๋Šฅ๋„ ์ข‹์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์ด๋ฅผ ๋ณด์™„ํ•˜๊ณ ์ž coarse network์™€ fine network์˜ ๋‘ ๊ฐ€์ง€ neural network์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ €, coarse network์€ ์•ž์„œ ์„ค๋ช…ํ•œ stratified sampling๊ณผ ๋™์ผํ•˜๊ฒŒ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.(bin = $N_c$) ํ•™์Šต๋œ coarse network๋กœ๋ถ€ํ„ฐ ์šฐ๋ฆฌ๋Š” ๋” ์กด์žฌํ•  ๋งŒํ•œ ํฌ์ธํŠธ๋“ค์„ ๋ฝ‘์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. fine network๋Š” inverse transform sampling์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ $N_f$ ๊ฐœ์˜ bin์œผ๋กœ๋ถ€ํ„ฐ ์ƒ˜ํ”Œ๋ง์„ ํ•˜๊ณ , coarse network์—์„œ ์–ป์€ ํฌ์ธํŠธ๋“ค์„ ๋”ํ•˜์—ฌ ์ด $N_c+N_f$ ๊ฐœ์˜ ์ƒ˜ํ”Œ๋กœ๋ถ€ํ„ฐ $C_f(r)$์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์š”์•ฝํ•˜๋ฉด, ๋ฌผ์ฒด๊ฐ€ ์กด์žฌํ•  ๋งŒํ•œ ๊ตฌ๊ฐ„์— ๋Œ€ํ•ด ๋” ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์„ ํ•˜์—ฌ ๋žœ๋”๋ง์˜ ์„ฑ๋Šฅ์„ ๋†’์ž…๋‹ˆ๋‹ค.

NeRF๋ฅผ overall๋กœ ์•„๋ž˜์˜ figure์™€ ๊ฐ™์ด ์š”์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Figure 2. Overview of NeRF

4. Experiment & Result

4.1. Experimental setup

- Dataset

  • Synthetic ๋ Œ๋”๋ง ๋ฐ์ดํ„ฐ์…‹ : Diffuse Synthetic 360ยบ, Realistic Synthetic 360ยบ

  • DeepVoxels ๋ฐ์ดํ„ฐ์…‹

- Baselines

  • Neural Volumes (NV)

  • Scene Representation Networks (SRN)

  • Local Light Field Fusion (LLFF)

- Training setup

  • batch size = 4096

  • Adam optimizer (lr = 5e-4, exponentially-decaying to 5e-5)

  • ํ•˜๋‚˜์˜ scene์— ๋Œ€ํ•ด 100-300k ์ •๋„์˜ iteration

  • single NVIDIA V100 GPU (ํ•˜๋ฃจ์—์„œ ์ดํ‹€ ์ •๋„ ๊ฑธ๋ฆผ)

- Evaluation metric

  • PSNR

  • SSIM

  • LPIPS

4.2. Result

NeRF๋Š” 3๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•œ ๊ฐ€์ง€ ๊ฐ’์„ ์ œ์™ธํ•˜๊ณ  ๊ธฐ์กด์˜ baseline๋“ค์„ ๋ชจ๋‘ outperformํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋žœ๋”๋งํ•œ ์ด๋ฏธ์ง€๋ฅผ ๋ดค์„ ๋•Œ ๋˜ํ•œ, ๋‹ค๋ฅธ baseline๋“ค์€ over-smoothing๋œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋Š”๋ฐ, ground truth์— ๊ฐ€๊น๊ฒŒ ๋žœ๋”๋ง ๋˜์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Table 1. NeRF ์„ฑ๋Šฅ
Figure 1. NeRF ์„ฑ๋Šฅ
Figure 1. NeRF ์„ฑ๋Šฅ
Figure 1. NeRF ์„ฑ๋Šฅ

4.3. Ablation Study

Ablation Study์€ Realistic Synthetic 360ยบ์—์„œ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด, Positional Encoding, View Dependence, Hierarchical sampling๊ฐ€ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Table 2. Ablation Study

5. Conclusion

NeRF๋Š” 3์ฐจ์›์˜ ์œ„์น˜ ์ •๋ณด์— 2D์˜ ๋ณด๋Š” ๊ฐ๋„(viewing direction)์„ ๋”ํ•œ 5D ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒ‰์ƒ๊ณผ ์ฒด์  ๋ฐ€๋„(volume density)๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” MLP๋ฅผ ํ•™์Šตํ•˜๊ณ , ์ด์˜ output์„ ๊ธฐ์กด์˜ ๋žœ๋”๋ง ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์›ํ•˜๋Š” ์œ„์น˜์—์„œ ๋ฐ”๋ผ๋ณธ ๋ฌผ์ฒด์˜ 2D ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ๋žœ๋”๋ง ๊ณผ์ •์—์„œ discreteํ•œ ์ ๋ถ„์„ ํ•˜์ง€ ์•Š๊ณ , stratified sampling approach์™€ ์ด๋ฅผ ๋” ๊ฐ•ํ™”ํ•œ Hierarchical volume sampling์„ ์‚ฌ์šฉํ•˜์—ฌ ๋žœ๋”๋ง์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ณ , ๋†’์€ ์ฐจ์›์˜ space๋กœ mappingํ•˜๋Š” Positional Encoding์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•๋“ค์„ ํ†ตํ•ด ๊ธฐ์กด์˜ baseline์„ outperformํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ๊ณ ํ™”์งˆ, ๊ตฌ์กฐ๊ฐ€ ๋ณต์žกํ•œ ๋ฌผ์ฒด์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์˜ค์ง MLP๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ ์€ ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ์œผ๋กœ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ์— NeRF๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ ๋” ๋ฐœ์ „๋œ ๋…ผ๋ฌธ๋“ค์ด ๋งŽ์ด ๋‚˜์˜ค๊ณ  ์žˆ๋Š”๋ฐ, ์ด๋“ค์ด ์–ด๋– ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์ถ”๊ฐ€ํ•˜๊ณ  ์žˆ๋Š”์ง€ ๋ณด๋Š” ๊ฒƒ ๋˜ํ•œ ํฅ๋ฏธ๋กœ์šธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

Please provide one-line (or 2~3 lines) message, which we can learn from this paper.

AI๋ฅผ ์ž˜ํ•˜๋ ค๋ฉด ์ˆ˜ํ•™์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.

๋ฌธ์ œ๋ฅผ ์ง‘์š”ํ•˜๊ฒŒ ํŒŒ๋ฉด ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

Author / Reviewer information

Author

์œ ํƒœํ˜• (Taehyung Yu)

  • KAIST AI

  • KAIST Data Mining Lab.

  • taehyung.yu@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

  2. Official (unofficial) GitHub repository

  3. Citation of related work

  4. Other useful materials

  5. ...

Last updated

Was this helpful?