GIRAFFE [Kor]

Niemeyer et al. / GIRAFFE] Representing Scenes as Compositional Generative Neural Feature Fields / CVPR 2021 (oral, best paper award)

English version of this article is available.

1. Problem definition

GAN(Generative Adversarial Network) ์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ์‚ฌ์‹ค์ ์ธ ์ด๋ฏธ์ง€๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๊ณ , ๋” ๋‚˜์•„๊ฐ€ ๊ฐ๊ฐ์˜ ํ‘œํ˜„(๋จธ๋ฆฌ์ƒ‰, ์ด๋ชฉ๊ตฌ๋น„ ๋“ฑ)์„ ๋…๋ฆฝ์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์ง€์— ์ด๋ฅด๋ €๋‹ค. ํ•˜์ง€๋งŒ 3์ฐจ์› ์„ธ๊ณ„๋ฅผ 2D ๋กœ ๋‚˜ํƒ€๋ƒ„์œผ๋กœ์จ ํ•œ๊ณ„์— ๋ถ€๋”ชํžˆ๊ฒŒ ๋˜์—ˆ๊ณ , ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ 3D representation ์„ ํšจ๊ณผ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์— ์ฃผ๋ ฅํ•˜๊ณ  ์žˆ๋‹ค. ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์€ 2์žฅ์—์„œ ์†Œ๊ฐœ๋  implicit neural representation ์ธ๋ฐ, ์ง€๊ธˆ๊นŒ์ง€์˜ ์—ฐ๊ตฌ๋“ค์€ ๋ฌผ์ฒด๊ฐ€ ํ•˜๋‚˜์ด๊ฑฐ๋‚˜ ๋ณต์žกํ•˜์ง€ ์•Š์€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ๋งŒ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ฐ๊ฐ์˜ ๋ฌผ์ฒด๋ฅผ 3D representation ์˜ ๊ฐœ๋ณ„์ ์ธ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๋Œ€ํ•˜๋Š” ์ƒ์„ฑ ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ฌผ์ฒด๊ฐ€ ์žˆ๋Š” ๋ณต์žกํ•œ ์ด๋ฏธ์ง€์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

2. Motivation

Implicit Neural Representation (INR)

๊ธฐ์กด ์ธ๊ณต์‹ ๊ฒฝ๋ง(neural network) ์€ ์ถ”์ •(ex. image classification) ๊ณผ ์ƒ์„ฑ(ex. generative models) ์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ด์— ๋ฐ˜ํ•ด Implicit representation ์€ ํ‘œํ˜„์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ, network parameter ์ž์ฒด๊ฐ€ ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ์˜๋ฏธํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ ๋„คํŠธ์›Œํฌ์˜ ํฌ๊ธฐ๋Š” ์ •๋ณด์˜ ๋ณต์žก๋„์— ๋น„๋ก€ํ•˜๊ฒŒ ๋œ๋‹ค (๋‹จ์ˆœํ•œ ์›๋ณด๋‹ค ๋ฒŒ์˜ ์‚ฌ์ง„์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ชจ๋ธ์ด ๋” ๋ณต์žกํ•˜๋‹ค). ๋” ๋‚˜์•„๊ฐ€ NeRF ์—์„œ ์ฒ˜๋Ÿผ ์ขŒํ‘œ๊ฐ€ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ๋“ค์–ด์™”์„ ๋•Œ RGB ๊ฐ’์„ ์‚ฐ์ถœํ•˜๋Š” ์—ฐ์†์ ์ธ ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์จ ์—ฐ์†์ ์ธ ํ‘œํ˜„๋„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

  • NeRF : Neural Radiance Field fฮธ:RLxร—RLdโ†’R+ร—R3f_{\theta}:R^{L_x}\times R^{L_d}\to R^+ \times R^3 (ฮณ(x),ฮณ(d))โ†’(ฯƒ,c)(\gamma(x),\gamma(d)) \to (\sigma, c)

    ํ•˜๋‚˜์˜ ์žฅ๋ฉด์€ 5D ์ขŒํ‘œ (3d ์œ„์น˜์™€ ๋ฐฉํ–ฅ) ์— ๋Œ€ํ•œ RGB ๊ฐ’๊ณผ ๋ถ€ํ”ผ intensity ์„ ์‚ฐ์ถœํ•˜๋Š” fully connected layer ๋กœ ํ‘œํ˜„๋œ๋‹ค. ์ด๋•Œ ๋” ๋†’์€ ์ฐจ์›์˜ ์ •๋ณด๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด 5D ์ž…๋ ฅ๊ฐ’์€ positional encoding ฮณ(x)\gamma(x) ์„ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค. ํŠน์ • ๋ฐฉํ–ฅ์—์„œ ๋น›์„ ์˜์•˜์„ ๋•Œ ์ƒ๊ธฐ๋Š” camera ray ๋‚ด์˜ ์ ์„ n ๊ฐœ ์ƒ˜ํ”Œ๋งํ•œ ํ›„, ๊ฐ๊ฐ์˜ color ์™€ density ๊ฐ’์„ volume rendering technique (3์žฅ Methods ์— ์„ค๋ช…) ์„ ํ†ตํ•ด ํ•ฉ์นจ์œผ๋กœ์จ ์ด๋ฏธ์ง€ pixel ์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•œ๋‹ค. ํ•™์Šต์€ GT(ground truth) posed ์ด๋ฏธ์ง€์™€ ์˜ˆ์ธก๋œ volume rendered ์ด๋ฏธ์ง€ ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

Figure 1: NeRF architecture
  • GRAF : Generative Radiance Field fฮธ:RLxร—RLdร—RMsร—RMaโ†’R+ร—R3f_{\theta}:R^{L_x}\times R^{L_d}\times R^{M_s}\times R^{M_a} \to R^+\times R^3 (ฮณ(x),ฮณ(d),zs,za)โ†’(ฯƒ,c)(\gamma(x),\gamma(d),z_s,z_a) \to (\sigma, c)

    ๋ณธ ๋…ผ๋ฌธ์€ NeRF ์™€ ๋‹ฌ๋ฆฌ unposed image ๋ฅผ ํ™œ์šฉํ•˜์—ฌ 3D representation ์„ ํ•™์Šตํ•œ๋‹ค. Input ์œผ๋กœ๋Š” sampling ๋œ camera pose ฮต (์œ„์ชฝ ๋ฐ˜๊ตฌ์—์„œ ์ค‘์‹ฌ์„ ๋ฐ”๋ผ๋ณด๋Š” ๋ฐฉํ–ฅ ์ค‘์—์„œ uniform ํ•˜๊ฒŒ sample) ๊ณผ sampling ๋œ K x K patch (unposed image ์—์„œ ์ค‘์‹ฌ์ด (u,v) ์ด๊ณ  scale ์ด s ์ธ K x K ์ด๋ฏธ์ง€) ๋ฅผ ๊ฐ€์ง„๋‹ค. ์ถ”๊ฐ€๋กœ, shape zsz_s ์™€ appearance zaz_a ์ฝ”๋“œ๋ฅผ condition ์œผ๋กœ ๋„ฃ์–ด์ฃผ์–ด, patch ์˜ pixel ๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ณ , discriminator ์—์„œ predicted patch ๋Š” fake, ์ด๋ฏธ์ง€ ๋ถ„ํฌ์—์„œ sampling ๋œ image ์˜ ์‹ค์ œ K x K patch ๋Š” real ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

Figure 2: GRAF architecture

Idea

GRAF ๊ฐ€ ์ œ์–ด๊ฐ€๋Šฅํ•œ ๊ณ ํ•ด์ƒ๋„์˜ image synthesis ๋ฅผ ํ•ด๋‚ด์ง€๋งŒ, ๋‹จ์ผ ๋ฌผ์ฒด๋งŒ ์žˆ๋Š” ๋น„๊ต์  ๊ฐ„๋‹จํ•œ imagery ์—์„œ๋งŒ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ํ•œ๊ณ„์ ์„ ๊ฐ€์ง„๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ, GIRAFFE ์—์„œ๋Š” ๊ฐœ๋ณ„ object ๋ฅผ ๊ตฌ๋ถ„ํ•˜์—ฌ ๋ณ€ํ˜•ํ•˜๊ณ  ํšŒ์ „์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” neural representation ์„ ์ œ์•ˆํ•œ๋‹ค.

3. Method

Figure 3: GIRAFFE architecture
  • Neural Feature Field : GRAF formulation ๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, 3D color ๋ฅผ output ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ MfM_f-dimensional feature ๋ฅผ output ํ•œ๋‹ค. hฮธ:RLxร—RLdร—RMsร—RMaโ†’R+ร—RMfh_{\theta}:R^{L_x} \times R^{L_d} \times R^{M_s} \times R^{M_a} \to R^+ \times R^{M_f} Object Representation NeRF ์™€ GRAF ์—์„œ๋Š” ์ „์ฒด scene ์ด ํ•˜๋‚˜์˜ model ๋กœ ํ‘œํ˜„ ๋˜์—ˆ๋Š”๋ฐ, ๊ฐ ๋ฌผ์ฒด๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด์„œ ๊ฐœ๋ณ„์ ์ธ feature field ๋กœ ๋‚˜ํƒ€๋‚ผ ๊ฒƒ์„ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ๋‹ค. ์ด๋•Œ affine transformation๋„ parameter ๋ฅผ dataset ์— ์˜์กด์ ์ธ ๋ถ„ํฌ ํ™œ์šฉํ•จ์œผ๋กœ์จ T={s,t,R}T=\{s,t,R\} (ss:scale, tt: translation, RR: rotation) ์—์„œ ์ƒ˜ํ”Œ๋งํ•จ์œผ๋กœ์จ pose, shape, appearance ๋ฅผ ๋ชจ๋‘ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. k(x)=Rโ‹…[s1s2s3]โ‹…x+tk(x)=R\cdot\begin{bmatrix} s_1 & & \\ & s_2 &\\ & & s_3 \end{bmatrix}\cdot x + t ๊ทธ๋ฆฌ๊ณ  ๋‚˜์„œ ์ด์‚ฐ์ ์œผ๋กœ ์ƒ˜ํ”Œ๋ง๋œ 3D ๋ฐ์ดํ„ฐ๋ฅผ 2D์— ๋งคํ•‘ํ•˜๋Š” volume rendering ์„ ์•„๋ž˜ ์‹๊ณผ ๊ฐ™์ด ์ง„ํ–‰ํ•œ๋‹ค. (ฯƒ,f)=hฮธ(ฮณ(kโˆ’1(x)),ฮณ(kโˆ’1(d)),zs,za)(\sigma,f)=h_{\theta}(\gamma(k^{-1}(x)),\gamma(k^{-1}(d)),z_s,z_a) Composition Operator ๊ฐ scene ์€ N ๊ฐ€์ง€์˜ entitiy ๋กœ ์ •์˜๋œ๋‹ค(N-1 objects, 1 background). ๊ฐ entity ์˜ density ์™€ feature ๋ฅผ ํ•ฉ์น˜๊ธฐ ์œ„ํ•ด density-weighted mean ์„ ์‚ฌ์šฉํ•œ๋‹ค.

    C(x,d)=(ฯƒ,1ฯƒโˆ‘i=1Nฯƒifi),whereฯƒ=โˆ‘i=1NฯƒiC(x,d)=(\sigma,{1\over\sigma} \sum_{i=1}^{N}\sigma_if_i), \quad where \quad \sigma = \sum_{i=1}^N\sigma_i

    3D volume rendering ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ RGB value๋ฅผ volume render ํ•˜๋Š” ๋ฐ˜๋ฉด, ๋ณธ ๋…ผ๋ฌธ์€ MfM_f-dimensional feature vector ff ๋ฅผ rendering ํ•œ๋‹ค. ํŠน์ • camera ray dd ๋ฅผ ๋”ฐ๋ผ ์ƒ˜ํ”Œ๋ง๋œ NsN_s ๊ฐœ์˜ ํฌ์ธํŠธ๋ฅผ ฯ€vol\pi_{vol} operator ๋ฅผ ํ†ตํ•ด ์ตœ์ข… feature vector ff ๋ฅผ ์–ป๋Š”๋‹ค. ฯ€vol:(R+ร—RMf)Nsโ†’RMf\pi_{vol} : (R^+ \times R^{M_f})^{N_s} \to R^{M_f} ๊ทธ ํ›„ NeRF ์™€ ๋™์ผํ•˜๊ฒŒ numerical integration ์„ ํ•ด์ค€๋‹ค. f=โˆ‘j=1Nsฯ„iฮฑifiฯ„j=โˆk=1jโˆ’1(1โˆ’ฮฑk)ฮฑj=1โˆ’eโˆ’ฯƒjฮดjf=\sum_{j=1}^{N_s}\tau_i\alpha_i f_i \quad \tau_j=\prod_{k=1}^{j-1}(1-\alpha_k) \quad \alpha_j=1-e^{-\sigma_j\delta_j} ์œ„์˜ ์‹์—์„œ ฮดj=โˆฃโˆฃxj+1โˆ’xjโˆฃโˆฃ2\delta_j=|| x_{j+1} - x_j ||_2 ๋Š” ์ฃผ๋ณ€ ์ƒ˜ํ”Œ ํฌ์ธํŠธ์™€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜๋ฏธํ•˜๊ณ , ๋ฐ€๋„๊ฐ’ ฯƒj\sigma_j ์™€ ํ•จ๊ป˜ ์•ŒํŒŒ๊ฐ’ ฮฑj\alpha_j ๋ฅผ ์ •์˜ํ•œ๋‹ค. ์ด ์•ŒํŒŒ๊ฐ’๋“ค์„ ๋ˆ„์ ํ•˜์—ฌ ํˆฌ๊ณผ๋„ ฯ„j\tau_j ๋ฅผ ์ •์˜ํ•˜๊ณ , ์ตœ์ข… feature vector ff ๋Š” ๊ฐ ํ”ฝ์…€์— ๋Œ€ํ•ด์„œ ฯ€vol\pi_{vol} ์„ ๊ณ„์‚ฐํ•จ์œผ๋กœ์จ ์–ป์–ด์ง„๋‹ค. ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ํ”ผ์ณ ๋งต์„ 16216^2 ํฌ๊ธฐ๋กœ ์–ป๋Š”๋ฐ, ์ด๋Š” ์‹ค์ œ ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„์ธ 64264^2 ๋‚˜ 2562256^2 ์— ๋ชป ๋ฏธ์นœ๋‹ค.

  • 2D neural rendering ๊ทธ๋ž˜์„œ ๋” ๋†’์€ ํ•ด์ƒ๋„๋กœ upsampling ํ•˜๊ธฐ ์œ„ํ•ด ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ 2D neural rendering ์„ ์ง„ํ–‰ํ•œ๋‹ค.

    ฯ€ฮธneural:RHvร—Wvร—Mfโ†’RHร—Wร—3\pi_\theta^{neural}:R^{H_v \times W_v \times M_f} \to R^{H \times W \times 3}

Figure 4: 2d neural rendering architecture
  • Training

    • Generator

    Gฮธ({zsi,zai,Ti}i=1N,ฯต)=ฯ€ฮธneural(Iv),whereIv={ฯ€vol({C(xjk,dk)}j=1Ns)}k=1Hvร—WvG_\theta(\left\{z_s^i,z_a^i,T_i\right\}_{i=1}^N,\epsilon)=\pi_\theta^{neural}(I_v),\quad where \quad I_v=\{\pi_{vol}(\{C(x_{jk},d_k)\}_{j=1}^{N_s})\}_{k=1}^{H_v \times W_v}

    • Discriminator : CNN with leaky ReLU

    • Loss Function = non-saturating GAN loss + R1-regularization V(ฮธ,ฯ•)=Ezsi,zaiโˆผN,ฯตโˆผpT[f(Dฯ•(Gฮธ({zsi,zai,Ti}i,ฯต))]+EIโˆผpD[f(โˆ’Dฯ•(I))โˆ’ฮปโˆฃโˆฃโ–ฝDฯ•(I)โˆฃโˆฃ2]V(\theta,\phi)=E_{z_s^i,z_a^i \sim N, \epsilon \sim p_T} [f(D_\phi(G_\theta(\{z_s^i,z_a^i,T_i\}_i,\epsilon))] + E_{I\sim p_D}[f(-D_\phi(I))- \lambda \vert\vert \bigtriangledown D_\phi (I) \vert\vert^2 ] ,wheref(t)=โˆ’log(1+exp(โˆ’t)),ฮป=10\quad , where \quad f(t)=-log(1+exp(-t)), \quad \lambda=10

4. Experiment & Result

Experimental setup

  • DataSet

    • single object dataset๋กœ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” Chairs, Cats, CelebA, CelebA-HQ

    • single-object dataset ์ค‘ ๊นŒ๋‹ค๋กญ๋‹ค๊ณ  ์•Œ๋ ค์ง„ CompCars, LSUN Churches, FFHQ

    • multi-object scenes์œผ๋กœ๋Š” Clevr-N, Clevr-2345

  • Baseline

    • voxel-based PlatonicGAN, BlockGAN, HoloGAN

    • radiance field-based GRAF

  • Training setup

    • ํ•œ ์žฅ๋ฉด ๋‚ด์˜ entity ์ˆ˜ NโˆผpNN \sim p_N, latent codes zsi,zaiโˆผN(0,I)z_s^i,z_a^i \sim N(0,I)

    • camera pose ฯตโˆผpฯต\epsilon \sim p_{\epsilon}, transformations TiโˆผpTT_i \sim p_T โ‡’ pฯตp_{\epsilon} ๊ณผ pTp_T ๋Š” uniform distribution ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์‹คํ—˜์„ ์ง„ํ–‰ํ•œ๋‹ค (๋ฐ์ดํ„ฐ ์ข…์†์ ์ธ camera elavation ๊ณผ object transformation)

    • ๊ฐœ๋ณ„ object field ๋Š” ๋ชจ๋‘ MLP weight ๋ฅผ ๊ณต์œ ํ•˜๋ฉฐ ReLU activation ์„ ์‚ฌ์šฉํ•œ๋‹ค.(object ๋“ค์€ 8 layers MLP(hidden dimension of 128), Mf=128M_f=128 ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , background ๋Š” ์ด์˜ ์ ˆ๋ฐ˜์„ ์‚ฌ์šฉํ•œ๋‹ค.)

    • Lx=2,3,10L_x=2,3,10๊ณผ Ld=2,3,4L_d=2,3,4 ๋ฅผ positional encoding parameter

    • ๊ฐ ray ๋”ฐ๋ผ 64 points๋ฅผ sample ํ•˜๊ณ  image ๋ณ„๋กœ 16216^2 pixels ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ์–ป๋Š”๋‹ค.

  • Evaluation Metric

    • 20,000 real & fake samples ๋กœ Frechet Inception Distance (FID) score ๊ณ„์‚ฐ

Result

  • disentangled scene generation : background ์™€์˜ ๋ถ„๋ฆฌ, feature ๊ฐ„์˜ ๋ถ„๋ฆฌ ๋ชจ๋‘ ์ž˜ ์ด๋ฃจ์–ด์ง„๋‹ค

Figure 5: disentanglement
  • comparison to baseline methods

Figure 6: qualitative comparison
  • ablation studies

    • importance of 2D neural rendering and its individual components

      Figure 7: neural rendering architecture ablation GRAF ์™€์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด๋Š” neural rendering ์„ volumne rendering ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ์ ์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ์˜ ํ‘œํ˜„๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๋” ๋ณต์žกํ•œ real scene ๋„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. ๋” ๋‚˜์•„๊ฐ€, rendering ์‹œ๊ฐ„๋„ ๊ธฐ์กด GRAF ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, 64264^2 pixels ์ด๋ฏธ์ง€์—์„œ๋Š” 110.1ms ์—์„œ 4.8ms ๋กœ ์ค„์—ˆ๊ณ , 2562256^2 pixels ์—์„œ๋Š” 1595.0ms ์—์„œ 5.9ms ๋กœ ์ค„์—ˆ๋‹ค.

    • positional encoding

      r(t,L)=(sin(20tฯ€),cos(20tฯ€),...,sin(2Ltฯ€),cos(2Ltฯ€))r(t,L) = (sin(2^0t\pi), cos(2^0t\pi),...,sin(2^Lt\pi),cos(2^Lt\pi))

      Figure 8: positional encoding

  • limitations

    • ๋ฐ์ดํ„ฐ ๋‚ด์— inherent bias ๊ฐ€ ์žˆ์œผ๋ฉด ๊ฐ™์ด ๋ณ€ํ™”ํ•ด์•ผํ•˜๋Š” factor ๋“ค์ด ๊ณ ์ •๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. (ex. ๋ˆˆ๊ณผ ํ—ค์–ด rotation)

    • camera pose ์™€ obejct ๋‹จ์œ„์˜ transformation ์ด uniform distribution ์„ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋Š”๋ฐ, ์‹ค์ œ๋กœ๋Š” ๊ทธ๋ ‡์ง€ ์•Š์„ ๊ฒƒ์ด๊ธฐ์— ์•„๋ž˜์™€ ๊ฐ™์€ disentanglement failure ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

    Figure 9: limitation_disentangle failure

5. Conclusion

โ‡’ ํ•œ ์žฅ๋ฉด์„ compositional generative neural feature field ๋กœ ๋‚˜ํƒ€๋ƒ„์œผ๋กœ์จ, ๊ฐœ๋ณ„ object ๋ฅผ background ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ shape ๊ณผ appenarance ๋กœ๋ถ€ํ„ฐ disentangle ํ•˜์˜€๊ณ , ๋ณ„๋‹ค๋ฅธ supervision ์—†์ด ์ด๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ control ํ•  ์ˆ˜ ์žˆ๋‹ค.

โ‡’ Future work

  • ๊ฐœ๋ณ„ object ์˜ tranformation ๊ณผ camera pose ์˜ distribution ์„ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•  ์ˆ˜๋Š” ์—†์„๊นŒ?

  • object mask ์™€ ๊ฐ™์ด ์–ป๊ธฐ ์‰ฌ์šด supervision ์„ ํ™œ์šฉํ•˜๋ฉด ๋” ๋ณต์žกํ•œ multi-object scene ์„ ๋” ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

  • Implicit Neural Representation ์„ ํ™œ์šฉํ•œ 3D scene representation ์€ ์ตœ๊ทผ์— ๊ฐ๊ด‘ ๋ฐ›๊ณ  ์ž‡๋Š” ๋ฐฉ์‹์ด๋‹ค.

  • ๊ฐ๊ฐ์˜ entity ๋ฅผ ๊ฐœ๋ณ„ feature field ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์€ ๊ทธ๋“ค์˜ movement ๋ฅผ disentangle ํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.

  • ๊ฐ feature ๋ฅผ ์›๋ž˜ dimension ๊ทธ๋Œ€๋กœ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋ณด๋‹ค๋Š” positional encoding ์ด๋‚˜ neural rendering ์„ ํ†ตํ•ด ๋” high dimensional space ๋กœ embedding ํ•˜์—ฌ ํ™œ์šฉํ•˜๋ฉด ๋” ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

Author / Reviewer information

Author

๊น€์†Œํฌ(Sohee Kim)

  • KAIST AI

  • Contact: joyhee@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

Last updated

Was this helpful?