GIRAFFE [Eng]

Niemeyer et al. / GIRAFFE] Representing Scenes as Compositional Generative Neural Feature Fields / CVPR 2021 (oral, best paper award)

한국어로 쓰인 리뷰를 읽으려면 여기 를 누르세요.

1. Problem definition

Through Generative Adversarial Networks(GANs), people have succeeded in generating highly-realistc images and even learning disentangled representations without explicit supervision. However, operating in 2D domain had limitations due to the 3-dimensional nature of this world. Recent investigations started to focus on incorporating 3D representations using voxels or radiance fields, but was restricted to single object scenes and showed less consistent results in high resolution and more complex images.

So this paper suggests incorporating a compositional 3D scene representation into the generative models, leading to more controllable image synthesis.

2. Motivation

Implicit Neural Representation (INR)

Existing Neural Networks(NN) played a role in prediction tasks(e.g. image classification) and generation (e.g. generative models). However, in INR, network parameter contains the information of the data iteself, so the network size is proportional to the complexity of the data, which is especially beneficial for representing 3D scenes. In addtion, as we learn a function, for example that maps the coordinate to RGB value for a single image, this enables us to represent the data in a continuous form.

  • NeRF : Neural Radiance Field \quad fθ:RLx×RLdR+×R3f_{\theta}:R^{L_x}\times R^{L_d}\to R^+ \times R^3 (γ(x),γ(d))(σ,c)(\gamma(x),\gamma(d)) \to (\sigma, c)

    A scene is represented using a fully-connected network, whose input is a single continous 5D coordinate (position + direction) that goes through positional encoding γ(x)\gamma(x) for higher dimensional information, and outputs the volume density and view-dependent RGB value (radiance). 5D coordinates for each direction(camera ray) r(t)r(t) are sampled, and the produced color c(r(t),d)c(r(t),d) and density σ(r(t))\sigma(r(t)) composites an image through volume rendering technique (explained in section 3). As a loss function, the difference between the volume redered image and the ground truth posed image is used.

  • GRAF : Generative Radiance Field \quad fθ:RLx×RLd×RMs×RMaR+×R3f_{\theta}:R^{L_x}\times R^{L_d}\times R^{M_s}\times R^{M_a} \to R^+\times R^3 (γ(x),γ(d),zs,za)(σ,c)(\gamma(x),\gamma(d),z_s,z_a) \to (\sigma, c)

    It proposes an adversarial generative model for conditional radiance fields, whose input is a sampled camera pose ε (sampled from upper hemisphere facing origin, uniformly) and a sampled K x K patch, whose center is (u,v)(u,v) and has scale ss , from the unposed input image. As a condition, shape zsz_s and apperance zaz_a code is added, and the model (fully connected network with ReLU activations) outputs a predicted patch, just like the NerF model. Next, the discriminator (convolutional neural network) is trained to distinguish between the predicted patch and real patch sampled from an image in the image distribution.

Idea

GRAF achieves controllable image synthesis at high resolution, but is restricted to single-object scenes and the results tend to degrade on more complex imagery. So this paper propose a model that is able to disentangle individual objects and allows translation and rotation as well as changing the camera pose.

3. Method

  • Neural Feature Field : Replaces GRAF’s formulation for 3D color output c with MfM_f -dimensional feature hθ:RLx×RLd×RMs×RMaR+×RMfh_{\theta}:R^{L_x} \times R^{L_d} \times R^{M_s} \times R^{M_a} \to R^+ \times R^{M_f} (γ(x),γ(d),zs,za)(σ,f)(\gamma(x),\gamma(d),z_s,z_a) \to (\sigma, f)

    Object Representation In NerF and GRAF, entire scene is represented by a single model, but in order to disentange different entities in the scene, GIRAFFE represents each object (in the scene) using a separate feature field in combination with affine transformation. The parameters T={s,t,R}T=\{s,t,R\} (ss:scale, tt: translation, RR: rotation) are sampled from dataset-dependent distribution k(x)=R[s1s2s3]x+tk(x)=R\cdot\begin{bmatrix} s_1 & & \\ & s_2 &\\ & & s_3 \end{bmatrix}\cdot x + t Therefore, it gains control over the pose, shape and appearance of individual objects. Then, through volume rendering, we can create 2D projection of a 3D discretly sampled dataset. (σ,f)=hθ(γ(k1(x)),γ(k1(d)),zs,za)(\sigma,f)=h_{\theta}(\gamma(k^{-1}(x)),\gamma(k^{-1}(d)),z_s,z_a) Composition Operator A scene is described as compositions of N entities (N-1 objects and 1 background). The model uses density-weighted mean to combine all features at (x,d)(x,d)

    C(x,d)=(σ,1σi=1Nσifi),whereσ=i=1NσiC(x,d)=(\sigma,{1\over\sigma} \sum_{i=1}^{N}\sigma_if_i), \quad where \quad \sigma = \sum_{i=1}^N\sigma_i 3D volume rendering Unlike previous models that volume render an RGB color value, GIRAFFE renders an MfM_f-dimensional feature vector ff. Along a camer ray dd, the model samples NsN_s points and the operator πvol\pi_{vol} maps them to the final feature vector ff. πvol:(R+×RMf)NsRMf\pi_{vol} : (R^+ \times R^{M_f})^{N_s} \to R^{M_f} They use the same numerical intergration method as in NeRF. f=j=1Nsτiαifiτj=k=1j1(1αk)αj=1eσjδjf=\sum_{j=1}^{N_s}\tau_i\alpha_i f_i \quad \tau_j=\prod_{k=1}^{j-1}(1-\alpha_k) \quad \alpha_j=1-e^{-\sigma_j\delta_j} Where, δj=xj+1xj2\delta_j=||x_{j+1} - x_j ||_2 is the distance between neighboring sample points and with density δj\delta_j, it defines the alpha value αj\alpha_j. By accumulating the alpha values, we can compute the transmittance τj\tau_j. The entire feature image is obtained by evaluation πvol\pi_{vol} at every pixel. For efficiency, they obtain a feature map of 16216^2 resolution, which is lower than the input resolution(64264^2 or 2562256^2 pixels).

  • 2D neural rendering In order to upsample the feature map to a higher-resolution image, the paper use 2D neural rendering as the figure below. πθneural:RHv×Wv×MfRH×W×3\pi_\theta^{neural}:R^{H_v \times W_v \times M_f} \to R^{H \times W \times 3}

  • Training

    • Generator

    Gθ({zsi,zai,Ti}i=1N,ϵ)=πθneural(Iv),whereIv={πvol({C(xjk,dk)}j=1Ns)}k=1Hv×WvG_\theta(\left\{z_s^i,z_a^i,T_i\right\}_{i=1}^N,\epsilon)=\pi_\theta^{neural}(I_v),\quad where \quad I_v=\{\pi_{vol}(\{C(x_{jk},d_k)\}_{j=1}^{N_s})\}_{k=1}^{H_v \times W_v}

    • Discriminator : CNN with leaky ReLU

    • Loss Function = non-saturating GAN loss + R1-regularization

    V(θ,ϕ)=Ezsi,zaiN,ϵpT[f(Dϕ(Gθ({zsi,zai,Ti}i,ϵ))]+EIpD[f(Dϕ(I))λDϕ(I)2]V(\theta,\phi)=E_{z_s^i,z_a^i \sim N, \epsilon \sim p_T} [f(D_\phi(G_\theta(\{z_s^i,z_a^i,T_i\}_i,\epsilon))] + E_{I\sim p_D}[f(-D_\phi(I))- \lambda \vert\vert \bigtriangledown D_\phi (I) \vert\vert^2 ] ,wheref(t)=log(1+exp(t)),λ=10\quad , where \quad f(t)=-log(1+exp(-t)), \quad \lambda=10

4. Experiment & Result

Experimental setup

  • DataSet

    • commonly used single object dataset: Chairs, Cats, CelebA, CelebA-HQ

    • challenging single-object dataset: CompCars, LSUN Churches, FFHQ

    • testing on multi-object scenes: Clevr-N, Clevr-2345

  • Baseline

    • voxel-based PlatonicGAN, BlockGAN, HoloGAN

    • radiance field-based GRAF

  • Training setup

    • number of entities in the scene NpNN \sim p_N, latent codes zsi,zaiN(0,I)z_s^i,z_a^i \sim N(0,I)

    • camera pose ϵpϵ\epsilon \sim p_{\epsilon}, transformations TipTT_i \sim p_T ⇒ In practice, pϵp_{\epsilon} and pTp_T is uniform distribution over data-dependent camera elevation angles and valid object tranformations each.

    • All object fields share their weights and are paramterized as MLPs with ReLU activations ( 8 layers with hidden dimension of 128, Mf=128M_f=128 for objects & half the layers and hidden dimension for background features)

    • Lx=2,3,10L_x=2,3,10 and Ld=2,3,4L_d=2,3,4 for positional encoding

    • sample 64 points along each ray and render feature images at 16216^2 pixels

  • Evaluation Metric

    • Frechet Inception Distance (FID) score with 20,000 real and fake samples

Result

  • disentangled scene generation

  • comparison to baseline methods

  • ablation studies

    • importance of 2D neural rendering and its individual components

    • positional encoding

      r(t,L)=(sin(20tπ),cos(20tπ),...,sin(2Ltπ),cos(2Ltπ))r(t,L) = (sin(2^0t\pi), cos(2^0t\pi),...,sin(2^Lt\pi),cos(2^Lt\pi))

  • limitations

    • struggles to disentangle factors of variation if there is an inherent bias in the data. (eg. eye and hair translation)

    • disentanglement failures due to mismatches between assumed uniform distribution (camera poses and object-lebel transformations) and their real distributions

5. Conclusion

⇒ By representing scenes as compostional generative neural feature fields, they disentangle individual objects from the background as well as their shape and appearance without explicit supervision

⇒ Future work

  • Investigate how the distributions over object level transformations and camera poses can be learned from data

  • Incorporate supervision which is easy to obtain (eg. object mask) -> scale to more complex, multi-object scenes

Take home message (오늘의 교훈)

  • 3D scene representation through Implicit Neural Representation is a recent trend that shows superior result.

  • Using individual feature field for each entitiy helps disentangle their movements.

  • Rather than limiting the features to its original size (coodinate : 3, RGB : 3), using positional encoding or neural rendering help represent the information more abundantly.

Author / Reviewer information

Author

김소희(Sohee Kim)

  • KAIST AI

  • Contact: joyhee@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

Last updated