📝
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • 1. Problem definition
  • 2. Motivation
  • Related work
  • Idea
  • 3. Method
  • 4. Experiment & Result
  • Experimental setup
  • Result
  • 5. Conclusion
  • Take home message (오늘의 교훈)
  • Author / Reviewer information
  • Author
  • Reviewer
  • Reference & Additional materials

Was this helpful?

  1. Paper review
  2. [2022 Spring] Paper review

GIRAFFE [Eng]

Niemeyer et al. / GIRAFFE] Representing Scenes as Compositional Generative Neural Feature Fields / CVPR 2021 (oral, best paper award)

PreviousGIRAFFE [Kor]NextDistConv [Kor]

Last updated 2 years ago

Was this helpful?

한국어로 쓰인 리뷰를 읽으려면 를 누르세요.

1. Problem definition

Through Generative Adversarial Networks(GANs), people have succeeded in generating highly-realistc images and even learning disentangled representations without explicit supervision. However, operating in 2D domain had limitations due to the 3-dimensional nature of this world. Recent investigations started to focus on incorporating 3D representations using voxels or radiance fields, but was restricted to single object scenes and showed less consistent results in high resolution and more complex images.

So this paper suggests incorporating a compositional 3D scene representation into the generative models, leading to more controllable image synthesis.

2. Motivation

Related work

Implicit Neural Representation (INR)

Existing Neural Networks(NN) played a role in prediction tasks(e.g. image classification) and generation (e.g. generative models). However, in INR, network parameter contains the information of the data iteself, so the network size is proportional to the complexity of the data, which is especially beneficial for representing 3D scenes. In addtion, as we learn a function, for example that maps the coordinate to RGB value for a single image, this enables us to represent the data in a continuous form.

  • NeRF : Neural Radiance Field \quad fθ:RLx×RLd→R+×R3f_{\theta}:R^{L_x}\times R^{L_d}\to R^+ \times R^3fθ​:RLx​×RLd​→R+×R3 (γ(x),γ(d))→(σ,c)(\gamma(x),\gamma(d)) \to (\sigma, c)(γ(x),γ(d))→(σ,c)

    A scene is represented using a fully-connected network, whose input is a single continous 5D coordinate (position + direction) that goes through positional encoding γ(x)\gamma(x)γ(x) for higher dimensional information, and outputs the volume density and view-dependent RGB value (radiance). 5D coordinates for each direction(camera ray) r(t)r(t)r(t) are sampled, and the produced color c(r(t),d)c(r(t),d)c(r(t),d) and density σ(r(t))\sigma(r(t))σ(r(t)) composites an image through volume rendering technique (explained in section 3). As a loss function, the difference between the volume redered image and the ground truth posed image is used.

  • GRAF : Generative Radiance Field \quad fθ:RLx×RLd×RMs×RMa→R+×R3f_{\theta}:R^{L_x}\times R^{L_d}\times R^{M_s}\times R^{M_a} \to R^+\times R^3fθ​:RLx​×RLd​×RMs​×RMa​→R+×R3 (γ(x),γ(d),zs,za)→(σ,c)(\gamma(x),\gamma(d),z_s,z_a) \to (\sigma, c)(γ(x),γ(d),zs​,za​)→(σ,c)

    It proposes an adversarial generative model for conditional radiance fields, whose input is a sampled camera pose ε (sampled from upper hemisphere facing origin, uniformly) and a sampled K x K patch, whose center is (u,v)(u,v)(u,v) and has scale sss , from the unposed input image. As a condition, shape zsz_szs​ and apperance zaz_aza​ code is added, and the model (fully connected network with ReLU activations) outputs a predicted patch, just like the NerF model. Next, the discriminator (convolutional neural network) is trained to distinguish between the predicted patch and real patch sampled from an image in the image distribution.

Idea

GRAF achieves controllable image synthesis at high resolution, but is restricted to single-object scenes and the results tend to degrade on more complex imagery. So this paper propose a model that is able to disentangle individual objects and allows translation and rotation as well as changing the camera pose.

3. Method

  • Neural Feature Field : Replaces GRAF’s formulation for 3D color output c with MfM_fMf​ -dimensional feature hθ:RLx×RLd×RMs×RMa→R+×RMfh_{\theta}:R^{L_x} \times R^{L_d} \times R^{M_s} \times R^{M_a} \to R^+ \times R^{M_f}hθ​:RLx​×RLd​×RMs​×RMa​→R+×RMf​ (γ(x),γ(d),zs,za)→(σ,f)(\gamma(x),\gamma(d),z_s,z_a) \to (\sigma, f)(γ(x),γ(d),zs​,za​)→(σ,f)

    Object Representation In NerF and GRAF, entire scene is represented by a single model, but in order to disentange different entities in the scene, GIRAFFE represents each object (in the scene) using a separate feature field in combination with affine transformation. The parameters T={s,t,R}T=\{s,t,R\}T={s,t,R} (sss:scale, ttt: translation, RRR: rotation) are sampled from dataset-dependent distribution k(x)=R⋅[s1s2s3]⋅x+tk(x)=R\cdot\begin{bmatrix} s_1 & & \\ & s_2 &\\ & & s_3 \end{bmatrix}\cdot x + tk(x)=R⋅​s1​​s2​​s3​​​⋅x+t Therefore, it gains control over the pose, shape and appearance of individual objects. Then, through volume rendering, we can create 2D projection of a 3D discretly sampled dataset. (σ,f)=hθ(γ(k−1(x)),γ(k−1(d)),zs,za)(\sigma,f)=h_{\theta}(\gamma(k^{-1}(x)),\gamma(k^{-1}(d)),z_s,z_a)(σ,f)=hθ​(γ(k−1(x)),γ(k−1(d)),zs​,za​) Composition Operator A scene is described as compositions of N entities (N-1 objects and 1 background). The model uses density-weighted mean to combine all features at (x,d)(x,d)(x,d)

    C(x,d)=(σ,1σ∑i=1Nσifi),whereσ=∑i=1NσiC(x,d)=(\sigma,{1\over\sigma} \sum_{i=1}^{N}\sigma_if_i), \quad where \quad \sigma = \sum_{i=1}^N\sigma_iC(x,d)=(σ,σ1​∑i=1N​σi​fi​),whereσ=∑i=1N​σi​ 3D volume rendering Unlike previous models that volume render an RGB color value, GIRAFFE renders an MfM_fMf​-dimensional feature vector fff. Along a camer ray ddd, the model samples NsN_sNs​ points and the operator πvol\pi_{vol}πvol​ maps them to the final feature vector fff. πvol:(R+×RMf)Ns→RMf\pi_{vol} : (R^+ \times R^{M_f})^{N_s} \to R^{M_f}πvol​:(R+×RMf​)Ns​→RMf​ They use the same numerical intergration method as in NeRF. f=∑j=1Nsτiαifiτj=∏k=1j−1(1−αk)αj=1−e−σjδjf=\sum_{j=1}^{N_s}\tau_i\alpha_i f_i \quad \tau_j=\prod_{k=1}^{j-1}(1-\alpha_k) \quad \alpha_j=1-e^{-\sigma_j\delta_j}f=∑j=1Ns​​τi​αi​fi​τj​=∏k=1j−1​(1−αk​)αj​=1−e−σj​δj​ Where, δj=∣∣xj+1−xj∣∣2\delta_j=||x_{j+1} - x_j ||_2δj​=∣∣xj+1​−xj​∣∣2​ is the distance between neighboring sample points and with density δj\delta_jδj​, it defines the alpha value αj\alpha_jαj​. By accumulating the alpha values, we can compute the transmittance τj\tau_jτj​. The entire feature image is obtained by evaluation πvol\pi_{vol}πvol​ at every pixel. For efficiency, they obtain a feature map of 16216^2162 resolution, which is lower than the input resolution(64264^2642 or 2562256^22562 pixels).

  • 2D neural rendering In order to upsample the feature map to a higher-resolution image, the paper use 2D neural rendering as the figure below. πθneural:RHv×Wv×Mf→RH×W×3\pi_\theta^{neural}:R^{H_v \times W_v \times M_f} \to R^{H \times W \times 3}πθneural​:RHv​×Wv​×Mf​→RH×W×3

  • Training

    • Generator

    Gθ({zsi,zai,Ti}i=1N,ϵ)=πθneural(Iv),whereIv={πvol({C(xjk,dk)}j=1Ns)}k=1Hv×WvG_\theta(\left\{z_s^i,z_a^i,T_i\right\}_{i=1}^N,\epsilon)=\pi_\theta^{neural}(I_v),\quad where \quad I_v=\{\pi_{vol}(\{C(x_{jk},d_k)\}_{j=1}^{N_s})\}_{k=1}^{H_v \times W_v}Gθ​({zsi​,zai​,Ti​}i=1N​,ϵ)=πθneural​(Iv​),whereIv​={πvol​({C(xjk​,dk​)}j=1Ns​​)}k=1Hv​×Wv​​

    • Discriminator : CNN with leaky ReLU

    • Loss Function = non-saturating GAN loss + R1-regularization

    V(θ,ϕ)=Ezsi,zai∼N,ϵ∼pT[f(Dϕ(Gθ({zsi,zai,Ti}i,ϵ))]+EI∼pD[f(−Dϕ(I))−λ∣∣▽Dϕ(I)∣∣2]V(\theta,\phi)=E_{z_s^i,z_a^i \sim N, \epsilon \sim p_T} [f(D_\phi(G_\theta(\{z_s^i,z_a^i,T_i\}_i,\epsilon))] + E_{I\sim p_D}[f(-D_\phi(I))- \lambda \vert\vert \bigtriangledown D_\phi (I) \vert\vert^2 ] V(θ,ϕ)=Ezsi​,zai​∼N,ϵ∼pT​​[f(Dϕ​(Gθ​({zsi​,zai​,Ti​}i​,ϵ))]+EI∼pD​​[f(−Dϕ​(I))−λ∣∣▽Dϕ​(I)∣∣2],wheref(t)=−log(1+exp(−t)),λ=10\quad , where \quad f(t)=-log(1+exp(-t)), \quad \lambda=10,wheref(t)=−log(1+exp(−t)),λ=10

4. Experiment & Result

Experimental setup

  • DataSet

    • commonly used single object dataset: Chairs, Cats, CelebA, CelebA-HQ

    • challenging single-object dataset: CompCars, LSUN Churches, FFHQ

    • testing on multi-object scenes: Clevr-N, Clevr-2345

  • Baseline

    • voxel-based PlatonicGAN, BlockGAN, HoloGAN

    • radiance field-based GRAF

  • Training setup

    • number of entities in the scene N∼pNN \sim p_NN∼pN​, latent codes zsi,zai∼N(0,I)z_s^i,z_a^i \sim N(0,I)zsi​,zai​∼N(0,I)

    • camera pose ϵ∼pϵ\epsilon \sim p_{\epsilon}ϵ∼pϵ​, transformations Ti∼pTT_i \sim p_TTi​∼pT​ ⇒ In practice, pϵp_{\epsilon}pϵ​ and pTp_TpT​ is uniform distribution over data-dependent camera elevation angles and valid object tranformations each.

    • All object fields share their weights and are paramterized as MLPs with ReLU activations ( 8 layers with hidden dimension of 128, Mf=128M_f=128Mf​=128 for objects & half the layers and hidden dimension for background features)

    • Lx=2,3,10L_x=2,3,10Lx​=2,3,10 and Ld=2,3,4L_d=2,3,4Ld​=2,3,4 for positional encoding

    • sample 64 points along each ray and render feature images at 16216^2162 pixels

  • Evaluation Metric

    • Frechet Inception Distance (FID) score with 20,000 real and fake samples

Result

  • disentangled scene generation

  • comparison to baseline methods

  • ablation studies

    • importance of 2D neural rendering and its individual components

    • positional encoding

      r(t,L)=(sin(20tπ),cos(20tπ),...,sin(2Ltπ),cos(2Ltπ))r(t,L) = (sin(2^0t\pi), cos(2^0t\pi),...,sin(2^Lt\pi),cos(2^Lt\pi))r(t,L)=(sin(20tπ),cos(20tπ),...,sin(2Ltπ),cos(2Ltπ))

  • limitations

    • struggles to disentangle factors of variation if there is an inherent bias in the data. (eg. eye and hair translation)

    • disentanglement failures due to mismatches between assumed uniform distribution (camera poses and object-lebel transformations) and their real distributions

5. Conclusion

⇒ By representing scenes as compostional generative neural feature fields, they disentangle individual objects from the background as well as their shape and appearance without explicit supervision

⇒ Future work

  • Investigate how the distributions over object level transformations and camera poses can be learned from data

  • Incorporate supervision which is easy to obtain (eg. object mask) -> scale to more complex, multi-object scenes

Take home message (오늘의 교훈)

  • 3D scene representation through Implicit Neural Representation is a recent trend that shows superior result.

  • Using individual feature field for each entitiy helps disentangle their movements.

  • Rather than limiting the features to its original size (coodinate : 3, RGB : 3), using positional encoding or neural rendering help represent the information more abundantly.

Author / Reviewer information

Author

김소희(Sohee Kim)

  • KAIST AI

  • Contact: joyhee@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

Key difference with GRAF is that GIRAFFE combines volume rendering with neural rendering. This method helps the model to be more expressive and better handle the complex real scenes. Furthermore, rendering speed is increased compared to GRAF(total rendering time is reduced from 110.1ms to 4.8ms, and from 1595.0ms to 5.9ms for 64264^2642 and 2562256^22562 pixels, respectively.)

GIRAFFE paper
GIRAFFE supplementary material
GIRAFFE - Github
INR explanation
NeRF paper
GRAF paper
여기
Figure 1: NeRF architecture
Figure 2: GRAF architecture
Figure 3: GIRAFFE architecture
Figure 4: 2d neural rendering architecture
Figure 5: disentanglement
Figure 6: qualitative comparison
Figure 7: neural rendering architecture ablation
Figure 8: positional encoding
Figure 9: limitation_disentangle failure