# GIRAFFE \[Eng]

한국어로 쓰인 리뷰를 읽으려면 [**여기**](/awesome-reviews/paper-review/2022-spring-paper-review/cvpr-2021-giraffe-kor.md) 를 누르세요.

## 1. Problem definition

Through Generative Adversarial Networks(GANs), people have succeeded in generating highly-realistc images and even learning disentangled representations without explicit supervision. However, operating in 2D domain had limitations due to the 3-dimensional nature of this world. Recent investigations started to focus on incorporating 3D representations using voxels or radiance fields, but was restricted to single object scenes and showed less consistent results in high resolution and more complex images.

So this paper suggests incorporating a **compositional** 3D scene representation into the **generative** models, leading to more controllable image synthesis.

## 2. Motivation

### Related work

**Implicit Neural Representation (INR)**

Existing Neural Networks(NN) played a role in prediction tasks(e.g. image classification) and generation (e.g. generative models). However, in INR, network parameter contains the information of the data iteself, so the network size is proportional to the complexity of the data, which is especially beneficial for representing 3D scenes. In addtion, as we learn a function, for example that maps the coordinate to RGB value for a single image, this enables us to represent the data in a continuous form.

* **NeRF : Neural Radiance Field** $$\quad$$ $$f\_{\theta}:R^{L\_x}\times R^{L\_d}\to R^+ \times R^3$$ $$(\gamma(x),\gamma(d)) \to (\sigma, c)$$

  A scene is represented using a fully-connected network, whose input is a single continous 5D coordinate (position + direction) that goes through positional encoding $$\gamma(x)$$ for higher dimensional information, and outputs the volume density and view-dependent RGB value (radiance). 5D coordinates for each direction(camera ray) $$r(t)$$ are sampled, and the produced color $$c(r(t),d)$$ and density $$\sigma(r(t))$$ composites an image through volume rendering technique (explained in section 3). As a loss function, the difference between the volume redered image and the ground truth *posed* image is used.

![Figure 1: NeRF architecture](/files/xM74aQM6nRvS6lD2iYMU)

* **GRAF : Generative Radiance Field** $$\quad$$ $$f\_{\theta}:R^{L\_x}\times R^{L\_d}\times R^{M\_s}\times R^{M\_a} \to R^+\times R^3$$ $$(\gamma(x),\gamma(d),z\_s,z\_a) \to (\sigma, c)$$

  It proposes an adversarial **generative** model for **conditional** radiance fields, whose input is a sampled camera pose ε (sampled from upper hemisphere facing origin, uniformly) and a sampled K x K patch, whose center is $$(u,v)$$ and has scale $$s$$ , from the *unposed* input image. As a condition, shape $$z\_s$$ and apperance $$z\_a$$ code is added, and the model (fully connected network with ReLU activations) outputs a predicted patch, just like the NerF model. Next, the discriminator (convolutional neural network) is trained to distinguish between the predicted patch and real patch sampled from an image in the image distribution.

![Figure 2: GRAF architecture](/files/5KqqshFNmH0iCM8YcpZA)

### Idea

GRAF achieves controllable image synthesis at high resolution, but is restricted to single-object scenes and the results tend to degrade on more complex imagery. So this paper propose a model that is able to disentangle individual objects and allows translation and rotation as well as changing the camera pose.

## 3. Method

![Figure 3: GIRAFFE architecture](/files/1hDiE7ZHiBOG4NWRJomF)

* **Neural Feature Field** : Replaces GRAF’s formulation for 3D color output c with $$M\_f$$ -dimensional feature\
  $$h\_{\theta}:R^{L\_x} \times R^{L\_d} \times R^{M\_s} \times R^{M\_a} \to R^+ \times R^{M\_f}$$ $$(\gamma(x),\gamma(d),z\_s,z\_a) \to (\sigma, f)$$

  **Object Representation** In NerF and GRAF, entire scene is represented by a single model, but in order to disentange different entities in the scene, GIRAFFE represents each object (in the scene) using a separate feature field in combination with affine transformation. The parameters $$T={s,t,R}$$ ($$s$$:scale, $$t$$: translation, $$R$$: rotation) are sampled from dataset-dependent distribution\
  $$k(x)=R\cdot\begin{bmatrix} s\_1 & & \ & s\_2 &\ & & s\_3 \end{bmatrix}\cdot x + t$$\
  Therefore, it gains control over the pose, shape and appearance of individual objects.\
  Then, through volume rendering, we can create 2D projection of a 3D discretly sampled dataset.\
  $$(\sigma,f)=h\_{\theta}(\gamma(k^{-1}(x)),\gamma(k^{-1}(d)),z\_s,z\_a)$$\
  \
  **Composition Operator** A scene is described as compositions of N entities (N-1 objects and 1 background). The model uses density-weighted mean to combine all features at $$(x,d)$$<br>

  $$C(x,d)=(\sigma,{1\over\sigma} \sum\_{i=1}^{N}\sigma\_if\_i), \quad where \quad \sigma = \sum\_{i=1}^N\sigma\_i$$\
  **3D volume rendering** Unlike previous models that volume render an RGB color value, GIRAFFE renders an $$M\_f$$-dimensional feature vector $$f$$. Along a camer ray $$d$$, the model samples $$N\_s$$ points and the operator $$\pi\_{vol}$$ maps them to the final feature vector $$f$$.\
  $$\pi\_{vol} : (R^+ \times R^{M\_f})^{N\_s} \to R^{M\_f}$$\
  They use the same numerical intergration method as in NeRF.\
  $$f=\sum\_{j=1}^{N\_s}\tau\_i\alpha\_i f\_i \quad \tau\_j=\prod\_{k=1}^{j-1}(1-\alpha\_k) \quad \alpha\_j=1-e^{-\sigma\_j\delta\_j}$$\
  Where, $$\delta\_j=||x\_{j+1} - x\_j ||*2$$ is the distance between neighboring sample points and with density $$\delta\_j$$, it defines the alpha value $$\alpha\_j$$. By accumulating the alpha values, we can compute the transmittance $$\tau\_j$$. The entire feature image is obtained by evaluation $$\pi*{vol}$$ at every pixel.\
  For efficiency, they obtain a feature map of $$16^2$$ resolution, which is lower than the input resolution($$64^2$$ or $$256^2$$ pixels).
* **2D neural rendering** In order to upsample the feature map to a higher-resolution image, the paper use 2D neural rendering as the figure below.\
  $$\pi\_\theta^{neural}:R^{H\_v \times W\_v \times M\_f} \to R^{H \times W \times 3}$$

![Figure 4: 2d neural rendering architecture](/files/MUuxEcvb0wFHt60QVHEE)

* **Training**

  * Generator

  $$G\_\theta(\left{z\_s^i,z\_a^i,T\_i\right}*{i=1}^N,\epsilon)=\pi*\theta^{neural}(I\_v),\quad where \quad I\_v={\pi\_{vol}({C(x\_{jk},d\_k)}*{j=1}^{N\_s})}*{k=1}^{H\_v \times W\_v}$$

  * Discriminator : CNN with leaky ReLU
  * Loss Function = non-saturating GAN loss + R1-regularization

  $$V(\theta,\phi)=E\_{z\_s^i,z\_a^i \sim N, \epsilon \sim p\_T} \[f(D\_\phi(G\_\theta({z\_s^i,z\_a^i,T\_i}*i,\epsilon))] + E*{I\sim p\_D}\[f(-D\_\phi(I))- \lambda \vert\vert \bigtriangledown D\_\phi (I) \vert\vert^2 ]$$$$\quad , where \quad f(t)=-log(1+exp(-t)), \quad \lambda=10$$

## 4. Experiment & Result

### Experimental setup

* DataSet
  * commonly used single object dataset: Chairs, Cats, CelebA, CelebA-HQ
  * challenging single-object dataset: CompCars, LSUN Churches, FFHQ
  * testing on multi-object scenes: Clevr-N, Clevr-2345
* Baseline
  * voxel-based PlatonicGAN, BlockGAN, HoloGAN
  * radiance field-based GRAF
* Training setup
  * number of entities in the scene $$N \sim p\_N$$, latent codes $$z\_s^i,z\_a^i \sim N(0,I)$$
  * camera pose $$\epsilon \sim p\_{\epsilon}$$, transformations $$T\_i \sim p\_T$$\
    ⇒ In practice, $$p\_{\epsilon}$$ and $$p\_T$$ is uniform distribution over data-dependent camera elevation angles and valid object tranformations each.
  * All object fields share their weights and are paramterized as MLPs with ReLU activations ( 8 layers with hidden dimension of 128, $$M\_f=128$$ for objects & half the layers and hidden dimension for background features)
  * $$L\_x=2,3,10$$ and $$L\_d=2,3,4$$ for positional encoding
  * sample 64 points along each ray and render feature images at $$16^2$$ pixels
* Evaluation Metric
  * Frechet Inception Distance (FID) score with 20,000 real and fake samples

### Result

* disentangled scene generation

![Figure 5: disentanglement](/files/LbNMkDyNMDsO1ht9yAwX)

* comparison to baseline methods

![Figure 6: qualitative comparison](/files/GoES8xwhuQmweaO1B2vp)

* ablation studies

  * importance of 2D neural rendering and its individual components

  ![Figure 7: neural rendering architecture ablation](/files/SUZF1NgqAzanJ41n2tWW)\
  Key difference with GRAF is that GIRAFFE combines volume rendering with neural rendering. This method helps the model to be more expressive and better handle the complex real scenes. Furthermore, rendering speed is increased compared to GRAF(total rendering time is reduced from 110.1ms to 4.8ms, and from 1595.0ms to 5.9ms for $$64^2$$ and $$256^2$$ pixels, respectively.)

  * positional encoding

    $$r(t,L) = (sin(2^0t\pi), cos(2^0t\pi),...,sin(2^Lt\pi),cos(2^Lt\pi))$$

    <img src="/files/hMpbeKtLd6FeozewD37a" alt="Figure 8: positional encoding" data-size="original">
* limitations

  * struggles to disentangle factors of variation if there is an inherent bias in the data. (eg. eye and hair translation)
  * disentanglement failures due to mismatches between assumed uniform distribution (camera poses and object-lebel transformations) and their real distributions

  <img src="/files/0VZgBH5R2slzW1xQImQ7" alt="Figure 9: limitation_disentangle failure" data-size="original">

## 5. Conclusion

⇒ By representing scenes as compostional generative neural feature fields, they disentangle individual objects from the background as well as their shape and appearance without explicit supervision

⇒ Future work

* Investigate how the distributions over object level transformations and camera poses can be learned from data
* Incorporate supervision which is easy to obtain (eg. object mask) -> scale to more complex, multi-object scenes

### Take home message (오늘의 교훈)

* 3D scene representation through Implicit Neural Representation is a recent trend that shows superior result.
* Using individual feature field for each entitiy helps disentangle their movements.
* Rather than limiting the features to its original size (coodinate : 3, RGB : 3), using positional encoding or neural rendering help represent the information more abundantly.

## Author / Reviewer information

### Author

**김소희(Sohee Kim)**

* KAIST AI
* Contact: <joyhee@kaist.ac.kr>

### Reviewer

1. Korean name (English name): Affiliation / Contact information
2. Korean name (English name): Affiliation / Contact information
3. ...

## Reference & Additional materials

1. [GIRAFFE paper](https://arxiv.org/abs/2011.12100)
2. [GIRAFFE supplementary material](http://www.cvlibs.net/publications/Niemeyer2021CVPR_supplementary.pdf)
3. [GIRAFFE - Github](https://github.com/autonomousvision/giraffe)
4. [INR explanation](https://www.notion.so/Implicit-Representation-Using-Neural-Network-c6aac62e0bf044ebbe70abcdb9cc3dd1)
5. [NeRF paper](https://arxiv.org/abs/2003.08934)
6. [GRAF paper](https://arxiv.org/abs/2007.02442)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://awesome-davian.gitbook.io/awesome-reviews/paper-review/2022-spring-paper-review/cvpr-2021-giraffe-eng.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
