GIRAFFE [Eng]
Niemeyer et al. / GIRAFFE] Representing Scenes as Compositional Generative Neural Feature Fields / CVPR 2021 (oral, best paper award)
Last updated
Niemeyer et al. / GIRAFFE] Representing Scenes as Compositional Generative Neural Feature Fields / CVPR 2021 (oral, best paper award)
Last updated
한국어로 쓰인 리뷰를 읽으려면 여기 를 누르세요.
Through Generative Adversarial Networks(GANs), people have succeeded in generating highly-realistc images and even learning disentangled representations without explicit supervision. However, operating in 2D domain had limitations due to the 3-dimensional nature of this world. Recent investigations started to focus on incorporating 3D representations using voxels or radiance fields, but was restricted to single object scenes and showed less consistent results in high resolution and more complex images.
So this paper suggests incorporating a compositional 3D scene representation into the generative models, leading to more controllable image synthesis.
Implicit Neural Representation (INR)
Existing Neural Networks(NN) played a role in prediction tasks(e.g. image classification) and generation (e.g. generative models). However, in INR, network parameter contains the information of the data iteself, so the network size is proportional to the complexity of the data, which is especially beneficial for representing 3D scenes. In addtion, as we learn a function, for example that maps the coordinate to RGB value for a single image, this enables us to represent the data in a continuous form.
NeRF : Neural Radiance Field
A scene is represented using a fully-connected network, whose input is a single continous 5D coordinate (position + direction) that goes through positional encoding for higher dimensional information, and outputs the volume density and view-dependent RGB value (radiance). 5D coordinates for each direction(camera ray) are sampled, and the produced color and density composites an image through volume rendering technique (explained in section 3). As a loss function, the difference between the volume redered image and the ground truth posed image is used.
GRAF achieves controllable image synthesis at high resolution, but is restricted to single-object scenes and the results tend to degrade on more complex imagery. So this paper propose a model that is able to disentangle individual objects and allows translation and rotation as well as changing the camera pose.
Training
Generator
Discriminator : CNN with leaky ReLU
Loss Function = non-saturating GAN loss + R1-regularization
DataSet
commonly used single object dataset: Chairs, Cats, CelebA, CelebA-HQ
challenging single-object dataset: CompCars, LSUN Churches, FFHQ
testing on multi-object scenes: Clevr-N, Clevr-2345
Baseline
voxel-based PlatonicGAN, BlockGAN, HoloGAN
radiance field-based GRAF
Training setup
Evaluation Metric
Frechet Inception Distance (FID) score with 20,000 real and fake samples
disentangled scene generation
comparison to baseline methods
ablation studies
importance of 2D neural rendering and its individual components
positional encoding
limitations
struggles to disentangle factors of variation if there is an inherent bias in the data. (eg. eye and hair translation)
disentanglement failures due to mismatches between assumed uniform distribution (camera poses and object-lebel transformations) and their real distributions
⇒ By representing scenes as compostional generative neural feature fields, they disentangle individual objects from the background as well as their shape and appearance without explicit supervision
⇒ Future work
Investigate how the distributions over object level transformations and camera poses can be learned from data
Incorporate supervision which is easy to obtain (eg. object mask) -> scale to more complex, multi-object scenes
3D scene representation through Implicit Neural Representation is a recent trend that shows superior result.
Using individual feature field for each entitiy helps disentangle their movements.
Rather than limiting the features to its original size (coodinate : 3, RGB : 3), using positional encoding or neural rendering help represent the information more abundantly.
김소희(Sohee Kim)
KAIST AI
Contact: joyhee@kaist.ac.kr
Korean name (English name): Affiliation / Contact information
Korean name (English name): Affiliation / Contact information
...
GRAF : Generative Radiance Field
It proposes an adversarial generative model for conditional radiance fields, whose input is a sampled camera pose ε (sampled from upper hemisphere facing origin, uniformly) and a sampled K x K patch, whose center is and has scale , from the unposed input image. As a condition, shape and apperance code is added, and the model (fully connected network with ReLU activations) outputs a predicted patch, just like the NerF model. Next, the discriminator (convolutional neural network) is trained to distinguish between the predicted patch and real patch sampled from an image in the image distribution.
Neural Feature Field : Replaces GRAF’s formulation for 3D color output c with -dimensional feature
Object Representation In NerF and GRAF, entire scene is represented by a single model, but in order to disentange different entities in the scene, GIRAFFE represents each object (in the scene) using a separate feature field in combination with affine transformation. The parameters (:scale, : translation, : rotation) are sampled from dataset-dependent distribution Therefore, it gains control over the pose, shape and appearance of individual objects. Then, through volume rendering, we can create 2D projection of a 3D discretly sampled dataset. Composition Operator A scene is described as compositions of N entities (N-1 objects and 1 background). The model uses density-weighted mean to combine all features at
3D volume rendering Unlike previous models that volume render an RGB color value, GIRAFFE renders an -dimensional feature vector . Along a camer ray , the model samples points and the operator maps them to the final feature vector . They use the same numerical intergration method as in NeRF. Where, is the distance between neighboring sample points and with density , it defines the alpha value . By accumulating the alpha values, we can compute the transmittance . The entire feature image is obtained by evaluation at every pixel. For efficiency, they obtain a feature map of resolution, which is lower than the input resolution( or pixels).
2D neural rendering In order to upsample the feature map to a higher-resolution image, the paper use 2D neural rendering as the figure below.
number of entities in the scene , latent codes
camera pose , transformations ⇒ In practice, and is uniform distribution over data-dependent camera elevation angles and valid object tranformations each.
All object fields share their weights and are paramterized as MLPs with ReLU activations ( 8 layers with hidden dimension of 128, for objects & half the layers and hidden dimension for background features)
and for positional encoding
sample 64 points along each ray and render feature images at pixels
Key difference with GRAF is that GIRAFFE combines volume rendering with neural rendering. This method helps the model to be more expressive and better handle the complex real scenes. Furthermore, rendering speed is increased compared to GRAF(total rendering time is reduced from 110.1ms to 4.8ms, and from 1595.0ms to 5.9ms for and pixels, respectively.)