Self-Calibrating Neural Radiance Fields [Eng]
Jeong et al. / Self-Calibrating Neural Radiance Fields / ICCV 2021
Last updated
Jeong et al. / Self-Calibrating Neural Radiance Fields / ICCV 2021
Last updated
한국어로 쓰인 리뷰를 읽으려면 여기를 누르세요.
Given a set of images of a scene, the proposed method, dubbed SCNeRF, jointly learns the geometry of the scene and the accurate camera parameters without any calibration objects. This task can be expressed as the following equation.
Find , when
where is a ray, is ray origin, and is ray direction, is a function that generates a ray from camera parameters, are camera parameters, is an estimated color of the ray , is a parameter set of NeRF model, is a function that estimates color of a ray using NeRF parameters.
Generally, scene geometry is learned with known camera parameters, or camera parameters are estimated without improving or learning scene geometry.
Unlike the previous approach, the purpose of this paper is to learn camera parameters and NeRF model parameters jointly.
Because of its simplicity and generality, traditional 3D vision tasks often assume that the camera model is a simple pinhole model. However, with the development of camera models, various camera models have been introduced, including fish eye models, and per-pixel generic models. A basic pinhole camera model is not enough to represent these kinds of complex camera models.
Self-Calibration is a research topic that calibrates camera parameters without an external calibration object (e.g., a checkerboard pattern) in the scene.
In many cases, calibration objects are not readily available. Thus, calibrating camera parameters without any external objects has been an important research topic.
However, conventional self-calibration methods solely rely on the geometric loss or constraints based on the epipolar geometry that only uses a set of sparse correspondences extracted from a non-differentiable process. This could lead to diverging results with extreme sensitivity to noise when a scene does not have enough interest points. Lastly, conventional self-calibration methods use an off-the-shelf non-differentiable feature matching algorithm and do not improve or learn the geometry, though it is well known that the better we know the geometry of the scene, the more accurate the camera model gets.
NeRF is a work that synthesizes a novel view of the scene by optimizing a separate neural continuous volume representation network for each scene.
At the time when the NeRF was published, this work achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.
However, this requires not only a dataset of captured RGB images of the scene but also the corresponding camera poses and intrinsic parameters, which are not always available.
Pinhole camera model parameters, fourth-order radial distortion parameters, and generic noise model parameters that can learn arbitrary non-linear camera distortions are included to overcome the limitation of the pinhole camera model.
To overcome the limitation of geometric loss used in the previous self-calibration methods, additional photometric consistency is used.
To get a more accurate camera model using improved geometry of the scene, the geometry represented using Neural Radiance Fields is learned jointly.
The pinhole camera model maps a 4-vector homogeneous coordinate in 3D space to a 3-vector in the image plane .
Where is the intrinsic matrix, is the rotation matrix, is the translation matrix
First, the camera intrinsic parameters are decomposed into the initialization and the residual parameter matrix (=). This is due to the highly non-convex nature of the intrinsic matrix that has a lot of local minima.
Similarly, the extrinsic parameters are decomposed into initial values and their residual parameters to represent the camera rotation R and translation t. However, directly learning the rotation offset for each element of a rotation matrix would break the orthogonality of the rotation matrix. Thus, the 6-vector representation which uses the unnormalized first two columns of a rotation matrix is utilized to represent a 3D rotation:
What does is quite similar to the Gram-Schmidt process. To make it clear, I made a conceptual image as follows. Here, is normalization.
As we can see in the figure, from unnormalized two vectors and , orthonormal vectors can be obtained.
Since commercial lenses deviate from an ideal lens with a single-lens focal length, this creates a number of aberrations. The most common one is referred to as “radial distortion”.
The camera model of SCNeRF is extended to incorporate such radial distortions. A widely used 4th order radial distortion model is deployed to express this radial distortion.
Undistorted normalized pixel coordinate converted from pixel coordinate can be obtained using the 4th order radial distortion model as you can see in the following equations.
where is initial radial distortion parameter denoted as and are residuals denoted as .
UsingPinhole Camera Model and Fourth Order Radial Distortion, ray direction and ray origin in the world coordinate can be expressed as the following.
where is vector normalization. For those who may confuse why equals the ray origin (=camea center) in the world coordinate, I made a conceptual image that shows the geometric meaning of vector .
Since these ray parameters and are functions of intrinsics, extrinsics, and distortion paramameter residuals (), we can pass gradients from the rays to the residuals to optimize the parameters. Note that are initial values of each parameters and not optimized.
Complex optical abberations in real lenses cannot be modeled using a parametric camera. For such noise, generic non-linear aberration model is used. Specifically, local ray parameter residuals , are used, where is the image coordinate.
Bilinear interpolation is used to extract continuous ray distortion parameters.
where indicates the ray direction offset at a control point in discrete 2D coordinate . is learned at discrete locations only.
Dual comes for free.
To help your understanding, the conceptual image of a generic non-linear aberration model is attached below.
For those who want to learn more about this generic camera model, please refer to the paper "Why having 10,000 parameters in your camera model is better than twelve" in the Reference & Additional materials section.
From Pinhole Camera Model, Fourth Order Radial Distortion, Generic Non-Linear Camera Distortion, the final ray direction, and ray origin can be expressed using the following graph.
To optimize calibration parameters, both geometric consistency loss and photometric consistency loss are exploited.
Geometric Consistency Loss is in the above figure. Let's break this down into pieces.
First, let be correspondence on camera A and camera B respectively. When all the camera parameters are calibrated, the ray and should intersect at the 3D point. However, when there’s a misalignment due to an error in camera parameters, two rays do not meet at a single point, and we can measure the deviation by computing the shortest distance between corresponding rays.
Let a point on be , and a point on be . A distance between the and a point on the is as we can see in the above figure.
Solving gives us that makes the distance between and to be minimum.
From this, we can find a point on ray B that has the shortest distance to ray A.
Dual comes for free.
In the above figure, in equation , and are expressed as and for simplicity.
After projecting the points to image planes and computing distance on the image planes, geometric consistency loss can be obtained, where is a projection function.
Note a point far from the cameras would have a large deviation, while a point close to the cameras would have a small deviation. Thus, the distance between the two points is computed on the image plane, not the 3D space, to remove this depth sensitivity.
Photometric consistency loss is defined as the following.
where is a pixel coordinate, and is a set of pixel coordinates in an image, is the output of the volumetric rendering using the ray of corresponding pixel , is the ground truth color.
HOW TO ESTIMATE ? What is Volume Rendering?
The color value of a ray can be represented as an integral of all colors weighted by the opaqueness along a ray, or can be approximated as the weighted sum of colors at N points along a ray as follows.
where is transparency, is color,
For those who want to learn more about this equation, please refer the "NeRF" paper inReference & Additional materials
Note that photometric consistency loss is differentiable with respect to the learnable camera parameters. From this, we can define gradients for the camera parameters and be able to calibrate the cameras.
It is impossible to learn accurate camera parameters when the geometry is unknown or too coarse for self-calibration. Thus, curriculum learning is adopted: geometry and a linear camera model first and complex camera model parameters later.
First, NeRF network is trained while initializing the camera focal lengths and principal point to half the image width and height. Learning coarse geometry first is crucial since it initializes the networks to a more favorable local optimum for learning better camera parameters.
Next, camera parameters for the linear camera model, radial distortion, nonlinear noise of ray direction, and ray origin are sequentially added to the learning.
Following is the final learning algorithm. The function returns a set of parameters for the curriculum learning which progressively adds complexity to the camera model.
Next, the model is trained with the projected ray distance by selecting a target image at random with sufficient correspondences.
Here, not all but some representative experimental results will be covered.
Dataset
LLFF
8 scenes
Tanks and Temples
4 scenes
Custom data collected by the author
6 scenes
fish-eye camera
Experiments
Improve over NeRF
Improve over NeRF++
Fish-eye Lens Reconstruction
Ablation Study
In this article, only the dataset and experiment highlighted in red will be covered.
Table 1 reports the qualities of the rendered images in the training set. Although the SCNeRF model does not adopt calibrated camera information, it shows a reliable rendering performance.
SCNeRF model shows better rendering qualities than NeRF when COLMAP initializes the camera information. Table 2 reports the rendering qualities of NeRF and SCNeRF. SCNeRF consistently shows better rendering qualities than the original NeRF.
Following is the visualization of the rendered images.
Here, (a) is the NeRF result without COLMAP output, (b) is the NeRF result with COLMAP output, (c) is the SCNeRF result without COLMAP output, and (d) is the SCNeRF result with COLMAP oupput.
To check the effects of the proposed models, an ablation study is conducted. Each phase is trained for 200K iterations. From this experiment, extending the SCNeRF model with learnable intrinsic and extrinsic parameters(IE), non-linear distortion(OD), and projected ray distance loss(PRD) is more potential for rendering clearer images. However, for some scenes, adopting projected ray distance increases the overall projected ray distance, even though it is not stated in the table.
Following is the visualization of the ablation study.
SCNeRF proposes a self-calibration algorithm that learns geometry and camera parameters jointly. The camera model consists of a pinhole model, radial distortion, and non-linear distortion, which capture real noises in lenses. The SCNeRF also proposes projected ray distance to improve accuracy, which allows the SCNeRF model to learn fine-grained correspondences. SCNeRF model learns geometry and camera parameters from scratch when the poses are not given, and improves NeRF to be more robust when camera poses are given.
In my perspective, this paper is worthy because it shows a way to calibrate camera parameters and neural radiance fields jointly.
I wonder why the result in the paper reports training set accuracy instead of val/test set accuracy.
Little bit disappointed by the fact that the non-preferrable results are only stated in the text, not in the table. (ex Ablation Study)
I noticed some errors in the equations and corrected them as I think they should be. Please feel free to comment if you find any errors in the equations used in this article.
SCNeRF learns geometry and camera parameters from scratch w/o poses
SCNeRF uses the camera model consisting of a pinhole model, radial distortion, and non-linear distortion
SCNeRF proposed projected ray distance to improve accuracy
김민정(Min-Jung Kim)
KAIST AI
Contact Information
email: emjay73@naver.com
Citation of this paper
Jeong, Yoonwoo, et al. "Self-calibrating neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Official Project Page : https://postech-cvlab.github.io/SCNeRF/
Official GitHub repository : https://github.com/POSTECH-CVLab/SCNeRF
Citation of related work
Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.
Schops, Thomas, et al. "Why having 10,000 parameters in your camera model is better than twelve." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
Zhou, Yi, et al. "On the continuity of rotation representations in neural networks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
Other useful materials
Lens Aberrations : chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Fwww.nao.org%2Fwp-content%2Fuploads%2F2020%2F04%2FLens-Aberrations.pdf&chunk=true
camera models : chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Fcvgl.stanford.edu%2Fteaching%2Fcs231a_winter1415%2Flecture%2Flecture2_camera_models_note.pdf&clen=4519068&chunk=true