Self-Calibrating Neural Radiance Fields [Eng]

Jeong et al. / Self-Calibrating Neural Radiance Fields / ICCV 2021

Self-Calibrating Neural Radiance Fields [Eng]

한국어로 쓰인 리뷰를 읽으려면 여기를 누르세요.

1. Problem definition

Given a set of images of a scene, the proposed method, dubbed SCNeRF, jointly learns the geometry of the scene and the accurate camera parameters without any calibration objects. This task can be expressed as the following equation.

Find K,R,t,k,ro,rd,θK, R, t, k, r_{o}, r_{d}, \theta, when

r=(ro,rd)=fcam(K,R,t,k,ro,rd)\mathbf{r}=(\mathbf{r_o}, \mathbf{r_d})=f_{cam}(K, R, t, k, r_o, r_d)

C^(r)=fnerf(r;θ)\hat{\mathbf{C}}(\mathbf{r})=f_{nerf}(\mathbf{r};\theta)

where r\mathbf{r} is a ray, ro\mathbf{r_o} is ray origin, and rd\mathbf{r_d} is ray direction, fcamf_{cam} is a function that generates a ray from camera parameters, (K,R,t,k,ro,rd)(K,R,t,k,r_o,r_d) are camera parameters, C^(r)\hat{\mathbf{C}}(\mathbf{r}) is an estimated color of the ray r\mathbf{r}, θ\theta is a parameter set of NeRF model, fnerff_{nerf} is a function that estimates color of a ray using NeRF parameters.

Generally, scene geometry is learned with known camera parameters, or camera parameters are estimated without improving or learning scene geometry.

Unlike the previous approach, the purpose of this paper is to learn camera parameters (K,R,t,k,ro,rd)(K,R,t,k,r_o,r_d) and NeRF model parameters θ\theta jointly.

2. Motivation

Camera Model

Because of its simplicity and generality, traditional 3D vision tasks often assume that the camera model is a simple pinhole model. However, with the development of camera models, various camera models have been introduced, including fish eye models, and per-pixel generic models. A basic pinhole camera model is not enough to represent these kinds of complex camera models.

Camera Self-Calibration

Self-Calibration is a research topic that calibrates camera parameters without an external calibration object (e.g., a checkerboard pattern) in the scene.

In many cases, calibration objects are not readily available. Thus, calibrating camera parameters without any external objects has been an important research topic.

However, conventional self-calibration methods solely rely on the geometric loss or constraints based on the epipolar geometry that only uses a set of sparse correspondences extracted from a non-differentiable process. This could lead to diverging results with extreme sensitivity to noise when a scene does not have enough interest points. Lastly, conventional self-calibration methods use an off-the-shelf non-differentiable feature matching algorithm and do not improve or learn the geometry, though it is well known that the better we know the geometry of the scene, the more accurate the camera model gets.

Neural Radiance Fields(NeRF) for Novel View Synthesis

NeRF is a work that synthesizes a novel view of the scene by optimizing a separate neural continuous volume representation network for each scene.

At the time when the NeRF was published, this work achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.

However, this requires not only a dataset of captured RGB images of the scene but also the corresponding camera poses and intrinsic parameters, which are not always available.

Idea

  • Pinhole camera model parameters, fourth-order radial distortion parameters, and generic noise model parameters that can learn arbitrary non-linear camera distortions are included to overcome the limitation of the pinhole camera model.

  • To overcome the limitation of geometric loss used in the previous self-calibration methods, additional photometric consistency is used.

  • To get a more accurate camera model using improved geometry of the scene, the geometry represented using Neural Radiance Fields is learned jointly.

3. Method

Differentiable Camera Model

Pinhole Camera Model

The pinhole camera model maps a 4-vector homogeneous coordinate in 3D space P4×1P_{4 \times 1} to a 3-vector in the image plane P3×1P'_{3 \times 1}.

P3×1=M3×4P=K3×3[R  T]3×4P4×1P'_{3\times1} = M_{3\times4}P=K_{3\times3}\left[R\; T\right]_{3\times 4} P_{4\times 1}

Where KK is the intrinsic matrix, RR is the rotation matrix, TT is the translation matrix

First, the camera intrinsic parameters are decomposed into the initialization K0K_0 and the residual parameter matrix ΔK\Delta K(=zKz_K). This is due to the highly non-convex nature of the intrinsic matrix that has a lot of local minima.

K=[fx+Δfx0cx+Δcx0fy+Δfycy+Δcy001]=K0+ΔKR3×3K=\begin{bmatrix} f_x+\Delta f_x & 0 & c_x + \Delta c_x \\ 0 & f_y + \Delta f_y & c_y + \Delta c_y \\ 0 & 0 & 1 \end{bmatrix} = K_0 + \Delta K \in \mathbb{R}^{3\times 3}

Similarly, the extrinsic parameters are decomposed into initial values and their residual parameters to represent the camera rotation R and translation t. However, directly learning the rotation offset for each element of a rotation matrix would break the orthogonality of the rotation matrix. Thus, the 6-vector representation which uses the unnormalized first two columns of a rotation matrix is utilized to represent a 3D rotation:

t=t0+ΔtR=f(a0+Δa)f([a1a2])=[b1b2b3]3×3\mathbf{t} = \mathbf{t_0} + \Delta \mathbf{t}\\R=f(\mathbf{a_0}+\Delta \mathbf{a})\\f\begin{pmatrix}\begin{bmatrix} | & | \\ a_1& a_2\\ | & | \end{bmatrix}\end{pmatrix} = \begin{bmatrix}|&|&|\\\mathbf{b_1} & \mathbf{b_2} & \mathbf{b_3}\\| & | & |\end{bmatrix}_{3 \times 3}

What ff does is quite similar to the Gram-Schmidt process. To make it clear, I made a conceptual image as follows. Here, N()N(\cdot) is L2L2 normalization.

As we can see in the figure, from unnormalized two vectors a1\mathbf{a_1} and a2\mathbf{a_2}, orthonormal vectors b1,b2,b3\mathbf{b_1}, \mathbf{b_2}, \mathbf{b_3} can be obtained.

Fourth Order Radial Distortion

Since commercial lenses deviate from an ideal lens with a single-lens focal length, this creates a number of aberrations. The most common one is referred to as “radial distortion”.

The camera model of SCNeRF is extended to incorporate such radial distortions. A widely used 4th order radial distortion model is deployed to express this radial distortion.

Undistorted normalized pixel coordinate (nx,ny)(n'_x, n'_y) converted from pixel coordinate (px,py)(p_x, p_y) can be obtained using the 4th order radial distortion model as you can see in the following equations.

(nx,ny)=(pxcxfx,pycyfy)r=nx2+ny2[nx,ny,1]T=K1[px(1+(k1+zk1)r2+(k2+zk2)r4),py(1+(k1+zk1)r2+(k2+zk2)r4),1](n_x, n_y) = (\frac{p_x-c_x}{f_x},\frac{p_y-c_y}{f_y})\\r=\sqrt{n^2_x+n^2_y}\\\left[n'_x, n'_y, 1 \right]^T = K^{-1} \left[p_x(1+(k_1+z_{k_1}) r^2 + (k_2+z_{k_2}) r^4), p_y(1+(k_1+z_{k_1}) r^2 + (k_2+z_{k_2}) r^4),1 \right]

where (k1,k2)(k_1, k_2) is initial radial distortion parameter denoted as k0k_0 and (zk1,zk2)(z_{k_1}, z_{k_2}) are residuals denoted as zkz_k.

Ray Direction & Origin

UsingPinhole Camera Model and Fourth Order Radial Distortion, ray direction rd\mathbf{r_d} and ray origin ro\mathbf{r_o} in the world coordinate can be expressed as the following.

rd=N(R[nx,ny,1]T)ro=t\mathbf{r_d} = N(R \cdot \left[n'_x, n'_y, 1 \right]^T)\\\mathbf{r_o}=\mathbf{t}

where N()N(\cdot) is vector normalization. For those who may confuse why t\mathbf{t} equals the ray origin ro\mathbf{r_o} (=camea center) in the world coordinate, I made a conceptual image that shows the geometric meaning of vector t\mathbf{t}.

Since these ray parameters rd\mathbf{r_d} and ro\mathbf{r_o} are functions of intrinsics, extrinsics, and distortion paramameter residuals (Δf,Δc,Δa,Δt,Δk\Delta f, \Delta c, \Delta a, \Delta t, \Delta k), we can pass gradients from the rays to the residuals to optimize the parameters. Note that K0,R0,t0,k0K_0,R_0, t_0, k_0are initial values of each parameters and not optimized.

Generic Non-Linear Camera Distortion

Complex optical abberations in real lenses cannot be modeled using a parametric camera. For such noise, generic non-linear aberration model is used. Specifically, local ray parameter residuals zd=Δrd(p)\mathbf{z_d} = \Delta \mathbf{r}_d(\mathbf{p}), zo=Δro(p)\mathbf{z}_o = \Delta \mathbf{r}_o(\mathbf{p}) are used, where p\mathbf{p} is the image coordinate.

rd=rd+zdro=ro+zo\mathbf{r}'_d = \mathbf{r}_d + \mathbf{z}_d \\\mathbf{r}'_o=\mathbf{r}_o+\mathbf{z}_o

Bilinear interpolation is used to extract continuous ray distortion parameters.

zd(p)=x=pxpx+1x=pypy+1(1xpx)(1ypy)zd[x,y]\mathbf{z}_d(\mathbf{p}) = \sum_{x=\lfloor\mathbf{p}_x\rfloor}^{\lfloor\mathbf{p}_x\rfloor+1}\sum_{x=\lfloor\mathbf{p}_y\rfloor}^{\lfloor\mathbf{p}_y\rfloor+1} \left(1-|x-\mathbf{p}_x|\right)\left(1-|y-\mathbf{p}_y|\right)\mathbf{z}_d\left[x,y\right]

where zd[x,y]\mathbf{z}_d[x, y] indicates the ray direction offset at a control point in discrete 2D coordinate (x,y)(x, y). zd[x,y]\mathbf{z}_d[x, y] is learned at discrete locations only.

Dual comes for free.

zo(p)=x=pxpx+1x=pypy+1(1xpx)(1ypy)zo[x,y]\mathbf{z}_o(\mathbf{p}) = \sum_{x=\lfloor\mathbf{p}_x\rfloor}^{\lfloor\mathbf{p}_x\rfloor+1}\sum_{x=\lfloor\mathbf{p}_y\rfloor}^{\lfloor\mathbf{p}_y\rfloor+1} \left(1-|x-\mathbf{p}_x|\right)\left(1-|y-\mathbf{p}_y|\right)\mathbf{z}_o\left[x,y\right]

To help your understanding, the conceptual image of a generic non-linear aberration model is attached below.

For those who want to learn more about this generic camera model, please refer to the paper "Why having 10,000 parameters in your camera model is better than twelve" in the Reference & Additional materials section.

Computational Graph of Ray Direction & origin

From Pinhole Camera Model, Fourth Order Radial Distortion, Generic Non-Linear Camera Distortion, the final ray direction, and ray origin can be expressed using the following graph.

Loss

To optimize calibration parameters, both geometric consistency loss and photometric consistency loss are exploited.

Geometric Consistency Loss

Geometric Consistency Loss is dπd_\pi in the above figure. Let's break this down into pieces.

First, let (pApB)\left(\mathbf{p_A} \leftrightarrow \mathbf{p_B}\right) be correspondence on camera A and camera B respectively. When all the camera parameters are calibrated, the ray rA\mathbf{r}_A and rB\mathbf{r}_B should intersect at the 3D point. However, when there’s a misalignment due to an error in camera parameters, two rays do not meet at a single point, and we can measure the deviation by computing the shortest distance between corresponding rays.

Let a point on rA\mathbf{r}_A be xA(tA)=ro,A+tArd,A\mathbf{x}_A(t_A) = \mathbf{r}_{o,A} + t_A\mathbf{r}_{d,A}, and a point on rB\mathbf{r}_B be xB(tB)=ro,B+tArd,B\mathbf{x}_B(t_B) = \mathbf{r}_{o,B} + t_A\mathbf{r}_{d,B}. A distance between the rA\mathbf{r}_A and a point on the rB\mathbf{r}_B is dd as we can see in the above figure.

Solving dd2dtBt^B=0\frac{\mathbf{d}d^2}{\mathbf{d}t_B}|_{\hat{t}_B}=0 gives us t^B\hat{t}_B that makes the distance between xB\mathbf{x}_B and rA\mathbf{r}_A to be minimum.

t^B=(rA,orB,o)×rA,d(rA,d×rB,d)(rA,d×rB,d)2\hat{t}_B = \frac{\left(\mathbf{r}_{A,o}-\mathbf{r}_{B,o}\right) \times \mathbf{r}_{A,d}\cdot \left( \mathbf{r}_{A,d} \times \mathbf{r}_{B,d}\right)}{\left(\mathbf{r}_{A,d}\times\mathbf{r}_{B,d}\right)^2}

From this, we can find a point on ray B that has the shortest distance to ray A.

x^B=xB(t^B)\mathbf{\hat{x}_B} = \mathbf{x_B}(\hat{t}_B)

Dual comes for free.

x^A=xA(t^A)\mathbf{\hat{x}_A} = \mathbf{x_A}(\hat{t}_A)

In the above figure, in equation dπd_{\pi}, x^A\mathbf{\hat{x}_A}and x^B\mathbf{\hat{x}_B} are expressed as xA\mathbf{x_A} and xB\mathbf{x_B} for simplicity.

After projecting the points to image planes and computing distance on the image planes, geometric consistency loss dπd_\pi can be obtained, where π()\pi(\cdot) is a projection function.

Note a point far from the cameras would have a large deviation, while a point close to the cameras would have a small deviation. Thus, the distance between the two points is computed on the image plane, not the 3D space, to remove this depth sensitivity.

Photometric Consistency Loss

Photometric consistency loss is defined as the following.

L=pIC(p)C^(r(p))22\mathcal{L} = \sum_{\mathbf{p}\in\mathcal{I}}||C(\mathbf{p})-\hat{C}(\mathbf{r(p)})||^2_2

where p\mathbf{p} is a pixel coordinate, and I\mathcal{I} is a set of pixel coordinates in an image, C^(r)\hat{C}(\mathbf{r}) is the output of the volumetric rendering using the ray r\mathbf{r} of corresponding pixel p\mathbf{p}, C(p)C(\mathbf{p}) is the ground truth color.

HOW TO ESTIMATE C^(r)\hat{C}(\mathbf{r})? What is Volume Rendering?

The color value C\mathbf{C} of a ray can be represented as an integral of all colors weighted by the opaqueness along a ray, or can be approximated as the weighted sum of colors at N points along a ray as follows.

C^iN(j=1i1α(r(tj),Δj))(1α(ti,Δi))c(r(ti),v)\mathbf{\hat{C}} \approx \sum_i^N\left( \prod_{j=1}^{i-1}\alpha (\mathbf{r}(t_j), \Delta_j) \right)\left( 1-\alpha(t_i, \Delta_i) \right) \mathbf{c}\left( \mathbf{r}(t_i), \mathbf{v} \right)

where α()\alpha(\cdot) is transparency, c()\mathbf{c(\cdot)}is color, Δi=ti+1ti\Delta_i = t_{i+1}-t_{i}

For those who want to learn more about this equation, please refer the "NeRF" paper inReference & Additional materials

Note that photometric consistency loss is differentiable with respect to the learnable camera parameters. From this, we can define gradients for the camera parameters and be able to calibrate the cameras.

Curriculum Learning

It is impossible to learn accurate camera parameters when the geometry is unknown or too coarse for self-calibration. Thus, curriculum learning is adopted: geometry and a linear camera model first and complex camera model parameters later.

First, NeRF network is trained while initializing the camera focal lengths and principal point to half the image width and height. Learning coarse geometry first is crucial since it initializes the networks to a more favorable local optimum for learning better camera parameters.

Next, camera parameters for the linear camera model, radial distortion, nonlinear noise of ray direction, and ray origin are sequentially added to the learning.

Following is the final learning algorithm. The get_paramsget\_params function returns a set of parameters for the curriculum learning which progressively adds complexity to the camera model.

Next, the model is trained with the projected ray distance by selecting a target image at random with sufficient correspondences.

4. Experiment & Result

Here, not all but some representative experimental results will be covered.

Experimental Setup

  • Dataset

    • LLFF

      • 8 scenes

    • Tanks and Temples

      • 4 scenes

    • Custom data collected by the author

      • 6 scenes

      • fish-eye camera

  • Experiments

    • Improve over NeRF

    • Improve over NeRF++

    • Fish-eye Lens Reconstruction

    • Ablation Study

In this article, only the dataset and experiment highlighted in red will be covered.

Improvement over NeRF

Table 1 reports the qualities of the rendered images in the training set. Although the SCNeRF model does not adopt calibrated camera information, it shows a reliable rendering performance.

SCNeRF model shows better rendering qualities than NeRF when COLMAP initializes the camera information. Table 2 reports the rendering qualities of NeRF and SCNeRF. SCNeRF consistently shows better rendering qualities than the original NeRF.

Following is the visualization of the rendered images.

Here, (a) is the NeRF result without COLMAP output, (b) is the NeRF result with COLMAP output, (c) is the SCNeRF result without COLMAP output, and (d) is the SCNeRF result with COLMAP oupput.

Ablation Study

To check the effects of the proposed models, an ablation study is conducted. Each phase is trained for 200K iterations. From this experiment, extending the SCNeRF model with learnable intrinsic and extrinsic parameters(IE), non-linear distortion(OD), and projected ray distance loss(PRD) is more potential for rendering clearer images. However, for some scenes, adopting projected ray distance increases the overall projected ray distance, even though it is not stated in the table.

Following is the visualization of the ablation study.

5. Conclusion

Summary

SCNeRF proposes a self-calibration algorithm that learns geometry and camera parameters jointly. The camera model consists of a pinhole model, radial distortion, and non-linear distortion, which capture real noises in lenses. The SCNeRF also proposes projected ray distance to improve accuracy, which allows the SCNeRF model to learn fine-grained correspondences. SCNeRF model learns geometry and camera parameters from scratch when the poses are not given, and improves NeRF to be more robust when camera poses are given.

Personal Opinion

  • In my perspective, this paper is worthy because it shows a way to calibrate camera parameters and neural radiance fields jointly.

  • I wonder why the result in the paper reports training set accuracy instead of val/test set accuracy.

  • Little bit disappointed by the fact that the non-preferrable results are only stated in the text, not in the table. (ex Ablation Study)

  • I noticed some errors in the equations and corrected them as I think they should be. Please feel free to comment if you find any errors in the equations used in this article.

Take home message

SCNeRF learns geometry and camera parameters from scratch w/o poses

SCNeRF uses the camera model consisting of a pinhole model, radial distortion, and non-linear distortion

SCNeRF proposed projected ray distance to improve accuracy

Reviewer information

김민정(Min-Jung Kim)

  • KAIST AI

  • Contact Information

    • email: emjay73@naver.com

Reference & Additional materials

  1. Citation of this paper

    1. Jeong, Yoonwoo, et al. "Self-calibrating neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

  2. Official GitHub repository : https://github.com/POSTECH-CVLab/SCNeRF

  3. Citation of related work

    1. Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

    2. Schops, Thomas, et al. "Why having 10,000 parameters in your camera model is better than twelve." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

    3. Zhou, Yi, et al. "On the continuity of rotation representations in neural networks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

  4. Other useful materials

    1. Lens Aberrations : chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Fwww.nao.org%2Fwp-content%2Fuploads%2F2020%2F04%2FLens-Aberrations.pdf&chunk=true

    2. camera models : chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Fcvgl.stanford.edu%2Fteaching%2Fcs231a_winter1415%2Flecture%2Flecture2_camera_models_note.pdf&clen=4519068&chunk=true

Last updated