📝
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • Self-Calibrating Neural Radiance Fields [Eng]
  • 1. Problem definition
  • 2. Motivation
  • Related work
  • Idea
  • 3. Method
  • Differentiable Camera Model
  • Loss
  • Curriculum Learning
  • 4. Experiment & Result
  • Experimental Setup
  • Improvement over NeRF
  • Ablation Study
  • 5. Conclusion
  • Summary
  • Personal Opinion
  • Take home message
  • Reviewer information
  • Reference & Additional materials

Was this helpful?

  1. Paper review
  2. [2022 Spring] Paper review

Self-Calibrating Neural Radiance Fields [Eng]

Jeong et al. / Self-Calibrating Neural Radiance Fields / ICCV 2021

PreviousSelf-Calibrating Neural Radiance Fields [Kor]NextGIRAFFE [Kor]

Last updated 2 years ago

Was this helpful?

Self-Calibrating Neural Radiance Fields [Eng]

한국어로 쓰인 리뷰를 읽으려면 를 누르세요.

1. Problem definition

Given a set of images of a scene, the proposed method, dubbed SCNeRF, jointly learns the geometry of the scene and the accurate camera parameters without any calibration objects. This task can be expressed as the following equation.

Find K,R,t,k,ro,rd,θK, R, t, k, r_{o}, r_{d}, \thetaK,R,t,k,ro​,rd​,θ, when

r=(ro,rd)=fcam(K,R,t,k,ro,rd)\mathbf{r}=(\mathbf{r_o}, \mathbf{r_d})=f_{cam}(K, R, t, k, r_o, r_d)r=(ro​,rd​)=fcam​(K,R,t,k,ro​,rd​)

C^(r)=fnerf(r;θ)\hat{\mathbf{C}}(\mathbf{r})=f_{nerf}(\mathbf{r};\theta)C^(r)=fnerf​(r;θ)

where r\mathbf{r}r is a ray, ro\mathbf{r_o}ro​ is ray origin, and rd\mathbf{r_d}rd​ is ray direction, fcamf_{cam}fcam​ is a function that generates a ray from camera parameters, (K,R,t,k,ro,rd)(K,R,t,k,r_o,r_d)(K,R,t,k,ro​,rd​) are camera parameters, C^(r)\hat{\mathbf{C}}(\mathbf{r})C^(r) is an estimated color of the ray r\mathbf{r}r, θ\thetaθ is a parameter set of NeRF model, fnerff_{nerf}fnerf​ is a function that estimates color of a ray using NeRF parameters.

Generally, scene geometry is learned with known camera parameters, or camera parameters are estimated without improving or learning scene geometry.

Unlike the previous approach, the purpose of this paper is to learn camera parameters (K,R,t,k,ro,rd)(K,R,t,k,r_o,r_d)(K,R,t,k,ro​,rd​) and NeRF model parameters θ\thetaθ jointly.

2. Motivation

Related work

Camera Model

Because of its simplicity and generality, traditional 3D vision tasks often assume that the camera model is a simple pinhole model. However, with the development of camera models, various camera models have been introduced, including fish eye models, and per-pixel generic models. A basic pinhole camera model is not enough to represent these kinds of complex camera models.

Camera Self-Calibration

Self-Calibration is a research topic that calibrates camera parameters without an external calibration object (e.g., a checkerboard pattern) in the scene.

In many cases, calibration objects are not readily available. Thus, calibrating camera parameters without any external objects has been an important research topic.

However, conventional self-calibration methods solely rely on the geometric loss or constraints based on the epipolar geometry that only uses a set of sparse correspondences extracted from a non-differentiable process. This could lead to diverging results with extreme sensitivity to noise when a scene does not have enough interest points. Lastly, conventional self-calibration methods use an off-the-shelf non-differentiable feature matching algorithm and do not improve or learn the geometry, though it is well known that the better we know the geometry of the scene, the more accurate the camera model gets.

Neural Radiance Fields(NeRF) for Novel View Synthesis

NeRF is a work that synthesizes a novel view of the scene by optimizing a separate neural continuous volume representation network for each scene.

At the time when the NeRF was published, this work achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.

However, this requires not only a dataset of captured RGB images of the scene but also the corresponding camera poses and intrinsic parameters, which are not always available.

Idea

  • Pinhole camera model parameters, fourth-order radial distortion parameters, and generic noise model parameters that can learn arbitrary non-linear camera distortions are included to overcome the limitation of the pinhole camera model.

  • To overcome the limitation of geometric loss used in the previous self-calibration methods, additional photometric consistency is used.

  • To get a more accurate camera model using improved geometry of the scene, the geometry represented using Neural Radiance Fields is learned jointly.

3. Method

Differentiable Camera Model

Pinhole Camera Model

The pinhole camera model maps a 4-vector homogeneous coordinate in 3D space P4×1P_{4 \times 1}P4×1​ to a 3-vector in the image plane P3×1′P'_{3 \times 1}P3×1′​.

P3×1′=M3×4P=K3×3[R  T]3×4P4×1P'_{3\times1} = M_{3\times4}P=K_{3\times3}\left[R\; T\right]_{3\times 4} P_{4\times 1}P3×1′​=M3×4​P=K3×3​[RT]3×4​P4×1​

Where KKK is the intrinsic matrix, RRR is the rotation matrix, TTT is the translation matrix

First, the camera intrinsic parameters are decomposed into the initialization K0K_0K0​ and the residual parameter matrix ΔK\Delta KΔK(=zKz_KzK​). This is due to the highly non-convex nature of the intrinsic matrix that has a lot of local minima.

K=[fx+Δfx0cx+Δcx0fy+Δfycy+Δcy001]=K0+ΔK∈R3×3K=\begin{bmatrix} f_x+\Delta f_x & 0 & c_x + \Delta c_x \\ 0 & f_y + \Delta f_y & c_y + \Delta c_y \\ 0 & 0 & 1 \end{bmatrix} = K_0 + \Delta K \in \mathbb{R}^{3\times 3}K=​fx​+Δfx​00​0fy​+Δfy​0​cx​+Δcx​cy​+Δcy​1​​=K0​+ΔK∈R3×3

Similarly, the extrinsic parameters are decomposed into initial values and their residual parameters to represent the camera rotation R and translation t. However, directly learning the rotation offset for each element of a rotation matrix would break the orthogonality of the rotation matrix. Thus, the 6-vector representation which uses the unnormalized first two columns of a rotation matrix is utilized to represent a 3D rotation:

t=t0+ΔtR=f(a0+Δa)f([∣∣a1a2∣∣])=[∣∣∣b1b2b3∣∣∣]3×3\mathbf{t} = \mathbf{t_0} + \Delta \mathbf{t}\\R=f(\mathbf{a_0}+\Delta \mathbf{a})\\f\begin{pmatrix}\begin{bmatrix} | & | \\ a_1& a_2\\ | & | \end{bmatrix}\end{pmatrix} = \begin{bmatrix}|&|&|\\\mathbf{b_1} & \mathbf{b_2} & \mathbf{b_3}\\| & | & |\end{bmatrix}_{3 \times 3}t=t0​+ΔtR=f(a0​+Δa)f​​∣a1​∣​∣a2​∣​​​​=​∣b1​∣​∣b2​∣​∣b3​∣​​3×3​

What fff does is quite similar to the Gram-Schmidt process. To make it clear, I made a conceptual image as follows. Here, N(⋅)N(\cdot)N(⋅) is L2L2L2 normalization.

As we can see in the figure, from unnormalized two vectors a1\mathbf{a_1}a1​ and a2\mathbf{a_2}a2​, orthonormal vectors b1,b2,b3\mathbf{b_1}, \mathbf{b_2}, \mathbf{b_3}b1​,b2​,b3​ can be obtained.

Fourth Order Radial Distortion

Since commercial lenses deviate from an ideal lens with a single-lens focal length, this creates a number of aberrations. The most common one is referred to as “radial distortion”.

The camera model of SCNeRF is extended to incorporate such radial distortions. A widely used 4th order radial distortion model is deployed to express this radial distortion.

Undistorted normalized pixel coordinate (nx′,ny′)(n'_x, n'_y)(nx′​,ny′​) converted from pixel coordinate (px,py)(p_x, p_y)(px​,py​) can be obtained using the 4th order radial distortion model as you can see in the following equations.

(nx,ny)=(px−cxfx,py−cyfy)r=nx2+ny2[nx′,ny′,1]T=K−1[px(1+(k1+zk1)r2+(k2+zk2)r4),py(1+(k1+zk1)r2+(k2+zk2)r4),1](n_x, n_y) = (\frac{p_x-c_x}{f_x},\frac{p_y-c_y}{f_y})\\r=\sqrt{n^2_x+n^2_y}\\\left[n'_x, n'_y, 1 \right]^T = K^{-1} \left[p_x(1+(k_1+z_{k_1}) r^2 + (k_2+z_{k_2}) r^4), p_y(1+(k_1+z_{k_1}) r^2 + (k_2+z_{k_2}) r^4),1 \right](nx​,ny​)=(fx​px​−cx​​,fy​py​−cy​​)r=nx2​+ny2​​[nx′​,ny′​,1]T=K−1[px​(1+(k1​+zk1​​)r2+(k2​+zk2​​)r4),py​(1+(k1​+zk1​​)r2+(k2​+zk2​​)r4),1]

where (k1,k2)(k_1, k_2)(k1​,k2​) is initial radial distortion parameter denoted as k0k_0k0​ and (zk1,zk2)(z_{k_1}, z_{k_2})(zk1​​,zk2​​) are residuals denoted as zkz_kzk​.

Ray Direction & Origin

UsingPinhole Camera Model and Fourth Order Radial Distortion, ray direction rd\mathbf{r_d}rd​ and ray origin ro\mathbf{r_o}ro​ in the world coordinate can be expressed as the following.

rd=N(R⋅[nx′,ny′,1]T)ro=t\mathbf{r_d} = N(R \cdot \left[n'_x, n'_y, 1 \right]^T)\\\mathbf{r_o}=\mathbf{t}rd​=N(R⋅[nx′​,ny′​,1]T)ro​=t

where N(⋅)N(\cdot)N(⋅) is vector normalization. For those who may confuse why t\mathbf{t}t equals the ray origin ro\mathbf{r_o}ro​ (=camea center) in the world coordinate, I made a conceptual image that shows the geometric meaning of vector t\mathbf{t}t.

Since these ray parameters rd\mathbf{r_d}rd​ and ro\mathbf{r_o}ro​ are functions of intrinsics, extrinsics, and distortion paramameter residuals (Δf,Δc,Δa,Δt,Δk\Delta f, \Delta c, \Delta a, \Delta t, \Delta kΔf,Δc,Δa,Δt,Δk), we can pass gradients from the rays to the residuals to optimize the parameters. Note that K0,R0,t0,k0K_0,R_0, t_0, k_0K0​,R0​,t0​,k0​are initial values of each parameters and not optimized.

Generic Non-Linear Camera Distortion

Complex optical abberations in real lenses cannot be modeled using a parametric camera. For such noise, generic non-linear aberration model is used. Specifically, local ray parameter residuals zd=Δrd(p)\mathbf{z_d} = \Delta \mathbf{r}_d(\mathbf{p})zd​=Δrd​(p), zo=Δro(p)\mathbf{z}_o = \Delta \mathbf{r}_o(\mathbf{p})zo​=Δro​(p) are used, where p\mathbf{p}p is the image coordinate.

rd′=rd+zdro′=ro+zo\mathbf{r}'_d = \mathbf{r}_d + \mathbf{z}_d \\\mathbf{r}'_o=\mathbf{r}_o+\mathbf{z}_ord′​=rd​+zd​ro′​=ro​+zo​

Bilinear interpolation is used to extract continuous ray distortion parameters.

zd(p)=∑x=⌊px⌋⌊px⌋+1∑x=⌊py⌋⌊py⌋+1(1−∣x−px∣)(1−∣y−py∣)zd[x,y]\mathbf{z}_d(\mathbf{p}) = \sum_{x=\lfloor\mathbf{p}_x\rfloor}^{\lfloor\mathbf{p}_x\rfloor+1}\sum_{x=\lfloor\mathbf{p}_y\rfloor}^{\lfloor\mathbf{p}_y\rfloor+1} \left(1-|x-\mathbf{p}_x|\right)\left(1-|y-\mathbf{p}_y|\right)\mathbf{z}_d\left[x,y\right]zd​(p)=x=⌊px​⌋∑⌊px​⌋+1​x=⌊py​⌋∑⌊py​⌋+1​(1−∣x−px​∣)(1−∣y−py​∣)zd​[x,y]

where zd[x,y]\mathbf{z}_d[x, y]zd​[x,y] indicates the ray direction offset at a control point in discrete 2D coordinate (x,y)(x, y)(x,y). zd[x,y]\mathbf{z}_d[x, y]zd​[x,y] is learned at discrete locations only.

Dual comes for free.

zo(p)=∑x=⌊px⌋⌊px⌋+1∑x=⌊py⌋⌊py⌋+1(1−∣x−px∣)(1−∣y−py∣)zo[x,y]\mathbf{z}_o(\mathbf{p}) = \sum_{x=\lfloor\mathbf{p}_x\rfloor}^{\lfloor\mathbf{p}_x\rfloor+1}\sum_{x=\lfloor\mathbf{p}_y\rfloor}^{\lfloor\mathbf{p}_y\rfloor+1} \left(1-|x-\mathbf{p}_x|\right)\left(1-|y-\mathbf{p}_y|\right)\mathbf{z}_o\left[x,y\right]zo​(p)=x=⌊px​⌋∑⌊px​⌋+1​x=⌊py​⌋∑⌊py​⌋+1​(1−∣x−px​∣)(1−∣y−py​∣)zo​[x,y]

To help your understanding, the conceptual image of a generic non-linear aberration model is attached below.

For those who want to learn more about this generic camera model, please refer to the paper "Why having 10,000 parameters in your camera model is better than twelve" in the Reference & Additional materials section.

Computational Graph of Ray Direction & origin

From Pinhole Camera Model, Fourth Order Radial Distortion, Generic Non-Linear Camera Distortion, the final ray direction, and ray origin can be expressed using the following graph.

Loss

To optimize calibration parameters, both geometric consistency loss and photometric consistency loss are exploited.

Geometric Consistency Loss

Geometric Consistency Loss is dπd_\pidπ​ in the above figure. Let's break this down into pieces.

First, let (pA↔pB)\left(\mathbf{p_A} \leftrightarrow \mathbf{p_B}\right)(pA​↔pB​) be correspondence on camera A and camera B respectively. When all the camera parameters are calibrated, the ray rA\mathbf{r}_ArA​ and rB\mathbf{r}_BrB​ should intersect at the 3D point. However, when there’s a misalignment due to an error in camera parameters, two rays do not meet at a single point, and we can measure the deviation by computing the shortest distance between corresponding rays.

Let a point on rA\mathbf{r}_ArA​ be xA(tA)=ro,A+tArd,A\mathbf{x}_A(t_A) = \mathbf{r}_{o,A} + t_A\mathbf{r}_{d,A}xA​(tA​)=ro,A​+tA​rd,A​, and a point on rB\mathbf{r}_BrB​ be xB(tB)=ro,B+tArd,B\mathbf{x}_B(t_B) = \mathbf{r}_{o,B} + t_A\mathbf{r}_{d,B}xB​(tB​)=ro,B​+tA​rd,B​. A distance between the rA\mathbf{r}_ArA​ and a point on the rB\mathbf{r}_BrB​ is ddd as we can see in the above figure.

Solving dd2dtB∣t^B=0\frac{\mathbf{d}d^2}{\mathbf{d}t_B}|_{\hat{t}_B}=0dtB​dd2​∣t^B​​=0 gives us t^B\hat{t}_Bt^B​ that makes the distance between xB\mathbf{x}_BxB​ and rA\mathbf{r}_ArA​ to be minimum.

t^B=(rA,o−rB,o)×rA,d⋅(rA,d×rB,d)(rA,d×rB,d)2\hat{t}_B = \frac{\left(\mathbf{r}_{A,o}-\mathbf{r}_{B,o}\right) \times \mathbf{r}_{A,d}\cdot \left( \mathbf{r}_{A,d} \times \mathbf{r}_{B,d}\right)}{\left(\mathbf{r}_{A,d}\times\mathbf{r}_{B,d}\right)^2}t^B​=(rA,d​×rB,d​)2(rA,o​−rB,o​)×rA,d​⋅(rA,d​×rB,d​)​

From this, we can find a point on ray B that has the shortest distance to ray A.

x^B=xB(t^B)\mathbf{\hat{x}_B} = \mathbf{x_B}(\hat{t}_B)x^B​=xB​(t^B​)

Dual comes for free.

x^A=xA(t^A)\mathbf{\hat{x}_A} = \mathbf{x_A}(\hat{t}_A)x^A​=xA​(t^A​)

In the above figure, in equation dπd_{\pi}dπ​, x^A\mathbf{\hat{x}_A}x^A​and x^B\mathbf{\hat{x}_B}x^B​ are expressed as xA\mathbf{x_A}xA​ and xB\mathbf{x_B}xB​ for simplicity.

After projecting the points to image planes and computing distance on the image planes, geometric consistency loss dπd_\pidπ​ can be obtained, where π(⋅)\pi(\cdot)π(⋅) is a projection function.

Note a point far from the cameras would have a large deviation, while a point close to the cameras would have a small deviation. Thus, the distance between the two points is computed on the image plane, not the 3D space, to remove this depth sensitivity.

Photometric Consistency Loss

Photometric consistency loss is defined as the following.

L=∑p∈I∣∣C(p)−C^(r(p))∣∣22\mathcal{L} = \sum_{\mathbf{p}\in\mathcal{I}}||C(\mathbf{p})-\hat{C}(\mathbf{r(p)})||^2_2L=p∈I∑​∣∣C(p)−C^(r(p))∣∣22​

where p\mathbf{p}p is a pixel coordinate, and I\mathcal{I}I is a set of pixel coordinates in an image, C^(r)\hat{C}(\mathbf{r})C^(r) is the output of the volumetric rendering using the ray r\mathbf{r}r of corresponding pixel p\mathbf{p}p, C(p)C(\mathbf{p})C(p) is the ground truth color.

HOW TO ESTIMATE C^(r)\hat{C}(\mathbf{r})C^(r)? What is Volume Rendering?

The color value C\mathbf{C}C of a ray can be represented as an integral of all colors weighted by the opaqueness along a ray, or can be approximated as the weighted sum of colors at N points along a ray as follows.

C^≈∑iN(∏j=1i−1α(r(tj),Δj))(1−α(ti,Δi))c(r(ti),v)\mathbf{\hat{C}} \approx \sum_i^N\left( \prod_{j=1}^{i-1}\alpha (\mathbf{r}(t_j), \Delta_j) \right)\left( 1-\alpha(t_i, \Delta_i) \right) \mathbf{c}\left( \mathbf{r}(t_i), \mathbf{v} \right)C^≈∑iN​(∏j=1i−1​α(r(tj​),Δj​))(1−α(ti​,Δi​))c(r(ti​),v)

where α(⋅)\alpha(\cdot)α(⋅) is transparency, c(⋅)\mathbf{c(\cdot)}c(⋅)is color, Δi=ti+1−ti\Delta_i = t_{i+1}-t_{i}Δi​=ti+1​−ti​

For those who want to learn more about this equation, please refer the "NeRF" paper inReference & Additional materials

Note that photometric consistency loss is differentiable with respect to the learnable camera parameters. From this, we can define gradients for the camera parameters and be able to calibrate the cameras.

Curriculum Learning

It is impossible to learn accurate camera parameters when the geometry is unknown or too coarse for self-calibration. Thus, curriculum learning is adopted: geometry and a linear camera model first and complex camera model parameters later.

First, NeRF network is trained while initializing the camera focal lengths and principal point to half the image width and height. Learning coarse geometry first is crucial since it initializes the networks to a more favorable local optimum for learning better camera parameters.

Next, camera parameters for the linear camera model, radial distortion, nonlinear noise of ray direction, and ray origin are sequentially added to the learning.

Following is the final learning algorithm. The get_paramsget\_paramsget_params function returns a set of parameters for the curriculum learning which progressively adds complexity to the camera model.

Next, the model is trained with the projected ray distance by selecting a target image at random with sufficient correspondences.

4. Experiment & Result

Here, not all but some representative experimental results will be covered.

Experimental Setup

  • Dataset

    • LLFF

      • 8 scenes

    • Tanks and Temples

      • 4 scenes

    • Custom data collected by the author

      • 6 scenes

      • fish-eye camera

  • Experiments

    • Improve over NeRF

    • Improve over NeRF++

    • Fish-eye Lens Reconstruction

    • Ablation Study

In this article, only the dataset and experiment highlighted in red will be covered.

Improvement over NeRF

Table 1 reports the qualities of the rendered images in the training set. Although the SCNeRF model does not adopt calibrated camera information, it shows a reliable rendering performance.

SCNeRF model shows better rendering qualities than NeRF when COLMAP initializes the camera information. Table 2 reports the rendering qualities of NeRF and SCNeRF. SCNeRF consistently shows better rendering qualities than the original NeRF.

Following is the visualization of the rendered images.

Here, (a) is the NeRF result without COLMAP output, (b) is the NeRF result with COLMAP output, (c) is the SCNeRF result without COLMAP output, and (d) is the SCNeRF result with COLMAP oupput.

Ablation Study

To check the effects of the proposed models, an ablation study is conducted. Each phase is trained for 200K iterations. From this experiment, extending the SCNeRF model with learnable intrinsic and extrinsic parameters(IE), non-linear distortion(OD), and projected ray distance loss(PRD) is more potential for rendering clearer images. However, for some scenes, adopting projected ray distance increases the overall projected ray distance, even though it is not stated in the table.

Following is the visualization of the ablation study.

5. Conclusion

Summary

SCNeRF proposes a self-calibration algorithm that learns geometry and camera parameters jointly. The camera model consists of a pinhole model, radial distortion, and non-linear distortion, which capture real noises in lenses. The SCNeRF also proposes projected ray distance to improve accuracy, which allows the SCNeRF model to learn fine-grained correspondences. SCNeRF model learns geometry and camera parameters from scratch when the poses are not given, and improves NeRF to be more robust when camera poses are given.

Personal Opinion

  • In my perspective, this paper is worthy because it shows a way to calibrate camera parameters and neural radiance fields jointly.

  • I wonder why the result in the paper reports training set accuracy instead of val/test set accuracy.

  • I noticed some errors in the equations and corrected them as I think they should be. Please feel free to comment if you find any errors in the equations used in this article.

Take home message

SCNeRF learns geometry and camera parameters from scratch w/o poses

SCNeRF uses the camera model consisting of a pinhole model, radial distortion, and non-linear distortion

SCNeRF proposed projected ray distance to improve accuracy

Reviewer information

김민정(Min-Jung Kim)

  • KAIST AI

  • Contact Information

    • email: emjay73@naver.com

Reference & Additional materials

  1. Citation of this paper

    1. Jeong, Yoonwoo, et al. "Self-calibrating neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

  2. Citation of related work

    1. Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

    2. Schops, Thomas, et al. "Why having 10,000 parameters in your camera model is better than twelve." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

    3. Zhou, Yi, et al. "On the continuity of rotation representations in neural networks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

  3. Other useful materials

    1. Lens Aberrations : chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Fwww.nao.org%2Fwp-content%2Fuploads%2F2020%2F04%2FLens-Aberrations.pdf&chunk=true

    2. camera models : chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Fcvgl.stanford.edu%2Fteaching%2Fcs231a_winter1415%2Flecture%2Flecture2_camera_models_note.pdf&clen=4519068&chunk=true

Little bit disappointed by the fact that the non-preferrable results are only stated in the text, not in the table. (ex )

Official Project Page :

Official GitHub repository :

https://arxiv.org/abs/2108.13826
https://postech-cvlab.github.io/SCNeRF/
https://github.com/POSTECH-CVLab/SCNeRF
Ablation Study
여기