📝
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • TLDR; Generate High Resolution Depth Map using Pretrained Monocular Depth Estimation Network
  • 1. Problem definition
  • Preliminaries - Monocular Depth Estimation (MDE)
  • 2. Motivation
  • 2.a Behaviour of Typical MDE CNNs
  • 2.b Searching the best patch resolution
  • 2.c Merging Multiple Patches for Consistent Depth
  • 2.d Patch Estimates for Local Boosting
  • 3. Experimental setup
  • 4. Result
  • 5. Ablation Studies
  • 6. Limitations
  • 7. Conclusion
  • Take home message
  • Author / Reviewer information
  • Author
  • Reviewer
  • Reference & Additional materials

Was this helpful?

  1. Paper review
  2. [2021 Fall] Paper review

Boosting Monocular Depth Estimation [Eng]

S. Mahdi H. Miangoleh et al./ Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging / CVPR 2021

PreviousDAN-net [Eng]NextProgressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]

Last updated 3 years ago

Was this helpful?

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging [Eng]

TLDR; Generate High Resolution Depth Map using Pretrained Monocular Depth Estimation Network

1. Problem definition

Performance of monocular depth estimation networks is sub-optimal in dense scenes with current state-of-the-art (SoTA) generating coarse outputs. As the resolution of input image affects the performance of monocular depth estimation networks, this paper presents a two-stage mechanism wherein the first stage corresponds to extracting depth maps at multiple resolutions and the second stage focuses on combining them to generate a fine-grained depth map for and from a single image.

Preliminaries - Monocular Depth Estimation (MDE)

MDE is focused on extracting structural information by relying on occlusion boundaries and pose information with current SoTA achieved by convolutional neural network-based algorithms that learn the mapping between an input-output pair of color images and depth map by proposing different architectural mechanisms or supervision strategies.

2. Motivation

When the resolution of the input image is similar to the training image the depth estimation performance of underlying CNN results in the consistent structural estimation of the whole scene while missing on high-frequency details exhibited by small structures within the image. However, when a high-resolution version of the same image is fed to the same network, smaller objects are detected accurately but the overall structure of the scene gets distorted. This occurs since the convolutional layers of the fixed CNN now have a smaller receptive field compared to when a low-resolution image is used as input wherein the network layers have a large receptive field resulting in the preservation of the structure of large objects.

Secondly, the authors also observed that when the depth cues are far apart from the receptive field of an underlying CNN, the output starts getting structurally inconsistent results. Hence the authors proposed to generate patches from the input image as inputs to a CNN such that it is similar to local depth cue density. These patchwise predictions are subsequently merged to achieve a fine-grained depth estimation result resulting in a newer SoTA without retraining the CNN for depth estimation.

2.a Behaviour of Typical MDE CNNs

The majority of depth estimation approaches train to use a pre-defined low input resolution, and since these models are fully convolutional and can accommodate arbitrary input sizes in theory when the same model is provided a high-resolution input image the finer details are missed during estimation while ensuring consistent global depth.

As the monocular depth estimation network relies on occlusion and perspective-related cues, when these cues in the image get apart from the receptive field, the network is not able to generate a coherent depth estimation around pixels that do not receive enough information. Hence the limited ability of the network to 'see' the information within its receptive field acts as the limiting factor towards consistent high-resolution depth estimation.

2.b Searching the best patch resolution

For any given image, the resolution that will result in an accurate depth estimation is to be found out which is done using a simple edge map. The authors introduce two terminologies namely R0R_0R0​ and R20R_{20}R20​ where the resolution where every pixel is at most a half receptive field size away from context edges is called R0R_0R0​ and when 20% of the pixels do not receive any context this resolution is referred as R20R_{20}R20​ with R0R_0R0​ and R20R_{20}R20​ depending on image content. Estimations with resolutions above R0 will lose structural consistency but they will have richer high-frequency content in the result.

2.c Merging Multiple Patches for Consistent Depth

To merge a low-resolution depth map obtained with a smaller-resolution input to the network and a higher-resolution depth map of the same image or patch that has better accuracy around depth discontinuities but suffers from low-frequency artifacts multiple patches for generating a high-resolution depth estimation, the authors proposed a double estimation framework wherein they proposed a standard image-to-image translation approach for this task. Following this, the authors used a pix2pix [3] framework using a 10 layer UNet [4] as a generator to reach a resolution of 1024 x 1024. As this network will be used for merging a wide range of input resolutions, it is trained to reconstruct fine-grained details from the high resolution input to the low-resolution input.

However, the issue is to ensure consistent high-resolution ground truth. Thus, the authors empirically picked 672*672 pixels as input resolution to the network. To ensure that the ground truth and higher-resolution patch estimation have the same amount of fine-grained details, we apply a guided filter on the patch estimation using the ground truth estimation as guidance.

As the maximum resolution at which the network will be able to generate a consistent structure depends on the distribution of the contextual cues in the image. Using an edge map as the proxy for contextual cues, the authors determine this maximum resolution by ensuring that no pixel is further apart from contextual cues than half of the receptive field size. For this purpose, the authors applied binary dilation to the edge map with a receptive-field-sized kernel in different resolutions. Then, the resolution for which the dilated edge map stops to produce all-one results is the maximum resolution where every pixel will receive context information in a forward pass.

As the merging network is lightweight, the time it takes to do a forward pass is magnitudes smaller than the monocular depth estimation networks. Furthermore, the running time of our method mainly depends on the number of times the base network is used within the pipeline.

2.d Patch Estimates for Local Boosting

As the number of pixels determines the estimation resolution for the image without nearby contextual cues, these regions constrain the maximum resolution. However, regions within the images contain high-frequency details and would thus benefit more from higher resolution estimations. Thus the authors presented a patch-selection mechanism to generate depth at different resolutions for different regions and subsequently merge them to ensure consistent depth.

As this is a data-driven issue, and current datasets do not provide such high-resolution training images, the authors followed a recursive mechanism wherein a base estimate is generated using the double estimation with a fixed resolution of R20R_20R2​0. Subsequently, patch selection is performed via image tilling at the base resolution with tile size equal to receptive field size and 1/3 overlap. To select tiles with high frequency details, edsge density is computed with tiles with edge density lower than images discarded. Next, the resolution of depth estimation is expanded till the edge density of the tile matches that of the image. Finally, these results are merged with the base estimate generating high-resolution images.

Step 1: Tile and discard
Step 2: Expand
Step 3: Merge

3. Experimental setup

As this approach is primarily a post-processing approach requiring a pretrained depth estimation method, the authors used Middleburry 2014 and IBMS-1 datasets. Furthermore, the pretrained depth estimation models considered are MiDAS and SGR that represented SoTA in monocular depth estimation.

To evaluate the performance, this paper uses a set of standard depth evaluation metrics following recent works [6, 8], including root mean squared error in disparity space (RMSE), percentage of pixels with δ and ordinal error (ORD) from [6] in depth space.

4. Result

The quantitative and visual results of the mechanism proposed in this paper is summarized in Fig. 5, and Fig. 6 respectively demonstrating the efficacy of the proposed mechanism in improving the performance of pretrained depth estimation models.

5. Ablation Studies

The authors quantitatively demonstrate the reasons by choosing R20R_20R2​0 as the high-resolution estimation in the double-estimation framework. They highlight that using higher resolution R30R_30R3​0 results in performance drop due to high-resolution results having heavy artifacts as the number of pixels in the image without contextual information increases.

6. Limitations

Finally the authors highlight that since their work utilizes pretrained monocular depth estimation networks, it suffers from their inherent limitations and generates relative, ordinal depth estimates but not absolute depth values. In addition, they observed the performance of the base models to degrade with noise and the current method is unable to provide meaningful improvement for noisy images.

7. Conclusion

This paper demonstrates the feasibility in generating a high resolution depth map from a single image using pre-trained models. While previous work is limited to sub-megapixel resolutions, this paper generates a high-resolution depth map with pretrained depth estimation algorithms.

Take home message

Monocular Depth Estimation Models lack fine-grained predictions.

Instead of devising a new algorithm, the authors instead focused upon increasing the performance of pretrained models by merging estimations at different resolutions.

To ensure consistent merging of monocular depth information across different resolutions a CNN is used to perform this task.

Author / Reviewer information

Author

** Pranjay Shyam **

  • KAIST-ME (PhD)

  • pranjayshyam@kaist.ac.kr

Reviewer

  1. Ngoc Quang Nguyen

  2. 박나현

  3. MUHAMMAD ADI NUGROHO

Reference & Additional materials

  1. Citation of this paper

@INPROCEEDINGS{Miangoleh2021Boosting,
author={S. Mahdi H. Miangoleh and Sebastian Dille and Long Mai and Sylvain Paris and Ya\u{g}{\i}z Aksoy},
title={Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging},
journal={Proc. CVPR},
year={2021},
}
@inproceedings{isola2017image,
  title={Image-to-image translation with conditional adversarial networks},
  author={Isola, Phillip and Zhu, Jun-Yan and Zhou, Tinghui and Efros, Alexei A},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={1125--1134},
  year={2017}
}
@inproceedings{ronneberger2015u,
  title={U-net: Convolutional networks for biomedical image segmentation},
  author={Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas},
  booktitle={International Conference on Medical image computing and computer-assisted intervention},
  pages={234--241},
  year={2015},
  organization={Springer}
}
@article{Ranftl2020,
	author    = {Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun},
	title     = {Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer},
	journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
	year      = {2020},
}
@InProceedings{Xian_2020_CVPR,
author = {Xian, Ke and Zhang, Jianming and Wang, Oliver and Mai, Long and Lin, Zhe and Cao, Zhiguo},
title = {Structure-Guided Ranking Loss for Single Image Depth Prediction},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}
@inproceedings{scharstein2014high,
  title={High-resolution stereo datasets with subpixel-accurate ground truth},
  author={Scharstein, Daniel and Hirschm{\"u}ller, Heiko and Kitajima, York and Krathwohl, Greg and Ne{\v{s}}i{\'c}, Nera and Wang, Xi and Westling, Porter},
  booktitle={German conference on pattern recognition},
  pages={31--42},
  year={2014},
  organization={Springer}
}
@inproceedings{koch2018evaluation,
  title={Evaluation of cnn-based single-image depth estimation methods},
  author={Koch, Tobias and Liebel, Lukas and Fraundorfer, Friedrich and Korner, Marco},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
  pages={0--0},
  year={2018}
}

Citation of Pix2Pix /

Citation of UNet

Citation of MiDAS / (Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer) paper

Citation of SGR / (Structure-Guided Ranking Loss for Single Image Depth Prediction) paper

Citation of Middleburry2014

Citation of IBims-1

Official GitHub repository
paper
Code
paper
paper
Code
paper
Code
dataset
paper
Figure 1: Overview of Results Generated by Framework in Following paper.
Figure 2: Performance of pretrained monocular depth estimation network at different resolutions. Note missing fine grained details at lower resolution with missing global information as resolution of images increase.
Figure 3: Depth Prediction using a fixed depth estimation network but varying input resolution.
Figure 4: Overview of the mechanism to identify best resolution.
Figure 5: Overview of the Double Estimation methodology.
Figure 7: Quantitative Performacne Results as Estimated following the Paper and compared with SoTA.
Figure 8: Qualitative Performacne Results using the methodlogy of the paper and compared with SoTA.
Figure 9: Ablation of different values of R_x
Figure 6: Working of the Patch Selection based Local Boosting [1]
Patchexpand
Patchexpand