Boosting Monocular Depth Estimation [Eng]
S. Mahdi H. Miangoleh et al./ Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging / CVPR 2021
Last updated
S. Mahdi H. Miangoleh et al./ Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging / CVPR 2021
Last updated
Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging [Eng]
Performance of monocular depth estimation networks is sub-optimal in dense scenes with current state-of-the-art (SoTA) generating coarse outputs. As the resolution of input image affects the performance of monocular depth estimation networks, this paper presents a two-stage mechanism wherein the first stage corresponds to extracting depth maps at multiple resolutions and the second stage focuses on combining them to generate a fine-grained depth map for and from a single image.
MDE is focused on extracting structural information by relying on occlusion boundaries and pose information with current SoTA achieved by convolutional neural network-based algorithms that learn the mapping between an input-output pair of color images and depth map by proposing different architectural mechanisms or supervision strategies.
When the resolution of the input image is similar to the training image the depth estimation performance of underlying CNN results in the consistent structural estimation of the whole scene while missing on high-frequency details exhibited by small structures within the image. However, when a high-resolution version of the same image is fed to the same network, smaller objects are detected accurately but the overall structure of the scene gets distorted. This occurs since the convolutional layers of the fixed CNN now have a smaller receptive field compared to when a low-resolution image is used as input wherein the network layers have a large receptive field resulting in the preservation of the structure of large objects.
Secondly, the authors also observed that when the depth cues are far apart from the receptive field of an underlying CNN, the output starts getting structurally inconsistent results. Hence the authors proposed to generate patches from the input image as inputs to a CNN such that it is similar to local depth cue density. These patchwise predictions are subsequently merged to achieve a fine-grained depth estimation result resulting in a newer SoTA without retraining the CNN for depth estimation.
The majority of depth estimation approaches train to use a pre-defined low input resolution, and since these models are fully convolutional and can accommodate arbitrary input sizes in theory when the same model is provided a high-resolution input image the finer details are missed during estimation while ensuring consistent global depth.
As the monocular depth estimation network relies on occlusion and perspective-related cues, when these cues in the image get apart from the receptive field, the network is not able to generate a coherent depth estimation around pixels that do not receive enough information. Hence the limited ability of the network to 'see' the information within its receptive field acts as the limiting factor towards consistent high-resolution depth estimation.
To merge a low-resolution depth map obtained with a smaller-resolution input to the network and a higher-resolution depth map of the same image or patch that has better accuracy around depth discontinuities but suffers from low-frequency artifacts multiple patches for generating a high-resolution depth estimation, the authors proposed a double estimation framework wherein they proposed a standard image-to-image translation approach for this task. Following this, the authors used a pix2pix [3] framework using a 10 layer UNet [4] as a generator to reach a resolution of 1024 x 1024. As this network will be used for merging a wide range of input resolutions, it is trained to reconstruct fine-grained details from the high resolution input to the low-resolution input.
However, the issue is to ensure consistent high-resolution ground truth. Thus, the authors empirically picked 672*672 pixels as input resolution to the network. To ensure that the ground truth and higher-resolution patch estimation have the same amount of fine-grained details, we apply a guided filter on the patch estimation using the ground truth estimation as guidance.
As the maximum resolution at which the network will be able to generate a consistent structure depends on the distribution of the contextual cues in the image. Using an edge map as the proxy for contextual cues, the authors determine this maximum resolution by ensuring that no pixel is further apart from contextual cues than half of the receptive field size. For this purpose, the authors applied binary dilation to the edge map with a receptive-field-sized kernel in different resolutions. Then, the resolution for which the dilated edge map stops to produce all-one results is the maximum resolution where every pixel will receive context information in a forward pass.
As the merging network is lightweight, the time it takes to do a forward pass is magnitudes smaller than the monocular depth estimation networks. Furthermore, the running time of our method mainly depends on the number of times the base network is used within the pipeline.
As the number of pixels determines the estimation resolution for the image without nearby contextual cues, these regions constrain the maximum resolution. However, regions within the images contain high-frequency details and would thus benefit more from higher resolution estimations. Thus the authors presented a patch-selection mechanism to generate depth at different resolutions for different regions and subsequently merge them to ensure consistent depth.
As this approach is primarily a post-processing approach requiring a pretrained depth estimation method, the authors used Middleburry 2014 and IBMS-1 datasets. Furthermore, the pretrained depth estimation models considered are MiDAS and SGR that represented SoTA in monocular depth estimation.
To evaluate the performance, this paper uses a set of standard depth evaluation metrics following recent works [6, 8], including root mean squared error in disparity space (RMSE), percentage of pixels with δ and ordinal error (ORD) from [6] in depth space.
The quantitative and visual results of the mechanism proposed in this paper is summarized in Fig. 5, and Fig. 6 respectively demonstrating the efficacy of the proposed mechanism in improving the performance of pretrained depth estimation models.
Finally the authors highlight that since their work utilizes pretrained monocular depth estimation networks, it suffers from their inherent limitations and generates relative, ordinal depth estimates but not absolute depth values. In addition, they observed the performance of the base models to degrade with noise and the current method is unable to provide meaningful improvement for noisy images.
This paper demonstrates the feasibility in generating a high resolution depth map from a single image using pre-trained models. While previous work is limited to sub-megapixel resolutions, this paper generates a high-resolution depth map with pretrained depth estimation algorithms.
Monocular Depth Estimation Models lack fine-grained predictions.
Instead of devising a new algorithm, the authors instead focused upon increasing the performance of pretrained models by merging estimations at different resolutions.
To ensure consistent merging of monocular depth information across different resolutions a CNN is used to perform this task.
** Pranjay Shyam **
KAIST-ME (PhD)
pranjayshyam@kaist.ac.kr
Ngoc Quang Nguyen
박나현
MUHAMMAD ADI NUGROHO
Citation of this paper
Citation of UNet paper
Citation of Middleburry2014 dataset
Citation of IBims-1 paper
For any given image, the resolution that will result in an accurate depth estimation is to be found out which is done using a simple edge map. The authors introduce two terminologies namely and where the resolution where every pixel is at most a half receptive field size away from context edges is called and when 20% of the pixels do not receive any context this resolution is referred as with and depending on image content. Estimations with resolutions above R0 will lose structural consistency but they will have richer high-frequency content in the result.
As this is a data-driven issue, and current datasets do not provide such high-resolution training images, the authors followed a recursive mechanism wherein a base estimate is generated using the double estimation with a fixed resolution of . Subsequently, patch selection is performed via image tilling at the base resolution with tile size equal to receptive field size and 1/3 overlap. To select tiles with high frequency details, edsge density is computed with tiles with edge density lower than images discarded. Next, the resolution of depth estimation is expanded till the edge density of the tile matches that of the image. Finally, these results are merged with the base estimate generating high-resolution images.
The authors quantitatively demonstrate the reasons by choosing as the high-resolution estimation in the double-estimation framework. They highlight that using higher resolution results in performance drop due to high-resolution results having heavy artifacts as the number of pixels in the image without contextual information increases.