We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints.
Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels.
Marigold-SSD builds on the generative prior of Marigold and adopts a single-step diffusion formulation with end-to-end fine-tuning. Rather than relying on expensive iterative denoising at test time, computation is shifted to a one-time fine-tuning stage, enabling deterministic single-step inference.
Marigold-SSD: single-step inference with late-fusion conditional decoder (ours)
Marigold-DC: iterative test-time optimization with guided denoising (50 steps + ensembling)
Sparse depth measurements are injected after the UNet denoiser via a conditional decoder that mirrors the multi-scale structure of the frozen VAE decoder. A trainable condition feature extractor processes the sparse input and fuses its features at each scale using 1×1 convolution layers initialized as zero convolutions (inspired by ControlNet). This preserves the original VAE behavior at initialization and allows the conditioning path to gradually contribute during fine-tuning.
The conditional decoder fuses multi-scale sparse depth features with the predicted depth latent, producing a dense metric depth map via global scale-and-shift alignment.
Qualitative comparison of Marigold-SSD vs. Marigold-DC across indoor (NYUv2, ScanNet, IBims-1, VOID) and outdoor (DDAD, KITTI) benchmarks. Each row shows RGB input, Marigold-SSD prediction, and Marigold-DC prediction.
Zero-shot comparison on six benchmarks (four indoor, two outdoor). Best and second-best results are bold and underlined, excluding the Marigold-DC with ensembling which uses 10× more inference compute. Rank expresses the average position per metric and dataset.
| Type | Method | ScanNet | iBims-1 | VOID | NYUv2 | KITTI | DDAD | Average | Rank | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | |||
| Discriminative | NLSPN ECCV '20 | 0.036 | 0.127 | 0.049 | 0.191 | 0.210 | 0.668 | 0.440 | 0.716 | 1.335 | 2.076 | 2.498 | 9.231 | 0.761 | 2.168 | 8.50 |
| CFormer CVPR '23 | 0.120 | 0.232 | 0.058 | 0.206 | 0.216 | 0.726 | 0.186 | 0.374 | 0.952 | 1.935 | 2.518 | 9.471 | 0.675 | 2.157 | 10.00 | |
| SpAgNet WACV '23 | — | — | — | — | 0.244 | 0.706 | 0.158 | 0.292 | 0.518 | 1.788 | 4.578 | 13.236 | — | — | 10.13 | |
| BP-Net CVPR '24 | 0.122 | 0.212 | 0.078 | 0.289 | 0.270 | 0.742 | — | — | — | — | 2.270 | 8.344 | — | — | 11.75 | |
| VPP4DC 3DV '24 | 0.023 | 0.076 | 0.062 | 0.228 | 0.148 | 0.543 | 0.077 | 0.247 | 0.413 | 1.609 | 1.344 | 6.781 | 0.344 | 1.581 | 4.00 | |
| OGNI-DC ECCV '24 | 0.029 | 0.094 | 0.059 | 0.186 | 0.175 | 0.593 | — | — | — | — | 1.867 | 6.876 | — | — | 4.88 | |
| DepthLab arXiv '24 | 0.051 | 0.081 | 0.098 | 0.198 | 0.214 | 0.602 | 0.184 | 0.276 | 0.921 | 2.171 | 4.498 | 8.379 | 0.994 | 1.951 | 8.58 | |
| Prompt Depth Anything CVPR '25 | 0.042 | 0.079 | 0.088 | 0.196 | 0.191 | 0.605 | 0.110 | 0.233 | 0.934 | 2.803 | 2.107 | 7.494 | 0.579 | 1.902 | 6.83 | |
| DMD³C CVPR '25 | 0.210 | 0.101 | — | — | 0.225 | 0.676 | — | — | — | — | 2.498 | 7.766 | — | — | 10.00 | |
| GBPN arXiv '26 | — | — | — | — | 0.220 | 0.680 | — | — | — | — | — | — | — | — | 12.50 | |
| Diffusion | Marigold + optim CVPR '24 | 0.091 | 0.141 | 0.167 | 0.300 | 0.261 | 0.652 | 0.194 | 0.309 | 1.765 | 3.361 | 22.872 | 32.661 | 4.225 | 6.237 | 13.25 |
| Marigold + LS CVPR '24 | 0.083 | 0.129 | 0.154 | 0.286 | 0.238 | 0.628 | 0.190 | 0.294 | 1.709 | 3.305 | 8.217 | 14.728 | 1.765 | 3.228 | 12.08 | |
| Marigold-E2E + LS WACV '25 | 0.073 | 0.116 | 0.143 | 0.275 | 0.233 | 0.623 | 0.134 | 0.224 | 1.591 | 3.214 | 7.901 | 14.231 | 1.679 | 3.114 | 10.42 | |
| Marigold-DC ensemble ICCV '25 | 0.017 | 0.057 | 0.045 | 0.166 | 0.152 | 0.551 | 0.048 | 0.124 | 0.434 | 1.465 | 2.364 | 6.449 | 0.510 | 1.469 | 1.75* | |
| Marigold-DC ICCV '25 | 0.020 | 0.063 | 0.062 | 0.205 | 0.157 | 0.557 | 0.057 | 0.142 | 0.558 | 1.676 | 2.985 | 7.905 | 0.640 | 1.758 | 5.08 | |
| Marigold-SSD★ Ours | 0.022 | 0.062 | 0.056 | 0.182 | 0.177 | 0.588 | 0.045 | 0.128 | 2.443 | 4.070 | 3.855 | 7.840 | 1.100 | 2.145 | 5.33 | |
| Marigold-SSD Ours | 0.027 | 0.068 | 0.060 | 0.185 | 0.182 | 0.590 | 0.052 | 0.134 | 0.454 | 1.496 | 2.065 | 6.522 | 0.473 | 1.499 | 3.75 | |
* Marigold-DC ensemble is excluded from ranking — it uses 10-sample ensembling, increasing runtime by ~10×. ★ denotes the variant trained on lower density levels. Bold = best, underline = second best.
We evaluate Marigold-SSD across a range of input depth densities. At the commonly used evaluation sparsity of 5,000 points on the DDAD dataset, we observe that even sophisticated models can be outperformed by simple barycentric interpolation within a Delaunay triangulation. However, at lower input densities the pretrained diffusion prior becomes more beneficial, and our approach outperforms both Marigold-DC and the interpolation.
Evaluation under multiple levels of depth density on NYUv2, ScanNet, VOID, IBims-1, and DDAD Density is expressed as the number of depth samples (#). For indoor datasets we plot only RMSE as MAE trends are the same.
If you find this work useful, please cite:
@InProceedings{Gregorek_2026_CVPR,
author = {Gregorek, Jakub and Pegios, Paraskevas and Metzger, Nando and Schindler, Konrad and Kontogianni, Theodora and Nalpantidis, Lazaros},
title = {Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2026},
pages = {1861-1872}
}