We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints.
Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels.
Marigold-SSD builds on the generative prior of Marigold and adopts a single-step diffusion formulation with end-to-end fine-tuning. Rather than relying on expensive iterative denoising at test time, computation is shifted to a one-time fine-tuning stage, enabling deterministic single-step inference.
Marigold-SSD: single-step inference with late-fusion conditional decoder (ours)
Marigold-DC: iterative test-time optimization with guided denoising (50 steps + ensembling)
Sparse depth measurements are injected after the UNet denoiser via a conditional decoder that mirrors the multi-scale structure of the frozen VAE decoder. A trainable condition feature extractor processes the sparse input and fuses its features at each scale using 1×1 convolution layers initialized as zero convolutions (inspired by ControlNet). This preserves the original VAE behavior at initialization and allows the conditioning path to gradually contribute during fine-tuning.
The conditional decoder fuses multi-scale sparse depth features with the predicted depth latent, producing a dense metric depth map via global scale-and-shift alignment.
Qualitative comparison of Marigold-SSD vs. Marigold-DC across indoor (NYUv2, ScanNet, IBims-1, VOID) and outdoor (DDAD, KITTI) benchmarks. Each row shows RGB input, Marigold-SSD prediction, and Marigold-DC prediction.
Zero-shot comparison on six benchmarks (four indoor, two outdoor). Best and second-best results are bold and underlined, excluding the Marigold-DC with ensembling which uses 10× more inference compute. Rank expresses the average position per metric and dataset.
| Type | Method | ScanNet | iBims-1 | VOID | NYUv2 | KITTI | DDAD | Average | Rank | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | MAE↓ | RMSE↓ | |||
| Discriminative | NLSPN ECCV '20 | 0.036 | 0.127 | 0.049 | 0.191 | 0.210 | 0.668 | 0.440 | 0.716 | 1.335 | 2.076 | 2.498 | 9.231 | 0.761 | 2.168 | 8.50 |
| CFormer CVPR '23 | 0.120 | 0.232 | 0.058 | 0.206 | 0.216 | 0.726 | 0.186 | 0.374 | 0.952 | 1.935 | 2.518 | 9.471 | 0.675 | 2.157 | 10.00 | |
| SpAgNet WACV '23 | — | — | — | — | 0.244 | 0.706 | 0.158 | 0.292 | 0.518 | 1.788 | 4.578 | 13.236 | — | — | 10.13 | |
| BP-Net CVPR '24 | 0.122 | 0.212 | 0.078 | 0.289 | 0.270 | 0.742 | — | — | — | — | 2.270 | 8.344 | — | — | 11.75 | |
| VPP4DC 3DV '24 | 0.023 | 0.076 | 0.062 | 0.228 | 0.148 | 0.543 | 0.077 | 0.247 | 0.413 | 1.609 | 1.344 | 6.781 | 0.344 | 1.581 | 4.00 | |
| OGNI-DC ECCV '24 | 0.029 | 0.094 | 0.059 | 0.186 | 0.175 | 0.593 | — | — | — | — | 1.867 | 6.876 | — | — | 4.88 | |
| DepthLab arXiv '24 | 0.051 | 0.081 | 0.098 | 0.198 | 0.214 | 0.602 | 0.184 | 0.276 | 0.921 | 2.171 | 4.498 | 8.379 | 0.994 | 1.951 | 8.58 | |
| Prompt Depth Anything CVPR '25 | 0.042 | 0.079 | 0.088 | 0.196 | 0.191 | 0.605 | 0.110 | 0.233 | 0.934 | 2.803 | 2.107 | 7.494 | 0.579 | 1.902 | 6.83 | |
| DMD³C CVPR '25 | 0.210 | 0.101 | — | — | 0.225 | 0.676 | — | — | — | — | 2.498 | 7.766 | — | — | 10.00 | |
| GBPN arXiv '26 | — | — | — | — | 0.220 | 0.680 | — | — | — | — | — | — | — | — | 12.50 | |
| Diffusion | Marigold + optim CVPR '24 | 0.091 | 0.141 | 0.167 | 0.300 | 0.261 | 0.652 | 0.194 | 0.309 | 1.765 | 3.361 | 22.872 | 32.661 | 4.225 | 6.237 | 13.25 |
| Marigold + LS CVPR '24 | 0.083 | 0.129 | 0.154 | 0.286 | 0.238 | 0.628 | 0.190 | 0.294 | 1.709 | 3.305 | 8.217 | 14.728 | 1.765 | 3.228 | 12.08 | |
| Marigold-E2E + LS WACV '25 | 0.073 | 0.116 | 0.143 | 0.275 | 0.233 | 0.623 | 0.134 | 0.224 | 1.591 | 3.214 | 7.901 | 14.231 | 1.679 | 3.114 | 10.42 | |
| Marigold-DC ensemble ICCV '25 | 0.017 | 0.057 | 0.045 | 0.166 | 0.152 | 0.551 | 0.048 | 0.124 | 0.434 | 1.465 | 2.364 | 6.449 | 0.510 | 1.469 | 1.75* | |
| Marigold-DC ICCV '25 | 0.020 | 0.063 | 0.062 | 0.205 | 0.157 | 0.557 | 0.057 | 0.142 | 0.558 | 1.676 | 2.985 | 7.905 | 0.640 | 1.758 | 5.08 | |
| Marigold-SSD★ Ours | 0.022 | 0.062 | 0.056 | 0.182 | 0.177 | 0.588 | 0.045 | 0.128 | 2.443 | 4.070 | 3.855 | 7.840 | 1.100 | 2.145 | 5.33 | |
| Marigold-SSD Ours | 0.027 | 0.068 | 0.060 | 0.185 | 0.182 | 0.590 | 0.052 | 0.134 | 0.454 | 1.496 | 2.065 | 6.522 | 0.473 | 1.499 | 3.75 | |
* Marigold-DC ensemble is excluded from ranking — it uses 10-sample ensembling, increasing runtime by ~10×. ★ denotes the variant trained on lower density levels. Bold = best, underline = second best.
If you find this work useful, please cite:
@article{gregorek2026depth,
title = {Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion},
author = {Gregorek, Jakub and Pegios, Paraskevas and Metzger, Nando and
Schindler, Konrad and Kontogianni, Theodora and Nalpantidis, Lazaros},
journal = {arXiv preprint arXiv:2603.10584},
year = {2026}
}