Preprint  ·  arXiv 2026

Need for Speed: Zero-Shot Depth Completion
with Single-Step Diffusion

Jakub Gregorek1,2    Paraskevas Pegios1,2    Nando Metzger3    Konrad Schindler3
Theodora Kontogianni1,2    Lazaros Nalpantidis1,2
1DTU – Technical University of Denmark   2Pioneer Centre for AI   3ETH Zürich

Abstract

We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints.

Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels.

Key Results

66×
Faster inference than
Marigold-DC (single run)
660×
Speedup vs. Marigold-DC
with 10-sample ensembling
1.499
RMSE (ours) vs. 1.758
for Marigold-DC
4.5
GPU days to finetune
(single NVIDIA H100)

Contributions

Method

Marigold-SSD builds on the generative prior of Marigold and adopts a single-step diffusion formulation with end-to-end fine-tuning. Rather than relying on expensive iterative denoising at test time, computation is shifted to a one-time fine-tuning stage, enabling deterministic single-step inference.

Architecture Comparison

Marigold-SSD architecture: single-step inference with conditional decoder

Marigold-SSD: single-step inference with late-fusion conditional decoder (ours)

Marigold-DC architecture: test-time optimization with iterative denoising

Marigold-DC: iterative test-time optimization with guided denoising (50 steps + ensembling)

Late-Fusion Conditional Decoder

Sparse depth measurements are injected after the UNet denoiser via a conditional decoder that mirrors the multi-scale structure of the frozen VAE decoder. A trainable condition feature extractor processes the sparse input and fuses its features at each scale using 1×1 convolution layers initialized as zero convolutions (inspired by ControlNet). This preserves the original VAE behavior at initialization and allows the conditioning path to gradually contribute during fine-tuning.

Late-fusion conditional decoder architecture diagram

The conditional decoder fuses multi-scale sparse depth features with the predicted depth latent, producing a dense metric depth map via global scale-and-shift alignment.

Qualitative Results

Corridor RGB Corridor Marigold-SSD Corridor Marigold-DC
Lab RGB Lab Marigold-SSD Lab Marigold-DC
Meeting room RGB Meeting room Marigold-SSD Meeting room Marigold-DC
Scene 28 RGB Scene 28 Marigold-SSD Scene 28 Marigold-DC
Scene 86 RGB Scene 86 Marigold-SSD Scene 86 Marigold-DC
Scene 0000000708 RGB Scene 0000000708 Marigold-SSD Scene 0000000708 Marigold-DC
Scene 0000001837 RGB Scene 0000001837 Marigold-SSD Scene 0000001837 Marigold-DC
KITTI scene 1 RGB KITTI scene 1 Marigold-SSD KITTI scene 1 Marigold-DC
KITTI scene 2 RGB KITTI scene 2 Marigold-SSD KITTI scene 2 Marigold-DC

Qualitative comparison of Marigold-SSD vs. Marigold-DC across indoor (NYUv2, ScanNet, IBims-1, VOID) and outdoor (DDAD, KITTI) benchmarks. Each row shows RGB input, Marigold-SSD prediction, and Marigold-DC prediction.

Quantitative Results

Zero-shot comparison on six benchmarks (four indoor, two outdoor). Best and second-best results are bold and underlined, excluding the Marigold-DC with ensembling which uses 10× more inference compute. Rank expresses the average position per metric and dataset.

Type Method ScanNet iBims-1 VOID NYUv2 KITTI DDAD Average Rank
MAE↓RMSE↓ MAE↓RMSE↓ MAE↓RMSE↓ MAE↓RMSE↓ MAE↓RMSE↓ MAE↓RMSE↓ MAE↓RMSE↓
Discriminative NLSPN ECCV '20 0.0360.127 0.0490.191 0.2100.668 0.4400.716 1.3352.076 2.4989.231 0.7612.168 8.50
CFormer CVPR '23 0.1200.232 0.0580.206 0.2160.726 0.1860.374 0.9521.935 2.5189.471 0.6752.157 10.00
SpAgNet WACV '23 0.2440.706 0.1580.292 0.5181.788 4.57813.236 10.13
BP-Net CVPR '24 0.1220.212 0.0780.289 0.2700.742 2.2708.344 11.75
VPP4DC 3DV '24 0.0230.076 0.0620.228 0.1480.543 0.0770.247 0.4131.609 1.3446.781 0.3441.581 4.00
OGNI-DC ECCV '24 0.0290.094 0.0590.186 0.1750.593 1.8676.876 4.88
DepthLab arXiv '24 0.0510.081 0.0980.198 0.2140.602 0.1840.276 0.9212.171 4.4988.379 0.9941.951 8.58
Prompt Depth Anything CVPR '25 0.0420.079 0.0880.196 0.1910.605 0.1100.233 0.9342.803 2.1077.494 0.5791.902 6.83
DMD³C CVPR '25 0.2100.101 0.2250.676 2.4987.766 10.00
GBPN arXiv '26 0.2200.680 12.50
Diffusion Marigold + optim CVPR '24 0.0910.141 0.1670.300 0.2610.652 0.1940.309 1.7653.361 22.87232.661 4.2256.237 13.25
Marigold + LS CVPR '24 0.0830.129 0.1540.286 0.2380.628 0.1900.294 1.7093.305 8.21714.728 1.7653.228 12.08
Marigold-E2E + LS WACV '25 0.0730.116 0.1430.275 0.2330.623 0.1340.224 1.5913.214 7.90114.231 1.6793.114 10.42
Marigold-DC ensemble ICCV '25 0.0170.057 0.0450.166 0.1520.551 0.0480.124 0.4341.465 2.3646.449 0.5101.469 1.75*
Marigold-DC ICCV '25 0.0200.063 0.0620.205 0.1570.557 0.0570.142 0.5581.676 2.9857.905 0.6401.758 5.08
Marigold-SSD★ Ours 0.0220.062 0.0560.182 0.1770.588 0.0450.128 2.4434.070 3.8557.840 1.1002.145 5.33
Marigold-SSD Ours 0.0270.068 0.0600.185 0.1820.590 0.0520.134 0.4541.496 2.0656.522 0.4731.499 3.75

* Marigold-DC ensemble is excluded from ranking — it uses 10-sample ensembling, increasing runtime by ~10×. ★ denotes the variant trained on lower density levels. Bold = best, underline = second best.

BibTeX

If you find this work useful, please cite:

@article{gregorek2026depth,
  title   = {Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion},
  author  = {Gregorek, Jakub and Pegios, Paraskevas and Metzger, Nando and
             Schindler, Konrad and Kontogianni, Theodora and Nalpantidis, Lazaros},
  journal = {arXiv preprint arXiv:2603.10584},
  year    = {2026}
}