Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

Abstract

We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints.

Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels.

Contributions

1 First single-step diffusion method for depth completion. Marigold-SSD is significantly faster than diffusion baselines while delivering better average performance, and remaining competitive even when baselines employ ensembling at substantially higher computational cost.
2 Late-fusion conditional decoder. A simple yet effective strategy for conditioning on sparse measurements, whose advantage over early-fusion alternatives is validated through ablation studies.
3 Comprehensive zero-shot evaluation. Strong robustness to varying condition sparsity levels across indoor and outdoor datasets, while revealing limitations of existing evaluation benchmarks.

Method

Marigold-SSD builds on the generative prior of Marigold and adopts a single-step diffusion formulation with end-to-end fine-tuning. Rather than relying on expensive iterative denoising at test time, computation is shifted to a one-time fine-tuning stage, enabling deterministic single-step inference.

Architecture Comparison

Marigold-SSD architecture: single-step inference with conditional decoder

Marigold-SSD: single-step inference with late-fusion conditional decoder (ours)

Marigold-DC architecture: test-time optimization with iterative denoising

Marigold-DC: iterative test-time optimization with guided denoising (50 steps + ensembling)

Late-Fusion Conditional Decoder

Sparse depth measurements are injected after the UNet denoiser via a conditional decoder that mirrors the multi-scale structure of the frozen VAE decoder. A trainable condition feature extractor processes the sparse input and fuses its features at each scale using 1×1 convolution layers initialized as zero convolutions (inspired by ControlNet). This preserves the original VAE behavior at initialization and allows the conditioning path to gradually contribute during fine-tuning.

The conditional decoder fuses multi-scale sparse depth features with the predicted depth latent, producing a dense metric depth map via global scale-and-shift alignment.

Quantitative Results

Zero-shot comparison on six benchmarks (four indoor, two outdoor). Best and second-best results are bold and underlined, excluding the Marigold-DC with ensembling which uses 10× more inference compute. Rank expresses the average position per metric and dataset.

Type	Method	ScanNet		iBims-1		VOID		NYUv2		KITTI		DDAD		Average		Rank
Type	Method	MAE↓	RMSE↓	MAE↓	RMSE↓	MAE↓	RMSE↓	MAE↓	RMSE↓	MAE↓	RMSE↓	MAE↓	RMSE↓	MAE↓	RMSE↓	Rank
Discriminative	NLSPN ECCV '20	0.036	0.127	0.049	0.191	0.210	0.668	0.440	0.716	1.335	2.076	2.498	9.231	0.761	2.168	8.50
	CFormer CVPR '23	0.120	0.232	0.058	0.206	0.216	0.726	0.186	0.374	0.952	1.935	2.518	9.471	0.675	2.157	10.00
	SpAgNet WACV '23	—	—	—	—	0.244	0.706	0.158	0.292	0.518	1.788	4.578	13.236	—	—	10.13
	BP-Net CVPR '24	0.122	0.212	0.078	0.289	0.270	0.742	—	—	—	—	2.270	8.344	—	—	11.75
	VPP4DC 3DV '24	0.023	0.076	0.062	0.228	0.148	0.543	0.077	0.247	0.413	1.609	1.344	6.781	0.344	1.581	4.00
	OGNI-DC ECCV '24	0.029	0.094	0.059	0.186	0.175	0.593	—	—	—	—	1.867	6.876	—	—	4.88
	DepthLab arXiv '24	0.051	0.081	0.098	0.198	0.214	0.602	0.184	0.276	0.921	2.171	4.498	8.379	0.994	1.951	8.58
	Prompt Depth Anything CVPR '25	0.042	0.079	0.088	0.196	0.191	0.605	0.110	0.233	0.934	2.803	2.107	7.494	0.579	1.902	6.83
	DMD³C CVPR '25	0.210	0.101	—	—	0.225	0.676	—	—	—	—	2.498	7.766	—	—	10.00
	GBPN arXiv '26	—	—	—	—	0.220	0.680	—	—	—	—	—	—	—	—	12.50

Diffusion	Marigold + optim CVPR '24	0.091	0.141	0.167	0.300	0.261	0.652	0.194	0.309	1.765	3.361	22.872	32.661	4.225	6.237	13.25
	Marigold + LS CVPR '24	0.083	0.129	0.154	0.286	0.238	0.628	0.190	0.294	1.709	3.305	8.217	14.728	1.765	3.228	12.08
	Marigold-E2E + LS WACV '25	0.073	0.116	0.143	0.275	0.233	0.623	0.134	0.224	1.591	3.214	7.901	14.231	1.679	3.114	10.42
	Marigold-DC ensemble ICCV '25	0.017	0.057	0.045	0.166	0.152	0.551	0.048	0.124	0.434	1.465	2.364	6.449	0.510	1.469	1.75*
	Marigold-DC ICCV '25	0.020	0.063	0.062	0.205	0.157	0.557	0.057	0.142	0.558	1.676	2.985	7.905	0.640	1.758	5.08
	Marigold-SSD★ Ours	0.022	0.062	0.056	0.182	0.177	0.588	0.045	0.128	2.443	4.070	3.855	7.840	1.100	2.145	5.33
	Marigold-SSD Ours	0.027	0.068	0.060	0.185	0.182	0.590	0.052	0.134	0.454	1.496	2.065	6.522	0.473	1.499	3.75

* Marigold-DC ensemble is excluded from ranking — it uses 10-sample ensembling, increasing runtime by ~10×. ★ denotes the variant trained on lower density levels. Bold = best, underline = second best.

Sparsity Levels & Interpolation

We evaluate Marigold-SSD across a range of input depth densities. At the commonly used evaluation sparsity of 5,000 points on the DDAD dataset, we observe that even sophisticated models can be outperformed by simple barycentric interpolation within a Delaunay triangulation. However, at lower input densities the pretrained diffusion prior becomes more beneficial, and our approach outperforms both Marigold-DC and the interpolation.

Legend: Marigold-SSD, Marigold-DC, Barycentric interpolation

Evaluation under multiple levels of depth density on NYUv2, ScanNet, VOID, IBims-1, and DDAD Density is expressed as the number of depth samples (#). For indoor datasets we plot only RMSE as MAE trends are the same.

BibTeX

If you find this work useful, please cite:

@InProceedings{Gregorek_2026_CVPR,
    author    = {Gregorek, Jakub and Pegios, Paraskevas and Metzger, Nando and Schindler, Konrad and Kontogianni, Theodora and Nalpantidis, Lazaros},
    title     = {Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2026},
    pages     = {1861-1872}
}

Need for Speed: Zero-Shot Depth Completion
with Single-Step Diffusion