Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

1RWTH Aachen University 2Eindhoven University of Technology
Teaser Image
Repurposing diffusion models for geometry estimation is as simple as end-to-end fine-tuning. Left: Depth and normal predictions of our method on in-the-wild images. Right: A simple fix for the DDIM scheduler enables single-step inference for recent diffusion-based depth estimators; and simple end-to-end fine-tuning outperforms more complex diffusion baselines in speed and accuracy.

Abstract

Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200x faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.

Inference Pipeline

Inference Pipeline
End-to-end fine-tuned (E2E FT) single-step deterministic depth and normal estimator. The model is given the mean noise as latent (i.e., zeros), and the timestep is fixed to t=1000.

Zero-shot Relative Depth Estimation

Inference Pipeline

Metric3D v2 was trained on ScanNet, so zero-shot evaluation on this dataset is not possible. We gray out results that were not reproducible with the released code and models.

Zero-shot Surface Normals Estimation

Inference Pipeline

Metric3D v2 was trained on ScanNet, so zero-shot evaluation on this dataset is not possible. We gray out results that were not reproducible with the released code and models.

BibTeX

@article{martingarcia2024diffusione2eft,
  title   = {Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think},
  author  = {Gonzalo Martin Garcia and Karim Abou Zeid and Christian Schmidt and Daan de Geus and Alexander Hermans and Bastian Leibe},
  journal = {arXiv preprint arXiv:2409.11355},
  year    = {2024}
}