Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding
Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding
TL;DR:
We turn 2D videos into 3D stereo videos using a controllable end-to-end model with a novel guided Decoder. No warping, high sharpness, minimal binocular rivalry, and adjustable 3D strength.Read full abstract
The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (respectively, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on real-world stereo videos show that our method outperforms both traditional and recent baselines.
Method
Our guided latent decoding strategy enables high-fidelity stereo generation. A frozen VAE Encoder computes the latent code of the input video. The synthesis network then generates the right-view latent, conditioned on a 3D strength control token. Finally, our Guided Decoder renders the high-fidelity output using both the generated latent and the original video as guidance.
Features
We can control the median disparity of the generation via a 3D strength control token.
Epipolar Cross-Attention (Structured Skip Connections) in VAE Decoder recover details and mitigate binocular rivalry. Our Guided Decoder produces better results than the Stable DiffusionVAE Decoder.
High Speed Performance
Our method runs a 512×512 patch with 16 frames in 1.7s, while the competitors are about 3× slower at least.
Qualitative Results
Comparison of generated stereo videos.
Quantitative Results
Performance on the Spatial Video Dataset (AVP) benchmark.
Citation
@article{metzger2025controllable,
title={Controllable Stereo Video Conversion with Guided Latent Decoding},
author={Metzger, Nando and Truong, Prune and Bhat, Goutam and Schindler, Konrad and Tombari, Federico},
journal={arXiv preprint},
year={2025}
}