Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding

Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding

1ETH Zurich 2Google 3TU Munich

TL;DR:

We turn 2D videos into 3D stereo videos using a controllable end-to-end model with a novel guided Decoder. No warping, high sharpness, minimal binocular rivalry, and adjustable 3D strength.
Read full abstract

The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (respectively, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on real-world stereo videos show that our method outperforms both traditional and recent baselines.

Teaser: Controllable Stereo Video Conversion

Method

Our guided latent decoding strategy enables high-fidelity stereo generation. A frozen VAE Encoder computes the latent code of the input video. The synthesis network then generates the right-view latent, conditioned on a 3D strength control token. Finally, our Guided Decoder renders the high-fidelity output using both the generated latent and the original video as guidance.

Guided VAE Architecture

Features

We can control the median disparity of the generation via a 3D strength control token.

Disparity Conditioning Mechanism

Epipolar Cross-Attention (Structured Skip Connections) in VAE Decoder recover details and mitigate binocular rivalry. Our Guided Decoder produces better results than the Stable DiffusionVAE Decoder.

VAE Compression Comparison

High Speed Performance

Our method runs a 512×512 patch with 16 frames in 1.7s, while the competitors are about 3× slower at least.

Speed Comparison Chart

Qualitative Results

Comparison of generated stereo videos.

Quantitative Results

Performance on the Spatial Video Dataset (AVP) benchmark.

Headset User Study Results
We conduct user study on a headset and find that our method is preferred over M2SVid in most cases. Against Eye2Eye users were undecided for 37.5% of the time and prefered Elastic3D 50% of the time.
Quantitative Performance Metrics
Performance metrics for the Spatial Video Dataset (AVP) benchmark.

Citation

@article{metzger2025controllable,
    title={Controllable Stereo Video Conversion with Guided Latent Decoding},
    author={Metzger, Nando and Truong, Prune and Bhat, Goutam and Schindler, Konrad and Tombari, Federico},
    journal={arXiv preprint},
    year={2025}
}