Long-horizon visual MPC under partial observability

ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

ELVIS combines recurrent world-model memory, multimodal latent MPPI, and critic-ensemble uncertainty to make deep visual imagination practical for long-horizon control.

Yurui Du· Pinhao Song· Yutong Hu· Renaud Detry

KU Leuven · Flanders Make at KU Leuven

Paper PDF Code BibTeX

Teaser video Visual MPC for sand spraying under severe occlusion

The task is to spread material evenly despite noisy, partially missing observations.

Abstract

Reliable long-horizon planning from pixels

Visual model-based RL is powerful, but long latent rollouts can branch into multiple plausible futures and accumulate model error, especially when visual evidence is corrupted by occlusions. ELVIS addresses these issues with a Dreamer-style recurrent state-space model, a Gaussian-mixture MPPI planner that preserves multiple long-horizon action hypotheses, and an ensemble-UCB-gated $\lambda_t$-return that softly truncates unreliable imagination.

The same uncertainty-aware return is used for imagined actor-critic learning and online MPPI scoring, aligning the learned policy prior with the planner. Across fourteen DeepMind Control visual tasks and a real-world sand-spraying setup with severe occlusions, ELVIS improves robustness, data efficiency, and zero-shot transfer.

Method

Memory, mixture planning, and calibrated imagination

ELVIS turns the current visual history into a compact latent belief, plans over multiple action-sequence hypotheses, then scores rollouts with the same uncertainty-aware return used to train its actor and critic.

World model

RSSM belief for occluded visual control

The encoder infers stochastic latents from observations while a recurrent deterministic state carries memory forward. Future planning uses prior latent transitions when observations are unavailable or unreliable.

RSSM world model diagram with deterministic and stochastic latent states.

Planner

GMM-MPPI preserves branches

Instead of a single Gaussian proposal, ELVIS keeps $M$ Gaussian action-sequence modes and updates each mode by weighted moment matching.

Gaussian mixture MPPI diagram showing multiple rollout modes.

Return

Ensemble-UCB gates $\lambda_t$

High-UCB states trigger stronger bootstrapping; low-UCB states allow deeper look-ahead. This softly reduces reliance on unreliable distant imagination.

Imaginary TD learning with UCB-gated lambda returns.

Shared scoring rule

One return for learning and planning

$$\lambda_t = \lambda_{\max} - (\lambda_{\max}-\lambda_{\min})\,\mathrm{norm}(\mathrm{UCB}(\hat{s}_t))$$ $$G_t = \hat{r}_t + \gamma\left[(1-\lambda_t)\mu_{t+1} + \lambda_t G_{t+1}\right]$$

$G_0$ trains the imagined actor-critic prior and scores candidate trajectories in GMM-MPPI, so policy learning and online planning optimize the same uncertainty-calibrated objective.

DeepMind Control Suite

Stronger visual control across 14 tasks

ELVIS ranks first or second on every DMC visual benchmark in the paper and achieves the strongest aggregate return over training.

Learning curves for 14 DeepMind Control visual tasks comparing ELVIS, TD-MPC2, and DreamerV3. — Per-task DMC visual control learning curves and aggregate score. Shaded bands show 95% confidence intervals over five seeds.

Main takeaway

ELVIS improves sample efficiency and final performance while remaining consistently competitive across all tasks, showing that recurrent memory, long-horizon planning, and uncertainty-aware return shaping help not only in heavily occluded settings but also on standard visual-control benchmarks.

Zero-shot sim-to-real

Real-world sand-spraying results

The qualitative comparison below shows how ELVIS maintains more uniform coverage under heavy dust and partial observability after transfer from simulation.

Qualitative zero-shot sand-spraying results comparing ELVIS, TD-MPC2, and DreamerV3. — For each method, the top row shows grayscale scene images for visualization only and the bottom row shows the corresponding heightmaps used for evaluation. ELVIS achieves the most uniform final deposition under real-world occlusions.

Citation

BibTeX

Replace the venue field once the camera-ready metadata is finalized.

@inproceedings{du2026elvis,
  title     = {ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC},
  author    = {Du, Yurui and Song, Pinhao and Hu, Yutong and Detry, Renaud},
  booktitle = {Robotics: Science and Systems},
  year      = {2026}
}