Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to efficiently leverage the benefits of both paradigms have shown promising results. However, these hybrid approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks.

While gradient-based methods are a promising approach, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts due to the fact that gradient-based methods can converge to suboptimal local optima and are prone to exploding or vanishing gradients. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model.

We incorporate uncertainty regularization directly into the optimization objective and amortize optimization iterations over time by reusing previously optimized actions. We evaluate our method on multiple continuous control tasks from the DeepMind Control Suite, Meta-World and HumanoidBench and show that gradient-based MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines.

We integrate our method into BMPC by replacing the MPPI planner with Dream-MPC during inference. As the aggregated performance metrics across all 24 tasks show, Dream-MPC can improve the performance of the underlying policy and outperform MPPI when using BMPC as a basis. While for TD-MPC2 our method can also improve the performance of the policy, we can not consistently match the performance of MPPI. This highlights the need for a good initial proposal for gradient-based MPC.

Furthermore, we show that our method also works well with image-based observations and can outperform MPPI.

Additionally, we evaluate the performance when enabling gradient-based planning already during training using TD-MPC2 as a basis.

On top of the quantitative evaluation, we also provide videos of the evaluation episodes for the different environments when using Dream-MPC to optimize the actions at inference.

We further integrate our method into Dreamer and also find improvements in sample efficiency and asymptotic performance when enabling planning during training.

Amortization of optimization iterations over time and incorporating uncertainty are important to overcome the limitations of previously proposed gradient-based MPC methods, namely worse performance compared to their gradient-free equivalents and high computational costs.
Our results highlight that gradient-based MPC with learned world models is a promising direction to further improve the performance of model-based RL algorithms.
While we use a single set of planning parameters across all experiments, the results suggest that it may be beneficial to dynamically adapt the parameters such as the action optimization step size or number of iterations to further improve the performance.

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Dream-MPC realizes gradient-based MPC by performing an imaginary rollout of a policy network in latent space via a learned world model.

Abstract

Gradient-Based Model Predictive Control

Comparison to Baselines

Planning at Test-Time

Visual Observations

Planning during training

Qualitative Results

DeepMind Control Suite

Meta-World

HumanoidBench

Additional Results

Takeaways