Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Anonymous Authors
Under Review

Dream-MPC realizes gradient-based MPC by performing an imaginary rollout of a policy network in latent space via a learned world model.

Abstract

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to efficiently leverage the benefits of both paradigms have shown promising results. However, these hybrid approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks.

While gradient-based methods are a promising approach, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts due to the fact that gradient-based methods can converge to suboptimal local optima and are prone to exploding or vanishing gradients. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model.

We incorporate uncertainty regularization directly into the optimization objective and amortize optimization iterations over time by reusing previously optimized actions. We evaluate our method on multiple continuous control tasks from the DeepMind Control Suite, Meta-World and HumanoidBench and show that gradient-based MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines.

Gradient-Based Model Predictive Control

We perform local trajectory optimization with a latent dynamics model. Instead of sampling hundreds or thousands of action sequences at each step as done by gradient-free, sampling-based methods such as the Cross-Entropy Method (CEM) or Model Predictive Path Integral (MPPI), Dream-MPC considers few samples from the policy prior and optimizes each action sequence by using gradient ascent to maximize the objective J. The first action with the highest predicted return is applied, and the procedure is repeated for the next time step.


Dream-MPC algorithm

Comparison to Baselines

For our experiments, we integrate Dream-MPC into TD-MPC2 and BMPC and evaluate it on 24 environments from the DeepMind Control Suite, Meta-World and HumanoidBench.


Planning at Test-Time

We integrate our method into BMPC by replacing the MPPI planner with Dream-MPC during inference. As the aggregated performance metrics across all 24 tasks show, Dream-MPC can improve the performance of the underlying policy and outperform MPPI when using BMPC as a basis. While for TD-MPC2 our method can also improve the performance of the policy, we can not consistently match the performance of MPPI. This highlights the need for a good initial proposal for gradient-based MPC.

Dream-MPC results

Visual Observations

Furthermore, we show that our method also works well with image-based observations and can outperform MPPI.

Dream-MPC results for visual observations

Planning during training

Additionally, we evaluate the performance when enabling gradient-based planning already during training using TD-MPC2 as a basis.

Dream-MPC learning curves

Qualitative Results

On top of the quantitative evaluation, we also provide videos of the evaluation episodes for the different environments when using Dream-MPC to optimize the actions at inference.

DeepMind Control Suite

Acrobot Swingup
Cartpole Swingup Sparse
Dog Run
Dog Walk
Fish Swim
Hopper Hop
Humanoid Run
Humanoid Walk

Meta-World

Assembly
Disassemble
Button Press
Lever Pull
Pick Place Wall
Push Back
Shelf Place
Window Open

HumanoidBench

Balance Hard
Balance Simple
Hurdle
Maze
Reach
Run
Slide
Stair

Additional Results

We further integrate our method into Dreamer and also find improvements in sample efficiency and asymptotic performance when enabling planning during training.

Dreamer results

Takeaways

  • Amortization of optimization iterations over time and incorporating uncertainty are important to overcome the limitations of previously proposed gradient-based MPC methods, namely worse performance compared to their gradient-free equivalents and high computational costs.
  • Our results highlight that gradient-based MPC with learned world models is a promising direction to further improve the performance of model-based RL algorithms.
  • While we use a single set of planning parameters across all experiments, the results suggest that it may be beneficial to dynamically adapt the parameters such as the action optimization step size or number of iterations to further improve the performance.