Official implementation of Metis
Jingyu Li1,2,*, Zhe Liu3,*, Dongnan Hu4,2, Junjie Wu5, Zipei Ma1,2, Wenxiao Wu6,2, Chao Han5, Zhihui Hao5, Zhikang Liu5, Kun Zhan5, Jiankang Deng7, Xiatian Zhu8, Li Zhang1,2,†
1Fudan University, 2Shanghai Innovation Institute, 3The University of Hong Kong, 4Tongji University, 5Li Auto Inc., 6Huazhong University of Science and Technology, 7Imperial College London, 8University of Surrey
*Equal contribution. †Corresponding author.
Equal contribution: Jingyu Li and Zhe Liu. Corresponding author: Li Zhang.
- 2026.06.14: Metis project page is live. Source code, pretrained checkpoints, and detailed documentation are planned for release in August 2026.
- World-action modeling for both autonomous driving and urban navigation. Metis is evaluated on NAVSIM-v2, NAVSIM-v1, CityWalker, and zero-shot real-robot deployment.
- Joint training, decoupled inference. Future video generation is used as a training signal, while inference directly predicts actions without explicitly generating future frames.
- Mixture-of-Transformers architecture. Metis separates a Video Generation Expert (VGE) and an Action Expert (AE), preserving the different distributions of visual generation and low-dimensional control.
- Masked asymmetric attention. Future video tokens can attend to future action tokens during training, while action tokens are masked from future video tokens and remain conditioned on the current observation, language instruction, and ego state. This lets the action branch benefit from world modeling without inheriting inference-time generation noise.
- Efficient planning. With 2 denoising steps on a single RTX 4090, action-only inference reaches 0.17s with 89.2 EPDMS on NAVSIM-v2
navtest, and is up to 8x faster than explicit future-video inference in the reported setting.
Metis is an end-to-end World-Action Model (WAM) for action planning in dynamic outdoor environments. Existing WAMs often require autoregressive future observation generation or tightly couple future video synthesis with action prediction. These designs introduce high latency and can pollute the action space with high-dimensional generation noise.
Metis addresses this with a simple principle:
Learn future world dynamics during training, but do not generate future videos when planning at test time.
The model contains two specialized experts:
- Video Generation Expert (VGE): built from a pretrained video generation backbone, using a video VAE and text conditioning to model future scene dynamics.
- Action Expert (AE): a diffusion transformer branch for low-dimensional action chunk prediction.
During training, the two experts interact in a shared latent space under a masked asymmetric attention pattern. During inference, the video generation path is bypassed, so Metis predicts the action chunk from the current observation, instruction, and ego state in a low-latency forward denoising process.
Given a current visual observation o_t, language instruction l, and ego state, Metis predicts an action chunk a_{t:t+H}. In autonomous driving, each waypoint is represented as (x, y, theta). In urban navigation, each waypoint is represented as (x, y).
Future observations are used only as training supervision. At inference time, Metis performs:
current observation + instruction + ego state -> action chunk
without sampling future video frames.
- Backbone VGE: Wan2.2-5B video generation expert in the main implementation setting.
- Action branch: approximately 1B parameters with hidden dimension
d_a = 1024. - Total model size: approximately 6B parameters.
- Conditioning: front-view image, text instruction, ego state.
- Training objective: joint flow matching loss for action prediction and future video generation.
- Inference: action-only denoising; the explicit future-video branch is skipped.
Metis studies three interaction patterns between future video tokens and action tokens:
- Joint attention: future video and action tokens are fully coupled.
- Isolated attention: the two branches are completely separated.
- Asymmetric attention (ours): video tokens can attend to action tokens, but action tokens do not attend to future video tokens.
This asymmetric design keeps training and inference consistent: the action branch never depends on generated future frames, yet still receives useful gradients from the world modeling task during training.
- NAVSIM-v2: Metis is trained on
navtrainwith 1,192 scenarios and evaluated onnavhardandnavtest.navhardcontains 244 safety-critical real-world scenarios in Stage 1 and 4,164 corresponding 3DGS-generated synthetic scenarios in Stage 2.navtestcontains 12,146 real-world scenarios for generalization evaluation. The main metric is EPDMS. - NAVSIM-v1: Metis is additionally evaluated on
navtestwith PDMS for comparison with prior planning methods. - CityWalker: Metis is evaluated on 15 hours of teleoperation data from diverse urban areas in New York City, with 6 hours for fine-tuning and 9 hours for testing. The benchmark reports L2 distance and Maximum Average Orientation Error (MAOE).
- Real-world navigation: Metis is deployed zero-shot on a Unitree Go2 quadruped for indoor and outdoor closed-loop obstacle-avoidance tests.
Metis is evaluated on NAVSIM-v2 navhard, which includes Stage 1 real-world safety-critical scenarios and Stage 2 3DGS-generated synthetic counterparts.
| Method | Sensors | Stage 1 | Stage 2 | EPDMS |
|---|---|---|---|---|
| DiffusionDrive | 3xC+L | 66.7 | 40.5 | 27.5 |
| ReCogDrive | 1xC | 67.7 | 37.6 | 25.7 |
| SGDrive | 1xC | 71.1 | 35.2 | 25.5 |
| Metis | 1xC | 75.8 | 41.7 | 32.2 |
Metis achieves state-of-the-art EPDMS among learning-based planners under the reported fair-comparison setting, using only the front camera and without multi-stage training, reinforcement learning, or auxiliary datasets.
| Method | Sensors | NC | DAC | EP | LK | EC | EPDMS |
|---|---|---|---|---|---|---|---|
| SGDrive | 1xC | 98.6 | 94.3 | 86.0 | 96.1 | 85.9 | 86.2 |
| Vega | 1xC | 98.9 | 95.3 | 87.0 | 96.1 | 76.3 | 86.9 |
| DriveFine | 1xC | 98.7 | 97.3 | 88.2 | 97.7 | 84.7 | 87.1 |
| Epona | 1xC | 97.1 | 95.7 | 88.6 | 97.0 | 67.8 | 85.1 |
| DriveVLA-W0 | 1xC | 98.5 | 99.1 | 86.4 | 93.2 | 58.9 | 86.1 |
| Metis | 1xC | 98.4 | 97.2 | 87.8 | 97.8 | 88.0 | 89.5 |
| Metis, best-of-6 | 1xC | 98.5 | 97.5 | 87.9 | 98.0 | 90.0 | 90.3 |
Metis is also evaluated for urban navigation on the CityWalker benchmark, using L2 distance and Maximum Average Orientation Error (MAOE). ABot-N0 is a pretrained method using large-scale navigation data and does not report L2; Metis achieves the best L2 and MAOE among the fine-tuned methods.
| Method | Setting | Mean L2 (m) | Mean MAOE (deg) | All L2 (m) | All MAOE (deg) |
|---|---|---|---|---|---|
| ABot-N0 | Pretrained | - | 11.2 | - | 7.6 |
| GNM | Fine-tuned | 1.22 | 16.2 | 0.74 | 12.1 |
| ViNT | Fine-tuned | 1.30 | 16.5 | 0.70 | 12.6 |
| NoMaD | Fine-tuned | 1.39 | 19.1 | 0.74 | 12.1 |
| CityWalker | Fine-tuned | 1.11 | 15.2 | 1.07 | 11.5 |
| Metis | Fine-tuned | 0.71 | 11.8 | 0.64 | 9.8 |
Qualitative comparison on CityWalker. Metis tracks the ground-truth trajectory far more closely than the zero-shot Epona baseline.
| Method | Sensors | DAC | EP | PDMS |
|---|---|---|---|---|
| UniWorldVLA | 1xC | 96.7 | 83.2 | 89.4 |
| DriveLAW | 1xC | 97.1 | 81.3 | 89.1 |
| Metis | 1xC | 97.1 | 83.4 | 89.1 |
| Metis, best-of-6 | 1xC | 97.5 | 84.0 | 89.7 |
All latency numbers below are reported in the paper for single-GPU inference.
| Setting | Steps | navtest PDMS | navtest EPDMS | navhard EPDMS | RTX 4090 | H200 |
|---|---|---|---|---|---|---|
| Action-only | 1 | 87.0 | 87.2 | 30.4 | 110 ms | 100 ms |
| Action-only | 2 | 88.9 | 89.2 | 31.2 | 147 ms | 140 ms |
| Action-only | 5 | 89.0 | 89.4 | 31.4 | 280 ms | 240 ms |
| Action-only | 10 | 89.1 | 89.5 | 32.2 | 480 ms | 430 ms |
Compared with explicit future-video inference:
| Method | PDMS | EPDMS | Latency |
|---|---|---|---|
| Epona | 86.2 | 85.1 | 0.32s |
| PWM | 87.3 | - | 0.57s |
| PWM, with video | 88.1 | - | 0.83s |
| Metis, with video | 89.0 | 89.5 | 1.38s |
| Metis, action-only (2 steps) | 88.9 | 89.2 | 0.17s |
| Image Size | Variant | navtest DAC | navtest LK | navtest EPDMS | navhard EPDMS |
|---|---|---|---|---|---|
| 320x384 | Joint | 96.5 | 97.0 | 87.4 | 28.0 |
| 320x384 | Isolated | 96.5 | 97.5 | 88.3 | 29.4 |
| 320x384 | Asymmetric | 97.0 | 97.6 | 88.8 | 31.6 |
| 640x768 | Asymmetric | 97.5 | 98.0 | 89.5 | 32.2 |
The rows below reproduce the VGE/AE ablation labels reported in Table 6 of the paper.
| VGE | AE | navtest PDMS | navtest EPDMS | navhard EPDMS |
|---|---|---|---|---|
| Wan2.1-1.3B | ~0.24B | 88.5 | 88.8 | 28.8 |
| Wan2.2-14B | ~0.21B | 88.2 | 88.2 | 31.2 |
| Wan2.2-14B | ~1.04B | 89.1 | 89.5 | 32.2 |
| Training Paradigm | PDMS | EPDMS |
|---|---|---|
| Without video co-training | 87.4 | 87.9 |
| With video co-training | 89.1 | 89.5 |
The paper reports the following experimental setup:
| Item | NAVSIM-v2 | CityWalker |
|---|---|---|
| Input resolution | 640x768 | 384x384 |
| Action horizon | 8 waypoints, 4s, 0.5s interval | 5 waypoints |
| Action format | (x, y, theta) |
(x, y) |
| Camera input | Front-facing camera only | Front-facing camera only |
| Training epochs | 60 | 30 |
| Batch size | 64 | 64 |
| Optimizer | AdamW | AdamW |
| Learning rate | 1e-4 |
1e-4 |
| Weight decay | 0.01 |
0.01 |
| Inference | 10 denoising steps, CFG = 1.0 | 10 denoising steps, CFG = 1.0 |
| Hardware | 8 NVIDIA H200 GPUs | 8 NVIDIA H200 GPUs |
Metis is deployed zero-shot on a Unitree Go2 quadruped for closed-loop obstacle avoidance. The reported real-world tests cover indoor daytime and outdoor nighttime environments. A PD controller adapted from prior navigation work converts predicted trajectories into velocity commands for the robot.
Zero-shot outdoor deployment (no task-specific training): the quadruped plans collision-free paths around a parked motorcycle and a rock over a 0–9s rollout.
The public release will include additional deployment notes where possible.
This repository is under active development. The full open-source release is planned for August 2026.
- Paper PDF
- Framework figure
- Inference code
- Training code
- Evaluation scripts
- NAVSIM-v2 data preparation guide
- CityWalker data preparation guide
| Model | Dataset | Input | Status |
|---|---|---|---|
| Metis-NAVSIM | NAVSIM-v2 / NAVSIM-v1 | 1x front camera, ego state, instruction | Coming in August 2026 |
| Metis-CityWalker | CityWalker | 1x front camera, ego state, instruction | Coming in August 2026 |
If you find Metis useful for your research, please consider citing:
@misc{li2026metis,
title = {Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation},
author = {Li, Jingyu and Liu, Zhe and Hu, Dongnan and Wu, Junjie and Ma, Zipei and Wu, Wenxiao and Han, Chao and Hao, Zhihui and Liu, Zhikang and Zhan, Kun and Deng, Jiankang and Zhu, Xiatian and Zhang, Li},
year = {2026},
note = {Preprint}
}The citation entry will be updated after the paper receives an official arXiv or venue record.
Metis builds on progress in video generation, world models, vision-language-action models, autonomous driving planning, and urban navigation. We thank the maintainers of the NAVSIM and CityWalker benchmarks, and the broader open-source robotics and autonomous driving communities.
This project is released under the Apache-2.0 License.
