Paper | Model | Project Page
DeMaVLA is a Vision-Language-Action (VLA) foundation model for generalizable deformable manipulation. It targets real-world bimanual household folding, where robots must handle garments from random initial states across different categories, geometries, materials, and scenes.
The model combines a Qwen3-VL backbone, a layer-aligned pruned action expert, flow-matching action generation, training-time real-time chunking (RTC), and human-in-the-loop DAgger. DeMaVLA is first pre-trained on about 5,000 hours of selected real-world dual-arm demonstrations, then post-trained on mixed folding demonstrations and corrective trajectories collected from real-robot failures.
Average success rate over 50 bimanual simulation tasks:
| Method | Clean | Randomized |
|---|---|---|
| pi_0 | 65.92 | 58.40 |
| pi_0.5 | 82.74 | 76.76 |
| X-VLA | 72.80 | 72.84 |
| ABot-M0 | 80.42 | 81.16 |
| LingBot-VLA | 86.50 | 85.34 |
| DeMaVLA | 88.42 | 86.78 |
Success rate and average completion time over four real-world folding tasks:
| Method | Shirt | Skirt | Pant | Towel | Average |
|---|---|---|---|---|---|
| pi_0 | 90.0%, 1:55 | 95.0%, 1:03 | 65.0%, 3:01 | 55.0%, 3:44 | 76.3%, 2:26 |
| DeMaVLA | 95.0%, 2:15 | 100.0%, 1:30 | 75.0%, 3:01 | 100.0%, 2:26 | 92.5%, 2:18 |
conda create -n demavla python=3.10
conda activate demavla
pip install -r requirements.txtThe released checkpoint is hosted on Hugging Face:
https://huggingface.co/Midea-AIRC/DeMaVLA
The training script expects a Qwen3-VL backbone path through QWEN3_VL_PATH or the --pretrain_vlm_backbone_path argument.
The training pipeline reads datasets in LeRobot v3 format. Dataset paths are configured in YAML files under yaml/.
For the provided RoboTwin example, edit:
yaml/cfg_stage2_robotwin.yaml
Before launching training, update the paths in:
scripts/aloha/train_DeMaVLA_robotwin_stage2.sh
yaml/cfg_stage2_robotwin.yaml
Typical fields to adjust:
QWEN3_VL_PATH: local path to the Qwen3-VL backbone.OUTPUT: output directory for checkpoints and logs.--pretrain_vla_path: path to an existing DeMaVLA/VLA checkpoint when fine-tuning.--data_cfg_path: dataset YAML path.dataset.rootanddataset.global_stats_json: dataset and normalization paths.
Launch the provided RoboTwin stage-2 training example:
bash scripts/aloha/train_DeMaVLA_robotwin_stage2.sh@article{su2026demavla,
title={DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation},
author={Su, Taiyi and Zhu, Jian and Wang, Tianjian and He, Youzhang and Huang, Zitai and Zhang, Jianjun and Ma, Chong and Wang, Hanyang and Zhang, Tianjiao and Yin, Munan and Ding, Weihao and Xu, Yi},
year={2026},
url={https://arxiv.org/pdf/2605.31286}
}This project is released under the Apache License 2.0. See LICENSE for details.

