Skip to content

midea-ai/DeMaVLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeMaVLA

DeMaVLA logo

Paper | Model | Project Page

DeMaVLA is a Vision-Language-Action (VLA) foundation model for generalizable deformable manipulation. It targets real-world bimanual household folding, where robots must handle garments from random initial states across different categories, geometries, materials, and scenes.

The model combines a Qwen3-VL backbone, a layer-aligned pruned action expert, flow-matching action generation, training-time real-time chunking (RTC), and human-in-the-loop DAgger. DeMaVLA is first pre-trained on about 5,000 hours of selected real-world dual-arm demonstrations, then post-trained on mixed folding demonstrations and corrective trajectories collected from real-robot failures.

DeMaVLA overview

Results

RoboTwin 2.0

Average success rate over 50 bimanual simulation tasks:

Method Clean Randomized
pi_0 65.92 58.40
pi_0.5 82.74 76.76
X-VLA 72.80 72.84
ABot-M0 80.42 81.16
LingBot-VLA 86.50 85.34
DeMaVLA 88.42 86.78

Real-World Household Folding

Success rate and average completion time over four real-world folding tasks:

Method Shirt Skirt Pant Towel Average
pi_0 90.0%, 1:55 95.0%, 1:03 65.0%, 3:01 55.0%, 3:44 76.3%, 2:26
DeMaVLA 95.0%, 2:15 100.0%, 1:30 75.0%, 3:01 100.0%, 2:26 92.5%, 2:18

Installation

conda create -n demavla python=3.10
conda activate demavla
pip install -r requirements.txt

Model Checkpoint

The released checkpoint is hosted on Hugging Face:

https://huggingface.co/Midea-AIRC/DeMaVLA

The training script expects a Qwen3-VL backbone path through QWEN3_VL_PATH or the --pretrain_vlm_backbone_path argument.

Data

The training pipeline reads datasets in LeRobot v3 format. Dataset paths are configured in YAML files under yaml/.

For the provided RoboTwin example, edit:

yaml/cfg_stage2_robotwin.yaml

Training

Before launching training, update the paths in:

scripts/aloha/train_DeMaVLA_robotwin_stage2.sh
yaml/cfg_stage2_robotwin.yaml

Typical fields to adjust:

  • QWEN3_VL_PATH: local path to the Qwen3-VL backbone.
  • OUTPUT: output directory for checkpoints and logs.
  • --pretrain_vla_path: path to an existing DeMaVLA/VLA checkpoint when fine-tuning.
  • --data_cfg_path: dataset YAML path.
  • dataset.root and dataset.global_stats_json: dataset and normalization paths.

Launch the provided RoboTwin stage-2 training example:

bash scripts/aloha/train_DeMaVLA_robotwin_stage2.sh

Citation

@article{su2026demavla,
  title={DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation},
  author={Su, Taiyi and Zhu, Jian and Wang, Tianjian and He, Youzhang and Huang, Zitai and Zhang, Jianjun and Ma, Chong and Wang, Hanyang and Zhang, Tianjiao and Yin, Munan and Ding, Weihao and Xu, Yi},
  year={2026},
  url={https://arxiv.org/pdf/2605.31286}
}

License

This project is released under the Apache License 2.0. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors