DeMaVLA

DeMaVLA is a Vision-Language-Action (VLA) foundation model for generalizable deformable manipulation. It targets real-world bimanual household folding, where robots must handle garments from random initial states across different categories, geometries, materials, and scenes.

The model combines a Qwen3-VL backbone, a layer-aligned pruned action expert, flow-matching action generation, training-time real-time chunking (RTC), and human-in-the-loop DAgger. DeMaVLA is first pre-trained on about 5,000 hours of selected real-world dual-arm demonstrations, then post-trained on mixed folding demonstrations and corrective trajectories collected from real-robot failures.

Results

RoboTwin 2.0

Average success rate over 50 bimanual simulation tasks:

Method	Clean	Randomized
pi_0	65.92	58.40
pi_0.5	82.74	76.76
X-VLA	72.80	72.84
ABot-M0	80.42	81.16
LingBot-VLA	86.50	85.34
DeMaVLA	88.42	86.78

Real-World Household Folding

Success rate and average completion time over four real-world folding tasks:

Method	Shirt	Skirt	Pant	Towel	Average
pi_0	90.0%, 1:55	95.0%, 1:03	65.0%, 3:01	55.0%, 3:44	76.3%, 2:26
DeMaVLA	95.0%, 2:15	100.0%, 1:30	75.0%, 3:01	100.0%, 2:26	92.5%, 2:18

Installation

conda create -n demavla python=3.10
conda activate demavla
pip install -r requirements.txt

Model Checkpoint

The released checkpoint is hosted on Hugging Face:

https://huggingface.co/Midea-AIRC/DeMaVLA

The training script expects a Qwen3-VL backbone path through QWEN3_VL_PATH or the --pretrain_vlm_backbone_path argument.

Data

The training pipeline reads datasets in LeRobot v3 format. Dataset paths are configured in YAML files under yaml/.

For the provided RoboTwin example, edit:

yaml/cfg_stage2_robotwin.yaml

Training

Before launching training, update the paths in:

scripts/aloha/train_DeMaVLA_robotwin_stage2.sh
yaml/cfg_stage2_robotwin.yaml

Typical fields to adjust:

QWEN3_VL_PATH: local path to the Qwen3-VL backbone.
OUTPUT: output directory for checkpoints and logs.
--pretrain_vla_path: path to an existing DeMaVLA/VLA checkpoint when fine-tuning.
--data_cfg_path: dataset YAML path.
dataset.root and dataset.global_stats_json: dataset and normalization paths.

Launch the provided RoboTwin stage-2 training example:

bash scripts/aloha/train_DeMaVLA_robotwin_stage2.sh

Citation

@article{su2026demavla,
  title={DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation},
  author={Su, Taiyi and Zhu, Jian and Wang, Tianjian and He, Youzhang and Huang, Zitai and Zhang, Jianjun and Ma, Chong and Wang, Hanyang and Zhang, Tianjiao and Yin, Munan and Ding, Weihao and Xu, Yi},
  year={2026},
  url={https://arxiv.org/pdf/2605.31286}
}

License

This project is released under the Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data_utils		data_utils
models		models
scripts		scripts
training_DeMaVLA		training_DeMaVLA
yaml		yaml
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeMaVLA

Results

RoboTwin 2.0

Real-World Household Folding

Installation

Model Checkpoint

Data

Training

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeMaVLA

Results

RoboTwin 2.0

Real-World Household Folding

Installation

Model Checkpoint

Data

Training

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages