Skip to content

midea-ai/TinyVLA

Repository files navigation

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

  • TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Modelsfor Robotic Manipulation
    arXiv IEEE RAL

📰 Authors

  • Junjie Wen 1,3; Yichen Zhu 2; Jinming Li 3,6; Minjie Zhu 1,3; Zhibin Tang 2; Kun Wu 4; Zhiyuan Xu 5; Ning Liu 5; Ran heng 2; Chaomin Shen 1; Yaxin Peng 6; Feifei Feng 2; and Jian Tang 5
* 1 Junjie Wen, Minjie Zhu, and Chaomin Shen are with East China Normal University, Shanghai 200042, China. {jjwen,mjzhu}@stu.ecnu.edu.cn, cmshen@cs.ecnu.edu.cn
* 2 Yichen Zhu, Ran Cheng, Zhibin Tang, and Feifei Feng are with Midea Group, AI Lab, Shanghai 201700, China. {zhuyc25, tangzb,ningliu22, chengran, feifei.feng}@midea.com
* 3 Junjie Wen, Minjie Zhu, and Jinming Li are interned at Midea Group,AI Lab, Shanghai 201700, China.
* 4 Kun Wu is with Syracuse University, New York 13244, USA. kwu102@syr.edu
* 5 Zhiyuan Xu, Ning Liu, and Jian Tang are with Beijing Innovation Center of Humanoid Robotics, Beijing 102676, China. {eric.xu,neil.liu, jian.tang}@x - humanoid.com
* 6 Jinming Li and Yaxin Peng are with Shanghai University, Shanghai 201900, China. {ljm2022, yaxin.peng}@shu.edu.cn
Junjie Wen and Yichen Zhu are co-first authors. Yichen Zhu and Chaomin Shen are the corresponding authors.

📰 News

  • Feb. 17th, 2025: 🔥🔥🔥Our code is released!
  • Feb. 9th, 2025: 🔥🔥🔥TinyVLA is accepted by IEEE Robotics and Automation Letters (RA-L) 2025!
  • Nov. 19th, 2024: TinyVLA is out! Paper can be found here. The project web can be found here.

Contents

Install

  1. Clone this repository and navigate to diffusion-vla folder
git clone https://github.com/liyaxuanliyaxuan/TinyVLA
  1. Install Package
conda create -n tinyvla python=3.10 -y
conda activate tinyvla
pip install --upgrade pip  # 
pip install -r requirements.txt
cd policy_heads
pip install -e . 
# install llava-pythia
cd ../llava-pythia
pip install -e . 

Data Preparation

  1. Our data format is the same as act, so you need to transfer your data into h5py format. You can refer to the rlds_to_h5py.py which is used to transfer the data from rlds format to h5py format.
# h5 data structure
root
  |-action (100,10)
  |-language_raw (1,)
  |-observations
      |-images # multi-view
          |-left (100,480,640,3)
          |-right (100,480,640,3)
          |-wrist (100,480,640,3)
      |-joint_positions (100,7)
      |-qpos (100,7)
      |-qvel (100,7)
  1. You have to add one entry in constants.py to specify the path of your data as follows.
    'your_task_name':{
        'dataset_dir': DATA_DIR + '/your_task_path', # define the path of the dataset
        'episode_len': 1000, #max length of the episode,
        'camera_names': ['front', 'wrist'] # define the camera names which are used as the key when reading data
    }

Download Pretrained VLM

We construct the VLM backbone by integrating a series of tiny LLM(Pythia) into Llava framework. We follow the standard training pipe line and data provided by Llava. All the weights of VLM used in our paper are listed as following:

Model Usage Link
Llava-Pythia(~400M) For TinyVLA-S huggingface
Llava-Pythia(~700M) For TinyVLA-B huggingface
Llava-Pythia(~1.3B) For TinyVLA-H huggingface

Train

The training script is "scripts/train.sh". And you need to change following parameters:

  1. OUTPUT :refers to the save directory for training, which must include the keyword "llava_pythia" (and optionally "lora"). If LoRA training is used, the name must include "lora" (e.g., "llava_pythia_lora").
  2. task_name :refers to the tasks used for training, which should be corresponded to "your_task_name" in aloha_scripts/constant.py
  3. model_name_or_path :path to the pretrained VLM weights
  4. Other hyperparameters like "batch_size", "save_steps" could be customized according to your computation resources.

Start training by following commands:

./scripts/train.sh

Evaluation

Before evaluation, we provide a post process script to generate a usable and smaller weights. The process script is "scripts/process_ckpts.sh". And you need to change following parameters:

  1. source_dir :path to trained VLA dir equals to OUTPUT in train.sh
  2. target_dir :path to save processed VLA weights

You can refer to our evaluation script eval_real_franka.py.

Acknowledgement

We build our project based on:

  • LLaVA: an amazing open-sourced project for vision language assistant
  • act-plus-plus: an amazing open-sourced project for robotics visuomotor learning
  • Miphi: an amazing open-sourced project for tiny vision language model

Citation

If you find Tiny-VLA useful for your research and applications, please cite using this BibTeX:

@misc{
    @inproceedings{wen2024tinyvla,
    title={Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation},
    author={Wen, Junjie and Zhu, Yichen and Li, Jinming and Zhu, Minjie and Wu, Kun and Xu, Zhiyuan and Liu, Ning and Cheng, Ran and Shen, Chaomin and Peng, Yaxin and others},
    booktitle={IEEE Robotics and Automation Letters (RA-L)},
    year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors