TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Modelsfor Robotic Manipulation
IEEE RAL

📰 Authors

Junjie Wen 1,3; Yichen Zhu 2; Jinming Li 3,6; Minjie Zhu 1,3; Zhibin Tang 2; Kun Wu 4; Zhiyuan Xu 5; Ning Liu 5; Ran heng 2; Chaomin Shen 1; Yaxin Peng 6; Feifei Feng 2; and Jian Tang 5

* 1 Junjie Wen, Minjie Zhu, and Chaomin Shen are with East China Normal University, Shanghai 200042, China. {jjwen,mjzhu}@stu.ecnu.edu.cn, cmshen@cs.ecnu.edu.cn
* 2 Yichen Zhu, Ran Cheng, Zhibin Tang, and Feifei Feng are with Midea Group, AI Lab, Shanghai 201700, China. {zhuyc25, tangzb,ningliu22, chengran, feifei.feng}@midea.com
* 3 Junjie Wen, Minjie Zhu, and Jinming Li are interned at Midea Group,AI Lab, Shanghai 201700, China.
* 4 Kun Wu is with Syracuse University, New York 13244, USA. kwu102@syr.edu
* 5 Zhiyuan Xu, Ning Liu, and Jian Tang are with Beijing Innovation Center of Humanoid Robotics, Beijing 102676, China. {eric.xu,neil.liu, jian.tang}@x - humanoid.com
* 6 Jinming Li and Yaxin Peng are with Shanghai University, Shanghai 201900, China. {ljm2022, yaxin.peng}@shu.edu.cn
Junjie Wen and Yichen Zhu are co-first authors. Yichen Zhu and Chaomin Shen are the corresponding authors.

📰 News

Feb. 17th, 2025: 🔥🔥🔥Our code is released!
Feb. 9th, 2025: 🔥🔥🔥TinyVLA is accepted by IEEE Robotics and Automation Letters (RA-L) 2025!
Nov. 19th, 2024: TinyVLA is out! Paper can be found here. The project web can be found here.

Install

Clone this repository and navigate to diffusion-vla folder

git clone https://github.com/liyaxuanliyaxuan/TinyVLA

Install Package

conda create -n tinyvla python=3.10 -y
conda activate tinyvla
pip install --upgrade pip  # 
pip install -r requirements.txt
cd policy_heads
pip install -e . 
# install llava-pythia
cd ../llava-pythia
pip install -e .

Data Preparation

Our data format is the same as act, so you need to transfer your data into h5py format. You can refer to the rlds_to_h5py.py which is used to transfer the data from rlds format to h5py format.

# h5 data structure
root
  |-action (100,10)
  |-language_raw (1,)
  |-observations
      |-images # multi-view
          |-left (100,480,640,3)
          |-right (100,480,640,3)
          |-wrist (100,480,640,3)
      |-joint_positions (100,7)
      |-qpos (100,7)
      |-qvel (100,7)

You have to add one entry in constants.py to specify the path of your data as follows.

    'your_task_name':{
        'dataset_dir': DATA_DIR + '/your_task_path', # define the path of the dataset
        'episode_len': 1000, #max length of the episode,
        'camera_names': ['front', 'wrist'] # define the camera names which are used as the key when reading data
    }

Download Pretrained VLM

We construct the VLM backbone by integrating a series of tiny LLM(Pythia) into Llava framework. We follow the standard training pipe line and data provided by Llava. All the weights of VLM used in our paper are listed as following:

Model	Usage	Link
Llava-Pythia(~400M)	For TinyVLA-S	huggingface
Llava-Pythia(~700M)	For TinyVLA-B	huggingface
Llava-Pythia(~1.3B)	For TinyVLA-H	huggingface

Train

The training script is "scripts/train.sh". And you need to change following parameters:

OUTPUT :refers to the save directory for training, which must include the keyword "llava_pythia" (and optionally "lora"). If LoRA training is used, the name must include "lora" (e.g., "llava_pythia_lora").
task_name :refers to the tasks used for training, which should be corresponded to "your_task_name" in aloha_scripts/constant.py
model_name_or_path :path to the pretrained VLM weights
Other hyperparameters like "batch_size", "save_steps" could be customized according to your computation resources.

Start training by following commands:

./scripts/train.sh

Evaluation

Before evaluation, we provide a post process script to generate a usable and smaller weights. The process script is "scripts/process_ckpts.sh". And you need to change following parameters:

source_dir :path to trained VLA dir equals to OUTPUT in train.sh
target_dir :path to save processed VLA weights

You can refer to our evaluation script eval_real_franka.py.

Acknowledgement

We build our project based on:

LLaVA: an amazing open-sourced project for vision language assistant
act-plus-plus: an amazing open-sourced project for robotics visuomotor learning
Miphi: an amazing open-sourced project for tiny vision language model

Citation

If you find Tiny-VLA useful for your research and applications, please cite using this BibTeX:

@misc{
    @inproceedings{wen2024tinyvla,
    title={Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation},
    author={Wen, Junjie and Zhu, Yichen and Li, Jinming and Zhu, Minjie and Wu, Kun and Xu, Zhiyuan and Liu, Ning and Cheng, Ran and Shen, Chaomin and Peng, Yaxin and others},
    booktitle={IEEE Robotics and Automation Letters (RA-L)},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
aloha_scripts		aloha_scripts
data_utils		data_utils
llava-pythia		llava-pythia
policy_heads		policy_heads
scripts		scripts
LICENSE		LICENSE
README.md		README.md
eval_real_franka.py		eval_real_franka.py
requirements.txt		requirements.txt
setup.py		setup.py
torch_utils.py		torch_utils.py
train_tinyvla.py		train_tinyvla.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

📰 Authors

📰 News

Contents

Install

Data Preparation

Download Pretrained VLM

Train

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

📰 Authors

📰 News

Contents

Install

Data Preparation

Download Pretrained VLM

Train

Evaluation

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages