π Release BlogΒ Β | Β Β π€ Hugging Face ModelΒ Β | Β Β π§ Deployment (via UI-TARS) Β Β | Β Β π₯οΈ Running on your own computer (via UI-TARS Desktop)Β Β
GLADOS-1 is the first computer-use (CUA) model post-trained using collective, crowd-sourced trajectories via the PANGO dataset.
Heavily inspired by the Qwen-2VL-Finetune repository, this project provides a framework for training vision-language models on GUI interaction data. While this code represents sample code for post-training UI-TARS-7B-SFT via ByteDance Seed, it can be trivially updated for any model based on the Qwen2-VL architecture.
The PANGO (Productivity Applications with Natural GUI Observations and trajectories) dataset contains real user interactions with web interfaces, converted into training conversations for multimodal models.
Each session in the PANGO dataset contains:
- Screenshots: GUI state images at different timestamps
- Actions: User interactions (clicks, drags, typing, etc.)
- Metadata: Session IDs, timestamps, and other inputs
The dataset supports various GUI interaction types:
Supported Actions:
click- Single left mouse clicksleft_double- Double left mouse clicksright_single- Right mouse clicksdrag- Mouse drag operations (converted from drag_start/drag_end pairs)key_press- Keyboard key pressesinput- Text input actionsscroll- Scroll wheel actions
Ignored Actions:
mouseover_start/mouseover_end- Mouse hover eventsdrag_start/drag_end- Individual drag events (converted to singledrag)
Converters transform raw Pango data into training conversations. Each converter implements a specific training purpose:
- Input: Single screenshot and instruction
- Output: Action prediction
- Use Case: Instruction-following GUI automation
- Input: Before and after screenshots
- Output: Action prediction
- Use Case: Reverse engineering user interactions
- Input: Conversational history containing screenshots and actions
- Output: Action prediction
- Use Case: Multi-turn conversation training
# Install uv package manager
brew install uv
# Install dependencies
make install# Train with grounding dataset
make train
# Train with state transition dataset
make train_state_transitionDuring setup, the script image_downloader script will download all images to the STORAGE_DIR directory. The estimated storage requirements for the pango-sample dataset is 15 GB, and 265 GB for the pango full dataset. Note, the image downloader script has a hardcoded buffer of 50GB, adjust and rebuild if this is an issue.
To create a new converter:
- Inherit from BasePangoConverter:
from code.converters.base_pango_converter import BasePangoConverter
class MyConverter(BasePangoConverter):
def __init__(self, dataset_path: str, prompt: str, **kwargs):
super().__init__(dataset_path, actions_to_ignore=[...], **kwargs)
self.prompt = prompt- Implement required methods:
def generate_conversation(self, *args, **kwargs) -> list:
"""Convert actions to training conversation format"""
# Return list of conversation frames with:
# - role: "user" or "assistant"
# - content: text/image content
# - loss_mask: 0 (ignore) or 1 (train on)
pass
def generate_indices(self, n: int, pct_train: float) -> tuple[list, list]:
"""Generate train/test indices for the dataset"""
# Return (train_indices, test_indices)
pass- Add action handling (as needed):
def _handle_custom_action(self, action: dict, original_dims, scaled_dims):
"""Handle new action types"""
# Convert pango action to model action format
return action_content- Create corresponding dataset class:
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, indices: list[int], converter: MyConverter, processor):
self.indices = indices
self.converter = converter
self.processor = processor
def __getitem__(self, idx):
# Convert index to training sample
return {"input_ids": ..., "attention_mask": ..., "labels": ...}- Coordinate Scaling: Actions use standardized coordinates (0-1000 range)
- Image Processing: Screenshots are resized and processed using
fetch_imagefromqwen-vl-utils - Error Handling: Use
_handle_error()and_handle_malformatted_action()for graceful failures - Lazy Loading: Images are loaded on demand during training by the
__getitem__method on the dataset class
code/
βββ converters/ # Data conversion logic
βββ datasets/ # PyTorch dataset implementations
βββ training/ # Training scripts and utilities
βββ train.py # Main training entry point
βββ utils.py # Utility functions
βββ consts.py # Constants
βββ exceptions.py # Custom exceptions
βββ tests/ # Test files
@misc{chakralabs2025glados-1,
author = {Chakra Labs},
title = {GLADOS-1},
url = {https://github.com/Chakra-Network/GLADOS-1},
year = {2025}
}