- Introduction
- Installation
- Supported Tasks
- Supported Models
- Benchmark Results
- Data Conversion
- Advanced Usage
- Citation
We introduce GUI-360Β°, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and are constrained by three persistent gaps:
- π Scarcity of real-world CUA tasks
- π Lack of automated collection-and-annotation pipelines for multi-modal trajectories
- π Absence of unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction
GUI-360Β° addresses these gaps with a large-scale automated pipeline for:
- β¨ Query sourcing
- ποΈ Environment-template construction
- π§ Task instantiation
- β‘ Batched execution
- π€ LLM-driven quality filtering
The released corpus contains:
- π 1.2M+ executed action steps across thousands of trajectories
- π» Popular Windows office applications (Word, Excel, PowerPoint)
- πΌοΈ Full-resolution screenshots
- βΏ Accessibility metadata (when available)
- π― Instantiated goals and reasoning traces
- β Both successful and failed trajectories
| Task | Description | Input | Output |
|---|---|---|---|
| π― GUI Grounding | Locate UI elements by text | Screenshot + description | Coordinates |
| π Screen Parsing | Extract UI control information | Screenshot | Structured elements |
| π€ Action Prediction | Predict next action | Screenshot + goal + history | Action + args |
The dataset supports a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision-language models on GUI-360Β° reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning yields significant gains.
First, clone the repository and install the required dependencies:
# Clone the repository
git clone <repository_url>
cd GUI360Β°
# Install Python dependencies
pip install -r requirements.txtπ¦ Key dependencies include:
| Package | Version | Purpose |
|---|---|---|
| PyTorch | >=2.0.0, <2.5.0 | Deep learning framework |
| Transformers | >=4.37.0 | Model inference |
| Pillow | >=8.0.0 | Image processing |
| Azure Storage SDK | latest | Data access |
| Sentence-Transformers | latest | Screen parsing evaluation |
| NumPy, einops, etc. | latest | ML utilities |
Download the GUI-360Β° dataset (use test folder for evaluation) and organize it in the following structure:
test /
βββ data/
β βββ <domain>/ # e.g., word, excel, ppt
β β βββ <category>/
β β β βββ success/
β β β βββ *.jsonl # Trajectory files
βββ image/
βββ <domain>/
β βββ <category>/
β β βββ success/
β β βββ *.png # Screenshots
For detailed dataset structure, see GUI-360 Dataset.
If you're using a custom model deployment, set up your model API endpoint. The framework supports:
- β OpenAI API-compatible endpoints (for GPT models)
- β Custom model servers (for open-source models)
Example: Deploy Qwen2.5-VL-7B
# Deploy your model server using vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--port 19806For GPT models, configure your OpenAI API credentials by setting environment variables:
# Required: Your OpenAI API key
export OPENAI_API_KEY="sk-your_openai_api_key_here"
# Optional: Custom API base URL (defaults to OpenAI official API)
export OPENAI_BASE_URL="https://api.openai.com/v1"π Verify Configuration:
# Test your API configuration
python -c "
import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
response = client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': 'Hello!'}],
max_tokens=10
)
print('β
OpenAI API configuration successful!')
print(f'Response: {response.choices[0].message.content}')
"Run the evaluation framework with the desired task and model:
π GUI Grounding with GPT-4o
python evaluation.py \
--root_dir ./test \
--type grounding \
--model_type gpt \
--model_name gpt-4o \
--threads 5 \
--output_dir results/groundingπ€ Action Prediction with Qwen2.5-VL-7B
python evaluation.py \
--root_dir ./test \
--type action_prediction \
--model_type qwen2.5_vl_7b \
--model_name Qwen/Qwen2.5-VL-7B-Instruct \
--api_url http://localhost:19806/v1 \
--threads 5 \
--output_dir results/action_predictionπ Screen Parsing
python evaluation.py \
--root_dir ./test \
--type screen_parsing \
--model_type gpt \
--model_name gpt-4o \
--threads 5 \
--output_dir results/screen_parsingIf evaluation is interrupted, you can resume from error cases:
python evaluation.py \
--root_dir ./test \
--type action_prediction \
--model_type gpt \
--model_name gpt-4o \
--resume_from results/evaluation_results_20241231_120000.json \
--threads 5| Argument | Type | Default | Description |
|---|---|---|---|
--root_dir |
str |
required | Path to dataset root directory |
--type |
str |
grounding |
Evaluation type: grounding, action_prediction, action_prediction_a11y, screen_parsing |
--model_type |
str |
gpt |
Model type: qwen2.5_vl_7b, gpt, gui_actor, uground, ui_tars, aguvis, omniparser, mock |
--model_name |
str |
varies | Model name or path |
--api_url |
str |
None |
API URL for model inference (e.g., http://localhost:19806/v1) |
--max_samples |
int |
None |
Maximum number of samples to evaluate (default: all) |
--threads |
int |
5 |
Number of threads for parallel evaluation |
--output_dir |
str |
results |
Output directory for results |
--resume_from |
str |
None |
Path to previous results file to resume from |
--no_save |
flag |
False |
Do not save detailed results |
--log_level |
str |
INFO |
Logging level: DEBUG, INFO, WARNING, ERROR |
Objective: Locate the precise coordinates of a UI element on screen based on a textual description.
screenshot_clean: Path to the screenshot image (PNG)thought: Text description of the target UI elementresolution: Screen resolution as(width, height)
coordinates: A[x, y]tuple representing pixel coordinates of the target element
| Metric | Description |
|---|---|
| Success Rate | % of samples where predicted coordinates fall within ground truth bounding box |
| Avg. Execution Time | Mean time per sample (seconds) |
{
"thoughts": "The user wants to click the 'Bold' button in the toolbar",
"coordinates": [450, 120]
}Objective: Extract structured control information from a screenshot, identifying all UI elements and their properties.
screenshot_clean: Path to the screenshot image (PNG)resolution: Screen resolution as(width, height)
control_infos: A list of dictionaries, each containing:control_text: Text content of the controlcontrol_rect: Bounding box as[left, top, right, bottom]control_type: Type of control (optional)
| Metric | Description |
|---|---|
| Recall | Ratio of correctly identified controls to total ground truth controls |
| Precision | Ratio of correctly identified controls to total predicted controls |
| F1 Score | Harmonic mean of recall and precision |
| Text Similarity | Average semantic similarity of control text (using Sentence-BERT) |
| IoU Accuracy | Average Intersection over Union for matched controls |
| Avg. Execution Time | Mean time per sample (seconds) |
[
{
"control_text": "Save",
"control_rect": [100, 50, 150, 80],
"control_type": "Button"
},
{
"control_text": "Document Title",
"control_rect": [200, 100, 600, 130],
"control_type": "TextBox"
}
]Objective: Predict the next action to take given a user instruction, current screenshot, and action history.
screenshot_clean: Path to the screenshot image (PNG)request: User's high-level goal/instructionprevious_actions: List of previous action descriptionsresolution: Screen resolution as(width, height)
A structured action with:
function: Action type (click,type,drag,scroll,hotkey,wait)args: Arguments for the action (varies by function)status: Execution status (CONTINUE,FINISH)
Supported Actions:
| Function | Arguments | Example |
|---|---|---|
click |
coordinate, button |
{"coordinate": [x, y], "button": "left"} |
type |
coordinate, keys |
{"coordinate": [x, y], "keys": "text"} |
drag |
start_coordinate, end_coordinate |
{"start_coordinate": [x1, y1], "end_coordinate": [x2, y2]} |
scroll |
coordinate, scroll_direction, scroll_amount |
{"coordinate": [x, y], "scroll_direction": "down", "scroll_amount": 3} |
hotkey |
keys |
{"keys": "ctrl+c"} |
| Metric | Description |
|---|---|
| Success Rate | % of samples where function, arguments, and status all match |
| Function Match Rate | % of correct function predictions |
| Args Match Rate | % of correct argument predictions (with coordinate tolerance) |
| Status Match Rate | % of correct status predictions |
| Avg. Execution Time | Mean time per sample (seconds) |
{
"thoughts": "I need to click the Bold button to make the text bold",
"tool_call": {
"function": "click",
"args": {
"coordinate": [450, 120],
"button": "left"
},
"status": "CONTINUE"
}
}Objective: Similar to action prediction, but leverages accessibility tree information for more informed decisions.
- Same as Action Prediction, plus:
control_infos: Structured accessibility information (UI element metadata)
- Same as Action Prediction
- Same as Action Prediction
The framework supports multiple vision-language models through a unified interface. Each model can be evaluated on different tasks depending on its capabilities.
| Model | π― Grounding | π Screen Parsing | π€ Action Prediction | βΏ A11y Action |
|---|---|---|---|---|
| GPT-4o | β | β | β | β |
| GPT-4V | β | β | β | β |
| Qwen2.5-VL-7B | β | β | β | β |
| GUI-Actor | β | β | β | β |
| UGround | β | β | β | β |
| UI-TARS | β | β | β | β |
| Aguvis-7B-720P | β | β | β | β |
| OmniParser | β | β | β | β |
Below are the benchmark results for various models on the GUI-360Β° dataset.
| Model | Success Rate (%) |
|---|---|
| GPT-4o | 9.38 |
| GPT-4.1 | 11.44 |
| Qwen2.5-VL-7B | 35.78 |
| GUI-Actor | 54.50 |
| UGround-7B | 53.85 |
| UI-TARS-1.5 7B | 62.27 |
| Aguvis-7B | 50.50 |
| Qwen2.5-VL-7B-SFT | 82.30 |
| UI-TARS-1.5 7B-SFT | 82.49 |
π Metric Definitions
- Success Rate: Percentage of samples where predicted coordinates fall within ground truth bounding box
| Model | Recall (%) | Precision (%) | F1 Score (%) | Text Sim. (%) | IoU Acc. (%) |
|---|---|---|---|---|---|
| GPT-4o | 1.4 | 3.4 | 1.9 | 14.7 | 22.9 |
| GPT-4.1 | 5.7 | 9.8 | 6.7 | 30.6 | 50.5 |
| o3 | 11.4 | 16.0 | 12.8 | 45.6 | 57.8 |
| GPT-5 | 8.0 | 11.1 | 8.9 | 30.4 | 56.9 |
| Qwen2.5-VL-7B | 1.0 | 18.1 | 1.5 | 11.3 | 21.1 |
| OmniParser | 45.9 | 41.1 | 40.6 | 56.5 | 73.1 |
| OmniParser v2 | 46.2 | 41.3 | 40.8 | 56.8 | 73.5 |
π Metric Definitions
- Recall: Ratio of correctly identified controls to total ground truth controls
- Precision: Ratio of correctly identified controls to total predicted controls
- F1 Score: Harmonic mean of recall and precision
- Text Similarity: Average semantic similarity of control text (using Sentence-BERT)
- IoU Accuracy: Average Intersection over Union for matched controls
| Model | Success Rate (%) |
|---|---|
| GPT-4o | 3.12 |
| GPT-4.1 | 2.82 |
| GPT-o3 | 17.92 |
| GPT-5 | 8.59 |
| Qwen2.5-VL-7B | 17.52 |
| Qwen2.5-VL-7B-SFT | 50.08 |
π Metric Definitions
- Success Rate: Percentage of samples where the predicted action matches the ground truth action (considering function, arguments, and execution status)
| Model | Success Rate (%) |
|---|---|
| GPT-4o | 36.71 |
| GPT-4.1 | 39.19 |
| GPT-o3 | 46.72 |
| GPT-5 | 34.86 |
| Qwen2.5-VL-7B | 14.18 |
| Qwen2.5-VL-7B-SFT | 25.78 |
π Metric Definitions
- Success Rate: Same metric as Visual-only Action Prediction
- A11y Advantage: This variant provides accessibility tree information as additional input. GPT models show significant improvement with A11y information, while Qwen2.5-VL-7B-SFT performs better in Visual-only mode.
π§ Step 1: Configure API Access
Set your OpenAI API credentials:
# Required: Your OpenAI API key
export OPENAI_API_KEY="sk-your_openai_api_key_here"
# Optional: Custom API base URL (defaults to OpenAI official API)
export OPENAI_BASE_URL="https://api.openai.com/v1"π Step 2: Run Evaluation Commands
# GPT-4o (recommended for best performance)
python evaluation.py \
--root_dir /path/to/dataset \
--type grounding \
--model_type gpt \
--model_name gpt-4o \
--threads 5
# GPT-4o-mini (faster and cheaper)
python evaluation.py \
--root_dir /path/to/dataset \
--type action_prediction \
--model_type gpt \
--model_name gpt-4o-mini \
--threads 10β Requirements:
- OpenAI API key configured via environment variable
OPENAI_API_KEY - Internet connection for API access
- Valid OpenAI subscription with sufficient credits
Step 1: Deploy Model Server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--port 19806Step 2: Run Evaluation
python evaluation.py \
--root_dir /path/to/dataset \
--type action_prediction \
--model_type qwen2.5_vl_7b \
--model_name Qwen/Qwen2.5-VL-7B-Instruct \
--api_url http://localhost:19806/v1 \
--threads 5β Requirements:
- vLLM or compatible inference server
- GPU with sufficient memory (recommended: 16GB+ VRAM)
β οΈ Important: Before using OmniParser, you need to configure the model paths inevaluation.py.
Step 1: Edit Configuration in evaluation.py (lines 136-144)
config = {
'som_model_path': '../../weights/icon_detect/model.pt', # Update this path
'caption_model_name': 'florence2',
'caption_model_path': '../../weights/icon_caption_florence', # Update this path
'device': 'cuda', # Or 'cpu' if no GPU available
'BOX_TRESHOLD': 0.05,
'host': host,
'port': port
}Step 2: Download OmniParser Model Weights
- SOM (Set-of-Mark) model: Place at the path specified in
som_model_path - Caption model (Florence-2): Place at the path specified in
caption_model_path
Step 3: Deploy the OmniParser API Server
python omniparser_server.py \
--som_model_path /path/to/weights/icon_detect/model.pt \
--caption_model_path /path/to/weights/icon_caption_florence \
--host 0.0.0.0 \
--port 7861Step 4: Run Evaluation
python evaluation.py \
--root_dir /path/to/dataset \
--type screen_parsing \
--model_type omniparser \
--model_name omniparser-screen-parsing \
--api_url http://localhost:7861 \
--threads 5β Requirements:
- OmniParser model weights (SOM and Florence-2 caption model)
- OmniParser API server running
- Model paths configured in
evaluation.pyModelFactory class
After evaluation completes, results are saved to the specified output directory:
results/
βββ evaluation_results_<timestamp>.json # Detailed results for all samples
βββ evaluation_summary_<timestamp>.json # Summary statistics
π Summary Statistics Include:
- β Total samples evaluated
- π― Success count and rate
- β Error count and rate
- β±οΈ Average execution time
- π Domain and category breakdowns
- π Task-specific metrics (e.g., recall, precision for screen parsing)
{
"total_samples": 1000,
"success_count": 850,
"success_rate": 85.0,
"error_count": 10,
"error_rate": 1.0,
"avg_execution_time": 2.35,
"model_name": "gpt-4o",
"model_type": "gpt",
"evaluation_time": "2024-12-31 12:00:00",
"domain_stats": {
"word": {
"total": 400,
"success": 340,
"success_rate": 85.0
},
"excel": {
"total": 300,
"success": 255,
"success_rate": 85.0
},
"ppt": {
"total": 300,
"success": 255,
"success_rate": 85.0
}
}
}We provide tools to convert the GUI-360Β° raw dataset into task-specific training formats suitable for vision-language model fine-tuning.
| Task | Description | Use Case |
|---|---|---|
| Action Prediction | Multi-turn conversations with action sequences | Train agents to predict next GUI actions |
| Action Prediction + A11y | Action prediction with accessibility information | Leverage accessibility trees for better performance |
| Screen Parsing | Extract UI elements from screenshots | Train models to understand GUI structure |
| GUI Grounding | Locate UI elements by text descriptions | Train models for GUI element localization |
# Navigate to converter directory
cd convertor
# Convert to action prediction format with image optimization
python convert_to_train.py \
--root_dir /path/to/GUI360_dataset \
--output_dir /path/to/training_data \
--type action_prediction \
--resize \
--max_pixels 500000
# Convert to screen parsing format
python convert_to_train.py \
--root_dir /path/to/GUI360_dataset \
--output_dir /path/to/training_data \
--type screen_parsing \
--resizeπ For detailed instructions, examples, and advanced options, see convertor/README.md
To integrate a custom model:
- Create a new model class in
models/inheriting fromBaseModel - Implement required methods:
predict(system_prompt, user_prompt, image_path, ...)construct_<task>_prompt(...)for each supported taskparse_<task>(response)for each supported task
- Add model factory entry in
evaluation.py:
@staticmethod
def _load_custom_model(model_name: str, api_url: str = None):
from models.custom_model import CustomModel
return CustomModel(model_name=model_name, api_url=api_url)- Update the
create_modelmethod to include the new model type
The framework supports multi-threaded evaluation for faster processing:
python evaluation.py \
--root_dir /path/to/dataset \
--type action_prediction \
--model_type gpt \
--model_name gpt-4o \
--threads 20 # Increase for faster evaluation- π¦ API rate limits
- π» Available system resources
- π₯οΈ Model server capacity
To evaluate on a limited number of samples (useful for quick testing):
python evaluation.py \
--root_dir /path/to/dataset \
--type grounding \
--model_type gpt \
--model_name gpt-4o \
--max_samples 100 # Only evaluate 100 samplesIf you use GUI-360Β° in your research, please cite:
@article{mu2025gui,
title={GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents},
author={Mu, Jian and Zhang, Chaoyun and Ni, Chiming and Wang, Lu and Qiao, Bo and Mathur, Kartik and Wu, Qianhui and Xie, Yuhang and Ma, Xiaojun and Zhou, Mengyu and others},
journal={arXiv preprint arXiv:2511.04307},
year={2025}
}This project is released under the MIT License. See the LICENSE file for details.
