Skip to content

2020-qqtcg/GUI-360

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌐 GUI-360°

A Comprehensive Dataset And Benchmark For Computer-Using Agents

Paper Dataset License

πŸ“„ Paper β€’ πŸ€— Dataset


πŸ“‹ Table of Contents


GUI-360Β° Pipeline

🎯 Introduction

We introduce GUI-360Β°, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and are constrained by three persistent gaps:

  1. πŸ” Scarcity of real-world CUA tasks
  2. πŸ”„ Lack of automated collection-and-annotation pipelines for multi-modal trajectories
  3. πŸ“Š Absence of unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction

🌟 Key Features

GUI-360Β° addresses these gaps with a large-scale automated pipeline for:

  • ✨ Query sourcing
  • πŸ—οΈ Environment-template construction
  • πŸ”§ Task instantiation
  • ⚑ Batched execution
  • πŸ€– LLM-driven quality filtering

πŸ“¦ Dataset Statistics

The released corpus contains:

  • πŸ“Š 1.2M+ executed action steps across thousands of trajectories
  • πŸ’» Popular Windows office applications (Word, Excel, PowerPoint)
  • πŸ–ΌοΈ Full-resolution screenshots
  • β™Ώ Accessibility metadata (when available)
  • 🎯 Instantiated goals and reasoning traces
  • βœ… Both successful and failed trajectories

🎨 Supported Tasks

Task Description Input Output
🎯 GUI Grounding Locate UI elements by text Screenshot + description Coordinates
πŸ” Screen Parsing Extract UI control information Screenshot Structured elements
πŸ€– Action Prediction Predict next action Screenshot + goal + history Action + args

The dataset supports a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision-language models on GUI-360Β° reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning yields significant gains.


πŸš€ Installation and Quick Start

1. πŸ“₯ Install Dependencies

First, clone the repository and install the required dependencies:

# Clone the repository
git clone <repository_url>
cd GUI360Β°

# Install Python dependencies
pip install -r requirements.txt

πŸ“¦ Key dependencies include:

Package Version Purpose
PyTorch >=2.0.0, <2.5.0 Deep learning framework
Transformers >=4.37.0 Model inference
Pillow >=8.0.0 Image processing
Azure Storage SDK latest Data access
Sentence-Transformers latest Screen parsing evaluation
NumPy, einops, etc. latest ML utilities

2. πŸ“ Prepare the Dataset

Download the GUI-360Β° dataset (use test folder for evaluation) and organize it in the following structure:

test /
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ <domain>/                # e.g., word, excel, ppt
β”‚   β”‚   β”œβ”€β”€ <category>/
β”‚   β”‚   β”‚   └── success/
β”‚   β”‚   β”‚       └── *.jsonl      # Trajectory files
└── image/
    β”œβ”€β”€ <domain>/
    β”‚   β”œβ”€β”€ <category>/
    β”‚   β”‚   └── success/
    β”‚   β”‚       └── *.png          # Screenshots

For detailed dataset structure, see GUI-360 Dataset.

3. πŸ”§ Deploy Your Model (Optional)

If you're using a custom model deployment, set up your model API endpoint. The framework supports:

  • βœ… OpenAI API-compatible endpoints (for GPT models)
  • βœ… Custom model servers (for open-source models)

Example: Deploy Qwen2.5-VL-7B

# Deploy your model server using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --port 19806

4. πŸ”‘ Configure API Access

For GPT models, configure your OpenAI API credentials by setting environment variables:

# Required: Your OpenAI API key
export OPENAI_API_KEY="sk-your_openai_api_key_here"

# Optional: Custom API base URL (defaults to OpenAI official API)
export OPENAI_BASE_URL="https://api.openai.com/v1"

πŸ” Verify Configuration:

# Test your API configuration
python -c "
import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{'role': 'user', 'content': 'Hello!'}],
    max_tokens=10
)
print('βœ… OpenAI API configuration successful!')
print(f'Response: {response.choices[0].message.content}')
"

5. ▢️ Run Evaluation

Run the evaluation framework with the desired task and model:

πŸ“ GUI Grounding with GPT-4o

python evaluation.py \
    --root_dir ./test \
    --type grounding \
    --model_type gpt \
    --model_name gpt-4o \
    --threads 5 \
    --output_dir results/grounding

πŸ€– Action Prediction with Qwen2.5-VL-7B

python evaluation.py \
    --root_dir ./test \
    --type action_prediction \
    --model_type qwen2.5_vl_7b \
    --model_name Qwen/Qwen2.5-VL-7B-Instruct \
    --api_url http://localhost:19806/v1 \
    --threads 5 \
    --output_dir results/action_prediction

πŸ” Screen Parsing

python evaluation.py \
    --root_dir ./test \
    --type screen_parsing \
    --model_type gpt \
    --model_name gpt-4o \
    --threads 5 \
    --output_dir results/screen_parsing

6. πŸ”„ Resume from Previous Results

If evaluation is interrupted, you can resume from error cases:

python evaluation.py \
    --root_dir ./test \
    --type action_prediction \
    --model_type gpt \
    --model_name gpt-4o \
    --resume_from results/evaluation_results_20241231_120000.json \
    --threads 5

βš™οΈ Evaluation Configuration

πŸŽ›οΈ Command-Line Arguments

Argument Type Default Description
--root_dir str required Path to dataset root directory
--type str grounding Evaluation type: grounding, action_prediction, action_prediction_a11y, screen_parsing
--model_type str gpt Model type: qwen2.5_vl_7b, gpt, gui_actor, uground, ui_tars, aguvis, omniparser, mock
--model_name str varies Model name or path
--api_url str None API URL for model inference (e.g., http://localhost:19806/v1)
--max_samples int None Maximum number of samples to evaluate (default: all)
--threads int 5 Number of threads for parallel evaluation
--output_dir str results Output directory for results
--resume_from str None Path to previous results file to resume from
--no_save flag False Do not save detailed results
--log_level str INFO Logging level: DEBUG, INFO, WARNING, ERROR

🎯 Supported Tasks

1. 🎯 GUI Grounding

Objective: Locate the precise coordinates of a UI element on screen based on a textual description.

πŸ“₯ Input

  • screenshot_clean: Path to the screenshot image (PNG)
  • thought: Text description of the target UI element
  • resolution: Screen resolution as (width, height)

πŸ“€ Expected Output

  • coordinates: A [x, y] tuple representing pixel coordinates of the target element

πŸ“Š Evaluation Metrics

Metric Description
Success Rate % of samples where predicted coordinates fall within ground truth bounding box
Avg. Execution Time Mean time per sample (seconds)

πŸ’‘ Example

{
  "thoughts": "The user wants to click the 'Bold' button in the toolbar",
  "coordinates": [450, 120]
}

2. πŸ” Screen Parsing

Objective: Extract structured control information from a screenshot, identifying all UI elements and their properties.

πŸ“₯ Input

  • screenshot_clean: Path to the screenshot image (PNG)
  • resolution: Screen resolution as (width, height)

πŸ“€ Expected Output

  • control_infos: A list of dictionaries, each containing:
    • control_text: Text content of the control
    • control_rect: Bounding box as [left, top, right, bottom]
    • control_type: Type of control (optional)

πŸ“Š Evaluation Metrics

Metric Description
Recall Ratio of correctly identified controls to total ground truth controls
Precision Ratio of correctly identified controls to total predicted controls
F1 Score Harmonic mean of recall and precision
Text Similarity Average semantic similarity of control text (using Sentence-BERT)
IoU Accuracy Average Intersection over Union for matched controls
Avg. Execution Time Mean time per sample (seconds)

πŸ’‘ Example

[
  {
    "control_text": "Save",
    "control_rect": [100, 50, 150, 80],
    "control_type": "Button"
  },
  {
    "control_text": "Document Title",
    "control_rect": [200, 100, 600, 130],
    "control_type": "TextBox"
  }
]

3. πŸ€– Action Prediction

Objective: Predict the next action to take given a user instruction, current screenshot, and action history.

πŸ“₯ Input

  • screenshot_clean: Path to the screenshot image (PNG)
  • request: User's high-level goal/instruction
  • previous_actions: List of previous action descriptions
  • resolution: Screen resolution as (width, height)

πŸ“€ Expected Output

A structured action with:

  • function: Action type (click, type, drag, scroll, hotkey, wait)
  • args: Arguments for the action (varies by function)
  • status: Execution status (CONTINUE, FINISH)

Supported Actions:

Function Arguments Example
click coordinate, button {"coordinate": [x, y], "button": "left"}
type coordinate, keys {"coordinate": [x, y], "keys": "text"}
drag start_coordinate, end_coordinate {"start_coordinate": [x1, y1], "end_coordinate": [x2, y2]}
scroll coordinate, scroll_direction, scroll_amount {"coordinate": [x, y], "scroll_direction": "down", "scroll_amount": 3}
hotkey keys {"keys": "ctrl+c"}

πŸ“Š Evaluation Metrics

Metric Description
Success Rate % of samples where function, arguments, and status all match
Function Match Rate % of correct function predictions
Args Match Rate % of correct argument predictions (with coordinate tolerance)
Status Match Rate % of correct status predictions
Avg. Execution Time Mean time per sample (seconds)

πŸ’‘ Example

{
  "thoughts": "I need to click the Bold button to make the text bold",
  "tool_call": {
    "function": "click",
    "args": {
      "coordinate": [450, 120],
      "button": "left"
    },
    "status": "CONTINUE"
  }
}

4. β™Ώ Action Prediction with Accessibility (A11y)

Objective: Similar to action prediction, but leverages accessibility tree information for more informed decisions.

πŸ“₯ Input

  • Same as Action Prediction, plus:
  • control_infos: Structured accessibility information (UI element metadata)

πŸ“€ Expected Output

  • Same as Action Prediction

πŸ“Š Evaluation Metrics

  • Same as Action Prediction

πŸ€– Supported Models

The framework supports multiple vision-language models through a unified interface. Each model can be evaluated on different tasks depending on its capabilities.

πŸ“Š Model Capabilities Matrix

Model 🎯 Grounding πŸ” Screen Parsing πŸ€– Action Prediction β™Ώ A11y Action
GPT-4o βœ… βœ… βœ… βœ…
GPT-4V βœ… βœ… βœ… βœ…
Qwen2.5-VL-7B βœ… βœ… βœ… βœ…
GUI-Actor βœ… ❌ βœ… ❌
UGround βœ… ❌ ❌ ❌
UI-TARS βœ… ❌ βœ… ❌
Aguvis-7B-720P βœ… ❌ βœ… ❌
OmniParser ❌ βœ… ❌ ❌

πŸ“ˆ Benchmark Results

Below are the benchmark results for various models on the GUI-360Β° dataset.

1. 🎯 GUI Grounding

Model Success Rate (%)
GPT-4o 9.38
GPT-4.1 11.44
Qwen2.5-VL-7B 35.78
GUI-Actor 54.50
UGround-7B 53.85
UI-TARS-1.5 7B 62.27
Aguvis-7B 50.50
Qwen2.5-VL-7B-SFT 82.30
UI-TARS-1.5 7B-SFT 82.49
πŸ“Œ Metric Definitions
  • Success Rate: Percentage of samples where predicted coordinates fall within ground truth bounding box

2. πŸ” Screen Parsing

Model Recall (%) Precision (%) F1 Score (%) Text Sim. (%) IoU Acc. (%)
GPT-4o 1.4 3.4 1.9 14.7 22.9
GPT-4.1 5.7 9.8 6.7 30.6 50.5
o3 11.4 16.0 12.8 45.6 57.8
GPT-5 8.0 11.1 8.9 30.4 56.9
Qwen2.5-VL-7B 1.0 18.1 1.5 11.3 21.1
OmniParser 45.9 41.1 40.6 56.5 73.1
OmniParser v2 46.2 41.3 40.8 56.8 73.5
πŸ“Œ Metric Definitions
  • Recall: Ratio of correctly identified controls to total ground truth controls
  • Precision: Ratio of correctly identified controls to total predicted controls
  • F1 Score: Harmonic mean of recall and precision
  • Text Similarity: Average semantic similarity of control text (using Sentence-BERT)
  • IoU Accuracy: Average Intersection over Union for matched controls

3. πŸ€– Action Prediction (Visual-only)

Model Success Rate (%)
GPT-4o 3.12
GPT-4.1 2.82
GPT-o3 17.92
GPT-5 8.59
Qwen2.5-VL-7B 17.52
Qwen2.5-VL-7B-SFT 50.08
πŸ“Œ Metric Definitions
  • Success Rate: Percentage of samples where the predicted action matches the ground truth action (considering function, arguments, and execution status)

4. β™Ώ Action Prediction with Accessibility (Visual+A11y)

Model Success Rate (%)
GPT-4o 36.71
GPT-4.1 39.19
GPT-o3 46.72
GPT-5 34.86
Qwen2.5-VL-7B 14.18
Qwen2.5-VL-7B-SFT 25.78
πŸ“Œ Metric Definitions
  • Success Rate: Same metric as Visual-only Action Prediction
  • A11y Advantage: This variant provides accessibility tree information as additional input. GPT models show significant improvement with A11y information, while Qwen2.5-VL-7B-SFT performs better in Visual-only mode.

πŸ”§ Model Configuration Guide

🌟 GPT Models (OpenAI API)

πŸ”§ Step 1: Configure API Access

Set your OpenAI API credentials:

# Required: Your OpenAI API key
export OPENAI_API_KEY="sk-your_openai_api_key_here"

# Optional: Custom API base URL (defaults to OpenAI official API)
export OPENAI_BASE_URL="https://api.openai.com/v1"

πŸš€ Step 2: Run Evaluation Commands

# GPT-4o (recommended for best performance)
python evaluation.py \
    --root_dir /path/to/dataset \
    --type grounding \
    --model_type gpt \
    --model_name gpt-4o \
    --threads 5

# GPT-4o-mini (faster and cheaper)
python evaluation.py \
    --root_dir /path/to/dataset \
    --type action_prediction \
    --model_type gpt \
    --model_name gpt-4o-mini \
    --threads 10

βœ… Requirements:

  • OpenAI API key configured via environment variable OPENAI_API_KEY
  • Internet connection for API access
  • Valid OpenAI subscription with sufficient credits

🎨 Qwen2.5-VL-7B

Step 1: Deploy Model Server

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --port 19806

Step 2: Run Evaluation

python evaluation.py \
    --root_dir /path/to/dataset \
    --type action_prediction \
    --model_type qwen2.5_vl_7b \
    --model_name Qwen/Qwen2.5-VL-7B-Instruct \
    --api_url http://localhost:19806/v1 \
    --threads 5

βœ… Requirements:

  • vLLM or compatible inference server
  • GPU with sufficient memory (recommended: 16GB+ VRAM)

πŸ” OmniParser

⚠️ Important: Before using OmniParser, you need to configure the model paths in evaluation.py.

Step 1: Edit Configuration in evaluation.py (lines 136-144)

config = {
    'som_model_path': '../../weights/icon_detect/model.pt',  # Update this path
    'caption_model_name': 'florence2',
    'caption_model_path': '../../weights/icon_caption_florence',  # Update this path
    'device': 'cuda',  # Or 'cpu' if no GPU available
    'BOX_TRESHOLD': 0.05,
    'host': host,
    'port': port
}

Step 2: Download OmniParser Model Weights

  • SOM (Set-of-Mark) model: Place at the path specified in som_model_path
  • Caption model (Florence-2): Place at the path specified in caption_model_path

Step 3: Deploy the OmniParser API Server

python omniparser_server.py \
    --som_model_path /path/to/weights/icon_detect/model.pt \
    --caption_model_path /path/to/weights/icon_caption_florence \
    --host 0.0.0.0 \
    --port 7861

Step 4: Run Evaluation

python evaluation.py \
    --root_dir /path/to/dataset \
    --type screen_parsing \
    --model_type omniparser \
    --model_name omniparser-screen-parsing \
    --api_url http://localhost:7861 \
    --threads 5

βœ… Requirements:

  • OmniParser model weights (SOM and Florence-2 caption model)
  • OmniParser API server running
  • Model paths configured in evaluation.py ModelFactory class

πŸ“Š Output Format

πŸ“ Evaluation Results

After evaluation completes, results are saved to the specified output directory:

results/
β”œβ”€β”€ evaluation_results_<timestamp>.json    # Detailed results for all samples
└── evaluation_summary_<timestamp>.json    # Summary statistics

πŸ“ˆ Summary Statistics Include:

  • βœ… Total samples evaluated
  • 🎯 Success count and rate
  • ❌ Error count and rate
  • ⏱️ Average execution time
  • πŸ“‚ Domain and category breakdowns
  • πŸ“Š Task-specific metrics (e.g., recall, precision for screen parsing)

πŸ’‘ Example Summary Output

{
  "total_samples": 1000,
  "success_count": 850,
  "success_rate": 85.0,
  "error_count": 10,
  "error_rate": 1.0,
  "avg_execution_time": 2.35,
  "model_name": "gpt-4o",
  "model_type": "gpt",
  "evaluation_time": "2024-12-31 12:00:00",
  "domain_stats": {
    "word": {
      "total": 400,
      "success": 340,
      "success_rate": 85.0
    },
    "excel": {
      "total": 300,
      "success": 255,
      "success_rate": 85.0
    },
    "ppt": {
      "total": 300,
      "success": 255,
      "success_rate": 85.0
    }
  }
}

πŸ”„ Data Conversion for Training

We provide tools to convert the GUI-360Β° raw dataset into task-specific training formats suitable for vision-language model fine-tuning.

🎯 Supported Training Tasks

Task Description Use Case
Action Prediction Multi-turn conversations with action sequences Train agents to predict next GUI actions
Action Prediction + A11y Action prediction with accessibility information Leverage accessibility trees for better performance
Screen Parsing Extract UI elements from screenshots Train models to understand GUI structure
GUI Grounding Locate UI elements by text descriptions Train models for GUI element localization

πŸš€ Quick Start

# Navigate to converter directory
cd convertor

# Convert to action prediction format with image optimization
python convert_to_train.py \
    --root_dir /path/to/GUI360_dataset \
    --output_dir /path/to/training_data \
    --type action_prediction \
    --resize \
    --max_pixels 500000

# Convert to screen parsing format
python convert_to_train.py \
    --root_dir /path/to/GUI360_dataset \
    --output_dir /path/to/training_data \
    --type screen_parsing \
    --resize

πŸ“– For detailed instructions, examples, and advanced options, see convertor/README.md


πŸ› οΈ Advanced Usage

πŸ”Œ Custom Model Integration

To integrate a custom model:

  1. Create a new model class in models/ inheriting from BaseModel
  2. Implement required methods:
    • predict(system_prompt, user_prompt, image_path, ...)
    • construct_<task>_prompt(...) for each supported task
    • parse_<task>(response) for each supported task
  3. Add model factory entry in evaluation.py:
@staticmethod
def _load_custom_model(model_name: str, api_url: str = None):
    from models.custom_model import CustomModel
    return CustomModel(model_name=model_name, api_url=api_url)
  1. Update the create_model method to include the new model type

⚑ Parallel Evaluation

The framework supports multi-threaded evaluation for faster processing:

python evaluation.py \
    --root_dir /path/to/dataset \
    --type action_prediction \
    --model_type gpt \
    --model_name gpt-4o \
    --threads 20  # Increase for faster evaluation

⚠️ Note: Adjust thread count based on:

  • 🚦 API rate limits
  • πŸ’» Available system resources
  • πŸ–₯️ Model server capacity

πŸ§ͺ Evaluation on Subset

To evaluate on a limited number of samples (useful for quick testing):

python evaluation.py \
    --root_dir /path/to/dataset \
    --type grounding \
    --model_type gpt \
    --model_name gpt-4o \
    --max_samples 100  # Only evaluate 100 samples

πŸ“ Citation

If you use GUI-360Β° in your research, please cite:

@article{mu2025gui,
  title={GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents},
  author={Mu, Jian and Zhang, Chaoyun and Ni, Chiming and Wang, Lu and Qiao, Bo and Mathur, Kartik and Wu, Qianhui and Xie, Yuhang and Ma, Xiaojun and Zhou, Mengyu and others},
  journal={arXiv preprint arXiv:2511.04307},
  year={2025}
}

πŸ“„ License

This project is released under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages