Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

lica-bench is a structured evaluation suite for measuring how well vision-language models understand, edit, and generate graphic design artifacts. Paper link It covers layout reasoning, typography, visual hierarchy, SVG/vector understanding, template variants, animation, and more.

Benchmarks use the Lica dataset (1,148 graphic design layouts). We release design-benchmarks.zip containing all task-specific evaluation data organized under benchmarks// (including manifests, JSON specifications, and prepared assets), along with model outputs for each task.

Benchmarks

Each task is one of two types: understanding (answer a question or edit an artifact), or generation (produce a new artifact). 45 tasks span seven domains across 39 benchmarks:

Domain	Tasks	Benchmarks	Description
category	2	2	Design category classification and user intent prediction
layout	8	8	Spatial reasoning over design canvases (aspect ratio, element counting, component type and detection), layout generation (intent-to-layout, partial completion, aspect-ratio adaptation), and layer-aware object insertion (`layout-8`, reference- or description-guided per sample)
lottie	2	2	Lottie animation generation from text and image
svg	8	8	SVG reasoning and editing (perceptual and semantic Q/A, bug fixing, optimization, style editing) and generation (text-to-SVG, image-to-SVG, combined input)
template	5	5	Template matching, retrieval, clustering, and generation (style completion, color transfer)
temporal	8	6	Keyframe ordering; motion type classification; video duration, component duration, and start-time estimation (`temporal-3`, with motion type / speed / direction in the same benchmark); generation (animation parameters, motion trajectory, short-form video)
typography	12	8	Font family, color, size / weight / alignment / letter spacing / line height* (single benchmark), style ranges, curvature, rotation, and generation (styled text element, styled text rendering to layout)

* typography-3 (Text Params Estimation) expects one JSON object with five fields: font_size, font_weight, text_align, letter_spacing, and line_height.

† Temporal (8 tasks, 6 benchmarks): five understanding lines = temporal-1, temporal-2, and three timing lines from temporal-3 (clip/video duration, per-component duration, start time—start time is separate from both duration lines). Three generation lines = temporal-4–temporal-6.

Getting started

1. Install

git clone https://github.com/purvanshi/lica-bench.git
cd lica-bench
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

# Add extras you need (pick any combination)
pip install -e ".[metrics]"          # scipy, sklearn, Pillow, cairosvg, etc.
pip install -e ".[openai]"           # OpenAI provider
pip install -e ".[gemini]"           # Gemini provider
pip install -e ".[anthropic]"        # Anthropic provider
pip install -e ".[svg-metrics]"      # Full SVG eval (metrics + LPIPS, CLIP)
pip install -e ".[lottie-metrics]"   # Lottie frame-level eval (rlottie-python)
pip install -e ".[layout-metrics]"   # Layout/image metrics (Linux + Python<3.12 recommended)
pip install -e ".[dev]"              # ruff linter

The PyPI/setuptools distribution is lica-bench; import the library as design_benchmarks.

2. Verify installation (no data, no API keys)

python scripts/run_benchmarks.py --list                     # enumerate tasks and readiness

3. Download data

python scripts/download_data.py                              # → data/lica-benchmarks-dataset/

--dataset-root is the bundle root (contains lica-data/ and benchmarks/). Task data is read from benchmarks/ using each benchmark's metadata. Use --data to point at a specific directory.

4. Run benchmarks

# Stub model (no API keys; validates load_data + build_model_input on real data)
python scripts/run_benchmarks.py --stub-model --benchmarks category-1 \
    --dataset-root data/lica-benchmarks-dataset --n 5

# Real model
python scripts/run_benchmarks.py --benchmarks svg-1 \
    --provider openai --model-id gpt-5.4 \
    --dataset-root data/lica-benchmarks-dataset

# Temporal benchmarks (video-based)
python scripts/run_benchmarks.py --benchmarks temporal-1 \
    --provider gemini \
    --dataset-root data/lica-benchmarks-dataset

# User custom python model entrypoint
python scripts/run_benchmarks.py --benchmarks svg-1 \
    --provider custom --custom-entry my_models.wrapper:build_model \
    --custom-init-kwargs '{"checkpoint":"/models/foo"}' \
    --dataset-root data/lica-benchmarks-dataset

# Local default VLM/LLM (defaults now use Qwen3-VL-4B-Instruct)
python scripts/run_benchmarks.py --benchmarks svg-1 \
    --provider hf --device auto \
    --dataset-root data/lica-benchmarks-dataset

# Diffusion / image generation (defaults now use FLUX.2 klein 4B)
python scripts/run_benchmarks.py --benchmarks layout-1 \
    --provider diffusion \
    --dataset-root data/lica-benchmarks-dataset

# Image-generation / editing task with a custom wrapper
python scripts/run_benchmarks.py --benchmarks typography-7 \
    --provider custom --custom-entry my_models.image_wrapper:build_model \
    --custom-modality image_generation \
    --dataset-root data/lica-benchmarks-dataset

# Official FLUX.2 wrapper via the existing custom provider
python -m pip install --no-deps --ignore-requires-python \
    "git+https://github.com/black-forest-labs/flux2.git"
python scripts/run_benchmarks.py --benchmarks layout-1 layout-3 layout-8 typography-7 typography-8 \
    --provider custom \
    --custom-entry design_benchmarks.models.local_models:Flux2Model \
    --custom-init-kwargs '{"model_name":"flux.2-klein-4b"}' \
    --custom-modality image_generation \
    --dataset-root data/lica-benchmarks-dataset

--custom-entry must point to an importable Python module attribute. In practice, that means your wrapper is either installed in the environment or reachable via PYTHONPATH.

For image-output tasks, prefer --custom-modality image_generation. If your wrapper uses source images or masks, expose capability attributes such as supports_image_output, supports_image_input, and supports_mask_editing so preflight warnings can distinguish text-to-image from image-edit/inpaint models.

design_benchmarks.models.local_models:Flux2Model keeps using the existing custom provider, so no extra CLI provider is required. The official FLUX.2 weights currently also require access to the gated black-forest-labs/FLUX.2-dev autoencoder (ae.safetensors); export HF_TOKEN / HF_HUB_TOKEN before running, or set AE_MODEL_PATH to a local copy of that file after approval.

The default local model ID for both hf and vllm is now Qwen/Qwen3-VL-4B-Instruct, which can cover both text-only and image-input benchmarks. The default diffusion model ID is now flux.2-klein-4b.

Use the same --dataset-root (Lica bundle root) for stub runs, API runs, and --batch-submit so paths inside CSVs/JSON resolve correctly.

See scripts/README.md for batch submit/collect, vLLM, HuggingFace, user model entrypoints, multi-model config files, and all CLI flags.

5. API keys

Set whichever provider(s) you need:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...            # Gemini (Google AI Studio / google-genai API key)

For Gemini on Vertex AI (service account), pass a JSON key file instead of relying on GOOGLE_API_KEY:

python scripts/run_benchmarks.py --benchmarks svg-1 --provider gemini \
    --credentials /path/to/service-account.json \
    --dataset-root data/lica-benchmarks-dataset

The file must be either a service account key (type: service_account) or JSON containing an api_key field.

Batch submit for Gemini also needs a GCS bucket (--bucket or DESIGN_BENCHMARKS_GCS_BUCKET); see scripts/README.md.

Benchmark dataset layout

Everything lives under one root directory lica-benchmarks-dataset/ (e.g. data/lica-benchmarks-dataset/ after download_data.py):

lica-benchmarks-dataset/
├── lica-data/                    # core Lica release (layouts, renders, metadata)
│   ├── metadata.csv              # one row per layout
│   ├── layouts/<template_id>/<layout_id>.json
│   ├── images/<template_id>/<layout_id>.{png,jpg,webp,mp4}
│   └── annotations/…             # optional
│
└── benchmarks/                   # evaluation inputs per domain
    ├── category/                 #   CategoryClassification/, UserIntentPrediction/
    ├── image/
    ├── layout/
    ├── lottie/
    ├── svg/
    ├── template/
    ├── temporal/                 #   KeyframeOrdering/, MotionTypeClassification/, etc.
    └── typography/

Using this bundle: Set --dataset-root to this directory. CSV image_path and template data_root entries resolve relative to --dataset-root.

What the two trees are: lica-data/ is the shared Lica corpus (layout JSON, renders, metadata.csv). benchmarks/ holds evaluation payloads per domain (CSVs, JSON, manifests, copied assets). Exact filenames differ by task; see the module under src/design_benchmarks/tasks/<domain>.py or docs/CONTRIBUTING.md when adding or packaging data.

Project structure

lica-bench/
├── src/design_benchmarks/
│   ├── tasks/              # @benchmark classes — one file per domain
│   │   ├── category.py     #   category-1, category-2
│   │   ├── layout.py       #   layout-1 … layout-8
│   │   ├── lottie.py       #   lottie-1, lottie-2
│   │   ├── svg.py          #   svg-1 … svg-8
│   │   ├── template.py     #   template-1 … template-5
│   │   ├── temporal.py     #   temporal-1 … temporal-6
│   │   └── typography.py   #   typography-1 … typography-8
│   ├── models/             # Provider wrappers (OpenAI, Anthropic, Gemini, HF, vLLM)
│   ├── metrics/            # Reusable metric functions (IoU, FID, SSIM, LPIPS, edit distance)
│   ├── evaluation/
│   │   ├── tracker.py      # Per-sample JSONL logger
│   │   └── reporting.py    # BenchmarkResult / RunReport (CSV + JSON)
│   ├── inference/          # Batch API runners, GCS helpers
│   ├── utils/              # Shared helpers (image, text, layout path resolution)
│   ├── base.py             # BaseBenchmark, BenchmarkMeta, TaskType, @benchmark
│   ├── registry.py         # Auto-discovery via pkgutil.walk_packages
│   └── runner.py           # BenchmarkRunner orchestration
├── scripts/
│   ├── download_data.py    # Fetch + unpack into lica-benchmarks-dataset/
│   └── run_benchmarks.py   # Unified CLI for list, stub, real, and batch runs
├── docs/
│   └── CONTRIBUTING.md     # How to add tasks and domains
└── pyproject.toml

Quick start (Python API)

from pathlib import Path
from design_benchmarks import BenchmarkRegistry, BenchmarkRunner
from design_benchmarks.models import load_model

root = Path("data/lica-benchmarks-dataset")
registry = BenchmarkRegistry()
registry.discover()

runner = BenchmarkRunner(registry)
models = {"openai": load_model("openai", model_id="gpt-5.4")}
report = runner.run(
    benchmark_ids=["svg-1"],
    models=models,
    dataset_root=root,
    n=5,
)
print(report.summary())
report.save("outputs/report.json")
runner.tracker.save("outputs/tracker.jsonl")

RunReport includes both metric scores and reliability counters per benchmark/model: count, success_count, failure_count, and failure_rate. This makes partial-run failures visible in terminal summaries and saved JSON/CSV reports.

Contributing

See docs/CONTRIBUTING.md for:

How to add a benchmark task to an existing domain
How to create a new domain module
Where benchmark inputs live in the Lica release and the PR checklist

Limitations

Some metrics (LPIPS, CLIP score, SSIM, CIEDE2000) need heavier extras (.[svg-metrics], .[lottie-metrics], .[layout-metrics]). The full .[layout-metrics] stack is enabled on Linux with Python < 3.12. Metrics whose dependencies are unavailable are omitted from the output (with a logged warning).
--provider picks which backend runs the model (OpenAI, Gemini, Anthropic, etc.); --model-id is only the catalog string for that backend (it does not select the provider). If you omit --model-id, the default for the chosen provider is used (see DEFAULT_MODEL_IDS in scripts/run_benchmarks.py). With --multi-models, each entry is provider:model_id so both are explicit. Use a --model-id your account actually exposes (README examples may name newer IDs such as gpt-5.4).
For local/open-source models, --model-id may be either a hub ID or a local checkpoint directory if the underlying backend supports it. If the model name/path does not make its modality obvious, set --model-modality text or --model-modality text_and_image explicitly.

Models

Provider	Install extra	CLI flag
OpenAI	`.[openai]`	`--provider openai`
Anthropic	`.[anthropic]`	`--provider anthropic`
Gemini	`.[gemini]`	`--provider gemini`
HuggingFace	(torch)	`--provider hf --device auto`
vLLM	`.[vllm]`	`--provider vllm`
Diffusion	`.[vllm-omni]`	`--provider diffusion`
OpenAI Image	`.[openai]`	`--provider openai_image`
Custom Entrypoint	(your code)	`--provider custom --custom-entry module:attr`

Evaluation extras

Extra	Contents	Used by
`.[metrics]`	scipy, sklearn, scikit-image, Pillow, cairosvg	All implemented tasks (clustering, color, SVG rendering)
`.[svg-metrics]`	metrics + torch, transformers, lpips	SVG generation (LPIPS, CLIP score)
`.[lottie-metrics]`	metrics + rlottie-python	Lottie generation (frame MSE, frame SSIM)
`.[layout-metrics]`	torch, transformers (+ Linux/Python<3.12: pyiqa, hpsv2, hpsv3, dreamsim, image-reward)	Layout / image generation (FID, HPSv2/v3, DreamSim)

Dataset

The Lica dataset underpins the initial benchmark release:

1,148 graphic design layouts across 9 design categories
Structured JSON annotations (components, positions, styles, descriptions)
Rendered images (PNG) and animations (MP4)
Download: python scripts/download_data.py

Citation

If you use this benchmark, please cite the original LICA dataset:

@misc{lica-dataset,
  author = {Mehta, Purvanshi and others},
  title  = {LICA: Open-Source Graphic Design Layout Dataset},
  year   = {2025},
  url    = {https://github.com/purvanshi/lica-dataset}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Benchmarks

Getting started

1. Install

2. Verify installation (no data, no API keys)

3. Download data

4. Run benchmarks

5. API keys

Benchmark dataset layout

Project structure

Quick start (Python API)

Contributing

Limitations

Models

Evaluation extras

Dataset

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs		docs
scripts		scripts
src/design_benchmarks		src/design_benchmarks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Benchmarks

Getting started

1. Install

2. Verify installation (no data, no API keys)

3. Download data

4. Run benchmarks

5. API keys

Benchmark dataset layout

Project structure

Quick start (Python API)

Contributing

Limitations

Models

Evaluation extras

Dataset

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages