Benchmark Caliper

A framework for assessing whether AI benchmarks developed in one cultural context can be validly applied to another.

Benchmark Caliper evaluates a benchmark across 6 validity dimensions (Input/Output × Ontology/Content/Form) and produces a scored, evidence-backed report for a specific deployment context.

How it works

You provide a benchmark paper (PDF) and a description of where and how the benchmark would be deployed. Benchmark Caliper then:

Extracts the benchmark's metadata and methodology from the paper.
Asks a few targeted questions about your deployment context and synthesizes your answers.
Builds a structured profile of the benchmark and the target region.
Scores all 6 validity dimensions and produces a report with per-dimension scores, reasoning, and supporting evidence.

The analysis runs on the Anthropic API with cost-routed model selection — lighter models for extraction and synthesis, the strongest model for the final scoring step.

Components

Component	Description
`anthropic_api_package_release/`	The validity-analysis pipeline. Run it from the command line to inspect the assessments from the paper, reproduce one, or analyze your own benchmark. See its README.
`website/`	A web interface to the pipeline: upload a paper, describe a deployment, and receive a validity report. See its README.

Repository structure

benchmark-caliper/
├── anthropic_api_package_release/   # Validity-analysis pipeline
│   ├── run_pipeline.py              # CLI entry point and orchestrator
│   ├── run_expert_stage1.py         # Batch runner for expert assessments
│   ├── run_comparative.py           # Comparative (regional vs. reference) runs
│   ├── client.py                    # Anthropic API wrapper with cost routing
│   ├── framework.yaml               # Validity dimensions, checklists, scoring rubric
│   ├── prompt_template.md           # Evaluation prompt template
│   ├── prompts/                     # LLM prompt per pipeline step
│   ├── scripts/                     # Deterministic helpers (PDF, parsing, reports)
│   ├── benchmarks/                  # Benchmark example YAMLs (ICL references)
│   ├── regions/                     # Region template YAMLs (ICL references)
│   ├── templates/                   # Input templates for new assessments
│   ├── assessments/                 # Pipeline outputs for the paper's assessments
│   ├── papers/                      # Benchmark papers and extracted summaries
│   ├── tests/                       # Pipeline test suite
│   └── README.md                    # Pipeline setup and usage
├── website/                         # Web interface to the pipeline
│   ├── server/                      # FastAPI backend
│   ├── client/                      # Vite + React + TypeScript frontend
│   ├── DESIGN.md                    # Architecture and security posture
│   ├── DEPLOYMENT.md                # Hosted setup
│   └── README.md                    # Local setup and usage
├── Dockerfile                       # Builds the website + pipeline image
├── render.yaml                      # Render deployment blueprint
└── entrypoint.sh                    # Container entrypoint

Getting started

To analyze a benchmark through the web interface: see website/README.md.
To run the pipeline from the command line — inspect the assessments from the paper, reproduce one, or analyze your own benchmark — see anthropic_api_package_release/README.md.

Deployment

The website (with the pipeline bundled) deploys as a single Docker service. Dockerfile and render.yaml define the build; see website/DEPLOYMENT.md for the hosted setup.

Validity dimensions

Dimension	Category	What it assesses
Input Ontology	Input × Ontology	Test case categories cover regional deployment needs
Input Content	Input × Content	Datapoint instances are culturally appropriate
Input Form	Input × Form	Signal encoding matches regional infrastructure
Output Ontology	Output × Ontology	Label categories reflect regional values
Output Content	Output × Content	Ground-truth labels align with regional perspectives
Output Form	Output × Form	Output modality matches regional usage patterns

Each dimension is scored 1–5:

Score	Meaning
1	Major validity violations; fundamentally misaligned with target context
2	Significant concerns; multiple concrete violations or gaps
3	Partially addressed; mixed evidence
4	Well addressed; minor concerns; documentation shows awareness
5	No concerns; explicit validity-preserving practices demonstrated

Ground-truth validation

Three benchmark–region pairs from the paper were scored by domain experts and serve as ground truth for validating the framework's scores. Mean absolute error (MAE) is measured against the expert average:

Benchmark	Region	Expert avg	MAE
HELM	Southeast Asia	1.7	0.00
SEA-HELM	Southeast Asia	4.3	0.67
IberBench	Iberian	3.7	0.00

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.claude		.claude
anthropic_api_package_release		anthropic_api_package_release
benchmarks		benchmarks
papers		papers
regions		regions
scripts		scripts
web_client_prompts		web_client_prompts
website		website
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
README.md		README.md
analysis.md		analysis.md
entrypoint.sh		entrypoint.sh
framework.yaml		framework.yaml
generate_heatmap.py		generate_heatmap.py
generate_report.py		generate_report.py
prompt_template.md		prompt_template.md
render.yaml		render.yaml
reproduce.sh		reproduce.sh
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark Caliper

How it works

Components

Repository structure

Getting started

Deployment

Validity dimensions

Ground-truth validation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmark Caliper

How it works

Components

Repository structure

Getting started

Deployment

Validity dimensions

Ground-truth validation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages