multimodal evaluation by jason718 · Pull Request #209 · allenai/olmo-eval

jason718 · 2026-06-13T06:44:15Z

Description

Merges the vision branch into main, adding multimodal (image) evaluation support to olmo-eval. This brings the full Molmo2 image-QA benchmark suite and PixMo dense-caption eval, ported from the mm_olmo reference implementation so models can be scored through the standard olmo-eval task/scorer/metric stack.

Tasks (12 total) — the 11 Molmo2 image-QA benchmarks plus dense caption:

Image-QA: AI2D, ChartQA, CountBench QA, DocVQA, InfoVQA, MathVista, MMMU, PixMo Count, RealWorldQA, TextVQA, VQA v2
Captioning: PixMo-Cap dense caption (GPT-judged)

Supporting components:

evals/tasks/common/image_qa_base.py — shared base for the image-QA tasks
common/scorers/image_qa.py, common/scorers/dense_caption_judge.py — scorers/metrics
common/image_qa/ — parsing/normalization helpers vendored verbatim from mm_olmo (VQA answer normalization, count parsing, MMMU parsing, MathVista offline scoring, prompt templates)
pyproject.toml — ruff E501 ignores scoped to the vendored tables/templates and the scorer test

27 files, +4559 lines.

Type of Change

New feature (non-breaking change that adds functionality)

(Additive only — all changes are new files under image-QA/vision; no existing tasks or behavior are modified.)

Testing

Automated checks run locally (at `778a4ac`, via `uv run`)

Command	Result
`ruff format --check src/ tests/`	✅ 457 files already formatted
`ruff check src/ tests/`	✅ All checks passed
`ty check src/`	✅ All checks passed
`pytest .../test_image_qa_scorers.py .../test_dense_caption_judge.py .../test_image_qa_pipeline.py`	✅ 102 passed, 1 skipped (1.45s)

The 102 passing tests cover the image-QA scorers (VQA / ANLS / EM / relaxed-correctness / MMMU / RealWorldQA / MathVista-offline / point-count / AI2D), the dense-caption GPT-judge (mocked — no network), and the offline image-QA pipeline. The single skip is the dump-parity suite (below).

Dump-parity — not run here (requires the released dumps)

tests/evals/tasks/test_image_qa_dump_parity.py re-scores the released mm_olmo Molmo2-4B prediction dumps and asserts:

Prompt parity — reconstructed user-turn text exactly matches each saved dump, and
Metric parity — recomputed metrics match the reference metrics.json within tolerance (2e-4 default, 2e-3 for MMMU).

It is gated behind RUN_DUMP_PARITY_TESTS=1 (+ the prediction dumps and HF_DATASETS_CACHE), so it skips in CI and in my run:

SKIPPED [1] tests/evals/tasks/test_image_qa_dump_parity.py:39: Set RUN_DUMP_PARITY_TESTS=1
(and HF_DATASETS_CACHE for the HF-hub tasks) to run dump-parity tests

Unit tests pass locally (pytest tests/ --ignore=tests/integration/) — ran the multimodal/image-QA subset above (102 passed, 1 skipped); did not run the full repo suite
Integration tests pass (if applicable)
New tests added for new functionality

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have added/updated documentation as needed
My changes generate no new warnings
Any dependent changes have been merged and published

Lint (ruff): - Remove unused imports (asyncio/json/Path in test_dense_caption_judge, abc.abstractmethod in dense_caption) and fix import ordering. - Replace if/else with a ternary (SIM108) and wrap two over-length logger.warning calls in dense_caption_judge. Type check (ty): - Declare metric/scorer name and scorer as plain dataclass fields instead of ClassVar, matching the Metric/Scorer base classes (fixes the invalid-attribute-override errors across image_qa, dense_caption, and dense_caption_judge). - vqa_normalization: type _argmin as Sequence[float] (list invariance), guard the regex match / float parse against None before use. - image_qa: _default_gpt_cache_dir always returns str. - countbench_qa: coerce question to str. - vqa2: ignore the np RandomState.shuffle ArrayLike mismatch (the seeded list shuffle must match mm_olmo for dump parity). No behavior change; offline scorer/caption/pipeline tests still pass.

jason718 changed the title ~~Vision~~ multimodal test Jun 13, 2026

jason718 changed the title ~~multimodal test~~ multimodal evaluation Jun 16, 2026

jason718 marked this pull request as ready for review June 16, 2026 08:16

new vision tasks

1529433

jason718 force-pushed the vision branch from 525e3a8 to 1529433 Compare June 16, 2026 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multimodal evaluation#209

multimodal evaluation#209
jason718 wants to merge 2 commits into
mainfrom
vision

jason718 commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jason718 commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Automated checks run locally (at 778a4ac, via uv run)

Dump-parity — not run here (requires the released dumps)

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jason718 commented Jun 13, 2026 •

edited

Loading

Automated checks run locally (at `778a4ac`, via `uv run`)