Skip to content

multimodal evaluation#209

Open
jason718 wants to merge 2 commits into
mainfrom
vision
Open

multimodal evaluation#209
jason718 wants to merge 2 commits into
mainfrom
vision

Conversation

@jason718

@jason718 jason718 commented Jun 13, 2026

Copy link
Copy Markdown

Description

Merges the vision branch into main, adding multimodal (image) evaluation support to olmo-eval. This brings the full Molmo2 image-QA benchmark suite and PixMo dense-caption eval, ported from the mm_olmo reference implementation so models can be scored through the standard olmo-eval task/scorer/metric stack.

Tasks (12 total) — the 11 Molmo2 image-QA benchmarks plus dense caption:

  • Image-QA: AI2D, ChartQA, CountBench QA, DocVQA, InfoVQA, MathVista, MMMU, PixMo Count, RealWorldQA, TextVQA, VQA v2
  • Captioning: PixMo-Cap dense caption (GPT-judged)

Supporting components:

  • evals/tasks/common/image_qa_base.py — shared base for the image-QA tasks
  • common/scorers/image_qa.py, common/scorers/dense_caption_judge.py — scorers/metrics
  • common/image_qa/ — parsing/normalization helpers vendored verbatim from mm_olmo (VQA answer normalization, count parsing, MMMU parsing, MathVista offline scoring, prompt templates)
  • pyproject.toml — ruff E501 ignores scoped to the vendored tables/templates and the scorer test

27 files, +4559 lines.

Type of Change

  • New feature (non-breaking change that adds functionality)

(Additive only — all changes are new files under image-QA/vision; no existing tasks or behavior are modified.)

Testing

Automated checks run locally (at 778a4ac, via uv run)

Command Result
ruff format --check src/ tests/ ✅ 457 files already formatted
ruff check src/ tests/ ✅ All checks passed
ty check src/ ✅ All checks passed
pytest .../test_image_qa_scorers.py .../test_dense_caption_judge.py .../test_image_qa_pipeline.py 102 passed, 1 skipped (1.45s)

The 102 passing tests cover the image-QA scorers (VQA / ANLS / EM / relaxed-correctness / MMMU / RealWorldQA / MathVista-offline / point-count / AI2D), the dense-caption GPT-judge (mocked — no network), and the offline image-QA pipeline. The single skip is the dump-parity suite (below).

Dump-parity — not run here (requires the released dumps)

tests/evals/tasks/test_image_qa_dump_parity.py re-scores the released mm_olmo Molmo2-4B prediction dumps and asserts:

  1. Prompt parity — reconstructed user-turn text exactly matches each saved dump, and
  2. Metric parity — recomputed metrics match the reference metrics.json within tolerance (2e-4 default, 2e-3 for MMMU).

It is gated behind RUN_DUMP_PARITY_TESTS=1 (+ the prediction dumps and HF_DATASETS_CACHE), so it skips in CI and in my run:

SKIPPED [1] tests/evals/tasks/test_image_qa_dump_parity.py:39: Set RUN_DUMP_PARITY_TESTS=1
(and HF_DATASETS_CACHE for the HF-hub tasks) to run dump-parity tests
  • Unit tests pass locally (pytest tests/ --ignore=tests/integration/) — ran the multimodal/image-QA subset above (102 passed, 1 skipped); did not run the full repo suite
  • Integration tests pass (if applicable)
  • New tests added for new functionality

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have added/updated documentation as needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

@jason718 jason718 changed the title Vision multimodal test Jun 13, 2026
@jason718 jason718 changed the title multimodal test multimodal evaluation Jun 16, 2026
@jason718 jason718 marked this pull request as ready for review June 16, 2026 08:16
Lint (ruff):
- Remove unused imports (asyncio/json/Path in test_dense_caption_judge,
  abc.abstractmethod in dense_caption) and fix import ordering.
- Replace if/else with a ternary (SIM108) and wrap two over-length
  logger.warning calls in dense_caption_judge.

Type check (ty):
- Declare metric/scorer name and scorer as plain dataclass fields instead
  of ClassVar, matching the Metric/Scorer base classes (fixes the
  invalid-attribute-override errors across image_qa, dense_caption, and
  dense_caption_judge).
- vqa_normalization: type _argmin as Sequence[float] (list invariance),
  guard the regex match / float parse against None before use.
- image_qa: _default_gpt_cache_dir always returns str.
- countbench_qa: coerce question to str.
- vqa2: ignore the np RandomState.shuffle ArrayLike mismatch (the seeded
  list shuffle must match mm_olmo for dump parity).

No behavior change; offline scorer/caption/pipeline tests still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants