Conversation
Lint (ruff): - Remove unused imports (asyncio/json/Path in test_dense_caption_judge, abc.abstractmethod in dense_caption) and fix import ordering. - Replace if/else with a ternary (SIM108) and wrap two over-length logger.warning calls in dense_caption_judge. Type check (ty): - Declare metric/scorer name and scorer as plain dataclass fields instead of ClassVar, matching the Metric/Scorer base classes (fixes the invalid-attribute-override errors across image_qa, dense_caption, and dense_caption_judge). - vqa_normalization: type _argmin as Sequence[float] (list invariance), guard the regex match / float parse against None before use. - image_qa: _default_gpt_cache_dir always returns str. - countbench_qa: coerce question to str. - vqa2: ignore the np RandomState.shuffle ArrayLike mismatch (the seeded list shuffle must match mm_olmo for dump parity). No behavior change; offline scorer/caption/pipeline tests still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Merges the
visionbranch intomain, adding multimodal (image) evaluation support to olmo-eval. This brings the full Molmo2 image-QA benchmark suite and PixMo dense-caption eval, ported from themm_olmoreference implementation so models can be scored through the standard olmo-eval task/scorer/metric stack.Tasks (12 total) — the 11 Molmo2 image-QA benchmarks plus dense caption:
Supporting components:
evals/tasks/common/image_qa_base.py— shared base for the image-QA taskscommon/scorers/image_qa.py,common/scorers/dense_caption_judge.py— scorers/metricscommon/image_qa/— parsing/normalization helpers vendored verbatim frommm_olmo(VQA answer normalization, count parsing, MMMU parsing, MathVista offline scoring, prompt templates)pyproject.toml— ruffE501ignores scoped to the vendored tables/templates and the scorer test27 files, +4559 lines.
Type of Change
(Additive only — all changes are new files under image-QA/vision; no existing tasks or behavior are modified.)
Testing
Automated checks run locally (at
778a4ac, viauv run)ruff format --check src/ tests/ruff check src/ tests/ty check src/pytest .../test_image_qa_scorers.py .../test_dense_caption_judge.py .../test_image_qa_pipeline.pyThe 102 passing tests cover the image-QA scorers (VQA / ANLS / EM / relaxed-correctness / MMMU / RealWorldQA / MathVista-offline / point-count / AI2D), the dense-caption GPT-judge (mocked — no network), and the offline image-QA pipeline. The single skip is the dump-parity suite (below).
Dump-parity — not run here (requires the released dumps)
tests/evals/tasks/test_image_qa_dump_parity.pyre-scores the released mm_olmo Molmo2-4B prediction dumps and asserts:metrics.jsonwithin tolerance (2e-4default,2e-3for MMMU).It is gated behind
RUN_DUMP_PARITY_TESTS=1(+ the prediction dumps andHF_DATASETS_CACHE), so it skips in CI and in my run:pytest tests/ --ignore=tests/integration/) — ran the multimodal/image-QA subset above (102 passed, 1 skipped); did not run the full repo suiteChecklist