how to use add a serialized task walkthru by IanMagnusson · Pull Request #121 · allenai/olmo-eval

IanMagnusson · 2026-04-08T03:29:43Z

Summary

Serialized tasks (added in #89) let people add evaluations to olmo-eval without writing a new Task class -- they just provide a JSONL file with pre-formatted prompts and instance metadata, register a variant, and pick their metrics. This PR adds documentation and clear examples of how to use serialized tasks as the simplest entry point for anyone who already has evaluation data in a static format for a straight forward task (e.g. migrated from another framework or generated by a custom script).

This PR adds:

README section (### Serialized Tasks): a concise overview with a schema snippet and pointer to the detailed guide.
docs/serialized_tasks.md: full walkthrough organized around two examples -- migrating a task from oe-eval-internal and creating a new serialized task from scratch. Covers the JSONL schema, what is/isn't serialized, metadata passthrough, and limitations.
examples/serialize_task_example.py: standalone script that serializes the SciQ dataset (4-choice science QA) into the JSONL format. Does not depend on oe-eval.
serialized:sciq_mc variant: registration using LogprobMCAccuracyMetric, verified end-to-end on Beaker (66% accuracy with OLMo-2-0425-1B, well above 25% random baseline).

Test plan

make lint passes (ruff check + ruff format)
Serialization script runs and produces valid JSONL
serialized:sciq_mc evaluated successfully on Beaker with OLMo-2-0425-1B (experiment: 01KNNFZ080T04S3MKQS89KYQGD, accuracy: 0.66)

rlebras

A couple of comments. LGTM! Thanks!

rlebras · 2026-04-08T17:30:33Z

+### Step 1: Serialize the data
+
+The script
+[`oe_eval/serialize_benchmark.py`](https://github.com/allenai/oe-eval-internal/blob/ianm-serialize-bench-data-vllm-schema/oe_eval/serialize_benchmark.py)


Let's point this to main when merged.

rlebras · 2026-04-08T17:44:50Z

+# Example: SciQ multiple-choice (from examples/serialize_task_example.py)
+# =============================================================================
+
+register_variant(


This will register sciq_mc on every import of serialized.py. Should we comment it out?

rlebras · 2026-04-08T18:29:57Z

+scored by logprob:
+
+```python
+from olmo_eval.common.metrics import LogprobMCAccuracyMetric


Also, this import is tied to the sciq_mc example. If we move sciq_mc to its own file, this import should move with it.

IanM added 3 commits April 7, 2026 18:47

example first draft

6a779ab

fix metric

bd33919

edits

487ca10

IanMagnusson requested review from rlebras and undfined April 8, 2026 03:35

IanMagnusson changed the title ~~Add serialized task documentation, SciQ example, and metric fix~~ how to use add a serialized task walkthru Apr 8, 2026

rlebras approved these changes Apr 8, 2026

View reviewed changes

rlebras reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use add a serialized task walkthru#121

how to use add a serialized task walkthru#121
IanMagnusson wants to merge 3 commits into
mainfrom
serialized-example

IanMagnusson commented Apr 8, 2026 •

edited

Loading

Uh oh!

rlebras left a comment

Uh oh!

rlebras Apr 8, 2026

Uh oh!

rlebras Apr 8, 2026

Uh oh!

rlebras Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IanMagnusson commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

rlebras left a comment

Choose a reason for hiding this comment

Uh oh!

rlebras Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

rlebras Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

rlebras Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IanMagnusson commented Apr 8, 2026 •

edited

Loading