how to use add a serialized task walkthru#121
Open
IanMagnusson wants to merge 3 commits into
Open
Conversation
added 3 commits
April 7, 2026 18:47
rlebras
approved these changes
Apr 8, 2026
rlebras
left a comment
Contributor
There was a problem hiding this comment.
A couple of comments. LGTM! Thanks!
| ### Step 1: Serialize the data | ||
|
|
||
| The script | ||
| [`oe_eval/serialize_benchmark.py`](https://github.com/allenai/oe-eval-internal/blob/ianm-serialize-bench-data-vllm-schema/oe_eval/serialize_benchmark.py) |
Contributor
There was a problem hiding this comment.
Let's point this to main when merged.
| # Example: SciQ multiple-choice (from examples/serialize_task_example.py) | ||
| # ============================================================================= | ||
|
|
||
| register_variant( |
Contributor
There was a problem hiding this comment.
This will register sciq_mc on every import of serialized.py. Should we comment it out?
rlebras
reviewed
Apr 8, 2026
| scored by logprob: | ||
|
|
||
| ```python | ||
| from olmo_eval.common.metrics import LogprobMCAccuracyMetric |
Contributor
There was a problem hiding this comment.
Also, this import is tied to the sciq_mc example. If we move sciq_mc to its own file, this import should move with it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Serialized tasks (added in #89) let people add evaluations to olmo-eval without writing a new Task class -- they just provide a JSONL file with pre-formatted prompts and instance metadata, register a variant, and pick their metrics. This PR adds documentation and clear examples of how to use serialized tasks as the simplest entry point for anyone who already has evaluation data in a static format for a straight forward task (e.g. migrated from another framework or generated by a custom script).
This PR adds:
### Serialized Tasks): a concise overview with a schema snippet and pointer to the detailed guide.docs/serialized_tasks.md: full walkthrough organized around two examples -- migrating a task from oe-eval-internal and creating a new serialized task from scratch. Covers the JSONL schema, what is/isn't serialized, metadata passthrough, and limitations.examples/serialize_task_example.py: standalone script that serializes the SciQ dataset (4-choice science QA) into the JSONL format. Does not depend on oe-eval.serialized:sciq_mcvariant: registration usingLogprobMCAccuracyMetric, verified end-to-end on Beaker (66% accuracy with OLMo-2-0425-1B, well above 25% random baseline).Test plan
make lintpasses (ruff check + ruff format)serialized:sciq_mcevaluated successfully on Beaker with OLMo-2-0425-1B (experiment:01KNNFZ080T04S3MKQS89KYQGD, accuracy: 0.66)