Skip to content

how to use add a serialized task walkthru#121

Open
IanMagnusson wants to merge 3 commits into
mainfrom
serialized-example
Open

how to use add a serialized task walkthru#121
IanMagnusson wants to merge 3 commits into
mainfrom
serialized-example

Conversation

@IanMagnusson

@IanMagnusson IanMagnusson commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Serialized tasks (added in #89) let people add evaluations to olmo-eval without writing a new Task class -- they just provide a JSONL file with pre-formatted prompts and instance metadata, register a variant, and pick their metrics. This PR adds documentation and clear examples of how to use serialized tasks as the simplest entry point for anyone who already has evaluation data in a static format for a straight forward task (e.g. migrated from another framework or generated by a custom script).

This PR adds:

  • README section (### Serialized Tasks): a concise overview with a schema snippet and pointer to the detailed guide.
  • docs/serialized_tasks.md: full walkthrough organized around two examples -- migrating a task from oe-eval-internal and creating a new serialized task from scratch. Covers the JSONL schema, what is/isn't serialized, metadata passthrough, and limitations.
  • examples/serialize_task_example.py: standalone script that serializes the SciQ dataset (4-choice science QA) into the JSONL format. Does not depend on oe-eval.
  • serialized:sciq_mc variant: registration using LogprobMCAccuracyMetric, verified end-to-end on Beaker (66% accuracy with OLMo-2-0425-1B, well above 25% random baseline).

Test plan

  • make lint passes (ruff check + ruff format)
  • Serialization script runs and produces valid JSONL
  • serialized:sciq_mc evaluated successfully on Beaker with OLMo-2-0425-1B (experiment: 01KNNFZ080T04S3MKQS89KYQGD, accuracy: 0.66)

@IanMagnusson IanMagnusson requested review from rlebras and undfined April 8, 2026 03:35
@IanMagnusson IanMagnusson changed the title Add serialized task documentation, SciQ example, and metric fix how to use add a serialized task walkthru Apr 8, 2026

@rlebras rlebras left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments. LGTM! Thanks!

Comment thread docs/serialized_tasks.md
### Step 1: Serialize the data

The script
[`oe_eval/serialize_benchmark.py`](https://github.com/allenai/oe-eval-internal/blob/ianm-serialize-bench-data-vllm-schema/oe_eval/serialize_benchmark.py)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's point this to main when merged.

# Example: SciQ multiple-choice (from examples/serialize_task_example.py)
# =============================================================================

register_variant(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will register sciq_mc on every import of serialized.py. Should we comment it out?

Comment thread docs/serialized_tasks.md
scored by logprob:

```python
from olmo_eval.common.metrics import LogprobMCAccuracyMetric

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this import is tied to the sciq_mc example. If we move sciq_mc to its own file, this import should move with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants