Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,48 @@ register_regime("my_task", "3shot", num_fewshot=3)

Usage: `olmo-eval run -t my_task:bpb:3shot`

### Serialized Tasks

If you already have evaluation data in a static format -- for example,
migrated from another framework or generated by a custom script -- you can
add a task without writing a new Task class. The built-in `SerializedTask`
loads a JSONL file where each line carries the fields needed to build both
an `Instance` (for scoring) and an `LMRequest` (for inference). You just
register a variant pointing at your file and choose your metrics:

```python
register_variant(
"serialized", "my_task_name",
data_source=DataSource(path="s3://my-bucket/my_task.jsonl"),
metrics=(MyMetric(),),
)
```

Each JSONL record looks like:

```jsonc
{
"doc_id": 0, // unique int id
"question": "Write a function ...", // raw question text
"gold_answers": ["def foo(): ..."], // correct answer(s)
"choices": null, // MC answer choices, or null
"metadata": {"task_name": "my_task", "id": 1}, // arbitrary scorer-visible metadata
"request_type": "loglikelihood", // "completion", "loglikelihood", or "chat"
"prompt": "You are an expert ...\n\n", // formatted prompt string
"messages": null, // chat messages list, or null
"continuations": ["def foo(): ..."] // strings to score (loglikelihood)
}
```

The serialized file captures what goes into the model (the fully formatted
prompt, including baked-in few-shot examples and chat templates). It does
**not** capture SamplingParams, Scorers, Metrics, or answer extraction
logic -- those are configured on the olmo-eval side.

For a full walkthrough -- including how to migrate a task from oe-eval and
how to create a new serialized task from scratch -- see
[docs/serialized_tasks.md](docs/serialized_tasks.md).

## Tool-Augmented Evaluation

olmo-eval supports evaluating models with tool use through the **Harness** abstraction. This enables comparing baseline model performance against tool-augmented performance on the same tasks.
Expand Down
329 changes: 329 additions & 0 deletions docs/serialized_tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
# Serialized Tasks

A serialized task lets you add an evaluation to olmo-eval without
implementing a new Task class. You provide a JSONL file containing
pre-formatted instances and model requests; the built-in `SerializedTask`
class handles loading and plumbing. All you need to do is register a variant
that points at your data and specifies the metrics you want.

The goal is to keep this path as simple as possible for common task shapes
(completion, loglikelihood, chat). More exotic task types that need custom
runtime behavior (per-instance sampling params, dynamic prompt construction,
multi-request-per-instance patterns) may still require a full Task subclass,
but we aim to grow `SerializedTask` over time to cover the common cases.

## JSONL schema at a glance

Each line is a JSON object with two groups of fields -- one that maps to an
`Instance` (used for scoring) and one that maps to an `LMRequest` (sent to
the model):

**Loglikelihood / BPB example** (single continuation scored against the prompt):

```json
{
"doc_id": 0,
"question": "Write a function to find the maximum element in a list.",
"gold_answers": ["def max_element(lst):\n return max(lst)"],
"choices": null,
"metadata": {"task_name": "mbpp:3shot:bpb", "id": 601, "test": "assert max_element([1,2,3]) == 3"},
"request_type": "loglikelihood",
"prompt": "You are an expert Python programmer...\n\n",
"messages": null,
"continuations": ["def max_element(lst):\n return max(lst)"]
}
```

**Multiple-choice example** (one continuation per choice, scored by logprob):

```json
{
"doc_id": 0,
"question": "What is the powerhouse of the cell?",
"gold_answers": ["Mitochondria"],
"choices": ["Nucleus", "Mitochondria", "Ribosome", "Golgi apparatus"],
"metadata": {"task_name": "sciq", "gold_idx": 1},
"request_type": "loglikelihood",
"prompt": "What is the powerhouse of the cell?\n A. Nucleus\n B. Mitochondria\n C. Ribosome\n D. Golgi apparatus\nAnswer:",
"messages": null,
"continuations": [" A", " B", " C", " D"]
}
```

**Chat example** (generation request with message list):

```json
{
"doc_id": 0,
"question": "What is the capital of France?",
"gold_answers": ["Paris"],
"choices": null,
"metadata": {"task_name": "trivia_chat"},
"request_type": "chat",
"prompt": null,
"messages": [
{"role": "system", "content": "Answer concisely."},
{"role": "user", "content": "What is the capital of France?"}
],
"continuations": null
}
```

**Instance-level fields** (for scoring):

| Field | Type | Description |
|-------|------|-------------|
| `doc_id` | `int` | Unique numeric ID within the file |
| `question` | `str` | Raw question text |
| `gold_answers` | `list[str]` | Valid gold answers; first element maps to `Instance.gold_answer` |
| `choices` | `list[str] \| null` | Answer choices for MC tasks, null otherwise |
| `metadata` | `dict` | Arbitrary per-instance metadata; always includes `task_name`, may include `gold_idx` for MC, plus any scorer-specific fields |

**LMRequest-level fields** (for inference):

| Field | Type | Description |
|-------|------|-------------|
| `request_type` | `str` | `"completion"`, `"loglikelihood"`, or `"chat"` |
| `prompt` | `str \| null` | Formatted prompt string (completion / loglikelihood) |
| `messages` | `list[dict] \| null` | Chat messages list (chat requests) |
| `continuations` | `list[str] \| null` | Continuation strings for loglikelihood scoring |

### What gets serialized and what does not

The JSONL captures **what goes into the model**: the fully formatted prompt
or chat messages, including baked-in few-shot examples, chat templates, and
continuation strings. It also captures Instance fields needed by scorers
(question, gold answers, choices, metadata).

The JSONL does **not** capture:

- **SamplingParams** (max_tokens, temperature, stop_sequences, etc.) --
these are runtime concerns configured via `TaskConfig` or CLI overrides.
- **Scorer and Metric definitions** -- these live in olmo-eval
task registrations, not in the data.
- **Answer extraction logic** -- implemented by scorers in olmo-eval.

---

## Example 1: Migrating a task from oe-eval-internal

This walkthrough covers how the `olmo3:base_easy:code:bpb` suite was
migrated from oe-eval-internal using serialized data. Use it as a reference
if you have a task that already exists in oe-eval.

### Step 1: Serialize the data

The script
[`oe_eval/serialize_benchmark.py`](https://github.com/allenai/oe-eval-internal/blob/ianm-serialize-bench-data-vllm-schema/oe_eval/serialize_benchmark.py)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's point this to main when merged.

converts oe-eval tasks into JSONL files. It loads a task through oe-eval's
own `load_task` / `build_all_requests` pipeline, then extracts the formatted
prompt and instance fields into `SerializedRecord` objects
(see lines 43-81 of that script for the schema dataclass).

```bash
python -m oe_eval.serialize_benchmark \
--task-suite olmo3:base_easy:code_bpb \
--output-dir /tmp/serialized_benchmarks
```

This produces one JSONL file per leaf task plus a `manifest.json`. The
script handles collapsing multiple-choice request instances into a single
record with `choices` and `continuations` lists, and extracts metadata
fields like `test` and `entry_point` from the underlying dataset document.

Note that this script does not necessarily cover every task type in oe-eval
out of the box. It was written to migrate the code BPB suite and serves as
an example of how to approach serialization for other tasks. Tasks with
unusual request types (e.g. `generate_until_and_loglikelihood`) or
non-standard `doc_to_target()` implementations may need adjustments -- see
the [Limitations](#limitations-and-gotchas) section below.

### Step 2: Upload the data

Upload the JSONL files to a location accessible by `DataLoader` (S3, GCS,
or a local path):

```bash
aws s3 cp /tmp/serialized_benchmarks/ s3://my-bucket/serialized/ --recursive
```

### Step 3: Register variants in olmo-eval

Each JSONL file becomes a variant of the `serialized` base task. Add
registrations in
[`src/olmo_eval/evals/tasks/serialized.py`](../src/olmo_eval/evals/tasks/serialized.py).
Here is what the code BPB migration looks like (simplified):

```python
from olmo_eval.common.metrics import BPBMetricInstanceAvg
from olmo_eval.data import DataSource
from olmo_eval.evals.tasks.common import register_variant

_S3_BASE = "s3://ai2-llm/ianm/oe-eval-serialized/olmo3_base_easy_code_bpb"
_BPB_METRICS = (BPBMetricInstanceAvg(),)

register_variant(
"serialized",
"codex_humaneval_3shot_bpb",
data_source=DataSource(path=f"{_S3_BASE}/codex_humaneval_3shot_bpb__none.jsonl"),
metrics=_BPB_METRICS,
)

register_variant(
"serialized",
"mbpp_3shot_bpb",
data_source=DataSource(path=f"{_S3_BASE}/mbpp_3shot_bpb__none.jsonl"),
metrics=_BPB_METRICS,
limit=500,
)
```

Each call overrides `TaskConfig` fields on the `serialized` base task.
The kwargs you can pass are any `TaskConfig` field: `data_source`,
`metrics`, `limit`, `primary_metric`, `sampling_params`, etc.

### Step 4: Define a suite

Group related serialized tasks into a suite in
[`src/olmo_eval/evals/suites/code.py`](../src/olmo_eval/evals/suites/code.py):

```python
OLMO3_BASE_EASY_CODE_BPB_SERIALIZED = register(
Suite(
name="olmo3:base_easy:code:bpb:serialized",
tasks=(
"serialized:codex_humaneval_3shot_bpb",
"serialized:mbpp_3shot_bpb",
_SERIALIZED_MT_MBPP_V2FIX_BPB,
),
aggregation=AggregationStrategy.AVERAGE_OF_AVERAGES,
description="OLMo3 base_easy code BPB suite from serialized data",
)
)
```

### Step 5: Run it

```bash
olmo-eval run -t serialized:codex_humaneval_3shot_bpb -m my-model
# or the whole suite:
olmo-eval run -s olmo3:base_easy:code:bpb:serialized -m my-model
```

---

## Example 2: Creating a new serialized task from scratch

This walkthrough covers creating a serialized task that does not come from
oe-eval -- you write a standalone script to produce the JSONL.

### Step 1: Write a serialization script

See
[`examples/serialize_task_example.py`](../examples/serialize_task_example.py)
for a complete, runnable example. It loads the SciQ dataset from
HuggingFace (a simple 4-choice science QA benchmark) and writes a JSONL
file. The key function is `serialize_doc`, which builds one record per
example:

```python
def serialize_doc(doc_id, doc):
# ... shuffle choices, find gold_idx ...
return {
"doc_id": doc_id,
"question": doc["question"],
"gold_answers": [doc["correct_answer"]],
"choices": choices,
"metadata": {"task_name": "sciq_serialized", "gold_idx": gold_idx},
"request_type": "loglikelihood",
"prompt": make_prompt(doc, choices),
"messages": None,
"continuations": [f" {label}" for label in "ABCD"],
}
```

Run it:

```bash
pip install datasets
python examples/serialize_task_example.py --output /tmp/sciq_serialized.jsonl
```

### Step 2: Register the variant

Add a registration pointing at your JSONL. For a multiple-choice task
scored by logprob:

```python
from olmo_eval.common.metrics import LogprobMCAccuracyMetric

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this import is tied to the sciq_mc example. If we move sciq_mc to its own file, this import should move with it.

from olmo_eval.data import DataSource
from olmo_eval.evals.tasks.common import register_variant

register_variant(
"serialized",
"sciq_mc",
data_source=DataSource(path="/tmp/sciq_serialized.jsonl"),
metrics=(LogprobMCAccuracyMetric(),),
)
```

### Step 3: Run it

```bash
olmo-eval run -t serialized:sciq_mc -m my-model
```

---

## Passing task-specific metadata

Scorers and metrics sometimes need per-instance data beyond the basic
Instance fields. For example, code execution scoring needs `test` and
`entry_point`, and IFEval constraint checking needs `instruction_id_list`.

Put these fields in the `metadata` dict of each JSONL record:

```json
{"metadata": {"test": "assert foo(1) == 2", "entry_point": "foo"}, ...}
```

The `SerializedTask` passes `metadata` through to `Instance.metadata`
unchanged, so they are available at scoring time via
`instance.metadata["test"]`.

If you need behavior the default `SerializedTask` does not provide -- for
example, a custom `extract_answer` method, per-instance `SamplingParams`,
or reading `system_prompt` from metadata into `LMRequest` -- subclass
`SerializedTask` and register your subclass as a new base task.

---

## Limitations and gotchas

Serialization works well for straightforward completion, loglikelihood, and
chat tasks. More complex task types can introduce complications:

- **Baked-in formatting is model-specific.** If the serialized prompt includes
any chat template, FIM tokens, or context-window truncation applied during
serialization. A JSONL file produced for one model family may not be valid
for another.

- **Dual-mode tasks (generation + BPB).** Some tasks issue both a
`generate_until` and a `loglikelihood` request per instance. These need
two separate JSONL files -- one per request type -- registered as separate
variants with different metrics.

- **Multiple-choice PMI normalization.** Unconditioned-prompt requests (used
for `acc_norm`) are not serialized. If needed, the olmo-eval consumer
would construct them at runtime from the `continuations` list and a
task-level unconditioned prompt string.

- **Rolling / corpus perplexity.** Short-document perplexity can be
serialized as loglikelihood with `prompt=""` and a single continuation.
True sliding-window rolling for long documents requires olmo-eval
infrastructure that does not exist yet.

- **Gold answer reliability.** When serializing from oe-eval,
`doc_to_target()` extracts `gold_answers`, but this method is not always
the canonical answer source for every task. Verify the serialized answers
match what scoring expects before running at scale.
Loading
Loading