[WIP] Hard reasoning by rlebras · Pull Request #102 · allenai/olmo-eval

rlebras · 2026-03-25T20:11:11Z

Description

Add HardReasoning evaluation tasks

Adds evaluation support for the allenai/hard-reasoning dataset — a benchmark of NP-hard logic puzzles (knapsack, partition, graph problems, etc.) framed as natural language scenarios.

Tasks

Nine tasks are registered, one per scenario subset:

hard_reasoning_{bringing_toys,classroom_assignment,dinner_party,
expense_splitting,printing_jobs,secret_santa,
social_gathering,wedding_planning,wedding_supplies}

Each task also has a :chat variant with extended context (max_tokens=32768).

Scoring

Rather than comparing against a stored gold answer, scoring uses the check() verifier from the np-hard-reasoning package. This directly validates that the model's proposed solution satisfies the problem constraints (e.g. weight ≤ capacity, value ≥ target), which is both more robust and avoids dependence on a specific canonical solution for problems that may have multiple valid answers.

The np-hard-reasoning package is an optional dependency — install with .[hard-reasoning] when running these tasks.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)

Testing

Unit tests pass locally (pytest tests/ --ignore=tests/integration/)
Integration tests pass (if applicable)
New tests added for new functionality

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have added/updated documentation as needed
My changes generate no new warnings
Any dependent changes have been merged and published

…into hard_reasoning

undfined

2 suggestions but otherwise looks great!

undfined · 2026-04-02T16:36:26Z

    "nvidia-ml-py>=12.560",
 ]
+hard-reasoning = [
+    "np-hard-reasoning @ git+https://github.com/allenai/np-hard-reasoning.git",


Let's pin this to a specific commit.

undfined · 2026-04-02T16:37:10Z

+# ty overrides for hard_reasoning task (uses optional np-hard-reasoning dependency)
+[[tool.ty.overrides]]
+include = ["src/olmo_eval/evals/tasks/hard_reasoning.py"]
+[tool.ty.overrides.rules]
+unresolved-import = "ignore"
+


This is fine for now. I'm going to make a pass on better type handling for optional deps here soon.

undfined · 2026-04-02T16:38:20Z

+def _extract_last_complete_json(s: str) -> dict | None:
+    """Extract the last complete JSON object from a string."""
+    stack: list[int] = []
+    last_json_start: int | None = None
+    last_json_str: str | None = None
+    for i, char in enumerate(s):
+        if char == "{":
+            stack.append(i)
+            if last_json_start is None:
+                last_json_start = i
+        elif char == "}":
+            if stack:
+                stack.pop()
+                if not stack:
+                    last_json_str = s[last_json_start : i + 1]
+                    last_json_start = None
+    if last_json_str:
+        try:
+            return json.loads(last_json_str.replace("\n", ""))
+        except json.JSONDecodeError:
+            pass
+    return None


Might be worth adding a test for any custom extraction logic like this. It can be lightweight.

rlebras added 6 commits March 4, 2026 12:26

Init

23f9dcd

Longer context

2be29eb

Chat without backend

6b2102d

Merge branch 'main' of https://github.com/allenai/olmo-eval-internal …

e1931dc

…into hard_reasoning

Merge branch 'main' of https://github.com/allenai/olmo-eval-internal …

e6a84ad

…into hard_reasoning

HardReasoning scorer based on verifier instead of gold answer

92f2b46

undfined approved these changes Apr 2, 2026

View reviewed changes

rlebras added 21 commits April 7, 2026 16:31

merging conflicts

08b67f3

Adding unit tests for _extract_last_complete_json per PR review

eb38117

Pinning np hard reasoning to a specific commit, per PR review

996e276

add num_instances=4 directly to the model preset

3195b9a

reverting model preset

bd9ba5f

injecting HF token

16f1a98

streaming hf ds

f023041

hf hub

8f4e994

task dependencies

a25bc92

vLLM and chat

f613ef1

olmo-3-7b-instruct

52b8c3a

parse_rate

b716ef3

olmo 3.1 32B instruct preset

cfd2d43

qwen3 VL 32B instruct preset

455298c

qwen3 32B instruct preset

b8659f8

gemma, olmo3-think, r1-32B

e565f01

qwen3-32B

cdf35fb

r1-qwen3-8b

555bda9

gpt-oss-20b

3f9b903

task split

349b91c

new data format

1465f0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Hard reasoning#102

[WIP] Hard reasoning#102
rlebras wants to merge 27 commits into
allenai:mainfrom
rlebras:hard_reasoning

rlebras commented Mar 25, 2026

Uh oh!

undfined left a comment

Uh oh!

undfined Apr 2, 2026

Uh oh!

undfined Apr 2, 2026

Uh oh!

undfined Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rlebras commented Mar 25, 2026

Description

Add HardReasoning evaluation tasks

Tasks

Scoring

Type of Change

Testing

Checklist

Uh oh!

undfined left a comment

Choose a reason for hiding this comment

Uh oh!

undfined Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

undfined Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

undfined Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants