[WIP] Hard reasoning#102
Open
rlebras wants to merge 27 commits into
Open
Conversation
…into hard_reasoning
…into hard_reasoning
undfined
approved these changes
Apr 2, 2026
undfined
left a comment
Collaborator
There was a problem hiding this comment.
2 suggestions but otherwise looks great!
| "nvidia-ml-py>=12.560", | ||
| ] | ||
| hard-reasoning = [ | ||
| "np-hard-reasoning @ git+https://github.com/allenai/np-hard-reasoning.git", |
Collaborator
There was a problem hiding this comment.
Let's pin this to a specific commit.
Comment on lines
+138
to
+143
| # ty overrides for hard_reasoning task (uses optional np-hard-reasoning dependency) | ||
| [[tool.ty.overrides]] | ||
| include = ["src/olmo_eval/evals/tasks/hard_reasoning.py"] | ||
| [tool.ty.overrides.rules] | ||
| unresolved-import = "ignore" | ||
|
|
Collaborator
There was a problem hiding this comment.
This is fine for now. I'm going to make a pass on better type handling for optional deps here soon.
Comment on lines
+39
to
+60
| def _extract_last_complete_json(s: str) -> dict | None: | ||
| """Extract the last complete JSON object from a string.""" | ||
| stack: list[int] = [] | ||
| last_json_start: int | None = None | ||
| last_json_str: str | None = None | ||
| for i, char in enumerate(s): | ||
| if char == "{": | ||
| stack.append(i) | ||
| if last_json_start is None: | ||
| last_json_start = i | ||
| elif char == "}": | ||
| if stack: | ||
| stack.pop() | ||
| if not stack: | ||
| last_json_str = s[last_json_start : i + 1] | ||
| last_json_start = None | ||
| if last_json_str: | ||
| try: | ||
| return json.loads(last_json_str.replace("\n", "")) | ||
| except json.JSONDecodeError: | ||
| pass | ||
| return None |
Collaborator
There was a problem hiding this comment.
Might be worth adding a test for any custom extraction logic like this. It can be lightweight.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add HardReasoning evaluation tasks
Adds evaluation support for the allenai/hard-reasoning dataset — a benchmark of NP-hard logic puzzles (knapsack, partition, graph problems, etc.) framed as natural language scenarios.
Tasks
Nine tasks are registered, one per scenario subset:
hard_reasoning_{bringing_toys,classroom_assignment,dinner_party,
expense_splitting,printing_jobs,secret_santa,
social_gathering,wedding_planning,wedding_supplies}
Each task also has a
:chatvariant with extended context (max_tokens=32768).Scoring
Rather than comparing against a stored gold answer, scoring uses the
check()verifier from thenp-hard-reasoningpackage. This directly validates that the model's proposed solution satisfies the problem constraints (e.g. weight ≤ capacity, value ≥ target), which is both more robust and avoids dependence on a specific canonical solution for problems that may have multiple valid answers.The
np-hard-reasoningpackage is an optional dependency — install with.[hard-reasoning]when running these tasks.Type of Change
Testing
pytest tests/ --ignore=tests/integration/)Checklist