Skip to content

[WIP] Hard reasoning#102

Open
rlebras wants to merge 27 commits into
allenai:mainfrom
rlebras:hard_reasoning
Open

[WIP] Hard reasoning#102
rlebras wants to merge 27 commits into
allenai:mainfrom
rlebras:hard_reasoning

Conversation

@rlebras

@rlebras rlebras commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Description

Add HardReasoning evaluation tasks

Adds evaluation support for the allenai/hard-reasoning dataset — a benchmark of NP-hard logic puzzles (knapsack, partition, graph problems, etc.) framed as natural language scenarios.

Tasks

Nine tasks are registered, one per scenario subset:

hard_reasoning_{bringing_toys,classroom_assignment,dinner_party,
expense_splitting,printing_jobs,secret_santa,
social_gathering,wedding_planning,wedding_supplies}

Each task also has a :chat variant with extended context (max_tokens=32768).

Scoring

Rather than comparing against a stored gold answer, scoring uses the check() verifier from the np-hard-reasoning package. This directly validates that the model's proposed solution satisfies the problem constraints (e.g. weight ≤ capacity, value ≥ target), which is both more robust and avoids dependence on a specific canonical solution for problems that may have multiple valid answers.

The np-hard-reasoning package is an optional dependency — install with .[hard-reasoning] when running these tasks.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)

Testing

  • Unit tests pass locally (pytest tests/ --ignore=tests/integration/)
  • Integration tests pass (if applicable)
  • New tests added for new functionality

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have added/updated documentation as needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

@undfined undfined left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 suggestions but otherwise looks great!

Comment thread pyproject.toml Outdated
"nvidia-ml-py>=12.560",
]
hard-reasoning = [
"np-hard-reasoning @ git+https://github.com/allenai/np-hard-reasoning.git",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pin this to a specific commit.

Comment thread pyproject.toml Outdated
Comment on lines +138 to +143
# ty overrides for hard_reasoning task (uses optional np-hard-reasoning dependency)
[[tool.ty.overrides]]
include = ["src/olmo_eval/evals/tasks/hard_reasoning.py"]
[tool.ty.overrides.rules]
unresolved-import = "ignore"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now. I'm going to make a pass on better type handling for optional deps here soon.

Comment on lines +39 to +60
def _extract_last_complete_json(s: str) -> dict | None:
"""Extract the last complete JSON object from a string."""
stack: list[int] = []
last_json_start: int | None = None
last_json_str: str | None = None
for i, char in enumerate(s):
if char == "{":
stack.append(i)
if last_json_start is None:
last_json_start = i
elif char == "}":
if stack:
stack.pop()
if not stack:
last_json_str = s[last_json_start : i + 1]
last_json_start = None
if last_json_str:
try:
return json.loads(last_json_str.replace("\n", ""))
except json.JSONDecodeError:
pass
return None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a test for any custom extraction logic like this. It can be lightweight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants