Now, `reward_fn` lives in LLMRayActor. #1225

finbarrtimbers · 2025-11-24T17:00:51Z

This has a few advantages:

Timing wise, we won't block on slow reward functions. This setup should scale well, both horizontally and vertically. Horizontally: We can have more actors, and each will manage their own rewards. Vertically: because we call the reward function asynchronously, we can have slower reward functions with minimal impact to overall throughput.
This is conceptually cleaner, as now, when we have a completion done, it has all the information needed to process it.

Runs:

Single GPU GRPO: Beaker
Multi-node GRPO: Beaker
Single GPU GRPO with tools: Beaker

Note

Moves reward calculation into vLLM actors using an async RewardConfig-built function, attaching scores/metrics to GenerationResult and updating the training/eval pipeline to consume them.

Runtime/Actors:
- Compute rewards inside LLMRayActor after generation; build async reward_fn from new RewardConfig and attach reward_scores/reward_metrics to GenerationResult.
- create_vllm_engines/LLMRayActor now accept reward_config, train_dataset, eval_dataset to enable in-actor reward computation.
Training pipeline:
- Remove reward_fn plumbing from grpo_fast; accumulate_inference_batches and eval now read result.reward_scores/reward_metrics and derive stats from them.
- create_model_and_optimizer passes RewardConfig and datasets to engines.
Ground truth utilities:
- Add apply_verifiable_reward and RewardConfig.build() producing the reward function; extend metrics (per-verifier averages and correct rates).
Data structures:
- Extend GenerationResult with reward_scores and reward_metrics.
Tests:
- Update tests to construct GenerationResult with reward_scores and remove explicit reward_fn usage where applicable.

^{Written by Cursor Bugbot for commit eff4c59. This will update automatically on new commits. Configure here.}

github-actions · 2025-11-25T22:49:40Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-25 22:49:39.612567452 +0000
+++ site-pr/sitemap.xml	2025-11-25 22:49:36.235470329 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-25</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-25</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

github-actions · 2025-11-26T00:44:18Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-26 00:44:18.134309052 +0000
+++ site-pr/sitemap.xml	2025-11-26 00:44:15.524327761 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-26</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-26</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

github-actions · 2025-11-26T15:35:50Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-26 15:35:50.138416591 +0000
+++ site-pr/sitemap.xml	2025-11-26 15:35:47.157435630 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-26</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-26</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

github-actions · 2025-11-26T15:56:59Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-26 15:56:58.752319600 +0000
+++ site-pr/sitemap.xml	2025-11-26 15:56:56.715339651 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-26</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-26</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

github-actions · 2025-11-26T16:10:09Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-26 16:10:08.509203117 +0000
+++ site-pr/sitemap.xml	2025-11-26 16:10:06.233215713 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-26</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-26</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

github-actions · 2025-11-26T17:09:57Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-26 17:09:57.006217805 +0000
+++ site-pr/sitemap.xml	2025-11-26 17:09:54.491226702 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-26</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-26</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

github-actions · 2025-11-26T18:59:45Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-26 18:59:44.789581101 +0000
+++ site-pr/sitemap.xml	2025-11-26 18:59:41.664612135 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-26</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-26</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

github-actions · 2025-11-26T20:10:33Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-26 20:10:32.684389026 +0000
+++ site-pr/sitemap.xml	2025-11-26 20:10:29.655391830 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-26</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-26</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

github-actions · 2025-11-26T20:40:29Z

Documentation Changes Detected

📄 sitemap.xml

--- site-base/sitemap.xml	2025-11-26 20:40:29.066222102 +0000
+++ site-pr/sitemap.xml	2025-11-26 20:40:26.518212992 +0000
@@ -9,6 +9,10 @@
          <lastmod>2025-11-26</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/nccl_hang_investigation/</loc>
+         <lastmod>2025-11-26</lastmod>
+    </url>
+    <url>

📄 sitemap.xml.gz

Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-01T22:05:44Z

open_instruct/vllm_utils.py

+async def compute_rewards(
+    actor: "LLMRayActor", result: GenerationResult, dataset: datasets.Dataset, is_eval: bool
+) -> tuple[list[float], dict]:
+    example = dataset[result.dataset_index]
+    decoded_responses = actor.llm_engine.tokenizer.batch_decode(result.responses)


Decode rewards without stripping special tokens

compute_rewards now decodes completions via actor.llm_engine.tokenizer.batch_decode(result.responses) without skip_special_tokens=True, whereas reward computation previously decoded with special tokens removed (e.g., in accumulate_inference_batches). For tokenizers that inject BOS/EOS markers, those tokens are passed to the verifiers, so format and ground-truth checks see extra tokens and mis-classify otherwise correct responses, distorting reward signals for every request.

Useful? React with 👍 / 👎.

open_instruct/vllm_utils.py

hamishivi

I think this generally looks good, but one higher level question: lets say I want to add extra information into my reward computation, e.g. I have some ongoing buffer of samples I want the reward computation to depend on, how easy is this? Since now the reward_fn thread has been moved into the llm actors (which I think is correct high level, just thinking about hackability).

open_instruct/vllm_utils.py

finbarrtimbers · 2025-12-02T04:20:50Z

Fixed the skip_special_tokens=True issue!

cursor · 2025-12-02T04:32:25Z

open_instruct/vllm_utils.py

+    dataset = actor.eval_dataset if is_eval else actor.train_dataset
+    result.reward_scores, result.reward_metrics = await compute_rewards(actor, result, dataset, is_eval)
+    results_queue = actor.eval_results_queue if is_eval else actor.results_queue
+    results_queue.put(result)


Bug: Rewards computed before empty response EOS modification

The reward computation now happens in finalize_completed_request on original responses, but the empty response handling (appending EOS token when finish_reason == "stop" and response is empty) happens later in accumulate_inference_batches. Previously, the empty response modification occurred BEFORE reward computation. This means rewards are now computed on potentially empty [] responses, while training uses modified responses containing [eos_token_id]. The TODO comment at line 1790 acknowledges this needs to be moved to LLMRayActor, but in the current state there's a mismatch between what the reward function evaluates and what the model trains on.

Additional Locations (1)

open_instruct/grpo_fast.py#L1789-L1795

hamishivi

lgtm :)

open_instruct/grpo_fast.py

cursor · 2025-12-02T16:28:09Z

open_instruct/vllm_utils.py

+    actor.request_outputs[base_request_id]["outputs"].append(sub_request["request_output"])
+
+    if len(actor.request_outputs[base_request_id]["outputs"]) == expected_n:
+        asyncio.run_coroutine_threadsafe(finalize_completed_request(actor, base_request_id), actor.loop)


Bug: Unhandled exceptions in async finalization cause silent failures

The Future returned by asyncio.run_coroutine_threadsafe(finalize_completed_request(...), actor.loop) is discarded, meaning exceptions in compute_rewards or elsewhere in the async function are silently swallowed. Combined with the fact that actor.request_outputs.pop(base_request_id) happens before the await compute_rewards(...) call, if reward computation fails the request data is already removed and the result never reaches the queue. The data preparation thread in accumulate_inference_batches would then hang waiting for results that will never arrive. The previous synchronous approach had exceptions bubble up visibly.

Additional Locations (1)

open_instruct/vllm_utils.py#L536-L541

After PR #1225 moved reward_fn to live inside LLMRayActor, these references were left behind during the merge. This removes: - reward_fn parameter from maybe_evaluate() and run_training() - reward_fn from accumulate_inference_batches() calls - reward_fn from create_model_and_optimizer return value unpacking - Unused Callable import 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

finbarrtimbers mentioned this pull request Nov 24, 2025

Switches CodeVerifier to use aiohttp instead of requests, enabling native async, and avoiding using asyncio.to_thread. #1227

Closed

new async reward impl.

ccea424

finbarrtimbers force-pushed the finbarr/async-reward branch from c24fb39 to ccea424 Compare November 26, 2025 21:07

finbarrtimbers added 18 commits November 27, 2025 08:20

more debu glogs

e6df3c9

Added logs

0c2c492

more debugging

032f34f

more debugging logs

e155a18

more debugging logs

c9fcfe3

more debugging

914d5ea

more debugging

f637610

update code

5ee950f

fixes

21abeb7

Fixes

b27118e

Merge branch 'main' into finbarr/async-reward

8fa96b7

Ran linter.

51f22fc

Cleaned up PR.

e0f803a

Updated code

700d25a

Asynchronously calculate rewards

5c75d3c

Merge branch 'main' into finbarr/async-reward

5954bc2

Cleaned up code.

7120e48

Added type annotation.

2182011

finbarrtimbers requested a review from hamishivi December 1, 2025 22:02

finbarrtimbers marked this pull request as ready for review December 1, 2025 22:02

chatgpt-codex-connector bot reviewed Dec 1, 2025

View reviewed changes

cursor bot reviewed Dec 1, 2025

View reviewed changes

open_instruct/vllm_utils.py Outdated Show resolved Hide resolved

hamishivi reviewed Dec 2, 2025

View reviewed changes

open_instruct/vllm_utils.py Outdated Show resolved Hide resolved

Added skip_special_tokens=True

74933ca

cursor bot reviewed Dec 2, 2025

View reviewed changes

hamishivi approved these changes Dec 2, 2025

View reviewed changes

finbarrtimbers mentioned this pull request Dec 2, 2025

Switches empty response handling to happen in LLMRayActor. #1243

Open

Merge branch 'main' into finbarr/async-reward

d74573f

cursor bot reviewed Dec 2, 2025

View reviewed changes

open_instruct/grpo_fast.py Show resolved Hide resolved

finbarrtimbers added 2 commits December 2, 2025 08:38

Fixed issue from cursor.

445fc08

Cleaned up code.

eff4c59

finbarrtimbers enabled auto-merge December 2, 2025 16:17

cursor bot reviewed Dec 2, 2025

View reviewed changes

finbarrtimbers added this pull request to the merge queue Dec 2, 2025

Merged via the queue into main with commit 0faba3c Dec 2, 2025
6 checks passed

Now, reward_fn lives in LLMRayActor. #1225

Now, reward_fn lives in LLMRayActor. #1225

Uh oh!

Conversation

finbarrtimbers commented Nov 24, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 25, 2025

Documentation Changes Detected

Uh oh!

github-actions bot commented Nov 26, 2025

Documentation Changes Detected

Uh oh!

github-actions bot commented Nov 26, 2025

Documentation Changes Detected

Uh oh!

github-actions bot commented Nov 26, 2025

Documentation Changes Detected

Uh oh!

github-actions bot commented Nov 26, 2025

Documentation Changes Detected

Uh oh!

github-actions bot commented Nov 26, 2025

Documentation Changes Detected

Uh oh!

github-actions bot commented Nov 26, 2025

Documentation Changes Detected

Uh oh!

github-actions bot commented Nov 26, 2025

Documentation Changes Detected

Uh oh!

github-actions bot commented Nov 26, 2025

Documentation Changes Detected

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

finbarrtimbers commented Dec 2, 2025

Uh oh!

cursor bot Dec 2, 2025

Choose a reason for hiding this comment

Bug: Rewards computed before empty response EOS modification

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Dec 2, 2025

Choose a reason for hiding this comment

Bug: Unhandled exceptions in async finalization cause silent failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Now, `reward_fn` lives in LLMRayActor. #1225

Now, `reward_fn` lives in LLMRayActor. #1225

finbarrtimbers commented Nov 24, 2025 •

edited by cursor bot

Loading