-
Notifications
You must be signed in to change notification settings - Fork 468
Now, reward_fn lives in LLMRayActor.
#1225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
c24fb39 to
ccea424
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
open_instruct/vllm_utils.py
Outdated
| async def compute_rewards( | ||
| actor: "LLMRayActor", result: GenerationResult, dataset: datasets.Dataset, is_eval: bool | ||
| ) -> tuple[list[float], dict]: | ||
| example = dataset[result.dataset_index] | ||
| decoded_responses = actor.llm_engine.tokenizer.batch_decode(result.responses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decode rewards without stripping special tokens
compute_rewards now decodes completions via actor.llm_engine.tokenizer.batch_decode(result.responses) without skip_special_tokens=True, whereas reward computation previously decoded with special tokens removed (e.g., in accumulate_inference_batches). For tokenizers that inject BOS/EOS markers, those tokens are passed to the verifiers, so format and ground-truth checks see extra tokens and mis-classify otherwise correct responses, distorting reward signals for every request.
Useful? React with 👍 / 👎.
hamishivi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this generally looks good, but one higher level question: lets say I want to add extra information into my reward computation, e.g. I have some ongoing buffer of samples I want the reward computation to depend on, how easy is this? Since now the reward_fn thread has been moved into the llm actors (which I think is correct high level, just thinking about hackability).
|
Fixed the |
| dataset = actor.eval_dataset if is_eval else actor.train_dataset | ||
| result.reward_scores, result.reward_metrics = await compute_rewards(actor, result, dataset, is_eval) | ||
| results_queue = actor.eval_results_queue if is_eval else actor.results_queue | ||
| results_queue.put(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Rewards computed before empty response EOS modification
The reward computation now happens in finalize_completed_request on original responses, but the empty response handling (appending EOS token when finish_reason == "stop" and response is empty) happens later in accumulate_inference_batches. Previously, the empty response modification occurred BEFORE reward computation. This means rewards are now computed on potentially empty [] responses, while training uses modified responses containing [eos_token_id]. The TODO comment at line 1790 acknowledges this needs to be moved to LLMRayActor, but in the current state there's a mismatch between what the reward function evaluates and what the model trains on.
Additional Locations (1)
hamishivi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm :)
| actor.request_outputs[base_request_id]["outputs"].append(sub_request["request_output"]) | ||
|
|
||
| if len(actor.request_outputs[base_request_id]["outputs"]) == expected_n: | ||
| asyncio.run_coroutine_threadsafe(finalize_completed_request(actor, base_request_id), actor.loop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Unhandled exceptions in async finalization cause silent failures
The Future returned by asyncio.run_coroutine_threadsafe(finalize_completed_request(...), actor.loop) is discarded, meaning exceptions in compute_rewards or elsewhere in the async function are silently swallowed. Combined with the fact that actor.request_outputs.pop(base_request_id) happens before the await compute_rewards(...) call, if reward computation fails the request data is already removed and the result never reaches the queue. The data preparation thread in accumulate_inference_batches would then hang waiting for results that will never arrive. The previous synchronous approach had exceptions bubble up visibly.
Additional Locations (1)
After PR #1225 moved reward_fn to live inside LLMRayActor, these references were left behind during the merge. This removes: - reward_fn parameter from maybe_evaluate() and run_training() - reward_fn from accumulate_inference_batches() calls - reward_fn from create_model_and_optimizer return value unpacking - Unused Callable import 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This has a few advantages:
Runs:
Note
Moves reward calculation into vLLM actors using an async RewardConfig-built function, attaching scores/metrics to GenerationResult and updating the training/eval pipeline to consume them.
LLMRayActorafter generation; build asyncreward_fnfrom newRewardConfigand attachreward_scores/reward_metricstoGenerationResult.create_vllm_engines/LLMRayActornow acceptreward_config,train_dataset,eval_datasetto enable in-actor reward computation.reward_fnplumbing fromgrpo_fast;accumulate_inference_batchesand eval now readresult.reward_scores/reward_metricsand derive stats from them.create_model_and_optimizerpassesRewardConfigand datasets to engines.apply_verifiable_rewardandRewardConfig.build()producing the reward function; extend metrics (per-verifier averages and correct rates).GenerationResultwithreward_scoresandreward_metrics.GenerationResultwithreward_scoresand remove explicitreward_fnusage where applicable.Written by Cursor Bugbot for commit eff4c59. This will update automatically on new commits. Configure here.