Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Nov 18, 2025

There's not a ton of point to this, right now. But by making this change, we can subsequently:

  1. move dpo_tune_cache.py over to use this, which will (eventually) let that use the Olmo-core trainer
  2. Give each LLMRayActor their own data_loader, so they can pull independently and not have to coordinate with main. This will let us remove add_prompt_to_generator and have them operate fully asynchronously. This will make switching the entire grpo_fast.py script over to the data_loader/trainer pattern easier.

Runs:


Note

Adds an olmo_core-compatible HF dataset loader and rewires GRPO, queues, and vLLM utils to use prompt_id-based requests with checkpointable reshuffling.

  • Data Loading:
    • HFDataLoader: New olmo_core DataLoaderBase wrapper for HuggingFace Dataset with sharding, shuffling/automatic reshuffle, exclusion, checkpointable state, and prompt_id emission.
  • GRPO Pipeline:
    • Replace ShufflingIterator and PendingQueriesMap with HFDataLoader-driven sampling and replenishment.
    • Update batching (accumulate_inference_batches) to fetch examples from prompt_dataset by dataset_index and handle no-resample/exclusion via loader.
    • Change add_prompt_to_generator to enqueue single-example PromptRequest using dataset_index and prompt_id.
    • Save/restore data loader state in checkpoints; create eval data loader; optional automatic reshuffle.
  • Queue Types & IDs:
    • PromptRequest/GenerationResult: require dataset_index, add prompt_id, remove epoch_number/training_step fields.
    • Request IDs now train|eval_{prompt_id}.
  • vLLM Utils:
    • Update request ID creation, metadata, and result processing to use prompt_id.
  • Tests:
    • Add test_data_loader.py; refactor GRPO and vLLM utils tests to new loader/queue semantics and prompt-based flow.
  • Dependencies:
    • Add ai2-olmo-core==2.3.0 to integrate with DataLoaderBase.

Written by Cursor Bugbot for commit 3107302. This will update automatically on new commits. Configure here.

@finbarrtimbers finbarrtimbers marked this pull request as ready for review November 26, 2025 17:17
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 1678 to 1682
percent_solved = np.mean(scores).item() / args.max_possible_score
# Don't resample prompt that was solved at more than no_resample_positive_rate
if no_resampling_pass_rate is not None and percent_solved >= no_resampling_pass_rate:
iter_dataloader.exclude_index(result.dataset_index)
total_no_resampled += 1
logging.debug(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce no_resampling_pass_rate when prompt is solved

The no_resampling_pass_rate flag is now effectively ignored. In accumulate_inference_batches, when a prompt’s average score exceeds the threshold (lines 1678‑1682) the code only increments total_no_resampled and logs, but no longer excludes that prompt from future sampling (the previous iter_dataloader.exclude_index(...) call was removed). With the flag set, solved prompts will keep being fed back through add_prompt_to_generator(next(data_loader), …) and resampled indefinitely instead of being retired, so training continues to spend compute on already-solved items.

Useful? React with 👍 / 👎.

Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff! I like that this PR is actually reducing the code complexity a little. But a few questions before I'm happy with merging...


@property
def total_batches(self) -> int:
"""Return the total number of batches in an epoch."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this doc string right? Isn't effective size number of samples in the dataset, from line 38?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're right. Fixed!

percent_solved = np.mean(scores).item() / args.max_possible_score
# Don't resample prompt that was solved at more than no_resample_positive_rate
if no_resampling_pass_rate is not None and percent_solved >= no_resampling_pass_rate:
iter_dataloader.exclude_index(result.dataset_index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this change intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistake! Fixed.

"""Return an iterable over all batches in the epoch."""
for i in range(self.batches_processed, self.effective_size):
example = self.dataset[i]
yield example | {"prompt_id": f"{self._epoch}_{example['dataset_index']}"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: batches_processed never incremented during iteration

The _iter_batches method yields items but never increments self.batches_processed. This breaks checkpointing because state_dict will always save batches_processed at its initial value (0 or whatever was restored). When resuming from a checkpoint, the data loader will restart from the beginning of the epoch instead of continuing from where it left off. The loop variable i should be used to update batches_processed after each yield, or batches_processed should be incremented in the __next__ method.

Fix in Cursor Fix in Web

self.reshuffle()
self._current_iter = self._iter_batches()
return next(self._current_iter)
raise
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Evaluation data loader exhausted after first use

When automatic_reshuffle=False (the default), after the first complete iteration through the dataset, _current_iter remains as an exhausted generator and is never reset. Subsequent calls to __next__ will immediately raise StopIteration without yielding any items. This breaks evaluation in grpo_fast.py where eval_data_loader is created without automatic_reshuffle=True and is iterated over multiple times during training. After the first evaluation completes, all subsequent evaluations will process zero examples because the iterator is exhausted.

Fix in Cursor Fix in Web

Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont think the fix got pushed for two comments 😅 I think the new cursor bugbot comments are a bit valid, and left a comment on the automatic reshuffle!

"""Whether to filter out prompts with zero reward std (all samples have the same score)."""
no_resampling_pass_rate: float | None = None
"""If the response to a prompt is solved at a rate higher than this, do not resample this prompt again"""
automatic_reshuffle: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this flag is false, then training just errors right? is there a reason to have it configurable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants