fix: Support datasets saved with save_to_disk in ResponseDataset #1610

sahgerlad · 2025-12-08T18:11:55Z

What does this PR do ?

Support datasets saved with save_to_disk in ResponseDataset. This fixes KeyError when loading Arrow datasets that were saved using HuggingFace datasets' save_to_disk() method.

Summary by CodeRabbit

Bug Fixes
- Improved dataset processing to gracefully handle datasets that already contain message columns, eliminating redundant transformations
- Enhanced dataset loading capabilities to support Arrow datasets in additional formats with improved error handling for compatibility

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-08T18:16:40Z

📝 Walkthrough

Walkthrough

Two utility functions were enhanced: dataset message key transformation is now idempotent, applying only when "messages" column is absent; dataset loading adds fallback logic to handle Arrow datasets saved with save_to_disk.

Changes

Cohort / File(s)	Change Summary
Message Key Idempotency `nemo_rl/data/datasets/response_datasets/response_dataset.py`	Made `add_messages_key` application conditional, only mapping `train_ds` and `val_ds` when the "messages" column is absent; refined `val_ds` check from truthiness to explicit None comparison
Dataset Loading Enhancement `nemo_rl/data/datasets/utils.py`	Added `load_from_disk` import; updated `load_dataset_from_path` docstring to document Arrow dataset support; introduced fallback logic to attempt `load_from_disk` when `load_dataset` raises ValueError containing "load_from_disk"

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Extra attention areas:
- Verify the conditional logic in response_dataset.py correctly detects presence of "messages" column across different dataset types
- Confirm the error handling in utils.py properly distinguishes between ValueError messages and doesn't mask unrelated errors

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: Support datasets saved with save_to_disk in ResponseDataset' accurately and concisely describes the main change in the PR, which is enabling support for datasets saved with HuggingFace's save_to_disk() method.
Test Results For Major Changes	✅ Passed	PR contains minor bug fixes to handle datasets saved with save_to_disk() rather than major changes. Modifications are focused and targeted fixes without new features, breaking changes, or significant refactoring.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64ab08d and b88bada.

📒 Files selected for processing (2)

nemo_rl/data/datasets/response_datasets/response_dataset.py (1 hunks)
nemo_rl/data/datasets/utils.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

nemo_rl/data/datasets/utils.py
nemo_rl/data/datasets/response_datasets/response_dataset.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

nemo_rl/data/datasets/utils.py
nemo_rl/data/datasets/response_datasets/response_dataset.py

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

nemo_rl/data/datasets/utils.py
nemo_rl/data/datasets/response_datasets/response_dataset.py

**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

nemo_rl/data/datasets/utils.py
nemo_rl/data/datasets/response_datasets/response_dataset.py

🧬 Code graph analysis (1)

nemo_rl/data/datasets/response_datasets/response_dataset.py (1)

nemo_rl/data/datasets/response_datasets/oai_format_dataset.py (2)

map (70-84)

add_messages_key (203-221)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (3)

nemo_rl/data/datasets/response_datasets/response_dataset.py (1)

59-63: Guarding add_messages_key avoids KeyError on preformatted datasets

Conditioning the mapping on "messages" not in *.column_names and checking val_ds is not None makes the transformation idempotent and prevents failures when loading datasets that already expose a messages column (e.g., Arrow datasets reloaded from disk). This looks correct and aligned with the intended fix; just ensure your tests cover both cases (with and without an existing messages column).

nemo_rl/data/datasets/utils.py (2)

20-20: Importing load_from_disk is consistent with new loading behavior

Adding load_from_disk to the datasets import cleanly supports the new fallback path in load_dataset_from_path; no issues spotted here. Please make sure your CI environment uses a datasets version that still exposes load_from_disk (it should, but worth confirming).

64-81: Fallback to load_from_disk correctly handles save_to_disk datasets

The updated load_dataset_from_path:

Keeps JSON handling unchanged.

Attempts load_dataset(data_path) once, with a minimal try body.

On ValueError, only falls back to load_from_disk(data_path) when the error message explicitly mentions load_from_disk, otherwise re‑raises.

This is a targeted and backward‑compatible way to support Arrow datasets saved via save_to_disk, while not masking unrelated errors. The docstring now accurately reflects the extended behavior. I don’t see changes needed here; just validate against a small round‑trip test (save via save_to_disk then reload through this helper) to ensure the error message still contains load_from_disk for your pinned datasets version.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

terrykong

@yuki-97 to review

yuki-97

@sahgerlad Thanks for supporting this! LGTM except one minor thing.

nemo_rl/data/datasets/utils.py

yuki-97 · 2025-12-12T09:10:16Z

@sahgerlad Can you add an unit test at tests/unit/data/datasets/test_response_dataset.py of the new save_to_disk?

e.g., you can load one dataset from HF and use save_to_disk to save to a tmp file, then use load_response_dataset to load it.

…s fixes KeyError when loading Arrow datasets that were saved using HuggingFace datasets' save_to_disk() method. Signed-off-by: Sahger Lad <[email protected]>

Co-authored-by: Yuki Huang <[email protected]> Signed-off-by: sahgerlad <[email protected]> Signed-off-by: Sahger Lad <[email protected]>

Tests that ResponseDataset correctly handles datasets that already have a 'messages' column, which is the case when loading Arrow datasets saved with HuggingFace's save_to_disk() method. Signed-off-by: Sahger Lad <[email protected]>

sahgerlad requested a review from a team as a code owner December 8, 2025 18:11

sahgerlad changed the title ~~Fix: Support datasets saved with save_to_disk in ResponseDataset~~ fix: Support datasets saved with save_to_disk in ResponseDataset Dec 8, 2025

github-actions bot added the community-request label Dec 8, 2025

terrykong reviewed Dec 10, 2025

View reviewed changes

terrykong requested a review from yuki-97 December 10, 2025 19:35

sahgerlad force-pushed the fix/dataset-loading-save-to-disk branch from b88bada to 2472f66 Compare December 12, 2025 06:13

yuki-97 reviewed Dec 12, 2025

View reviewed changes

nemo_rl/data/datasets/utils.py Outdated Show resolved Hide resolved

yuki-97 previously approved these changes Dec 12, 2025

View reviewed changes

sahgerlad dismissed yuki-97’s stale review via 4c9e6e1 December 12, 2025 18:26

sahgerlad requested a review from a team as a code owner December 12, 2025 18:26

sahgerlad force-pushed the fix/dataset-loading-save-to-disk branch from 4c9e6e1 to 7ed83cc Compare December 12, 2025 18:29

sahgerlad and others added 3 commits December 12, 2025 13:35

fix: Support datasets saved with save_to_disk in ResponseDataset. Thi…

8492486

…s fixes KeyError when loading Arrow datasets that were saved using HuggingFace datasets' save_to_disk() method. Signed-off-by: Sahger Lad <[email protected]>

Update nemo_rl/data/datasets/utils.py

d41e4eb

Co-authored-by: Yuki Huang <[email protected]> Signed-off-by: sahgerlad <[email protected]> Signed-off-by: Sahger Lad <[email protected]>

sahgerlad force-pushed the fix/dataset-loading-save-to-disk branch from 7ed83cc to 4a4cd8d Compare December 12, 2025 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Support datasets saved with save_to_disk in ResponseDataset #1610

fix: Support datasets saved with save_to_disk in ResponseDataset #1610

Uh oh!

sahgerlad commented Dec 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 8, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

terrykong left a comment

Uh oh!

yuki-97 left a comment

Uh oh!

Uh oh!

yuki-97 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Support datasets saved with save_to_disk in ResponseDataset #1610

Are you sure you want to change the base?

fix: Support datasets saved with save_to_disk in ResponseDataset #1610

Uh oh!

Conversation

sahgerlad commented Dec 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 8, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuki-97 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sahgerlad commented Dec 8, 2025 •

edited by coderabbitai bot

Loading