Skip to content

Conversation

@sahgerlad
Copy link

@sahgerlad sahgerlad commented Dec 8, 2025

What does this PR do ?

Support datasets saved with save_to_disk in ResponseDataset. This fixes KeyError when loading Arrow datasets that were saved using HuggingFace datasets' save_to_disk() method.

Summary by CodeRabbit

  • Bug Fixes
    • Improved dataset processing to gracefully handle datasets that already contain message columns, eliminating redundant transformations
    • Enhanced dataset loading capabilities to support Arrow datasets in additional formats with improved error handling for compatibility

✏️ Tip: You can customize this high-level summary in your review settings.

@sahgerlad sahgerlad requested a review from a team as a code owner December 8, 2025 18:11
@sahgerlad sahgerlad changed the title Fix: Support datasets saved with save_to_disk in ResponseDataset fix: Support datasets saved with save_to_disk in ResponseDataset Dec 8, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 8, 2025

📝 Walkthrough

Walkthrough

Two utility functions were enhanced: dataset message key transformation is now idempotent, applying only when "messages" column is absent; dataset loading adds fallback logic to handle Arrow datasets saved with save_to_disk.

Changes

Cohort / File(s) Change Summary
Message Key Idempotency
nemo_rl/data/datasets/response_datasets/response_dataset.py
Made add_messages_key application conditional, only mapping train_ds and val_ds when the "messages" column is absent; refined val_ds check from truthiness to explicit None comparison
Dataset Loading Enhancement
nemo_rl/data/datasets/utils.py
Added load_from_disk import; updated load_dataset_from_path docstring to document Arrow dataset support; introduced fallback logic to attempt load_from_disk when load_dataset raises ValueError containing "load_from_disk"

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Extra attention areas:
    • Verify the conditional logic in response_dataset.py correctly detects presence of "messages" column across different dataset types
    • Confirm the error handling in utils.py properly distinguishes between ValueError messages and doesn't mask unrelated errors

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: Support datasets saved with save_to_disk in ResponseDataset' accurately and concisely describes the main change in the PR, which is enabling support for datasets saved with HuggingFace's save_to_disk() method.
Test Results For Major Changes ✅ Passed PR contains minor bug fixes to handle datasets saved with save_to_disk() rather than major changes. Modifications are focused and targeted fixes without new features, breaking changes, or significant refactoring.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64ab08d and b88bada.

📒 Files selected for processing (2)
  • nemo_rl/data/datasets/response_datasets/response_dataset.py (1 hunks)
  • nemo_rl/data/datasets/utils.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • nemo_rl/data/datasets/utils.py
  • nemo_rl/data/datasets/response_datasets/response_dataset.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

  • nemo_rl/data/datasets/utils.py
  • nemo_rl/data/datasets/response_datasets/response_dataset.py
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • nemo_rl/data/datasets/utils.py
  • nemo_rl/data/datasets/response_datasets/response_dataset.py
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • nemo_rl/data/datasets/utils.py
  • nemo_rl/data/datasets/response_datasets/response_dataset.py
🧬 Code graph analysis (1)
nemo_rl/data/datasets/response_datasets/response_dataset.py (1)
nemo_rl/data/datasets/response_datasets/oai_format_dataset.py (2)
  • map (70-84)
  • add_messages_key (203-221)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (3)
nemo_rl/data/datasets/response_datasets/response_dataset.py (1)

59-63: Guarding add_messages_key avoids KeyError on preformatted datasets

Conditioning the mapping on "messages" not in *.column_names and checking val_ds is not None makes the transformation idempotent and prevents failures when loading datasets that already expose a messages column (e.g., Arrow datasets reloaded from disk). This looks correct and aligned with the intended fix; just ensure your tests cover both cases (with and without an existing messages column).

nemo_rl/data/datasets/utils.py (2)

20-20: Importing load_from_disk is consistent with new loading behavior

Adding load_from_disk to the datasets import cleanly supports the new fallback path in load_dataset_from_path; no issues spotted here. Please make sure your CI environment uses a datasets version that still exposes load_from_disk (it should, but worth confirming).


64-81: Fallback to load_from_disk correctly handles save_to_disk datasets

The updated load_dataset_from_path:

  • Keeps JSON handling unchanged.
  • Attempts load_dataset(data_path) once, with a minimal try body.
  • On ValueError, only falls back to load_from_disk(data_path) when the error message explicitly mentions load_from_disk, otherwise re‑raises.

This is a targeted and backward‑compatible way to support Arrow datasets saved via save_to_disk, while not masking unrelated errors. The docstring now accurately reflects the extended behavior. I don’t see changes needed here; just validate against a small round‑trip test (save via save_to_disk then reload through this helper) to ensure the error message still contains load_from_disk for your pinned datasets version.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuki-97 to review

@terrykong terrykong requested a review from yuki-97 December 10, 2025 19:35
@sahgerlad sahgerlad force-pushed the fix/dataset-loading-save-to-disk branch from b88bada to 2472f66 Compare December 12, 2025 06:13
Copy link
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sahgerlad Thanks for supporting this! LGTM except one minor thing.

yuki-97
yuki-97 previously approved these changes Dec 12, 2025
@yuki-97
Copy link
Contributor

yuki-97 commented Dec 12, 2025

@sahgerlad Can you add an unit test at tests/unit/data/datasets/test_response_dataset.py of the new save_to_disk?

e.g., you can load one dataset from HF and use save_to_disk to save to a tmp file, then use load_response_dataset to load it.

@sahgerlad sahgerlad requested a review from a team as a code owner December 12, 2025 18:26
@sahgerlad sahgerlad force-pushed the fix/dataset-loading-save-to-disk branch from 4c9e6e1 to 7ed83cc Compare December 12, 2025 18:29
sahgerlad and others added 3 commits December 12, 2025 13:35
…s fixes KeyError when loading Arrow datasets that were saved using

HuggingFace datasets' save_to_disk() method.

Signed-off-by: Sahger Lad <[email protected]>
Co-authored-by: Yuki Huang <[email protected]>
Signed-off-by: sahgerlad <[email protected]>
Signed-off-by: Sahger Lad <[email protected]>
Tests that ResponseDataset correctly handles datasets that already
have a 'messages' column, which is the case when loading Arrow
datasets saved with HuggingFace's save_to_disk() method.

Signed-off-by: Sahger Lad <[email protected]>
@sahgerlad sahgerlad force-pushed the fix/dataset-loading-save-to-disk branch from 7ed83cc to 4a4cd8d Compare December 12, 2025 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants