Add PyTorch DataLoader Evaluator plugin #6112

JanuszL · 2025-12-05T10:39:16Z

Introduces a lightweight diagnostic tool for identifying data loading
bottlenecks in PyTorch training pipelines.
This change adds Loader Evaluator inside pytorch DALI plugin, a
jupyter notebook tutorial, and a documentation page with tests
LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring
Two operation modes: 'log' (normal iteration with metrics) and 'replay'
(cached batches for ideal performance simulation)
PerformanceMetrics class for detailed performance tracking and bottleneck
analysis
In-memory batch caching for replay mode to simulate ideal data loading
Comprehensive test suite and documentation with example notebook
The tool helps users compare real vs. ideal data loading performance and
identify optimization opportunities.

Authored-by: Albert Wolant [email protected]

Category:

New feature (non-breaking change which adds functionality)

Description:

Introduces a lightweight diagnostic tool for identifying data loading
bottlenecks in PyTorch training pipelines.
This change adds Loader Evaluator inside pytorch DALI plugin, a
jupyter notebook tutorial, and a documentation page with tests
LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring
Two operation modes: 'log' (normal iteration with metrics) and 'replay'
(cached batches for ideal performance simulation)
PerformanceMetrics class for detailed performance tracking and bottleneck
analysis
In-memory batch caching for replay mode to simulate ideal data loading
Comprehensive test suite and documentation with example notebook
The tool helps users compare real vs. ideal data loading performance and
identify optimization opportunities.

Additional information:

Affected modules and functionalities:

new module in Pytorch plugin
new example
new test for it
new documentation page describing the overall idea

Key points relevant for the review:

overall idea and flow

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: DALI-4299

review-notebook-app · 2025-12-05T10:39:22Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

dali/test/python/test_pytorch_loader_evaluator.py

dali-automaton · 2025-12-05T10:45:29Z

CI MESSAGE: [39670512]: BUILD STARTED

JanuszL · 2025-12-05T10:47:30Z

!build

dali-automaton · 2025-12-05T10:50:27Z

CI MESSAGE: [39670654]: BUILD STARTED

greptile-apps · 2025-12-05T10:55:55Z

Greptile Overview

Greptile Summary

This PR introduces a new PyTorch DataLoader Evaluator plugin to help identify data loading bottlenecks in training pipelines. The tool wraps PyTorch DataLoader with two modes: "log" mode for collecting performance metrics during normal iteration, and "replay" mode for simulating ideal data loading by caching and replaying batches.

Added LoaderEvaluator class in nvidia.dali.plugin.pytorch.loader_evaluator that provides in-memory batch caching, performance metrics collection (batch times, total time, throughput), and seamless integration with existing PyTorch training loops
Comprehensive test suite covering basic functionality, both operation modes, edge cases (empty dataloaders, single batches, invalid modes), and metrics collection
Documentation includes RST pages explaining the technical approach and comparison with other profiling tools (NSYS, PyTorch Profiler), plus a Jupyter notebook tutorial demonstrating practical usage
Test integration added to the PyTorch test suite in qa/TL0_python-self-test-core/test_body.sh

Confidence Score: 5/5

This PR is safe to merge - it adds a new, self-contained diagnostic tool with no impact on existing functionality
The implementation is well-structured, follows existing patterns in the DALI PyTorch plugin, includes comprehensive tests, and is properly documented. The code is additive only with no modifications to existing functionality.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py	4/5	Core LoaderEvaluator implementation providing DataLoader wrapping with performance metrics. Well-structured with log/replay modes for bottleneck detection.
dali/test/python/test_pytorch_loader_evaluator.py	5/5	Comprehensive test suite covering basic functionality, modes, methods, and edge cases. Good coverage of the LoaderEvaluator features.
docs/examples/frameworks/pytorch/loader_evaluator/pytorch_data_loader_evaluator.ipynb	5/5	Well-documented Jupyter notebook tutorial demonstrating data loading bottleneck detection with clear examples and explanations.
docs/plugins/pytorch_data_loader_evaluator.rst	5/5	Documentation page explaining the tool's purpose, technical approach, and comparison with other profiling tools.

Sequence Diagram

sequenceDiagram
    participant User as Training Loop
    participant LE as LoaderEvaluator
    participant DL as PyTorch DataLoader
    participant Cache as Batch Cache

    Note over User,Cache: Log Mode
    User->>LE: for batch in loader
    LE->>DL: iter(dataloader)
    loop Each batch
        LE->>DL: next(dataloader_iter)
        DL-->>LE: batch data
        LE->>LE: record batch_time
        LE-->>User: yield batch
    end
    User->>LE: get_metrics()
    LE-->>User: performance metrics

    Note over User,Cache: Replay Mode (Construction)
    User->>LE: LoaderEvaluator(dl, mode="replay")
    LE->>DL: iterate all batches
    DL-->>LE: batch data
    LE->>Cache: cache batches (up to num_cached_batches)

    Note over User,Cache: Replay Mode (Iteration)
    User->>LE: for batch in loader
    loop Each batch (original length)
        LE->>Cache: get cached_batches[i % cache_size]
        Cache-->>LE: cached batch
        LE->>LE: record batch_time
        LE-->>User: yield batch
    end
    User->>LE: get_metrics()
    LE-->>User: ideal performance metrics

greptile-apps

Additional Comments (3)

dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/metrics.py, line 137-145 (link)

logic: json.dump() will fail with TypeError because get_summary() returns numpy types (np.float64 from np.mean(), np.std(), etc.) which are not JSON serializable.

Consider converting numpy types to native Python types before serialization, e.g.:

def save_metrics(self, filename: str):
    """Save metrics to a JSON file."""
    metrics = self.get_summary()
    bottlenecks = self.identify_bottlenecks()

    # Convert numpy types to native Python types for JSON serialization
    def convert_numpy(obj):
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        elif isinstance(obj, (np.floating, np.integer)):
            return obj.item()
        elif isinstance(obj, dict):
            return {k: convert_numpy(v) for k, v in obj.items()}
        elif isinstance(obj, list):
            return [convert_numpy(i) for i in obj]
        return obj

    output = convert_numpy({"metrics": metrics, "bottlenecks": bottlenecks})

    with open(filename, "w") as f:
        json.dump(output, f, indent=2)

    print(f"Performance metrics saved to: {filename}")

dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/metrics.py, line 100-102 (link)

style: Potential division by zero if all batch_times are 0 (edge case).
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/__init__.py, line 1 (link)

style: Missing Apache 2.0 license header that other files in this repository have. Same applies to loader.py and metrics.py.

_{9 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

- Introduces a lightweight diagnostic tool for identifying data loading bottlenecks in PyTorch training pipelines. - This change adds Loader Evaluator inside pytorch DALI plugin, a jupyter notebook tutorial, and a documentation page with tests - LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring Two operation modes: 'log' (normal iteration with metrics) and 'replay' (cached batches for ideal performance simulation) - PerformanceMetrics class for detailed performance tracking and bottleneck analysis - In-memory batch caching for replay mode to simulate ideal data loading - Comprehensive test suite and documentation with example notebook - The tool helps users compare real vs. ideal data loading performance and identify optimization opportunities. Authored-by: Albert Wolant <[email protected]> Signed-off-by: Janusz Lisiecki <[email protected]>

JanuszL · 2025-12-05T12:11:10Z

@greptileai please review again.

greptile-apps

Additional Comments (1)

dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/__init__.py, line 16 (link)

syntax: Missing metrics.py module. This import will raise ModuleNotFoundError: No module named 'nvidia.dali.plugin.pytorch.loader_evaluator.metrics' at runtime.

Either create the missing metrics.py file with the PerformanceMetrics class, or remove this import and the corresponding __all__ entry.

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Janusz Lisiecki <[email protected]>

JanuszL · 2025-12-05T12:46:05Z

!build

dali-automaton · 2025-12-05T12:50:28Z

CI MESSAGE: [39675479]: BUILD STARTED

JanuszL · 2025-12-05T12:53:58Z

@greptileai please review again.

greptile-apps

Additional Comments (2)

dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 85 (link)

style: Type hint assumes specific batch structure but DataLoader can return any type. Should this support generic batch types instead of assuming Tuple[TorchTensor, TorchTensor]?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 146-163 (link)

style: Double break logic could be simplified with a single loop condition

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

JanuszL · 2025-12-05T13:11:37Z

@greptileai please review again.

Signed-off-by: Janusz Lisiecki <[email protected]>

greptile-apps

Additional Comments (1)

dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 143 (link)

style: Unused variable batch_count - initialized but never used.

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

JanuszL · 2025-12-05T13:19:39Z

!build

dali-automaton · 2025-12-05T13:25:21Z

CI MESSAGE: [39676718]: BUILD STARTED

qa/TL0_python-self-test-core/test_body.sh

docs/plugins/pytorch_data_loader_evaluator.rst

mdabek-nvidia · 2025-12-05T14:51:10Z

docs/examples/frameworks/pytorch/loader_evaluator/pytorch_data_loader_evaluator.ipynb

@@ -0,0 +1,451 @@
+{


Line #10. import sys
I think that sys is not used anywhere.

Reply via ReviewNB

mdabek-nvidia · 2025-12-05T14:51:10Z

docs/examples/frameworks/pytorch/loader_evaluator/pytorch_data_loader_evaluator.ipynb

@@ -0,0 +1,451 @@
+{


Line #1. # Wrap your dataloader with LoaderEvaluator (this is the only change!)
Nitpick: there is also a change below dataloader became monitored_dataloader

Reply via ReviewNB

Do you think it should be:

dataloader = LoaderEvaluator( dataloader, mode="replay", num_cached_batches=len(dataloader) // 10 )

instead?

Yes, this would be clear, that we are just "updating" the dataloader.

Signed-off-by: Janusz Lisiecki <[email protected]>

dali-automaton · 2025-12-05T17:33:13Z

CI MESSAGE: [39670654]: BUILD FAILED

dali-automaton · 2025-12-05T17:50:19Z

CI MESSAGE: [39676718]: BUILD FAILED

JanuszL · 2025-12-05T18:53:11Z

!build

dali-automaton · 2025-12-05T18:55:46Z

CI MESSAGE: [39693133]: BUILD STARTED

dali-automaton · 2025-12-05T21:14:49Z

CI MESSAGE: [39693133]: BUILD FAILED

dali-automaton · 2025-12-07T11:42:22Z

CI MESSAGE: [39693133]: BUILD PASSED

greptile-apps

Additional Comments (1)

docs/examples/frameworks/pytorch/loader_evaluator/pytorch_data_loader_evaluator.ipynb, line 1 (link)

logic: Undefined variable monitored_dataloader in cell 12. In cell 10, the variable is assigned as dataloader, but cell 12 references monitored_dataloader which will cause a NameError at runtime.

Either change cell 10 to assign monitored_dataloader or change cell 12 to use dataloader.

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Janusz Lisiecki <[email protected]>

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

JanuszL · 2025-12-08T05:40:24Z

!build

dali-automaton · 2025-12-08T05:45:22Z

CI MESSAGE: [39788983]: BUILD STARTED

dali-automaton · 2025-12-08T14:00:35Z

CI MESSAGE: [39788983]: BUILD FAILED

dali-automaton · 2025-12-08T14:18:25Z

CI MESSAGE: [39788983]: BUILD PASSED

JanuszL marked this pull request as draft December 5, 2025 10:39

github-advanced-security bot found potential problems Dec 5, 2025

View reviewed changes

dali/test/python/test_pytorch_loader_evaluator.py Fixed Show fixed Hide fixed

JanuszL force-pushed the loader_Evaluator branch 2 times, most recently from 6766528 to a008b7c Compare December 5, 2025 10:45

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

mdabek-nvidia self-assigned this Dec 5, 2025

JanuszL force-pushed the loader_Evaluator branch from a008b7c to 7d47417 Compare December 5, 2025 11:12

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

JanuszL force-pushed the loader_Evaluator branch 3 times, most recently from c414911 to b7d6958 Compare December 5, 2025 12:43

Fix

41896c6

Signed-off-by: Janusz Lisiecki <[email protected]>

JanuszL force-pushed the loader_Evaluator branch from b7d6958 to 41896c6 Compare December 5, 2025 12:45

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

More fixes

f61d369

Signed-off-by: Janusz Lisiecki <[email protected]>

JanuszL force-pushed the loader_Evaluator branch from ff53b06 to f61d369 Compare December 5, 2025 13:13

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

mdabek-nvidia reviewed Dec 5, 2025

View reviewed changes

qa/TL0_python-self-test-core/test_body.sh Outdated Show resolved Hide resolved

docs/plugins/pytorch_data_loader_evaluator.rst Show resolved Hide resolved

mdabek-nvidia reviewed Dec 5, 2025

View reviewed changes

more fixes

d306b1b

Signed-off-by: Janusz Lisiecki <[email protected]>

JanuszL marked this pull request as ready for review December 7, 2025 11:37

greptile-apps bot reviewed Dec 7, 2025

View reviewed changes

Fix

69d250f

Signed-off-by: Janusz Lisiecki <[email protected]>

JanuszL force-pushed the loader_Evaluator branch from 1c41dac to 69d250f Compare December 7, 2025 12:53

greptile-apps bot reviewed Dec 7, 2025

View reviewed changes

dali-automaton assigned banasraf and jantonguirao Dec 8, 2025

Add PyTorch DataLoader Evaluator plugin #6112

Are you sure you want to change the base?

Add PyTorch DataLoader Evaluator plugin #6112

Uh oh!

Conversation

JanuszL commented Dec 5, 2025

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

review-notebook-app bot commented Dec 5, 2025

Uh oh!

Uh oh!

dali-automaton commented Dec 5, 2025

Uh oh!

JanuszL commented Dec 5, 2025

Uh oh!

dali-automaton commented Dec 5, 2025

Uh oh!

greptile-apps bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (3)

Uh oh!

JanuszL commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

JanuszL commented Dec 5, 2025

Uh oh!

dali-automaton commented Dec 5, 2025

Uh oh!

JanuszL commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

JanuszL commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

JanuszL commented Dec 5, 2025

Uh oh!

dali-automaton commented Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

mdabek-nvidia Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JanuszL Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

mdabek-nvidia Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

greptile-apps bot commented Dec 5, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

mdabek-nvidia Dec 5, 2025 •

edited

Loading

mdabek-nvidia Dec 5, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading