-
Notifications
You must be signed in to change notification settings - Fork 650
Add PyTorch DataLoader Evaluator plugin #6112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
6766528 to
a008b7c
Compare
|
CI MESSAGE: [39670512]: BUILD STARTED |
|
!build |
|
CI MESSAGE: [39670654]: BUILD STARTED |
Greptile OverviewGreptile SummaryThis PR introduces a new PyTorch DataLoader Evaluator plugin to help identify data loading bottlenecks in training pipelines. The tool wraps PyTorch DataLoader with two modes: "log" mode for collecting performance metrics during normal iteration, and "replay" mode for simulating ideal data loading by caching and replaying batches.
Confidence Score: 5/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User as Training Loop
participant LE as LoaderEvaluator
participant DL as PyTorch DataLoader
participant Cache as Batch Cache
Note over User,Cache: Log Mode
User->>LE: for batch in loader
LE->>DL: iter(dataloader)
loop Each batch
LE->>DL: next(dataloader_iter)
DL-->>LE: batch data
LE->>LE: record batch_time
LE-->>User: yield batch
end
User->>LE: get_metrics()
LE-->>User: performance metrics
Note over User,Cache: Replay Mode (Construction)
User->>LE: LoaderEvaluator(dl, mode="replay")
LE->>DL: iterate all batches
DL-->>LE: batch data
LE->>Cache: cache batches (up to num_cached_batches)
Note over User,Cache: Replay Mode (Iteration)
User->>LE: for batch in loader
loop Each batch (original length)
LE->>Cache: get cached_batches[i % cache_size]
Cache-->>LE: cached batch
LE->>LE: record batch_time
LE-->>User: yield batch
end
User->>LE: get_metrics()
LE-->>User: ideal performance metrics
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/metrics.py, line 137-145 (link)logic:
json.dump()will fail withTypeErrorbecauseget_summary()returns numpy types (np.float64fromnp.mean(),np.std(), etc.) which are not JSON serializable.Consider converting numpy types to native Python types before serialization, e.g.:
def save_metrics(self, filename: str): """Save metrics to a JSON file.""" metrics = self.get_summary() bottlenecks = self.identify_bottlenecks() # Convert numpy types to native Python types for JSON serialization def convert_numpy(obj): if isinstance(obj, np.ndarray): return obj.tolist() elif isinstance(obj, (np.floating, np.integer)): return obj.item() elif isinstance(obj, dict): return {k: convert_numpy(v) for k, v in obj.items()} elif isinstance(obj, list): return [convert_numpy(i) for i in obj] return obj output = convert_numpy({"metrics": metrics, "bottlenecks": bottlenecks}) with open(filename, "w") as f: json.dump(output, f, indent=2) print(f"Performance metrics saved to: {filename}")
-
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/metrics.py, line 100-102 (link)style: Potential division by zero if all
batch_timesare 0 (edge case). -
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/__init__.py, line 1 (link)style: Missing Apache 2.0 license header that other files in this repository have. Same applies to
loader.pyandmetrics.py.
9 files reviewed, 3 comments
- Introduces a lightweight diagnostic tool for identifying data loading bottlenecks in PyTorch training pipelines. - This change adds Loader Evaluator inside pytorch DALI plugin, a jupyter notebook tutorial, and a documentation page with tests - LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring Two operation modes: 'log' (normal iteration with metrics) and 'replay' (cached batches for ideal performance simulation) - PerformanceMetrics class for detailed performance tracking and bottleneck analysis - In-memory batch caching for replay mode to simulate ideal data loading - Comprehensive test suite and documentation with example notebook - The tool helps users compare real vs. ideal data loading performance and identify optimization opportunities. Authored-by: Albert Wolant <[email protected]> Signed-off-by: Janusz Lisiecki <[email protected]>
a008b7c to
7d47417
Compare
|
@greptileai please review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/__init__.py, line 16 (link)syntax: Missing
metrics.pymodule. This import will raiseModuleNotFoundError: No module named 'nvidia.dali.plugin.pytorch.loader_evaluator.metrics'at runtime.Either create the missing
metrics.pyfile with thePerformanceMetricsclass, or remove this import and the corresponding__all__entry.
8 files reviewed, 1 comment
c414911 to
b7d6958
Compare
Signed-off-by: Janusz Lisiecki <[email protected]>
b7d6958 to
41896c6
Compare
|
!build |
|
CI MESSAGE: [39675479]: BUILD STARTED |
|
@greptileai please review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 85 (link)style: Type hint assumes specific batch structure but DataLoader can return any type. Should this support generic batch types instead of assuming Tuple[TorchTensor, TorchTensor]?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 146-163 (link)style: Double break logic could be simplified with a single loop condition
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
8 files reviewed, 2 comments
|
@greptileai please review again. |
Signed-off-by: Janusz Lisiecki <[email protected]>
ff53b06 to
f61d369
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 143 (link)style: Unused variable
batch_count- initialized but never used.
8 files reviewed, 1 comment
|
!build |
|
CI MESSAGE: [39676718]: BUILD STARTED |
| @@ -0,0 +1,451 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| @@ -0,0 +1,451 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. # Wrap your dataloader with LoaderEvaluator (this is the only change!)
Nitpick: there is also a change below dataloader became monitored_dataloader
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it should be:
dataloader = LoaderEvaluator(
dataloader, mode="replay", num_cached_batches=len(dataloader) // 10
)
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would be clear, that we are just "updating" the dataloader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Signed-off-by: Janusz Lisiecki <[email protected]>
|
CI MESSAGE: [39670654]: BUILD FAILED |
|
CI MESSAGE: [39676718]: BUILD FAILED |
|
!build |
|
CI MESSAGE: [39693133]: BUILD STARTED |
|
CI MESSAGE: [39693133]: BUILD FAILED |
|
CI MESSAGE: [39693133]: BUILD PASSED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
docs/examples/frameworks/pytorch/loader_evaluator/pytorch_data_loader_evaluator.ipynb, line 1 (link)logic: Undefined variable
monitored_dataloaderin cell 12. In cell 10, the variable is assigned asdataloader, but cell 12 referencesmonitored_dataloaderwhich will cause aNameErrorat runtime.Either change cell 10 to assign
monitored_dataloaderor change cell 12 to usedataloader.
8 files reviewed, 1 comment
Signed-off-by: Janusz Lisiecki <[email protected]>
1c41dac to
69d250f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
|
!build |
|
CI MESSAGE: [39788983]: BUILD STARTED |
|
CI MESSAGE: [39788983]: BUILD FAILED |
|
CI MESSAGE: [39788983]: BUILD PASSED |
bottlenecks in PyTorch training pipelines.
jupyter notebook tutorial, and a documentation page with tests
Two operation modes: 'log' (normal iteration with metrics) and 'replay'
(cached batches for ideal performance simulation)
analysis
identify optimization opportunities.
Authored-by: Albert Wolant [email protected]
Category:
New feature (non-breaking change which adds functionality)
Description:
bottlenecks in PyTorch training pipelines.
jupyter notebook tutorial, and a documentation page with tests
Two operation modes: 'log' (normal iteration with metrics) and 'replay'
(cached batches for ideal performance simulation)
analysis
identify optimization opportunities.
Additional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-4299