Skip to content

Conversation

@mdabek-nvidia
Copy link
Collaborator

Category:

New feature

Description:

Torchvision objective API.
This proof of concept is currently limited to few selected operators and composing them into a single pipeline.

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4309

mdabek-nvidia and others added 30 commits October 2, 2025 19:56
Implemented:
- Compose,
- Resize,
- CenterCrop

Signed-off-by: Marek Dabek <[email protected]>
Signed-off-by: Marek Dabek <[email protected]>
- nvcc for CUDA 12.4 and newer uses a different allocator
  than the default one from glibc and this clashes with asan.
  In the asan build we install plugins that are compiled on
  the fly. We cannot turn off sanitizers as DALI is imported
  and the python signatures are generated on the fly.
  This makes sure the extra env variable that is passed only
  to the stub generation process during video plugin installation
  test so we can preload asan only for this part and not affect
  the nvcc invocations.

Signed-off-by: Janusz Lisiecki <[email protected]>
…y checks in Tensor and TensorList python bindings (NVIDIA#6054)

* Add layout dimensionality checks in Tensor and TensorList python bindings.
* Fix layouts in backend_impl tests.
* Fix layouts in FW iterator test.

-----
Signed-off-by: Michal Zientkiewicz <[email protected]>
…IDIA#6058)

- assumes that in the absence of a keyframe in the video,
  the first frame can be treated as such.
- when reusing the old index, the number of frames in
  the video is not set according to the index but inferred
  from the video, which may not match the index build
  (as it can skip frames with negative timestamps).

Signed-off-by: Janusz Lisiecki <[email protected]>
The Mode documentation main landing page

Signed-off-by: Marek Dabek <[email protected]>
This PR brings two classes that are foundational for the Dynamic Mode: Tensor and Batch.
Both classes can wrap either an actual object, backed by Tensor[List]<CPU|GPU> or InvocationResult or results of indexing or composition.
A batch can be:

* TensorList<CPU|GPU>
* a list of Tensor objects

A tensor can be:
* Tensor<CPU|GPU>
* a Batch and an index in the batch
* a TensorSlice object

There are also utility functions, tensor, as_tensor and batch, as_batch. The ones without as_ construct a tensor or batch, respectively, and perform a copy. The functions with as_ prefix avoid the copy, if possible.

Batch access:
as of this PR, Batch deliberately doesn't include __getitem__ or __len__. It can be iterated to yield tensors, but it cannot be indexed. To access the list of tensors, use .tensors property; to apply batched slicing, use .slice property. __getitem__ will be added later and will implement either tensor access or batched slicing, which remains to be decided.

------
Signed-off-by: Michal Zientkiewicz <[email protected]>
- Updates tensor shapes to include channel dimension (e.g., [3,5,6] -> [3,5,6,3])
- Changes layout specification from NHWC to HWC for non-batched tensors
- Adjusts test data shapes to match 3D tensor expectations

Signed-off-by: Janusz Lisiecki <[email protected]>
* Update deps 25/10
* Stopped conda build for Python 3.9 which is no longer maintained by conda

Signed-off-by: Rafal Banas <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
Signed-off-by: Janusz Lisiecki <[email protected]>
Co-authored-by: Kamil Tokarski <[email protected]>
Co-authored-by: Janusz Lisiecki <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
…IA#6060)

Operator generator:
- generate operator classes, constructors and __call__ functions
- generate operator functions for stateless operators
Tests:
- arithmetic op tests
- slicing tests
- conversion tests
- RN50 pipeline
Device:
- accept torch device and torch-style device names (cuda == gpu)
- operators now accept "gpu" device for "mixed" ops
Fixes:
- temporarily remove invocation result cache (leaky and not necessary without more sophisticated keys / CSE)
- fix copying tensors to host (backend impl)

---------

Signed-off-by: Michal Zientkiewicz <[email protected]>
- Replaces CUDA_VERSION with CUDA_SUBVERSION=13.0.96-1
- Pins CUDA package to specific version (13.0.96-1) - update 2

Signed-off-by: Janusz Lisiecki <[email protected]>
* Rename DALI2 to dynamic
* import dynamic as ndd

---------
Signed-off-by: Michał Zientkiewicz <[email protected]>
* Add math functions and tests.
* Move arithm op arguments to GPU if there's any GPU argument.
* Forbid mixing of DALI CPU and GPU tensors/batches in arithmetic ops and math functions.

---------
Signed-off-by: Michal Zientkiewicz <[email protected]>
- Dynamic mode augmentation gallery notebook
- Adds Dynamic Mode section to the examples section of the documentation
- Sphinx 8.x Compatibility 

Signed-off-by: Joaquin Anton Guirao <[email protected]>
* Update to FFmpeg 8.0

Signed-off-by: Janusz Lisiecki <[email protected]>
Signed-off-by: Kamil Tokarski <[email protected]>
Co-authored-by: Kamil Tokarski <[email protected]>
* Fix memory order scoping
* Adjust lambda captures
* Remove the usage of std::is_pod.
* Warning suppression.
* Fix incorrect usage of enums.
* Adjust compiler version in jupyter-conda test

---------
Signed-off-by: Michal Zientkiewicz <[email protected]>
Add bounding box rotation as a single op with options including how they should expand with the rotate, whether the image canvas was fixed when the image was rotated, and the box format (layout and normalization).

This op prunes boxes and labels if keep_size=True and the box is truncated to a fraction below remove_threshold.
---------

Signed-off-by: Bryce Ferenczi <[email protected]>
… pool. (NVIDIA#6072)

* Add a busy list to CUDAStreamPool. Don't return busy streams from thread pool.
* Extend stream pool tests.

---------
Signed-off-by: Michal Zientkiewicz <[email protected]>
mzient and others added 22 commits November 27, 2025 12:08
… bbox_rotate tests. (NVIDIA#6073)

* Fix usage of std::optional in bbox rotate. Fix conversion to numpy in bbox_rotate tests.
---------

Signed-off-by: Michał Zientkiewicz <[email protected]>
…ream 0. (NVIDIA#6071)

* Use proper stream in TensorList and Tensor copy.
* Fix usages of DynamicScratchpad.
* Set non-host stream when setting last input for repeat-last inputs.
* Fix C API tests.
* Refactor copy stream/device selection

Signed-off-by: Michał Zientkiewicz <[email protected]>

* Review: copy_to_external.
* WAR device selection for H2D and D2H copy and copy with special stream.

Known issues: D2D copy across devices doesn't work with buffers allocated with VMM API, regardless of permissions used in `cuMemSetAccess`.

---------
Signed-off-by: Michal Zientkiewicz <[email protected]>
…5988)

* Document executor flags StreamPolicy and OperatorConcurrency
---------
Signed-off-by: Michał Zientkiewicz <[email protected]>
- updates installation instructions to include --no-build-isolation flag
  for all nvidia-dali-tf-plugin pip install commands. This flag is necessary
  because pip's default build isolation prevents the installation process
  from detecting the installed TensorFlow version.
- adds warnings to TensorFlow and Video plugin
- adds --no-build-isolation to tests

Signed-off-by: Janusz Lisiecki <[email protected]>
* Add name generation for dynamic API.
* Limit the number of spelled-out arguments.
* Adjust argument type name for dynamic API (use Tensor/Batch instead of TensorList)
* Add docstrings for dynamic mode ops and functions.
* Fix __call__ documentation. Update tests to look for mention of graph.

Signed-off-by: Michal Zientkiewicz <[email protected]>
* Remove texture-based video processing in NvDecoder

- Removes TextureObject class and related texture management code
- Replaces texture-based convert_frame with direct VideoColorSpaceConversion
- Removes textures_ member and get_textures method
- Unifies the video reader with the experimental one by removing texture usage which don't
  provide much gain in this use case
- fixes the usage of libswscale to avoid intermediate quantization in ycbcr->rgb video conversion

Signed-off-by: Janusz Lisiecki <[email protected]>
Signed-off-by: Michal Zientkiewicz <[email protected]>
Co-authored-by: Michal Zientkiewicz <[email protected]>
… Device and Readers. (NVIDIA#6080)

* Fill gaps in dynamic mode Tensor, Batch, Device, EvalContext and Readers documentation.
* Fix typos (and likewise typos find elsewhere in the code)
* Hide private functions (and remove their vestigial usages)

Signed-off-by: Michal Zientkiewicz <[email protected]>
Make sure that there's at least one block descriptor for each sample in the batch.

Signed-off-by: Michal Zientkiewicz <[email protected]>
* Add Philox32x4_10 generator for CPU.
* Test vs cuRAND

---------

Signed-off-by: Michal Zientkiewicz <[email protected]>
…VIDIA#6087)

Add support for passing random state to random operators in Dynamic Mode
through a new _random_state argument that is hidden from the Python API
and documentation.

Changes:
- Add OpSchema::AddRandomStateArg() function that creates a hidden
  optional argument accepting 1D uint32 tensor inputs
- The argument name starts with underscore (_random_state) and is
  explicitly marked as hidden to prevent it from appearing in
  documentation
- Apply AddRandomStateArg() to all random operators except readers:
  * random operators: Choice, Uniform, Normal, CoinFlip, BatchPermutation
  * noise operators: Gaussian, SaltAndPepper, Shot
  * segmentation operators: RandomMaskPixel, RandomObjectBBox
  * image operators: Jitter, RandomBBoxCrop, ROIRandomCrop
  * RandomCropAttr base schema
- Add comprehensive Python tests verifying:
  * Argument is hidden from GetArgumentNames()
  * Argument is configured as tensor argument
  * Operators accept uint32 tensor inputs
  * Reader operators don't have the argument

The _random_state argument enables the executor to pass RNG state to
operators in Dynamic Mode without exposing it in the public Python API.

Signed-off-by: Joaquin Anton Guirao <[email protected]>
…#6091)

- Replace hardcoded clang 19 path with find command to locate
  __clang_cuda_runtime_wrapper.h across different clang versions.
- This makes the Dockerfile compatible with various clang installations
  in manylinux builds.

Signed-off-by: Janusz Lisiecki <[email protected]>
This PR adds some random distributions usable on both host and device.
The distributions added in this PR:
* uniform real (float and double)
* uniform int (including up to full integer range)
* uniform real over unit range [0..1]
* normal standard (Box-Muller transform)
* normal (scaled normal standard)
* Bernoulli

---------
Signed-off-by: Michal Zientkiewicz <[email protected]>
Operator's keyword arguments are converted to their types as advertised in OpSchema. Add tests that verify that the necessary conversions take place and the unnecessary conversions are skipped.

---------
Signed-off-by: Michal Zientkiewicz <[email protected]>
Generate documentation section for the Dynamic Mode.

Signed-off-by: Szymon Karpiński <[email protected]>
The distribution carries only the mean value and the actual distribution is selected a evaluation time.
The results are not consistent between CPU and GPU.
---------
Signed-off-by: Michal Zientkiewicz <[email protected]>
…VIDIA#6084)

* Remove the name "dynamic executor" from the docs and error messages
* Remove redundant notice from getting_started notebook
* Unify peek_image_shape and experimental.peek_image_shape docs

---------

Signed-off-by: Michał Szołucha <[email protected]>
@mdabek-nvidia mdabek-nvidia marked this pull request as draft December 4, 2025 10:52
# We need to enable conditionals by default
try:
Pipeline.push_current(self)
self._conditionals_enabled = True

Check warning

Code scanning / CodeQL

Overwriting attribute in super-class or sub-class Warning

Assignment overwrites attribute _conditionals_enabled, which was previously defined in superclass
Pipeline
.
try:
Pipeline.push_current(self)
self._conditionals_enabled = True
self._condition_stack = _conditionals._ConditionStack()

Check warning

Code scanning / CodeQL

Overwriting attribute in super-class or sub-class Warning

Assignment overwrites attribute _condition_stack, which was previously defined in superclass
Pipeline
.
def test_invalid_type(size):
with assert_raises(TypeError):
td = Compose([CenterCrop(size=size)])
del td

Check warning

Code scanning / CodeQL

Unnecessary delete statement in function Warning test

Unnecessary deletion of local variable
td
in function
test_invalid_type
.
def test_value_error(size):
with assert_raises(ValueError):
td = Compose([CenterCrop(size=size)])
del td

Check warning

Code scanning / CodeQL

Unnecessary delete statement in function Warning test

Unnecessary deletion of local variable
td
in function
test_value_error
.
Comment on lines +64 to +68
def _infer_effective_size(
self,
size: Optional[Union[int, Sequence[int]]],
max_size: Optional[int] = None,
) -> Tuple[int, int]:

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.
[DEPRECATED but used]
"""

def __init__(self): ...

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.
or (not isinstance(resize, int))
):
with assert_raises(ValueError):
td = Compose(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable td is not used.
@greptile-apps
Copy link

greptile-apps bot commented Dec 4, 2025

Greptile Overview

Greptile Summary

This PR introduces a torchvision-compatible API for DALI, implementing common image transforms (Resize, CenterCrop, RandomHorizontalFlip, RandomVerticalFlip, Pad, ToTensor) that can be composed using a Compose class that extends nvidia.dali.Pipeline.

Key changes:

  • Added new nvidia.dali.experimental.torchvision module with v2-style transforms
  • Compose inherits from Pipeline and enables chaining transforms with conditional support via autograph
  • Supports both PIL Image and torch.Tensor inputs, with automatic conversion back to input type
  • Modified _conditionals.py to include the new torchvision module in autograph conversion

Issues found:

  • Syntax error in compose.py:167 - mode="RGB" incorrectly passed to append() instead of Image.fromarray()
  • Ignored device parameter in flips.py:29 - hardcoded to "cpu"
  • Ignored device parameter in pad.py:294 - hardcoded to "cpu"
  • Unreachable code in pad.py:131-132 - early return prevents concatenation
  • Debug print statement left in resize.py:128

Confidence Score: 2/5

  • Contains a syntax error that will cause runtime failures and several logic bugs affecting GPU execution paths
  • The PR has a clear syntax error in compose.py line 167 that will cause a TypeError when processing batched PIL images. Additionally, the device parameter is ignored in multiple places (flips.py, pad.py), preventing GPU execution even when explicitly requested. There's also unreachable code in pad.py suggesting incomplete implementation.
  • compose.py (syntax error), flips.py (device param ignored), pad.py (device param ignored + unreachable code)

Important Files Changed

File Analysis

Filename Score Overview
dali/python/nvidia/dali/experimental/torchvision/v2/compose.py 2/5 Core pipeline composition class with syntax error in Image.fromarray() call at line 167 - mode="RGB" passed to append() instead of fromarray()
dali/python/nvidia/dali/experimental/torchvision/v2/flips.py 3/5 Random flip implementations with device parameter being ignored - always hardcoded to "cpu" at line 29
dali/python/nvidia/dali/experimental/torchvision/v2/pad.py 3/5 Padding implementations with device parameter ignored at line 294 and unreachable code after early return at line 131
dali/python/nvidia/dali/experimental/torchvision/v2/resize.py 3/5 Resize implementation with debug print statement left at line 128 that should be removed
dali/python/nvidia/dali/experimental/torchvision/v2/centercrop.py 5/5 Center crop implementation with proper input validation and DALI fn.crop usage
dali/python/nvidia/dali/experimental/torchvision/v2/tensor.py 4/5 ToTensor implementation - simple conversion dividing by 255.0
dali/python/nvidia/dali/experimental/torchvision/init.py 5/5 Package exports for torchvision-compatible transforms
dali/python/nvidia/dali/_conditionals.py 5/5 Added torchvision module to autograph convert_modules list

Sequence Diagram

sequenceDiagram
    participant User
    participant Compose
    participant Pipeline
    participant Transform
    participant ExternalSource
    participant DALIOps

    User->>Compose: __call__(data_input)
    Compose->>Compose: Convert PIL/Tensor to torch.Tensor
    Compose->>Pipeline: build()
    Compose->>ExternalSource: feed_input(data)
    Compose->>Pipeline: run()
    
    loop For each transform in op_list
        Pipeline->>Transform: __call__(input_node)
        Transform->>DALIOps: fn.crop/fn.flip/fn.resize/fn.pad
        DALIOps-->>Transform: output_node
        Transform-->>Pipeline: output_node
    end
    
    Pipeline-->>Compose: output TensorList
    Compose->>Compose: to_torch_tensor(output)
    Compose->>Compose: Convert back to PIL if needed
    Compose-->>User: output (PIL/Tensor)
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Marek Dabek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants