`scaled_dot_production_attention` support for additional Torch backends (Flash Attention) #365

swahtz · 2025-12-03T07:35:45Z

This PR makes it possible to use other available Torch attention backends (flash, efficient, math) by wrapping the input JaggedTensors data in Torch nested Tensors. The mechanism for selecting these other backends is the same as the Torch built-in attention sdpa_kernel (see examples at https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html):

with torch.nn.attention.sdpa_kernel([torch.nn.attention.SDPBackend.FLASH_ATTENTION]):
            fvdb.scaled_dot_product_attention(query, key, value, 1)

Some backends allow us to just take views of the JaggedTensor data to create the nested Tensor and others require copies, this PR tries to be smart to check what backend PyTorch is being configured to use and allows backends which can run on the nested Tensors created by views to do so.

Also changed SDPA tests to run all the backends with their available data types

fixes #363

…h NestedTensor Signed-off-by: Jonathan Swartz <[email protected]>

Signed-off-by: Jonathan Swartz <[email protected]>

…ion backends. Added pytests that test each attention backend and compatible data type Signed-off-by: Jonathan Swartz <[email protected]>

Co-authored-by: Copilot <[email protected]> Signed-off-by: Jonathan Swartz <[email protected]>

Signed-off-by: Jonathan Swartz <[email protected]>

…offsets instead of joffsets. Signed-off-by: Jonathan Swartz <[email protected]>

Signed-off-by: Jonathan Swartz <[email protected]>

Copilot

Pull request overview

This PR enables support for additional PyTorch attention backends (Flash Attention, Efficient Attention, Math) in the scaled_dot_product_attention function by wrapping JaggedTensor data in PyTorch nested tensors. The implementation intelligently selects between zero-copy views and tensor copies based on backend requirements to optimize performance.

Key Changes:

Added backend-aware nested tensor creation logic in C++ that chooses between zero-copy views (make_nested_view) and tensor copies (make_nested_tensor) based on the selected attention backend
Expanded test coverage to parameterize tests across all supported backends (Flash, Efficient, Math) with their compatible data types
Adjusted dimension handling in tests to accommodate Flash Attention's requirement that q, k, v have matching last dimensions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
tests/unit/test_jagged_tensor.py	Parameterized SDPA tests to cover Flash, Efficient, and Math backends with appropriate dtypes; added Flash Attention dimension constraint
src/fvdb/FVDB.cpp	Implemented backend-aware nested tensor creation with two helper functions and runtime backend detection logic to optimize tensor creation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/fvdb/FVDB.cpp

tests/unit/test_jagged_tensor.py

src/fvdb/FVDB.cpp

… wrong creation function Use enums instead of ints Signed-off-by: Jonathan Swartz <[email protected]>

Signed-off-by: Jonathan Swartz <[email protected]>

blackencino

I have three big questions after going through this more carefully.

Why do we need to do any of it in C++? Could we do the same thing in Python, within the JaggedTensor frontend, and work towards the goal of having less and less C++ code where we can avoid it.
We're deferring to the torch backend specification mechanism as the way users would go about defining the attention machinery, which aligns with how they'd use torch. I like that the default torch behavior is to try to choose the best backend based on your usage. Does fVDB need/want to have any stronger opinions about which backends to use? I think we probably don't, but I get cautious around API decisions that expose expert-level algorithm internals.
Our tests seem mostly like smoke tests at this point, we don't have a validation against what SDPA is supposed to mean, or any conceptual explanation. Plus, we're doing permutation of inputs and outputs when comparing to torch , which makes it hard to say that what we're computing is what we expect to see. We should be wrapping the SDPA API so that we consume and produce the same dimensional ordering as torch, except where that's impossible.

blackencino · 2025-12-09T05:14:04Z

tests/unit/test_jagged_tensor.py

+        # Torch -- For-loop approach (always use MATH for reference to ensure consistency)
        out_jagged_torch_forloop_list = []
        for b in range(batch_size):
            # From LHE to NHLE / SHV to NHSV


LHE, NHLE, SHV, NHSV - these are not meaningful abbreviations. This is potentially mission-critical code, we should be explaining our assumptions and what we're actually testing against.

Permuting our data to work with torch APIs, or to test the results, is something we worked hard to get rid of in the convolution code. Is there a way we can wrap our use of attention such that the dimensional ordering in and out matches what torch would consume or produce? These permute lines look very much like what we've just gotten rid of in convolution.

swahtz · 2025-12-11T00:36:18Z

I have three big questions after going through this more carefully.

Why do we need to do any of it in C++? Could we do the same thing in Python, within the JaggedTensor frontend, and work towards the goal of having less and less C++ code where we can avoid it.

We absolutely could do this in python. Is having less and less C++ code the goal? For the longest time the goal was to have as thin of a python layer as we could for binding and have all the meaningful logic in C++. One potential use-case of having this in C++ is portability to other systems (such as inference systems like ONNX) where perhaps the user needs a C++ equivalent to scaled_dot_product_attention available to them that will match what they called in PyTorch in python.

We're deferring to the torch backend specification mechanism as the way users would go about defining the attention machinery, which aligns with how they'd use torch. I like that the default torch behavior is to try to choose the best backend based on your usage. Does fVDB need/want to have any stronger opinions about which backends to use? I think we probably don't, but I get cautious around API decisions that expose expert-level algorithm internals.

That I don't really know if we have any stronger opinions, perhaps @heiwang1997 would have thoughts. I think consistent behaviour of backend choice to Torch is probably a good goal… if people are using the same configurations in a network that uses our operator and Torch's for different things, I think it'd be expected for the user to see the same attention backend selection used.

Our tests seem mostly like smoke tests at this point, we don't have a validation against what SDPA is supposed to mean, or any conceptual explanation. Plus, we're doing permutation of inputs and outputs when comparing to torch , which makes it hard to say that what we're computing is what we expect to see. We should be wrapping the SDPA API so that we consume and produce the same dimensional ordering as torch, except where that's impossible.

I looked around PyTorch's test for their operator, their tests largely compare results of random inputs against reference implementations. So like here they have a reference SDPA function implemented entirely from basic PyTorch operations and compare the outputs to the SDPA operators:

https://github.com/pytorch/pytorch/blob/main/test/test_transformers.py#L1122

Not saying we shouldn't do as you suggest, just providing info on how PyTorch validates these operators.

swahtz and others added 12 commits December 3, 2025 13:41

Switch to using PyTorch's SDPA by wrapping our JaggedTensor in a torc…

b8c4ff6

…h NestedTensor Signed-off-by: Jonathan Swartz <[email protected]>

Remove Attention operator source files

7852523

Signed-off-by: Jonathan Swartz <[email protected]>

Remove cudnn from build infrastructure

c1c7e22

Signed-off-by: Jonathan Swartz <[email protected]>

Fix nested tensor creation to support using flash or efficient attent…

7d8b9c2

…ion backends. Added pytests that test each attention backend and compatible data type Signed-off-by: Jonathan Swartz <[email protected]>

Update src/fvdb/FVDB.cpp

5dc0ba4

Co-authored-by: Copilot <[email protected]> Signed-off-by: Jonathan Swartz <[email protected]>

Update tests/unit/test_jagged_tensor.py

6f5e7a4

Co-authored-by: Copilot <[email protected]> Signed-off-by: Jonathan Swartz <[email protected]>

Formatting fix

e5e3b65

Signed-off-by: Jonathan Swartz <[email protected]>

Use lsizes, which might already be cached on CPU, to compute storage_…

484c7eb

…offsets instead of joffsets. Signed-off-by: Jonathan Swartz <[email protected]>

Merge branch 'sdpa_torch' into flash_attention

32fd306

Signed-off-by: Jonathan Swartz <[email protected]>

Use cached lsizes instead of copying joffsets to cpu

330d961

Signed-off-by: Jonathan Swartz <[email protected]>

Merge branch 'main' into flash_attention

048c4a4

Signed-off-by: Jonathan Swartz <[email protected]>

Comment fixes

519f5ec

Signed-off-by: Jonathan Swartz <[email protected]>

swahtz requested a review from a team as a code owner December 3, 2025 07:35

swahtz requested review from blackencino, Copilot and phapalova and removed request for blackencino and phapalova December 3, 2025 07:35

Copilot started reviewing on behalf of swahtz December 3, 2025 07:36 View session

Copilot finished reviewing on behalf of swahtz December 3, 2025 07:37

swahtz added the core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module label Dec 3, 2025

Copilot AI reviewed Dec 3, 2025

View reviewed changes

swahtz added 2 commits December 3, 2025 21:01

Fix issue where if both flash and efficient are enabled we select the…

373bffe

… wrong creation function Use enums instead of ints Signed-off-by: Jonathan Swartz <[email protected]>

Add bfloat16 to tests

c38ea8a

Signed-off-by: Jonathan Swartz <[email protected]>

blackencino reviewed Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`scaled_dot_production_attention` support for additional Torch backends (Flash Attention) #365

`scaled_dot_production_attention` support for additional Torch backends (Flash Attention) #365

Uh oh!

swahtz commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blackencino left a comment

Uh oh!

blackencino Dec 9, 2025

Uh oh!

swahtz commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scaled_dot_production_attention support for additional Torch backends (Flash Attention) #365

Are you sure you want to change the base?

scaled_dot_production_attention support for additional Torch backends (Flash Attention) #365

Uh oh!

Conversation

swahtz commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blackencino left a comment

Choose a reason for hiding this comment

Uh oh!

blackencino Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

swahtz commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`scaled_dot_production_attention` support for additional Torch backends (Flash Attention) #365

`scaled_dot_production_attention` support for additional Torch backends (Flash Attention) #365