Skip to content

Conversation

@blackencino
Copy link
Contributor

@blackencino blackencino commented Feb 11, 2026

Part of #458 — stacked on #463

Review note: This PR is stacked on PR1 (#463). The diff includes PR1's changes since both target main.
After PR1 merges, this PR will be rebased and the diff will show only PR2 changes.

Summary

  • Ported 6 additional ops to dispatch_table + new dispatch tools (14 dispatch tables total counting fwd+bwd):
    • ActiveVoxelsInBoundsMaskforEachActiveVoxel, device-only dispatch table
    • CoarseIjkForFineGridforEachActiveVoxel, device-only dispatch table (gains CPU support)
    • CoordsInGriddispatch::for_each + jagged_in (device x int_stype x contiguity)
    • IjkToInvIndexdispatch::for_each + jagged_in (device x int_stype x contiguity)
    • DownsampleGridAvgPool (fwd+bwd) — forEachActiveVoxel over coarse grid with dual-grid accessor (device x float_stype x contiguity)
    • GridEdgeNetworkforEachActiveVoxel, device-only, multi-output tensors
  • Removed FVDB_DISPATCH_KERNEL_DEVICE from 9 call sites:
    • GridBatch.cpp (3): coordsInGrid, ijkToInvIndex, gridEdgeNetwork
    • GridBatchImpl.cu (1): activeVoxelsInBoundsMask
    • AvgPoolGrid.cpp (2): forward + backward
    • BuildCoarseGridFromFine.cu (2): CUDA + PrivateUse1 specializations
    • BuildGridForConv.cu (1): conv ijk shortcut path
  • All op headers are now type-erased (no template <torch::DeviceType>)
  • All precondition checks live inside the ops now, GridBatch calls to ops are thin.

New patterns established

  • Device-only dispatch tables (1 axis) — ActiveVoxelsInBoundsMask, CoarseIjkForFineGrid, GridEdgeNetwork
  • forEachJaggedElementChannel* proven unnecessaryCoordsInGrid and IjkToInvIndex use for_each + jagged_in directly (old code used forEachJaggedElementChannelCUDA/CPU with numChannels=1)
  • Dual-grid voxel iterationDownsampleGridAvgPool iterates coarse grid, reads fine grid via separate accessor (mirrors UpsampleGridNearest)
  • Multi-output from forEachActiveVoxelGridEdgeNetwork writes to 4 output tensors per voxel

Iteration pattern notes

  • DownsampleGridAvgPool replaces per-(voxel, channel) GPU thread parallelism with a per-voxel sequential channel loop (matching UpsampleGridNearest). This trades per-channel parallelism for fewer redundant NanoVDB tree traversals in the pooling window. See comments in the op.
  • ActiveVoxelsInBoundsMask loses a leaf-level hasOverlap(leaf.bbox()) early-out that the old code had. The old check was per-thread (not cooperative), and forEachActiveVoxel does not expose the leaf reference. Documented in comments; a future forEachLeaf-based variant could restore it.
  • All other ops have identical GPU memory access patterns to the old code.

Test plan

  • All gtests pass
  • Python tests pass
  • No old macros (FVDB_DISPATCH_KERNEL, AT_DISPATCH_V2) in any new/modified code

@blackencino blackencino requested a review from a team as a code owner February 11, 2026 07:42
Signed-off-by: Christopher Horvath <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant