Summary
Add support for SageAttention as a selectable backend in PhyAI's Diffusion Attention system.
Motivation
SageAttention provides optimized quantized attention kernels for diffusion models, with reported speedups over FlashAttention while preserving output quality. Adding SageAttention would give PhyAI users another high-performance attention backend for diffusion/DiT workloads, especially video and image generation models where attention cost is significant.
Proposed Scope
- Add a new diffusion attention backend option, e.g.
SAGE_ATTN or sage.
- Lazily import
sageattention so it remains an optional dependency.
- Provide a clear error message when the backend is selected but SageAttention is not installed or the GPU/CUDA environment is unsupported.
- Support the common diffusion attention path, especially non-causal self-attention/cross-attention.
- Preserve existing fallback behavior for unsupported shapes, dtypes, masks, or devices.
- Add unit tests and, if available, a small benchmark/validation path comparing against the existing SDPA/FlashAttention backend.
References
Summary
Add support for SageAttention as a selectable backend in PhyAI's Diffusion Attention system.
Motivation
SageAttention provides optimized quantized attention kernels for diffusion models, with reported speedups over FlashAttention while preserving output quality. Adding SageAttention would give PhyAI users another high-performance attention backend for diffusion/DiT workloads, especially video and image generation models where attention cost is significant.
Proposed Scope
SAGE_ATTNorsage.sageattentionso it remains an optional dependency.References