[Feat][kernel] Fused FlashAttention with causal mask, varlen packing, and exported attention LSE

## Why
Target long-context RL workloads with packed variable-length batches. Export attention softmax LSE for backward, diagnostics, and rollout/training attention alignment. The exported LSE is attention-domain LSE, not vocab-logprob LSE.

## Todo
- CUDA: SM90 WGMMA + TMA path, SM80 mma.sync fallback, with varlen and LSE export.
- ROCm: MFMA-based kernel with 16x16x16-style tiling, CK comparison, and RL-Kernel-specific LSE/varlen semantics.
- Triton: extend the existing dense fallback with LSE export and varlen support as the cross-platform semantic baseline.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat][kernel] Fused FlashAttention with causal mask, varlen packing, and exported attention LSE #171

Why

Todo

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feat][kernel] Fused FlashAttention with causal mask, varlen packing, and exported attention LSE #171

Description

Why

Todo

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions