Skip to content

[Feat][kernel] Fused FlashAttention with causal mask, varlen packing, and exported attention LSE #171

Description

@KJLdefeated

Why

Target long-context RL workloads with packed variable-length batches. Export attention softmax LSE for backward, diagnostics, and rollout/training attention alignment. The exported LSE is attention-domain LSE, not vocab-logprob LSE.

Todo

  • CUDA: SM90 WGMMA + TMA path, SM80 mma.sync fallback, with varlen and LSE export.
  • ROCm: MFMA-based kernel with 16x16x16-style tiling, CK comparison, and RL-Kernel-specific LSE/varlen semantics.
  • Triton: extend the existing dense fallback with LSE export and varlen support as the cross-platform semantic baseline.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions