Why
Target long-context RL workloads with packed variable-length batches. Export attention softmax LSE for backward, diagnostics, and rollout/training attention alignment. The exported LSE is attention-domain LSE, not vocab-logprob LSE.
Todo
- CUDA: SM90 WGMMA + TMA path, SM80 mma.sync fallback, with varlen and LSE export.
- ROCm: MFMA-based kernel with 16x16x16-style tiling, CK comparison, and RL-Kernel-specific LSE/varlen semantics.
- Triton: extend the existing dense fallback with LSE export and varlen support as the cross-platform semantic baseline.
Why
Target long-context RL workloads with packed variable-length batches. Export attention softmax LSE for backward, diagnostics, and rollout/training attention alignment. The exported LSE is attention-domain LSE, not vocab-logprob LSE.
Todo