Port mamba2 kernels and runtime from sglang#03c77dc#412
Conversation
lightseek-bot
left a comment
There was a problem hiding this comment.
Please use components available in tokenspeed, such as some modules from Qwen 3.5, to support this, rather than adapting from others.
d06d03e to
402ef26
Compare
Add the TokenSpeed Mamba2 runtime wrappers and Triton SSD kernels needed by hybrid Mamba2 models, with the local decode selective-state-update implementation omitted in favor of FlashInfer's maintained flashinfer.mamba.selective_state_update. Provenance: - Source repo: https://github.com/sgl-project/sglang - Source commit: 03c77dc33d0a051aa15c1235407440d9d107b98f - Source files adapted from SGLang: - python/sglang/srt/layers/attention/mamba/mamba.py - python/sglang/srt/layers/attention/mamba/mamba2_metadata.py - python/sglang/srt/layers/attention/mamba/mixer2_rms_norm_gated.py - python/sglang/srt/layers/attention/mamba/ops/ssd_bmm.py - python/sglang/srt/layers/attention/mamba/ops/ssd_chunk_scan.py - python/sglang/srt/layers/attention/mamba/ops/ssd_chunk_state.py - python/sglang/srt/layers/attention/mamba/ops/ssd_combined.py - python/sglang/srt/layers/attention/mamba/ops/ssd_state_passing.py TokenSpeed adaptations: - Use TokenSpeed Mapping, tensor-parallel helpers, linear layers, and weight loader hooks in the runtime mixer. - Import Triton through tokenspeed_kernel._triton in the copied SSD kernels. - Keep SGLang/vLLM/source-state comments in the copied kernel files. - Use FlashInfer for selective_state_update instead of carrying SGLang's local mamba_ssm.py SSU implementation. Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
402ef26 to
79ace81
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0079f0df82
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| import torch | ||
| import torch.nn as nn | ||
| from flashinfer.mamba import selective_state_update |
There was a problem hiding this comment.
Keep FlashInfer behind tokenspeed-kernel
In deployments where flashinfer-python is unavailable or unsupported (for example non-NVIDIA/runtime-only environments), importing this Mamba2 layer now raises before backend selection can fall back. The repo guidance says runtime code should use tokenspeed-kernel as the kernel boundary and keep third-party kernel libraries there, so this update path should be exposed through tokenspeed-kernel or optionalized instead of importing FlashInfer directly here.
Useful? React with 👍 / 👎.
|
This PR has been inactive for 14 days and is marked as stale. It will be closed in 3 days if there is no further activity. |
Groundwork for NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 support.
This ports the required Mamba2 Triton kernels, mixer, and metadata classes from SGLang for the follow-up NemotronH architecture PRs.