Skip to content

Port mamba2 kernels and runtime from sglang#03c77dc#412

Open
netanel-haber wants to merge 2 commits into
lightseekorg:mainfrom
netanel-haber:feature/mamba2-triton-kernels
Open

Port mamba2 kernels and runtime from sglang#03c77dc#412
netanel-haber wants to merge 2 commits into
lightseekorg:mainfrom
netanel-haber:feature/mamba2-triton-kernels

Conversation

@netanel-haber

@netanel-haber netanel-haber commented Jun 10, 2026

Copy link
Copy Markdown

Groundwork for NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 support.

This ports the required Mamba2 Triton kernels, mixer, and metadata classes from SGLang for the follow-up NemotronH architecture PRs.

@netanel-haber netanel-haber changed the title port mamba2 kernels from sglang#03c77dc Port mamba2 kernels from sglang#03c77dc Jun 10, 2026
@netanel-haber netanel-haber changed the title Port mamba2 kernels from sglang#03c77dc Port mamba2 kernels and runtime from sglang#03c77dc Jun 10, 2026

@lightseek-bot lightseek-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use components available in tokenspeed, such as some modules from Qwen 3.5, to support this, rather than adapting from others.

@netanel-haber netanel-haber force-pushed the feature/mamba2-triton-kernels branch 2 times, most recently from d06d03e to 402ef26 Compare June 11, 2026 13:32
Add the TokenSpeed Mamba2 runtime wrappers and Triton SSD kernels needed by hybrid Mamba2 models, with the local decode selective-state-update implementation omitted in favor of FlashInfer's maintained flashinfer.mamba.selective_state_update.

Provenance:

- Source repo: https://github.com/sgl-project/sglang

- Source commit: 03c77dc33d0a051aa15c1235407440d9d107b98f

- Source files adapted from SGLang:

  - python/sglang/srt/layers/attention/mamba/mamba.py

  - python/sglang/srt/layers/attention/mamba/mamba2_metadata.py

  - python/sglang/srt/layers/attention/mamba/mixer2_rms_norm_gated.py

  - python/sglang/srt/layers/attention/mamba/ops/ssd_bmm.py

  - python/sglang/srt/layers/attention/mamba/ops/ssd_chunk_scan.py

  - python/sglang/srt/layers/attention/mamba/ops/ssd_chunk_state.py

  - python/sglang/srt/layers/attention/mamba/ops/ssd_combined.py

  - python/sglang/srt/layers/attention/mamba/ops/ssd_state_passing.py

TokenSpeed adaptations:

- Use TokenSpeed Mapping, tensor-parallel helpers, linear layers, and weight loader hooks in the runtime mixer.

- Import Triton through tokenspeed_kernel._triton in the copied SSD kernels.

- Keep SGLang/vLLM/source-state comments in the copied kernel files.

- Use FlashInfer for selective_state_update instead of carrying SGLang's local mamba_ssm.py SSU implementation.

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
@netanel-haber netanel-haber force-pushed the feature/mamba2-triton-kernels branch from 402ef26 to 79ace81 Compare June 11, 2026 14:35
@netanel-haber netanel-haber marked this pull request as ready for review June 11, 2026 21:56
@netanel-haber netanel-haber requested a review from a team as a code owner June 11, 2026 21:56

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0079f0df82

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


import torch
import torch.nn as nn
from flashinfer.mamba import selective_state_update

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep FlashInfer behind tokenspeed-kernel

In deployments where flashinfer-python is unavailable or unsupported (for example non-NVIDIA/runtime-only environments), importing this Mamba2 layer now raises before backend selection can fall back. The repo guidance says runtime code should use tokenspeed-kernel as the kernel boundary and keep third-party kernel libraries there, so this update path should be exposed through tokenspeed-kernel or optionalized instead of importing FlashInfer directly here.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown

This PR has been inactive for 14 days and is marked as stale. It will be closed in 3 days if there is no further activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants