Skip to content

Add Triton sampling backends alongside FlashInfer#280

Open
FlamingoPg wants to merge 18 commits into
lightseekorg:mainfrom
FlamingoPg:flamingo/sample
Open

Add Triton sampling backends alongside FlashInfer#280
FlamingoPg wants to merge 18 commits into
lightseekorg:mainfrom
FlamingoPg:flamingo/sample

Conversation

@FlamingoPg

@FlamingoPg FlamingoPg commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add TokenSpeed-native triton and triton_full sampling backends alongside the existing flashinfer / flashinfer_full probability-route backends.
  • Keep NVIDIA default sampling backend as flashinfer; this PR does not remove FlashInfer sampling or change the default route.
  • Adapt vLLM MRV2 sampling principles at the kernel/runtime boundary: logits-to-Gumbel-Max sampling, stateless in-kernel RNG, TokenSpeed pool-state indirection, and CUDA-graph-safe sampler variants.
  • Keep runtime dependencies behind the tokenspeed-kernel boundary; attention/MoE/quantization FlashInfer paths are untouched.

Changes

  • Add pool-aware Triton sampling ops for no-filter Gumbel, finite top-k, finite top-k + top-p, top-p-only, min-p/full sampling helpers, selected-token logprob, and verify-chain support.
  • Add triton and triton_full runtime backends with separate MRV2-style Gumbel routes.
  • Add a neutral PoolSamplingBackend so FlashInfer and Triton share TokenSpeed request-pool state without Triton inheriting FlashInfer probability/coin state.
  • Add CUDA graph sampler variants so graph replay can select the captured route for no-filter, top-k, top-k+top-p, top-p-only, verify, and full/min-p paths.
  • Keep FlashInfer/CUDA probability-route code available as a parallel backend and baseline.
  • Reorganize sampling kernel tests so existing sampling/CUDA tests stay in test_sampling.py, while Triton-specific coverage lives in test_sampling_triton.py and test_sampling_triton_full.py.

Benchmark

These are the latest focused sampling-path results I could trace from the current branch. Source artifacts:

  • Normal sample focused benchmark: /tmp/tokenspeed_sampling_mr/focused_sampling_path_bench.csv
  • Normal sample focused benchmark, large vocab: /tmp/tokenspeed_sampling_mr/focused_sampling_path_bench_151936.csv
  • Full/min-p focused operator benchmark: /tmp/tokenspeed_sampling_mr/current_sampling_ops.csv

Environment from the benchmark logs: NVIDIA H100 80GB HBM3. Timing uses CUDA events. All numbers are milliseconds, shown as median / p95. Speedup is FlashInfer baseline / Triton, so higher is better.

Important scope notes:

  • Normal sample rows compare against flashinfer.sample() core behavior. That route does not call fused_topk_topp_renorm; it uses softmax -> top_k_top_p_sampling_from_probs.
  • Full/min-p rows compare against flashinfer_full-style probability behavior. That route does call fused_topk_topp_renorm on NVIDIA before min_p_sampling_from_probs.
  • current_triton_pool_op is the latest optimized pool-aware Triton op path.
  • current_runtime_sample is the full TokenSpeed runtime sampling backend call and includes route/backend overhead.
  • These are focused sampler-path measurements, not full serving throughput claims.

A. Normal Sample: Triton vs flashinfer.sample()

Baseline route:

softmax(logits / temperature)
-> top_k_top_p_sampling_from_probs(...)
-> token

Candidate route:

logits + pool scalars
-> TokenSpeed Triton Gumbel/candidate sampler
-> token

Current Pool-Aware Triton Op vs Old FlashInfer Core

mode vocab bs old FlashInfer core current Triton pool op speedup
no_filter 32768 1 0.087296 / 0.102688 0.042960 / 0.047200 2.03x
no_filter 32768 8 0.085504 / 0.102496 0.042240 / 0.046784 2.02x
no_filter 32768 32 0.086384 / 0.099680 0.042304 / 0.047104 2.04x
top_k_top_p 32768 1 0.103248 / 0.115648 0.060272 / 0.071072 1.71x
top_k_top_p 32768 8 0.103840 / 0.118304 0.058672 / 0.071616 1.77x
top_k_top_p 32768 32 0.106208 / 0.114304 0.058896 / 0.072384 1.80x
no_filter 151936 1 0.141440 / 0.173440 0.044880 / 0.056096 3.15x
no_filter 151936 8 0.147280 / 0.156704 0.045472 / 0.055040 3.24x
no_filter 151936 32 0.167904 / 0.178080 0.052016 / 0.058080 3.23x
top_k_top_p 151936 1 0.195264 / 0.215296 0.091920 / 0.100032 2.12x
top_k_top_p 151936 8 0.203504 / 0.216416 0.096208 / 0.100128 2.12x
top_k_top_p 151936 32 0.234256 / 0.243520 0.134336 / 0.139328 1.74x

Full Runtime Sample Call vs Old FlashInfer Core

mode vocab bs old FlashInfer core current runtime sample speedup
no_filter 32768 1 0.087296 / 0.102688 0.083232 / 0.093696 1.05x
no_filter 32768 8 0.085504 / 0.102496 0.081232 / 0.089568 1.05x
no_filter 32768 32 0.086384 / 0.099680 0.081504 / 0.090112 1.06x
top_k_top_p 32768 1 0.103248 / 0.115648 0.077200 / 0.094624 1.34x
top_k_top_p 32768 8 0.103840 / 0.118304 0.077120 / 0.090816 1.35x
top_k_top_p 32768 32 0.106208 / 0.114304 0.077392 / 0.091232 1.37x
no_filter 151936 1 0.141440 / 0.173440 0.100048 / 0.130112 1.41x
no_filter 151936 8 0.147280 / 0.156704 0.100512 / 0.114112 1.47x
no_filter 151936 32 0.167904 / 0.178080 0.100496 / 0.115904 1.67x
top_k_top_p 151936 1 0.195264 / 0.215296 0.109248 / 0.122432 1.79x
top_k_top_p 151936 8 0.203504 / 0.216416 0.112416 / 0.119168 1.81x
top_k_top_p 151936 32 0.234256 / 0.243520 0.150912 / 0.156448 1.55x

B. Full/min-p Path: Triton Full vs flashinfer_full

Baseline route on NVIDIA:

apply penalties/bias if enabled
-> softmax(logits / temperature)
-> fused_topk_topp_renorm(...)
-> min_p_sampling_from_probs(...)
-> token

Candidate route:

logits + full sampling pools
-> TokenSpeed Triton full/min-p Gumbel/rejection sampler
-> token

This table is the focused full/min-p operator comparison from current_sampling_ops.csv.

mode vocab bs FlashInfer full baseline Triton full op speedup
min_p 32768 1 0.097856 / 0.106944 0.095472 / 0.108576 1.02x
min_p 32768 8 0.182784 / 0.194080 0.095312 / 0.109440 1.92x
min_p 32768 32 0.202912 / 0.217952 0.095696 / 0.112288 2.12x
top_k_top_p_min_p 32768 1 0.108672 / 0.114976 0.096208 / 0.109472 1.13x
top_k_top_p_min_p 32768 8 0.114528 / 0.120640 0.095312 / 0.107328 1.20x
top_k_top_p_min_p 32768 32 0.126416 / 0.134528 0.097888 / 0.110592 1.29x
min_p 151936 1 0.176240 / 0.180416 0.098080 / 0.113792 1.80x
min_p 151936 8 0.230592 / 0.234496 0.096384 / 0.110080 2.39x
min_p 151936 32 0.609088 / 0.632736 0.162848 / 0.172320 3.74x
top_k_top_p_min_p 151936 1 0.206560 / 0.210592 0.097024 / 0.112128 2.13x
top_k_top_p_min_p 151936 8 0.255536 / 0.260544 0.096496 / 0.110560 2.65x
top_k_top_p_min_p 151936 32 0.346656 / 0.355264 0.162208 / 0.166272 2.14x

Benchmark Interpretation

  • Normal sample and full/min-p are different baselines. Normal flashinfer.sample() does not use fused_topk_topp_renorm; flashinfer_full does.
  • The Triton op-level route is the clear win on normal sampling: 2.02x-3.24x on no-filter and 1.71x-2.12x on finite top-k + top-p except 151936/bs32, which is still 1.74x.
  • The full runtime normal-sample call still wins, but the margin is smaller because route selection and backend scaffolding are included. The smallest win is 32768 no-filter at about 1.05x.
  • Against the flashinfer_full route that includes fused_topk_topp_renorm, Triton full also wins in these focused full/min-p operator rows. The tightest case is 32768/bs1 min_p, where median improves slightly while p95 is roughly comparable.
  • This PR does not claim full serving throughput speedup from these focused numbers; it keeps FlashInfer as the default backend while the Triton routes remain opt-in.

Validation

  • pre-commit run --all-files (passed)
  • python -m pytest test/runtime/test_sampling_backend_pool.py test/runtime/test_sampling_backend_registry.py test/runtime/test_cli_config_compat.py (90 passed, 18 warnings)
  • python -m pytest tokenspeed-kernel/test/ops/test_sampling.py tokenspeed-kernel/test/ops/test_sampling_triton.py tokenspeed-kernel/test/ops/test_sampling_triton_full.py (60 passed, 18 warnings)

Notes

  • PR is intentionally additive: flashinfer / flashinfer_full remain available, and NVIDIA default remains flashinfer.
  • Generated benchmark artifacts and migration scratch docs are not included.

@FlamingoPg FlamingoPg requested a review from a team as a code owner May 27, 2026 07:55
@@ -18,14 +18,48 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

"""Triton sampling helper kernels."""
"""TokenSpeed-native Triton sampling kernels.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add vLLM's copyright here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added SPDX-FileCopyrightText: Copyright contributors to the vLLM project to the Triton sampling kernel header in commit c3c25815.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3c2581533

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 981f9400f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 681396814e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28fa045441

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/sampling/backends/triton.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7179e82b15

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/sampling/backends/triton.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d81cab179f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton.py Outdated

@yweng0828 yweng0828 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please compare the performance of fused_topk_topp before we remove this code? #184

Thanks.

@FlamingoPg FlamingoPg marked this pull request as draft May 28, 2026 04:06
@FlamingoPg FlamingoPg changed the title Replace sampling FlashInfer backend with TokenSpeed Triton Add Triton sampling backends alongside FlashInfer May 30, 2026
@FlamingoPg FlamingoPg marked this pull request as ready for review May 30, 2026 17:33
@FlamingoPg FlamingoPg marked this pull request as draft May 30, 2026 17:34
@FlamingoPg FlamingoPg marked this pull request as ready for review May 30, 2026 17:34

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a76e165f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: baab8fce85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/sampling/backends/triton_full.py
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cefc7b737f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5798a559a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@yweng0828 yweng0828 self-requested a review June 4, 2026 02:46

@yweng0828 yweng0828 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have some questions regarding the experiment B. Full/min-p Path: Triton Full vs flashinfer_full. Thanks for your attention.

  1. What is min_p?
  2. Does 0.097856 / 0.106944 refer to the time between the two experiments?
  3. The FlashInfer full baseline is slower than the Triton full op. Is this because the Triton full op combines all operators together, while the FlashInfer full baseline consists of several separate operators?

@yweng0828 yweng0828 dismissed their stale review June 4, 2026 02:53

Already updated.

Signed-off-by: FlamingoPg <1106310035@qq.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e913e9be6c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/sampling/backends/triton_full.py
@FlamingoPg

FlamingoPg commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@yweng0828 Thanks for the questions. Let me clarify the comparison boundary here.

  1. min_p keeps tokens whose probability is at least min_p * max_prob for that row. In the probability route this is applied on probs followed by renormalization. In the Triton full route we apply the equivalent logits-space condition before Gumbel-Max, so we avoid materializing the full probability table.

  2. 0.097856 / 0.106944 means median / p95 latency in milliseconds for that benchmark case. It is not the time between two experiments. I should make the table header clearer.

  3. This is not intended as a single-kernel CUDA-vs-Triton microbenchmark. It is a focused operator-pipeline comparison for the full sampling route. The FlashInfer full baseline follows the existing probability route, while Triton full follows logits-space Gumbel-Max. So the speedup comes from changing the sampling formulation and avoiding full probability materialization/renormalization, not from claiming Triton is a faster implementation of the exact same fused CUDA kernel.

Also, your fused_topk_topp CUDA kernel is a very strong implementation of the probability-renorm route. The AIR radix top-k/top-p design, CUDA-graph-safe launch sequence, and side-stream overlap make it a good baseline for the existing probs -> renorm -> sample pipeline. This PR is trying to evaluate a different MRV2-style formulation rather than replace that kernel with a like-for-like Triton version.

@yweng0828

yweng0828 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Hi @FlamingoPg, thank you for your detailed reply! It's really interesting to know such an operator pipeline. Thanks for your great work!

In other words, the two routes ultimately yield the same sampling result (they are mathematically equivalent), but they differ in their computational processes. This leads to their different performance. Is that correct?

If the answer is yes, my last question will be: Under what circumstances can we completely replace the traditional probability route with the logits-space Gumbel-Max route, considering the latter has better performance?

One counterexample I can think of is: if the user needs the probability as part of the output, then we must use the original probability route, is it right?

Glad to hear more from you. Thanks :)

@github-actions

Copy link
Copy Markdown

This PR has been inactive for 14 days and is marked as stale. It will be closed in 3 days if there is no further activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants