Add Triton sampling backends alongside FlashInfer by FlamingoPg · Pull Request #280 · lightseekorg/tokenspeed

FlamingoPg · 2026-05-27T07:55:49Z

Summary

Add TokenSpeed-native triton and triton_full sampling backends alongside the existing flashinfer / flashinfer_full probability-route backends.
Keep NVIDIA default sampling backend as flashinfer; this PR does not remove FlashInfer sampling or change the default route.
Adapt vLLM MRV2 sampling principles at the kernel/runtime boundary: logits-to-Gumbel-Max sampling, stateless in-kernel RNG, TokenSpeed pool-state indirection, and CUDA-graph-safe sampler variants.
Keep runtime dependencies behind the tokenspeed-kernel boundary; attention/MoE/quantization FlashInfer paths are untouched.

Changes

Add pool-aware Triton sampling ops for no-filter Gumbel, finite top-k, finite top-k + top-p, top-p-only, min-p/full sampling helpers, selected-token logprob, and verify-chain support.
Add triton and triton_full runtime backends with separate MRV2-style Gumbel routes.
Add a neutral PoolSamplingBackend so FlashInfer and Triton share TokenSpeed request-pool state without Triton inheriting FlashInfer probability/coin state.
Add CUDA graph sampler variants so graph replay can select the captured route for no-filter, top-k, top-k+top-p, top-p-only, verify, and full/min-p paths.
Keep FlashInfer/CUDA probability-route code available as a parallel backend and baseline.
Reorganize sampling kernel tests so existing sampling/CUDA tests stay in test_sampling.py, while Triton-specific coverage lives in test_sampling_triton.py and test_sampling_triton_full.py.

Benchmark

These are the latest focused sampling-path results I could trace from the current branch. Source artifacts:

Normal sample focused benchmark: /tmp/tokenspeed_sampling_mr/focused_sampling_path_bench.csv
Normal sample focused benchmark, large vocab: /tmp/tokenspeed_sampling_mr/focused_sampling_path_bench_151936.csv
Full/min-p focused operator benchmark: /tmp/tokenspeed_sampling_mr/current_sampling_ops.csv

Environment from the benchmark logs: NVIDIA H100 80GB HBM3. Timing uses CUDA events. All numbers are milliseconds, shown as median / p95. Speedup is FlashInfer baseline / Triton, so higher is better.

Important scope notes:

Normal sample rows compare against flashinfer.sample() core behavior. That route does not call fused_topk_topp_renorm; it uses softmax -> top_k_top_p_sampling_from_probs.
Full/min-p rows compare against flashinfer_full-style probability behavior. That route does call fused_topk_topp_renorm on NVIDIA before min_p_sampling_from_probs.
current_triton_pool_op is the latest optimized pool-aware Triton op path.
current_runtime_sample is the full TokenSpeed runtime sampling backend call and includes route/backend overhead.
These are focused sampler-path measurements, not full serving throughput claims.

A. Normal Sample: Triton vs `flashinfer.sample()`

Baseline route:

softmax(logits / temperature)
-> top_k_top_p_sampling_from_probs(...)
-> token

Candidate route:

logits + pool scalars
-> TokenSpeed Triton Gumbel/candidate sampler
-> token

Current Pool-Aware Triton Op vs Old FlashInfer Core

mode	vocab	bs	old FlashInfer core	current Triton pool op	speedup
no_filter	32768	1	0.087296 / 0.102688	0.042960 / 0.047200	2.03x
no_filter	32768	8	0.085504 / 0.102496	0.042240 / 0.046784	2.02x
no_filter	32768	32	0.086384 / 0.099680	0.042304 / 0.047104	2.04x
top_k_top_p	32768	1	0.103248 / 0.115648	0.060272 / 0.071072	1.71x
top_k_top_p	32768	8	0.103840 / 0.118304	0.058672 / 0.071616	1.77x
top_k_top_p	32768	32	0.106208 / 0.114304	0.058896 / 0.072384	1.80x
no_filter	151936	1	0.141440 / 0.173440	0.044880 / 0.056096	3.15x
no_filter	151936	8	0.147280 / 0.156704	0.045472 / 0.055040	3.24x
no_filter	151936	32	0.167904 / 0.178080	0.052016 / 0.058080	3.23x
top_k_top_p	151936	1	0.195264 / 0.215296	0.091920 / 0.100032	2.12x
top_k_top_p	151936	8	0.203504 / 0.216416	0.096208 / 0.100128	2.12x
top_k_top_p	151936	32	0.234256 / 0.243520	0.134336 / 0.139328	1.74x

Full Runtime Sample Call vs Old FlashInfer Core

mode	vocab	bs	old FlashInfer core	current runtime sample	speedup
no_filter	32768	1	0.087296 / 0.102688	0.083232 / 0.093696	1.05x
no_filter	32768	8	0.085504 / 0.102496	0.081232 / 0.089568	1.05x
no_filter	32768	32	0.086384 / 0.099680	0.081504 / 0.090112	1.06x
top_k_top_p	32768	1	0.103248 / 0.115648	0.077200 / 0.094624	1.34x
top_k_top_p	32768	8	0.103840 / 0.118304	0.077120 / 0.090816	1.35x
top_k_top_p	32768	32	0.106208 / 0.114304	0.077392 / 0.091232	1.37x
no_filter	151936	1	0.141440 / 0.173440	0.100048 / 0.130112	1.41x
no_filter	151936	8	0.147280 / 0.156704	0.100512 / 0.114112	1.47x
no_filter	151936	32	0.167904 / 0.178080	0.100496 / 0.115904	1.67x
top_k_top_p	151936	1	0.195264 / 0.215296	0.109248 / 0.122432	1.79x
top_k_top_p	151936	8	0.203504 / 0.216416	0.112416 / 0.119168	1.81x
top_k_top_p	151936	32	0.234256 / 0.243520	0.150912 / 0.156448	1.55x

B. Full/min-p Path: Triton Full vs `flashinfer_full`

Baseline route on NVIDIA:

apply penalties/bias if enabled
-> softmax(logits / temperature)
-> fused_topk_topp_renorm(...)
-> min_p_sampling_from_probs(...)
-> token

Candidate route:

logits + full sampling pools
-> TokenSpeed Triton full/min-p Gumbel/rejection sampler
-> token

This table is the focused full/min-p operator comparison from current_sampling_ops.csv.

mode	vocab	bs	FlashInfer full baseline	Triton full op	speedup
min_p	32768	1	0.097856 / 0.106944	0.095472 / 0.108576	1.02x
min_p	32768	8	0.182784 / 0.194080	0.095312 / 0.109440	1.92x
min_p	32768	32	0.202912 / 0.217952	0.095696 / 0.112288	2.12x
top_k_top_p_min_p	32768	1	0.108672 / 0.114976	0.096208 / 0.109472	1.13x
top_k_top_p_min_p	32768	8	0.114528 / 0.120640	0.095312 / 0.107328	1.20x
top_k_top_p_min_p	32768	32	0.126416 / 0.134528	0.097888 / 0.110592	1.29x
min_p	151936	1	0.176240 / 0.180416	0.098080 / 0.113792	1.80x
min_p	151936	8	0.230592 / 0.234496	0.096384 / 0.110080	2.39x
min_p	151936	32	0.609088 / 0.632736	0.162848 / 0.172320	3.74x
top_k_top_p_min_p	151936	1	0.206560 / 0.210592	0.097024 / 0.112128	2.13x
top_k_top_p_min_p	151936	8	0.255536 / 0.260544	0.096496 / 0.110560	2.65x
top_k_top_p_min_p	151936	32	0.346656 / 0.355264	0.162208 / 0.166272	2.14x

Benchmark Interpretation

Normal sample and full/min-p are different baselines. Normal flashinfer.sample() does not use fused_topk_topp_renorm; flashinfer_full does.
The Triton op-level route is the clear win on normal sampling: 2.02x-3.24x on no-filter and 1.71x-2.12x on finite top-k + top-p except 151936/bs32, which is still 1.74x.
The full runtime normal-sample call still wins, but the margin is smaller because route selection and backend scaffolding are included. The smallest win is 32768 no-filter at about 1.05x.
Against the flashinfer_full route that includes fused_topk_topp_renorm, Triton full also wins in these focused full/min-p operator rows. The tightest case is 32768/bs1 min_p, where median improves slightly while p95 is roughly comparable.
This PR does not claim full serving throughput speedup from these focused numbers; it keeps FlashInfer as the default backend while the Triton routes remain opt-in.

Validation

pre-commit run --all-files (passed)
python -m pytest test/runtime/test_sampling_backend_pool.py test/runtime/test_sampling_backend_registry.py test/runtime/test_cli_config_compat.py (90 passed, 18 warnings)
python -m pytest tokenspeed-kernel/test/ops/test_sampling.py tokenspeed-kernel/test/ops/test_sampling_triton.py tokenspeed-kernel/test/ops/test_sampling_triton_full.py (60 passed, 18 warnings)

Notes

PR is intentionally additive: flashinfer / flashinfer_full remain available, and NVIDIA default remains flashinfer.
Generated benchmark artifacts and migration scratch docs are not included.

lightseek-bot · 2026-05-27T07:57:45Z

@@ -18,14 +18,48 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 # SOFTWARE.

-"""Triton sampling helper kernels."""
+"""TokenSpeed-native Triton sampling kernels.


Please add vLLM's copyright here

Added SPDX-FileCopyrightText: Copyright contributors to the vLLM project to the Triton sampling kernel header in commit c3c25815.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3c2581533

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 981f9400f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 681396814e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28fa045441

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7179e82b15

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d81cab179f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

yweng0828

Could you please compare the performance of fused_topk_topp before we remove this code? #184

Thanks.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a76e165f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: baab8fce85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: FlamingoPg <1106310035@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cefc7b737f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: FlamingoPg <1106310035@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5798a559a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

yweng0828

Hi, I have some questions regarding the experiment B. Full/min-p Path: Triton Full vs flashinfer_full. Thanks for your attention.

What is min_p?
Does 0.097856 / 0.106944 refer to the time between the two experiments?
The FlashInfer full baseline is slower than the Triton full op. Is this because the Triton full op combines all operators together, while the FlashInfer full baseline consists of several separate operators?

Already updated.

Signed-off-by: FlamingoPg <1106310035@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e913e9be6c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

FlamingoPg · 2026-06-04T04:32:17Z

@yweng0828 Thanks for the questions. Let me clarify the comparison boundary here.

min_p keeps tokens whose probability is at least min_p * max_prob for that row. In the probability route this is applied on probs followed by renormalization. In the Triton full route we apply the equivalent logits-space condition before Gumbel-Max, so we avoid materializing the full probability table.
0.097856 / 0.106944 means median / p95 latency in milliseconds for that benchmark case. It is not the time between two experiments. I should make the table header clearer.
This is not intended as a single-kernel CUDA-vs-Triton microbenchmark. It is a focused operator-pipeline comparison for the full sampling route. The FlashInfer full baseline follows the existing probability route, while Triton full follows logits-space Gumbel-Max. So the speedup comes from changing the sampling formulation and avoiding full probability materialization/renormalization, not from claiming Triton is a faster implementation of the exact same fused CUDA kernel.

Also, your fused_topk_topp CUDA kernel is a very strong implementation of the probability-renorm route. The AIR radix top-k/top-p design, CUDA-graph-safe launch sequence, and side-stream overlap make it a good baseline for the existing probs -> renorm -> sample pipeline. This PR is trying to evaluate a different MRV2-style formulation rather than replace that kernel with a like-for-like Triton version.

yweng0828 · 2026-06-04T07:54:55Z

Hi @FlamingoPg, thank you for your detailed reply! It's really interesting to know such an operator pipeline. Thanks for your great work!

In other words, the two routes ultimately yield the same sampling result (they are mathematically equivalent), but they differ in their computational processes. This leads to their different performance. Is that correct?

If the answer is yes, my last question will be: Under what circumstances can we completely replace the traditional probability route with the logits-space Gumbel-Max route, considering the latter has better performance?

One counterexample I can think of is: if the user needs the probability as part of the output, then we must use the original probability route, is it right?

Glad to hear more from you. Thanks :)

github-actions · 2026-06-27T00:27:55Z

This PR has been inactive for 14 days and is marked as stale. It will be closed in 3 days if there is no further activity.

FlamingoPg requested a review from a team as a code owner May 27, 2026 07:55

lightseek-bot reviewed May 27, 2026

View reviewed changes

FlamingoPg force-pushed the flamingo/sample branch from a76a3e9 to c3c2581 Compare May 27, 2026 08:02

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton.py Outdated

FlamingoPg force-pushed the flamingo/sample branch from c3c2581 to 981f940 Compare May 27, 2026 09:01

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton.py Outdated

FlamingoPg force-pushed the flamingo/sample branch from 981f940 to 6813968 Compare May 27, 2026 09:21

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton.py Outdated

FlamingoPg force-pushed the flamingo/sample branch from 6813968 to 28fa045 Compare May 27, 2026 09:42

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/sampling/backends/triton.py Outdated

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/sampling/backends/triton.py Outdated

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton.py Outdated

FlamingoPg force-pushed the flamingo/sample branch from 27b7316 to 288f13f Compare May 27, 2026 20:32

yweng0828 previously requested changes May 28, 2026

View reviewed changes

FlamingoPg marked this pull request as draft May 28, 2026 04:06

FlamingoPg force-pushed the flamingo/sample branch from dc66686 to c121645 Compare May 28, 2026 04:12

FlamingoPg changed the title ~~Replace sampling FlashInfer backend with TokenSpeed Triton~~ Add Triton sampling backends alongside FlashInfer May 30, 2026

FlamingoPg marked this pull request as ready for review May 30, 2026 17:33

FlamingoPg marked this pull request as draft May 30, 2026 17:34

FlamingoPg marked this pull request as ready for review May 30, 2026 17:34

chatgpt-codex-connector Bot reviewed May 30, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton/full.py

FlamingoPg force-pushed the flamingo/sample branch from 633932c to baab8fc Compare June 1, 2026 09:45

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton/topp.py

Comment thread python/tokenspeed/runtime/sampling/backends/triton_full.py

FlamingoPg added 7 commits June 2, 2026 14:53

Replace sampling FlashInfer backend with Triton

c8fcd0d

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): capture greedy triton graph variant

7c58f91

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): limit greedy graph variant to spec verify

104eb7a

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): randomize greedy triton ties

499bad5

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): keep triton greedy deterministic

e770b71

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): keep single-token greedy graph stable

8291aed

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): guard verify logits before argmax

3024d5a

Signed-off-by: FlamingoPg <1106310035@qq.com>

FlamingoPg added 5 commits June 2, 2026 14:54

test(sampling): gate verify nan guard to nvidia

521433e

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): reject invalid greedy verify rows

843a14b

Signed-off-by: FlamingoPg <1106310035@qq.com>

perf(sampling): specialize top-p rejection path

f66c286

Signed-off-by: FlamingoPg <1106310035@qq.com>

feat(sampling): add triton sampling backends

bd51e6e

Signed-off-by: FlamingoPg <1106310035@qq.com>

Refine Triton sampling implementation

d72e501

Signed-off-by: FlamingoPg <1106310035@qq.com>

FlamingoPg force-pushed the flamingo/sample branch from baab8fc to cefc7b7 Compare June 2, 2026 15:22

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated

FlamingoPg added 3 commits June 2, 2026 15:30

tmp

979d454

Signed-off-by: FlamingoPg <1106310035@qq.com>

chore(sampling): clean triton sampling diff

0831390

Signed-off-by: FlamingoPg <1106310035@qq.com>

fix(sampling): prepare cuda graph capture variant

afa34a3

Signed-off-by: FlamingoPg <1106310035@qq.com>

FlamingoPg force-pushed the flamingo/sample branch from cefc7b7 to afa34a3 Compare June 2, 2026 15:31

refactor(sampling): remove triton full re-export module

b5798a5

Signed-off-by: FlamingoPg <1106310035@qq.com>

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/triton/topk_topp.py

yweng0828 self-requested a review June 4, 2026 02:46

yweng0828 reviewed Jun 4, 2026

View reviewed changes

fix(sampling): mask invalid top-k tail candidates

e913e9b

Signed-off-by: FlamingoPg <1106310035@qq.com>

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/sampling/backends/triton_full.py

Merge branch 'main' into flamingo/sample

ea66ef5

github-actions Bot added the inactive label Jun 27, 2026

Uh oh!

Conversation

FlamingoPg commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark

A. Normal Sample: Triton vs flashinfer.sample()

Current Pool-Aware Triton Op vs Old FlashInfer Core

Full Runtime Sample Call vs Old FlashInfer Core

B. Full/min-p Path: Triton Full vs flashinfer_full

Benchmark Interpretation

Validation

Notes

Uh oh!

lightseek-bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

FlamingoPg May 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

yweng0828 left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

yweng0828 left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

FlamingoPg commented May 27, 2026 •

edited

Loading

A. Normal Sample: Triton vs `flashinfer.sample()`

B. Full/min-p Path: Triton Full vs `flashinfer_full`

FlamingoPg commented Jun 4, 2026 •

edited

Loading

yweng0828 commented Jun 4, 2026 •

edited

Loading