Add Triton sampling backends alongside FlashInfer#280
Conversation
| @@ -18,14 +18,48 @@ | |||
| # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |||
| # SOFTWARE. | |||
|
|
|||
| """Triton sampling helper kernels.""" | |||
| """TokenSpeed-native Triton sampling kernels. | |||
There was a problem hiding this comment.
Please add vLLM's copyright here
There was a problem hiding this comment.
Added SPDX-FileCopyrightText: Copyright contributors to the vLLM project to the Triton sampling kernel header in commit c3c25815.
a76a3e9 to
c3c2581
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c3c2581533
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
c3c2581 to
981f940
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 981f9400f2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
981f940 to
6813968
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 681396814e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
6813968 to
28fa045
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 28fa045441
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7179e82b15
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d81cab179f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
27b7316 to
288f13f
Compare
dc66686 to
c121645
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a76e165f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
633932c to
baab8fc
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: baab8fce85
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
baab8fc to
cefc7b7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cefc7b737f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: FlamingoPg <1106310035@qq.com>
Signed-off-by: FlamingoPg <1106310035@qq.com>
cefc7b7 to
afa34a3
Compare
Signed-off-by: FlamingoPg <1106310035@qq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b5798a559a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
yweng0828
left a comment
There was a problem hiding this comment.
Hi, I have some questions regarding the experiment B. Full/min-p Path: Triton Full vs flashinfer_full. Thanks for your attention.
- What is min_p?
- Does
0.097856 / 0.106944refer to the time between the two experiments? - The
FlashInfer full baselineis slower than theTriton full op. Is this because theTriton full opcombines all operators together, while theFlashInfer full baselineconsists of several separate operators?
Signed-off-by: FlamingoPg <1106310035@qq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e913e9be6c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@yweng0828 Thanks for the questions. Let me clarify the comparison boundary here.
Also, your |
|
Hi @FlamingoPg, thank you for your detailed reply! It's really interesting to know such an operator pipeline. Thanks for your great work! In other words, the two routes ultimately yield the same sampling result (they are mathematically equivalent), but they differ in their computational processes. This leads to their different performance. Is that correct? If the answer is yes, my last question will be: Under what circumstances can we completely replace the traditional One counterexample I can think of is: if the user needs the probability as part of the output, then we must use the original Glad to hear more from you. Thanks :) |
|
This PR has been inactive for 14 days and is marked as stale. It will be closed in 3 days if there is no further activity. |
Summary
tritonandtriton_fullsampling backends alongside the existingflashinfer/flashinfer_fullprobability-route backends.flashinfer; this PR does not remove FlashInfer sampling or change the default route.tokenspeed-kernelboundary; attention/MoE/quantization FlashInfer paths are untouched.Changes
tritonandtriton_fullruntime backends with separate MRV2-style Gumbel routes.PoolSamplingBackendso FlashInfer and Triton share TokenSpeed request-pool state without Triton inheriting FlashInfer probability/coin state.test_sampling.py, while Triton-specific coverage lives intest_sampling_triton.pyandtest_sampling_triton_full.py.Benchmark
These are the latest focused sampling-path results I could trace from the current branch. Source artifacts:
/tmp/tokenspeed_sampling_mr/focused_sampling_path_bench.csv/tmp/tokenspeed_sampling_mr/focused_sampling_path_bench_151936.csv/tmp/tokenspeed_sampling_mr/current_sampling_ops.csvEnvironment from the benchmark logs:
NVIDIA H100 80GB HBM3. Timing uses CUDA events. All numbers are milliseconds, shown asmedian / p95. Speedup isFlashInfer baseline / Triton, so higher is better.Important scope notes:
flashinfer.sample()core behavior. That route does not callfused_topk_topp_renorm; it usessoftmax -> top_k_top_p_sampling_from_probs.flashinfer_full-style probability behavior. That route does callfused_topk_topp_renormon NVIDIA beforemin_p_sampling_from_probs.current_triton_pool_opis the latest optimized pool-aware Triton op path.current_runtime_sampleis the full TokenSpeed runtime sampling backend call and includes route/backend overhead.A. Normal Sample: Triton vs
flashinfer.sample()Baseline route:
Candidate route:
Current Pool-Aware Triton Op vs Old FlashInfer Core
Full Runtime Sample Call vs Old FlashInfer Core
B. Full/min-p Path: Triton Full vs
flashinfer_fullBaseline route on NVIDIA:
Candidate route:
This table is the focused full/min-p operator comparison from
current_sampling_ops.csv.Benchmark Interpretation
flashinfer.sample()does not usefused_topk_topp_renorm;flashinfer_fulldoes.2.02x-3.24xon no-filter and1.71x-2.12xon finite top-k + top-p except 151936/bs32, which is still1.74x.1.05x.flashinfer_fullroute that includesfused_topk_topp_renorm, Triton full also wins in these focused full/min-p operator rows. The tightest case is 32768/bs1min_p, where median improves slightly while p95 is roughly comparable.Validation
pre-commit run --all-files(passed)python -m pytest test/runtime/test_sampling_backend_pool.py test/runtime/test_sampling_backend_registry.py test/runtime/test_cli_config_compat.py(90 passed,18 warnings)python -m pytest tokenspeed-kernel/test/ops/test_sampling.py tokenspeed-kernel/test/ops/test_sampling_triton.py tokenspeed-kernel/test/ops/test_sampling_triton_full.py(60 passed,18 warnings)Notes
flashinfer/flashinfer_fullremain available, and NVIDIA default remainsflashinfer.