fix(deepseek-v2-lite): use true batched decode benchmark path#184
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements a narrow same-prompt batched decode path for DeepSeek-V2-Lite, introducing batched operations such as gemm_per_token and argmax_batch_bf16_into along with host-side batched attention helpers. It also updates documentation, tests, and expected oracle hashes. Feedback on the changes highlights a critical correctness issue in sample_tokens where stream synchronization is performed before clone_dtoh instead of after, which could lead to a race condition when accessing the copied data on the host.
8038fb3 to
518575b
Compare
518575b to
4d61251
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes the DeepSeek-V2-Lite EP2 direct benchmark path so
bench_serving request --concurrency > 1no longer measures a row loop over the single-row decode path.The new path is deliberately narrow: same prompt, greedy decode, fixed output length, and batch sizes
1..=8.concurrency=1remains the single-row control;concurrency=4/8now advance rows through a batch-shaped decode step and report TPOT from shared decode-step timing.Scope
In scope:
gemm_per_token_cudaandargmax_batch_bf16_cudakernel paths.KERNELS.mdfor the newly exposed runtime-facing wrappers.Out of scope:
Evidence
Local checks on the pushed commit:
cargo fmt --checkpassedgit diff --checkpassedbatched-benchmark-gate.mdreference scan passedGPU/model evidence collected on the same DeepSeek-V2-Lite snapshot for this true-batch implementation:
all_token_text_exact[11, 304, 608, 245, 207, 16, 24, 1012, 1712, 5075, 473, 254, 7312, 13, 304, 608], I am a 19 year old girl from the UK. I am4fb4c8825fe4d2c4a1d966da25c259abdf675f4de4548daa5d41aea7dfe302250eedf11429e9ac13bb799c31665c6e9f70a1ac4493a08a3f3da9ecf39c1ec347PegaInfer direct benchmark snapshot, 2x RTX 5090, same model snapshot:
Claim Boundary
This PR only fixes the measurement contract for the narrow DeepSeek-V2-Lite EP2 direct benchmark path. It does not claim a performance improvement, production batching support, or vLLM parity.
Refs #170