Skip to content

fix(deepseek-v2-lite): use true batched decode benchmark path#184

Merged
xiaguan merged 1 commit into
xiaguan:mainfrom
CAICAIIs:feat/dsv2-lite-true-batch-decode
May 28, 2026
Merged

fix(deepseek-v2-lite): use true batched decode benchmark path#184
xiaguan merged 1 commit into
xiaguan:mainfrom
CAICAIIs:feat/dsv2-lite-true-batch-decode

Conversation

@CAICAIIs
Copy link
Copy Markdown
Contributor

Summary

This PR fixes the DeepSeek-V2-Lite EP2 direct benchmark path so bench_serving request --concurrency > 1 no longer measures a row loop over the single-row decode path.

The new path is deliberately narrow: same prompt, greedy decode, fixed output length, and batch sizes 1..=8. concurrency=1 remains the single-row control; concurrency=4/8 now advance rows through a batch-shaped decode step and report TPOT from shared decode-step timing.

Scope

In scope:

  • Add a DeepSeek-V2-Lite EP2 same-prompt batched decode path for direct benchmark measurement.
  • Keep prefill row-by-row; only decode advances rows through a batch-shaped step.
  • Keep internal projection boundaries conservative for HF/hash parity before performance optimization.
  • Require same-prompt batch rows to produce identical token traces before reporting benchmark numbers.
  • Reuse the existing HF / host-staged / NCCL hash oracle.
  • Add shared wrappers for existing gemm_per_token_cuda and argmax_batch_bf16_cuda kernel paths.
  • Update KERNELS.md for the newly exposed runtime-facing wrappers.

Out of scope:

  • Production continuous batching.
  • vLLM parity or performance optimization claims.
  • Sparse dispatch, pegainfer-comm/NVLink, multi-node, or generic EP topology.
  • Changing the NCCL transport path beyond preserving parity with host-staged output.

Evidence

Local checks on the pushed commit:

  • cargo fmt --check passed
  • git diff --check passed
  • Private host/path/credential scan over changed files passed
  • Stale batched-benchmark-gate.md reference scan passed

GPU/model evidence collected on the same DeepSeek-V2-Lite snapshot for this true-batch implementation:

  • HF / host-staged / NCCL comparison: all_token_text_exact
  • Generated ids: [11, 304, 608, 245, 207, 16, 24, 1012, 1712, 5075, 473, 254, 7312, 13, 304, 608]
  • Generated text: , I am a 19 year old girl from the UK. I am
  • Token SHA256: 4fb4c8825fe4d2c4a1d966da25c259abdf675f4de4548daa5d41aea7dfe30225
  • Text SHA256: 0eedf11429e9ac13bb799c31665c6e9f70a1ac4493a08a3f3da9ecf39c1ec347

PegaInfer direct benchmark snapshot, 2x RTX 5090, same model snapshot:

Batch Backend steady TPOT p50 ms steady TPOT avg ms decode tok/s Trace
1 host-staged 58.558 62.009 16.144 HF trace
1 NCCL 193.650 201.276 4.982 HF trace
4 host-staged 202.186 210.409 19.124 HF trace
4 NCCL 333.321 344.764 11.528 HF trace
8 host-staged 394.753 411.348 19.423 HF trace
8 NCCL 522.917 539.643 14.874 HF trace

Claim Boundary

This PR only fixes the measurement contract for the narrow DeepSeek-V2-Lite EP2 direct benchmark path. It does not claim a performance improvement, production batching support, or vLLM parity.

Refs #170

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a narrow same-prompt batched decode path for DeepSeek-V2-Lite, introducing batched operations such as gemm_per_token and argmax_batch_bf16_into along with host-side batched attention helpers. It also updates documentation, tests, and expected oracle hashes. Feedback on the changes highlights a critical correctness issue in sample_tokens where stream synchronization is performed before clone_dtoh instead of after, which could lead to a race condition when accessing the copied data on the host.

Comment thread pegainfer-deepseek-v2-lite/src/runtime.rs
@CAICAIIs CAICAIIs force-pushed the feat/dsv2-lite-true-batch-decode branch from 8038fb3 to 518575b Compare May 27, 2026 17:04
@CAICAIIs CAICAIIs force-pushed the feat/dsv2-lite-true-batch-decode branch from 518575b to 4d61251 Compare May 27, 2026 17:20
@xiaguan xiaguan merged commit 4e76629 into xiaguan:main May 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants