fix(deepseek-v2-lite): use true batched decode benchmark path by CAICAIIs · Pull Request #184 · xiaguan/pegainfer

CAICAIIs · 2026-05-27T16:19:58Z

Summary

This PR fixes the DeepSeek-V2-Lite EP2 direct benchmark path so bench_serving request --concurrency > 1 no longer measures a row loop over the single-row decode path.

The new path is deliberately narrow: same prompt, greedy decode, fixed output length, and batch sizes 1..=8. concurrency=1 remains the single-row control; concurrency=4/8 now advance rows through a batch-shaped decode step and report TPOT from shared decode-step timing.

Scope

In scope:

Add a DeepSeek-V2-Lite EP2 same-prompt batched decode path for direct benchmark measurement.
Keep prefill row-by-row; only decode advances rows through a batch-shaped step.
Keep internal projection boundaries conservative for HF/hash parity before performance optimization.
Require same-prompt batch rows to produce identical token traces before reporting benchmark numbers.
Reuse the existing HF / host-staged / NCCL hash oracle.
Add shared wrappers for existing gemm_per_token_cuda and argmax_batch_bf16_cuda kernel paths.
Update KERNELS.md for the newly exposed runtime-facing wrappers.

Out of scope:

Production continuous batching.
vLLM parity or performance optimization claims.
Sparse dispatch, pegainfer-comm/NVLink, multi-node, or generic EP topology.
Changing the NCCL transport path beyond preserving parity with host-staged output.

Evidence

Local checks on the pushed commit:

cargo fmt --check passed
git diff --check passed
Private host/path/credential scan over changed files passed
Stale batched-benchmark-gate.md reference scan passed

GPU/model evidence collected on the same DeepSeek-V2-Lite snapshot for this true-batch implementation:

HF / host-staged / NCCL comparison: all_token_text_exact
Generated ids: [11, 304, 608, 245, 207, 16, 24, 1012, 1712, 5075, 473, 254, 7312, 13, 304, 608]
Generated text: , I am a 19 year old girl from the UK. I am
Token SHA256: 4fb4c8825fe4d2c4a1d966da25c259abdf675f4de4548daa5d41aea7dfe30225
Text SHA256: 0eedf11429e9ac13bb799c31665c6e9f70a1ac4493a08a3f3da9ecf39c1ec347

PegaInfer direct benchmark snapshot, 2x RTX 5090, same model snapshot:

Batch	Backend	steady TPOT p50 ms	steady TPOT avg ms	decode tok/s	Trace
1	host-staged	58.558	62.009	16.144	HF trace
1	NCCL	193.650	201.276	4.982	HF trace
4	host-staged	202.186	210.409	19.124	HF trace
4	NCCL	333.321	344.764	11.528	HF trace
8	host-staged	394.753	411.348	19.423	HF trace
8	NCCL	522.917	539.643	14.874	HF trace

Claim Boundary

This PR only fixes the measurement contract for the narrow DeepSeek-V2-Lite EP2 direct benchmark path. It does not claim a performance improvement, production batching support, or vLLM parity.

Refs #170

gemini-code-assist

Code Review

This pull request implements a narrow same-prompt batched decode path for DeepSeek-V2-Lite, introducing batched operations such as gemm_per_token and argmax_batch_bf16_into along with host-side batched attention helpers. It also updates documentation, tests, and expected oracle hashes. Feedback on the changes highlights a critical correctness issue in sample_tokens where stream synchronization is performed before clone_dtoh instead of after, which could lead to a race condition when accessing the copied data on the host.

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Comment thread pegainfer-deepseek-v2-lite/src/runtime.rs

CAICAIIs force-pushed the feat/dsv2-lite-true-batch-decode branch from 8038fb3 to 518575b Compare May 27, 2026 17:04

fix(deepseek-v2-lite): use true batched decode benchmark path

4d61251

CAICAIIs force-pushed the feat/dsv2-lite-true-batch-decode branch from 518575b to 4d61251 Compare May 27, 2026 17:20

xiaguan merged commit 4e76629 into xiaguan:main May 28, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deepseek-v2-lite): use true batched decode benchmark path#184

fix(deepseek-v2-lite): use true batched decode benchmark path#184
xiaguan merged 1 commit into
xiaguan:mainfrom
CAICAIIs:feat/dsv2-lite-true-batch-decode

CAICAIIs commented May 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CAICAIIs commented May 27, 2026

Summary

Scope

Evidence

Claim Boundary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants