perf(deepseek-v4): vectorize read_deepseek_v4_indexer_fp8_cache#238
perf(deepseek-v4): vectorize read_deepseek_v4_indexer_fp8_cache#238yuanqingz wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: efa0d11208
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| slots = slot_mapping.to(torch.int64) | ||
| valid = slots >= 0 | ||
| safe_slots = torch.where(valid, slots, torch.zeros_like(slots)) | ||
| pages = torch.div(safe_slots, block_size, rounding_mode="floor") |
There was a problem hiding this comment.
Move slot_mapping to cache device before offset math
This change keeps slots on whatever device slot_mapping already uses (slot_mapping.to(torch.int64)), but later combines value_base/scale_base with torch.arange(..., device=cache_2d.device). When cache_2d is on CUDA and slot_mapping is on CPU, this now raises a cross-device tensor error during offset construction, whereas the previous .tolist() loop accepted CPU mappings. Converting slot_mapping to cache_2d.device before computing pages/pos would preserve prior behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 6bdc106. The original .tolist() loop was implicitly device-agnostic (Python ints moved between CPU/GPU via subsequent slicing). The vectorized version composes offsets with torch.arange(device=cache_2d.device), so I now explicitly slot_mapping.to(device=cache_2d.device, dtype=torch.int64) to preserve the prior contract.
efa0d11 to
6bdc106
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
The original implementation iterated over `slot_mapping.tolist()` in
Python and performed GPU slicing + dtype-view + multiply per token. For
a 16-req x 1024-token prefill batch (~14338 tokens) across ~30 sparse
attention layers this is ~430K Python iterations per forward pass, each
with several GPU ops. The CPU sync from `.tolist()` also blocks any
hope of CUDA graph capture for the indexer path.
Replace with a batched torch-op implementation following the same
pattern already used by `read_deepseek_v4_indexer_mxfp4_cache` (same
file): one `gather` per dimension, dequantize on device. Output is
bit-identical to the reference loop for valid slots, zero for invalid
slots (slot < 0).
Measured impact on DeepSeek-V4-Flash with H20-3e TP=4, FP8 KV cache,
random ISL=1024 OSL=4 c=16:
TTFT (ms): 823,467 -> 18,197 (45x)
TPOT (ms): 2,067 -> 298 (7x)
16/16 bench duration: 1350s -> 19s (70x)
The vectorized implementation is also CUDA-graph-safe (no Python
branches, no `.tolist()` CPU sync), unblocking `--enforce-eager`
removal for V4-Flash's sparse indexer path.
Existing test `test_csa_indexer_cache_insert_fp8_path` continues to
pass; numerical equivalence with the original reference loop was
verified against the DeepSeek-V4-Flash bring-up smoke ('The capital of
France is Paris.') and a successful 16/16 random-prompt bench run.
Signed-off-by: Yuanqing Zhao <yuanqingz@nvidia.com>
9c4c4f6 to
5fe24d3
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5fe24d38bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dtype=torch.int64, | ||
| )[None, :] | ||
| ) | ||
| values = flat_cache[value_offsets].view(torch.float8_e4m3fn).float() |
There was a problem hiding this comment.
Skip gathers when all slots are invalid
This vectorized path performs flat_cache[value_offsets] unconditionally, so when slot_mapping is entirely padding (<0) and the cache has zero pages, it still tries to read row 0 from an empty buffer and fails with an out-of-bounds index (on CUDA this can surface as a device-side assert). The previous loop-based implementation skipped invalid slots and returned zeros in this case, so this is a behavioral regression for padded/empty-cache inputs unless you short-circuit when valid.any() is false before computing gathers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 1e29154. Added a shape-only short-circuit that returns zeros when slot_mapping.numel() == 0 or cache_2d.shape[0] == 0. Kept it shape-only (vs valid.any()) so the check stays CUDA-graph-capture-safe — no host sync. Also flipped the early-return from torch.empty to torch.zeros so the contract holds in the "cache has zero pages, all slots are padding" case the reference loop tolerated.
The sibling read_deepseek_v4_indexer_mxfp4_cache has the same latent issue with the same shape — happy to file a follow-up PR for it.
When cache_2d has zero pages (e.g., warmup batches before any FP8 indexer rows are cached) and slot_mapping is all-padding, the reference per-token loop iterated `slot_mapping.tolist()` and `continue`d on every `slot < 0`, returning the zero-initialized output tensor without touching `flat_cache`. The vectorized path uses `where(valid, slots, 0)` to keep offsets in-range, but the resulting row-0 gather still indexes into an empty `flat_cache` and raises an out-of-bounds error (on CUDA, surfaces as a device-side assert). Add a shape-only early return when `slot_mapping.numel() == 0` or `cache_2d.shape[0] == 0`. Shape-only so the check stays CUDA-graph-capture-safe (no `valid.any()` host sync). Switch the empty return tensor from `torch.empty` to `torch.zeros` to match the reference behavior in the cache-has-zero-pages case. Caught by codex review on PR lightseekorg#238. Signed-off-by: Yuanqing Zhao <yuanqingz@nvidia.com>
|
This PR has been inactive for 14 days and is marked as stale. It will be closed in 3 days if there is no further activity. |
|
@yuanqingz |
Summary
Replace the per-token Python loop in
read_deepseek_v4_indexer_fp8_cachewith a batched torch-op gather + dequant, matching the pattern already
used by
read_deepseek_v4_indexer_mxfp4_cachein the same file.The original loop iterates
slot_mapping.tolist()and performs severalGPU ops per token. For a 16-req × 1024-token prefill batch
(~14338 tokens) × ~30 sparse attention layers, that's ~430K Python
iterations per forward pass, each with a small GPU slice/view/multiply.
The
.tolist()host sync also blocks CUDA-graph capture of the indexerpath.
Measured impact
DeepSeek-V4-Flash on H20-3e TP=4, FP8 KV cache, random ISL=1024 c=16.
Eager mode (isolates this PR's change)
--enforce-eagerso only the indexer vectorization is exercised. Thewider CUDA-graph capture of the V4 indexer path is unrelated and was
made capture-safe on
mainby #242.OSL=4, NUM_PROMPTS=16:
Cumulative effect with CUDA graphs
Same hardware, OSL=256, NUM_PROMPTS=64, CUDA graphs enabled (no
--enforce-eager). This run also carries a separate capture-saferewrite of
_deepseek_v4_indexer_topk_from_cache_batchedthat isnot in this PR — semantically equivalent to what #242 has already
landed on
main, so reproducing this row on currentmainonlyrequires this PR. (Run id:
2299611; box: viking-prod-586.)For external context, the published vLLM HT reference at TP=4 c=128
on the same hardware reports mean TPOT 107 ms; this run is 1.5× faster
per token at c=16.
Correctness
slots; zero for
slot < 0.test/runtime/test_deepseek_v4_attention_ops.py::test_csa_indexer_cache_insert_fp8_pathpasses.
("The capital of France is Paris.") returns the expected completion.
complete successfully.
Re-test on current
main— blocked by a separate upstream bugI attempted to reproduce the cumulative-effect row on current
main(post-#242) after rebuilding
:smokewith the matchingtokenspeed-kernel-0.1.0.dev20260525. V4-Flash startup unconditionallyfails in
_fp8_act_quant_dequantatruntime/models/deepseek_v4.py:212:trtllm_fp8_quantize_1x128(...)returns a 1-D scale on this build, butthe call site expects 2-D. This blocks every V4-Flash TP=4 init on
Hopper. It is independent of the vectorize change in this PR — the
function being replaced here was not touched by #242, so this PR is
mechanically equivalent on old vs current
main. Happy to re-run thenumbers against a
mainonce the_fp8_act_quant_dequantshape issueis sorted (tracked as #246).