perf(deepseek-v4): vectorize read_deepseek_v4_indexer_fp8_cache by yuanqingz · Pull Request #238 · lightseekorg/tokenspeed

yuanqingz · 2026-05-24T08:58:34Z

Summary

Replace the per-token Python loop in read_deepseek_v4_indexer_fp8_cache
with a batched torch-op gather + dequant, matching the pattern already
used by read_deepseek_v4_indexer_mxfp4_cache in the same file.

The original loop iterates slot_mapping.tolist() and performs several
GPU ops per token. For a 16-req × 1024-token prefill batch
(~14338 tokens) × ~30 sparse attention layers, that's ~430K Python
iterations per forward pass, each with a small GPU slice/view/multiply.
The .tolist() host sync also blocks CUDA-graph capture of the indexer
path.

Measured impact

DeepSeek-V4-Flash on H20-3e TP=4, FP8 KV cache, random ISL=1024 c=16.

Eager mode (isolates this PR's change)

--enforce-eager so only the indexer vectorization is exercised. The
wider CUDA-graph capture of the V4 indexer path is unrelated and was
made capture-safe on main by #242.

OSL=4, NUM_PROMPTS=16:

Metric	Before	After	Speedup
16/16 bench duration	1350 s	19 s	70×
Mean TPOT	2067 ms	298 ms	7×
TTFT	823 s	18 s	45×

Cumulative effect with CUDA graphs

Same hardware, OSL=256, NUM_PROMPTS=64, CUDA graphs enabled (no
--enforce-eager). This run also carries a separate capture-safe
rewrite of _deepseek_v4_indexer_topk_from_cache_batched that is
not in this PR — semantically equivalent to what #242 has already
landed on main, so reproducing this row on current main only
requires this PR. (Run id: 2299611; box: viking-prod-586.)

Metric	Before (eager, no patch)	After (PR + cudagraph)	Speedup
Output throughput	3.03 tok/s	141.4 tok/s	47×
Total throughput	294 tok/s	707 tok/s	2.4×
Mean TPOT	2067 ms	68 ms	30×
Bench duration	1350 s	116 s	11.6×
Successful requests	16 / 64	64 / 64	—

For external context, the published vLLM HT reference at TP=4 c=128
on the same hardware reports mean TPOT 107 ms; this run is 1.5× faster
per token at c=16.

Correctness

Vectorized output is bit-identical to the reference loop for valid
slots; zero for slot < 0.
Existing test
test/runtime/test_deepseek_v4_attention_ops.py::test_csa_indexer_cache_insert_fp8_path
passes.
End-to-end V4-Flash bring-up SMOKE
("The capital of France is Paris.") returns the expected completion.
All 16/16 (OSL=4) and 64/64 (OSL=256) random-prompt bench requests
complete successfully.

Re-test on current `main` — blocked by a separate upstream bug

I attempted to reproduce the cumulative-effect row on current main
(post-#242) after rebuilding :smoke with the matching
tokenspeed-kernel-0.1.0.dev20260525. V4-Flash startup unconditionally
fails in _fp8_act_quant_dequant at runtime/models/deepseek_v4.py:212:

scale = scale.float().transpose(0, 1).contiguous()
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

trtllm_fp8_quantize_1x128(...) returns a 1-D scale on this build, but
the call site expects 2-D. This blocks every V4-Flash TP=4 init on
Hopper. It is independent of the vectorize change in this PR — the
function being replaced here was not touched by #242, so this PR is
mechanically equivalent on old vs current main. Happy to re-run the
numbers against a main once the _fp8_act_quant_dequant shape issue
is sorted (tracked as #246).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efa0d11208

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T09:00:32Z

+    slots = slot_mapping.to(torch.int64)
+    valid = slots >= 0
+    safe_slots = torch.where(valid, slots, torch.zeros_like(slots))
+    pages = torch.div(safe_slots, block_size, rounding_mode="floor")


Move slot_mapping to cache device before offset math

This change keeps slots on whatever device slot_mapping already uses (slot_mapping.to(torch.int64)), but later combines value_base/scale_base with torch.arange(..., device=cache_2d.device). When cache_2d is on CUDA and slot_mapping is on CPU, this now raises a cross-device tensor error during offset construction, whereas the previous .tolist() loop accepted CPU mappings. Converting slot_mapping to cache_2d.device before computing pages/pos would preserve prior behavior.

Useful? React with 👍 / 👎.

Good catch — fixed in 6bdc106. The original .tolist() loop was implicitly device-agnostic (Python ints moved between CPU/GPU via subsequent slicing). The vectorized version composes offsets with torch.arange(device=cache_2d.device), so I now explicitly slot_mapping.to(device=cache_2d.device, dtype=torch.int64) to preserve the prior contract.

yuanqingz · 2026-05-24T15:06:03Z

@codex review

chatgpt-codex-connector · 2026-05-24T15:08:36Z

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The original implementation iterated over `slot_mapping.tolist()` in Python and performed GPU slicing + dtype-view + multiply per token. For a 16-req x 1024-token prefill batch (~14338 tokens) across ~30 sparse attention layers this is ~430K Python iterations per forward pass, each with several GPU ops. The CPU sync from `.tolist()` also blocks any hope of CUDA graph capture for the indexer path. Replace with a batched torch-op implementation following the same pattern already used by `read_deepseek_v4_indexer_mxfp4_cache` (same file): one `gather` per dimension, dequantize on device. Output is bit-identical to the reference loop for valid slots, zero for invalid slots (slot < 0). Measured impact on DeepSeek-V4-Flash with H20-3e TP=4, FP8 KV cache, random ISL=1024 OSL=4 c=16: TTFT (ms): 823,467 -> 18,197 (45x) TPOT (ms): 2,067 -> 298 (7x) 16/16 bench duration: 1350s -> 19s (70x) The vectorized implementation is also CUDA-graph-safe (no Python branches, no `.tolist()` CPU sync), unblocking `--enforce-eager` removal for V4-Flash's sparse indexer path. Existing test `test_csa_indexer_cache_insert_fp8_path` continues to pass; numerical equivalence with the original reference loop was verified against the DeepSeek-V4-Flash bring-up smoke ('The capital of France is Paris.') and a successful 16/16 random-prompt bench run. Signed-off-by: Yuanqing Zhao <yuanqingz@nvidia.com>

zhyncs · 2026-05-25T08:11:09Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fe24d38bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T08:15:15Z

+            dtype=torch.int64,
+        )[None, :]
+    )
+    values = flat_cache[value_offsets].view(torch.float8_e4m3fn).float()


Skip gathers when all slots are invalid

This vectorized path performs flat_cache[value_offsets] unconditionally, so when slot_mapping is entirely padding (<0) and the cache has zero pages, it still tries to read row 0 from an empty buffer and fails with an out-of-bounds index (on CUDA this can surface as a device-side assert). The previous loop-based implementation skipped invalid slots and returned zeros in this case, so this is a behavioral regression for padded/empty-cache inputs unless you short-circuit when valid.any() is false before computing gathers.

Useful? React with 👍 / 👎.

Good catch — fixed in 1e29154. Added a shape-only short-circuit that returns zeros when slot_mapping.numel() == 0 or cache_2d.shape[0] == 0. Kept it shape-only (vs valid.any()) so the check stays CUDA-graph-capture-safe — no host sync. Also flipped the early-return from torch.empty to torch.zeros so the contract holds in the "cache has zero pages, all slots are padding" case the reference loop tolerated.

The sibling read_deepseek_v4_indexer_mxfp4_cache has the same latent issue with the same shape — happy to file a follow-up PR for it.

When cache_2d has zero pages (e.g., warmup batches before any FP8 indexer rows are cached) and slot_mapping is all-padding, the reference per-token loop iterated `slot_mapping.tolist()` and `continue`d on every `slot < 0`, returning the zero-initialized output tensor without touching `flat_cache`. The vectorized path uses `where(valid, slots, 0)` to keep offsets in-range, but the resulting row-0 gather still indexes into an empty `flat_cache` and raises an out-of-bounds error (on CUDA, surfaces as a device-side assert). Add a shape-only early return when `slot_mapping.numel() == 0` or `cache_2d.shape[0] == 0`. Shape-only so the check stays CUDA-graph-capture-safe (no `valid.any()` host sync). Switch the empty return tensor from `torch.empty` to `torch.zeros` to match the reference behavior in the cache-has-zero-pages case. Caught by codex review on PR lightseekorg#238. Signed-off-by: Yuanqing Zhao <yuanqingz@nvidia.com>

github-actions · 2026-06-23T00:29:27Z

This PR has been inactive for 14 days and is marked as stale. It will be closed in 3 days if there is no further activity.

SimonCqk · 2026-06-24T02:00:41Z

@yuanqingz
Hi, thanks for the contribution! As for the benchmark results, it seems not directly comparable. Would you mind adding another benchmark run with cudagraph enabled (without --enforce-eager) and comparing the results before and after your change?

yuanqingz requested a review from a team as a code owner May 24, 2026 08:58

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

yuanqingz force-pushed the perf/vectorize-read-indexer-fp8-cache branch from efa0d11 to 6bdc106 Compare May 24, 2026 14:18

yuanqingz changed the title ~~perf(deepseek-v4): vectorize read_deepseek_v4_indexer_fp8_cache~~ perf(deepseek-v4): vectorize FP8 indexer read + make indexer topk batched CUDA-graph safe May 25, 2026

yuanqingz force-pushed the perf/vectorize-read-indexer-fp8-cache branch from 9c4c4f6 to 5fe24d3 Compare May 25, 2026 02:48

yuanqingz changed the title ~~perf(deepseek-v4): vectorize FP8 indexer read + make indexer topk batched CUDA-graph safe~~ perf(deepseek-v4): vectorize read_deepseek_v4_indexer_fp8_cache May 25, 2026

yuanqingz mentioned this pull request May 25, 2026

DeepSeek-V4-Flash startup fails on Hopper: _fp8_act_quant_dequant IndexError on scale.transpose(0, 1) #246

Closed

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

yuanqingz and others added 2 commits May 25, 2026 07:50

Merge branch 'main' into perf/vectorize-read-indexer-fp8-cache

9226ebf

github-actions Bot added the inactive label Jun 23, 2026

github-actions Bot removed the inactive label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(deepseek-v4): vectorize read_deepseek_v4_indexer_fp8_cache#238

perf(deepseek-v4): vectorize read_deepseek_v4_indexer_fp8_cache#238
yuanqingz wants to merge 3 commits into
lightseekorg:mainfrom
yuanqingz:perf/vectorize-read-indexer-fp8-cache

yuanqingz commented May 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Uh oh!

yuanqingz May 24, 2026

Uh oh!

yuanqingz commented May 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 24, 2026

Uh oh!

zhyncs commented May 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Uh oh!

yuanqingz May 25, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

SimonCqk commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yuanqingz commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured impact

Eager mode (isolates this PR's change)

Cumulative effect with CUDA graphs

Correctness

Re-test on current main — blocked by a separate upstream bug

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

yuanqingz May 24, 2026

Choose a reason for hiding this comment

Uh oh!

yuanqingz commented May 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 24, 2026

Uh oh!

zhyncs commented May 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

yuanqingz May 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

SimonCqk commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanqingz commented May 24, 2026 •

edited

Loading

Re-test on current `main` — blocked by a separate upstream bug