Skip to content

[Bug] iso3/planar3 KV prefill scales super-linearly and produces garbage on RTX 5090 (sm_120 / CUDA 12.8) #10

@TanveerKahlon

Description

@TanveerKahlon

[Bug] iso3 / planar3 KV kernels: super-linear prefill slowdown and garbage output on RTX 5090 (sm_120 / CUDA 12.8)

Summary

On a 2× RTX 5090 (Blackwell, sm_120a, CUDA 12.8 toolkit, driver 580.126.09) box, the rotorquant KV-cache types iso3 and planar3 — as exposed by the johndpope/llama-cpp-turboquant consumer fork at feature/planarquant-kv-cache@fc3d1b6 — show:

  • Super-linear prefill slowdown that scales with context length — 2× slower at 512 tokens, 7.5× at 4 K, ~24× at 16 K compared to f16/f16 on the same fork binary
  • Incoherent output at real long-context workloads — 37 K-token prompt produces I?????????-style garbage + 0.3 t/s decode

The fork's own f16/f16 baseline works correctly at every context size (156.8 t/s decode on Llama 3.1 8B, coherent English), so the CUDA / PTX / driver / CMake path is fine — the regression is localised to the rotorquant kernel path on sm_120.

Filing on this repo because the arithmetic authoring (Givens rotations, quaternion rotations, isometric projections) originates here and the README headlines RTX 5090 results. (Attempted to file a companion report on johndpope/llama-cpp-turboquant where the CUDA kernel code lives, but that repo has issues + discussions both disabled — happy to coordinate with either maintainer.)

Environment

GPUs 2× NVIDIA GeForce RTX 5090 (32 GiB each)
Compute capability 12.0 (sm_120a)
Driver 580.126.09
CUDA toolkit 12.8.1 (side-toolkit at /opt/cuda-12.8/)
OS Ubuntu 24.04, Linux 6.17.0-20-generic
GCC 13.3
Consumer fork johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache (fc3d1b6, tip as of 2026-04-20)
ggml version reported 0.9.8-dirty

Reproducer A — super-linear prefill slowdown (deterministic, via llama-bench)

Built from johndpope/llama-cpp-turboquant @ feature/planarquant-kv-cache with:

cmake -B build \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DGGML_CUDA=ON -DGGML_CUDA_MMQ=ON -DGGML_FLASH_ATTN=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

(Needs one-line ggml/include/ggml.h:184 patch for GCC 13.3 — filing separately on the fork.)

Run:

MODEL=llama-3.1-8b-instruct.Q4_K_M.gguf
./build/bin/llama-bench -m $MODEL -ngl 99 -p  512 -n 128 -ctk f16  -ctv f16
./build/bin/llama-bench -m $MODEL -ngl 99 -p 4096 -n  64 -ctk f16  -ctv f16
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p  4096 -n  64 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p 16384 -n  64 -ctk iso3 -ctv iso3
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk planar3 -ctv planar3
./build/bin/llama-bench -m $MODEL -ngl 99 -p   512 -n 128 -ctk planar3 -ctv f16

Results (tensor-split across both 5090s, default -sm layer):

Config pp512 (t/s) pp4096 (t/s) pp16384 (t/s) tg128/tg64 (t/s)
f16/f16 (baseline) 12 533 ± 1 035 16 263 ± 4 n/a 234 / 244
iso3/iso3 5 771 ± 298 (−54 %) 2 166 ± 2 (−87 %) 691 ± 1 179 / 186
planar3/planar3 7 586 ± 140 (−40 %) 184
planar3/f16 14 850 ± 534 (+18 %) 218

iso3 prefill throughput collapses from ~46 % of baseline at 512 tokens to ~13 % at 4 K to ~4 % at 16 K. Decode (tg) throughput is basically fine, so the regression is on the prefill / batch-eval path, not the single-token attention kernel.

planar3/f16 (K-only compression) is the only rotorquant config that avoids the collapse — it's actually faster than baseline at pp512. Consistent with it skipping more work rather than hitting the slow path.

Reproducer B — garbage output at ~37 K context (llama-server)

With Llama 3.1 8B Q4_K_M and a ~37 K-token Punjabi scripture prompt (real workload — long-form Gurmukhi extraction):

./build/bin/llama-server -m $MODEL -ngl 99 --jinja \
  --host 127.0.0.1 --port 8090 \
  -c 65536 -t 16 -ts 1,1 \
  -ctk iso3 -ctv iso3 -fa on
Config Decode (t/s) Output
-ctk iso3 -ctv iso3 -fa on 0.6 garbage: I?????????
-ctk planar3 -ctv f16 -fa on 0.4 garbage: I?????????
-ctk iso3 -ctv iso3 (no -fa) 0.3 garbage
-ctk f16 -ctv f16 -fa on (fork baseline) 156.8 coherent English

Raw data captured at ~/local_llm/.recon/install-gate-5-v2-rotorquant.json in our local repo — can upload if useful.

Given the prefill scaling in Reproducer A, my guess is these are the same root cause: a prefill kernel that scales wrong on sm_120 and past some threshold produces arithmetic that's corrupt as well as slow, rather than just slow.

What I've ruled out

  • Not a broken toolchain. Fork's own f16/f16 baseline runs at 156.8 t/s decode / 12.5 K t/s prefill on the same build, coherent output.
  • Not a caller-side -fa issue. Garbage reproduces with and without flash-attention.
  • Not sm_120a vs sm_120. CMAKE_CUDA_ARCHITECTURES=120 targets base sm_120 PTX; kernels using a-family accelerated ops still run.
  • Not CUDA version cached wrong. Built cleanly with /opt/cuda-12.8/ toolkit; libggml-cuda.so links libcudart.so.12.

Hypothesis

The rotorquant CUDA kernels landed in the consumer fork via:

  • 1ed0453 — "Add CUDA set_rows kernels for planar3/iso3/planar4/iso4"
  • 26c90d6 — "Add CUDA F16→quantized conversion kernels for planar3/4 and iso3/4"
  • a75b16f — "Add CUDA flash attention dequantize for planar3/iso3/planar4/iso4"
  • 9d4ece5 — "COMPRESSION WORKS: 5.1x K-cache + 200 tok/s decode on CUDA"

None of those commits mention cu128, WMMA, mma.sync, __CUDA_ARCH__ >= 1200, or any architecture gating. The "200 tok/s decode on CUDA" in 9d4ece5 was almost certainly measured on sm_89 Ada (RTX 4090 / A6000) since none of the commits adjust block/warp sizing or add Blackwell-specific codepaths.

Most-likely candidates for the sm_120 path going bad:

  1. Givens-rotation prefill kernel uses a block shape or shared-memory layout valid on sm_89 (128 KB dynamic smem) but overflowing / mis-aligning on sm_120 (228 KB dynamic smem with different bank layout)
  2. WMMA / mma.sync instruction chosen by template expansion compiles on sm_120 but produces wrong output for Givens-rotated tiles
  3. Constant-memory read pattern for rotation coefficients behaves differently on Blackwell (CC 12.0 changed constant cache behaviour)

Cross-reference: ggml-org/llama.cpp#21915 (gibberish with quantised KV cache on sm_120, open as of 2026-04-14) — same symptom family for non-rotorquant KV quant, suggests the sm_120 + quant-KV path is fragile upstream too.

Ask

  1. Which compute capabilities were the iso3 / planar3 kernels validated on in CI or manual testing before the RTX 5090 numbers landed in the README?
  2. Is there a suspected kernel (Givens rotation in prefill? FA dequantize?) where the sm_89 → sm_120 jump would plausibly regress? I can run compute-sanitizer / nsys profile and share output if it would help narrow this down.
  3. If this is a known-soft spot on Blackwell, would it be worth a note in the README until kernel paths are updated?

Happy to coordinate a debug session — this is blocking a real-world Gurmukhi scripture extraction workload for us (30 K–50 K-token chunks), so a fix here would be directly useful. The algorithm itself is genuinely encouraging — planar3/f16 showing K-only compression faster than baseline at pp512 is a nice result.


Filed 2026-04-20. The consumer fork johndpope/llama-cpp-turboquant has issues + discussions disabled, so there's no parallel ticket there — this is the single issue for this bug. Machine-readable repro data at our local .recon/install-gate-5-v2-rotorquant.json.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions