Skip to content

Per-GPU kernel autotune + cov-shuffle (shfl) variant + Eigen port#143

Open
zpzim wants to merge 63 commits into
masterfrom
feat/gpu-kernel-shfl-variant
Open

Per-GPU kernel autotune + cov-shuffle (shfl) variant + Eigen port#143
zpzim wants to merge 63 commits into
masterfrom
feat/gpu-kernel-shfl-variant

Conversation

@zpzim
Copy link
Copy Markdown
Owner

@zpzim zpzim commented May 22, 2026

Background

The cov-shuffle / warp-shuffle GPU kernel design has been on this repo's
wishlist since 2018 (experimental-warp-shuffle,
experimental-shared-memory) and was retried in late 2022 / early 2023
with an Eigen-port foundation (use-eigen-in-gpu-kernels,
eigen-gpu-warp-shuffle, eigen-gpu-cov-shuffle, eigen-tex-mem).
None of those branches reached a shippable state — the codegen
tradeoffs (compile time, kernel binary size, smem footprint per
variant) made it hard to keep the new kernel alongside the existing
sliding-window one without regressing builds for users who didn't need
it. This PR is the iteration that finally lands the work, on top of a
per-(profile, variant) build split + an autotune cache that picks per
device, so the new kernel can coexist with the legacy one in the same
binary.

Summary

Three threads of work that landed together because the autotune machinery
depends on the kernel-variant split, and the variant split is the thing
that lets the cov-shuffle variant live alongside the existing
sliding-window kernels without compile-time blow-up.

1. Per-GPU autotune

  • New --autotune CLI flag (and pyscamp.autotune() binding) that
    sweeps (variant geometry, blocksz) for every supported
    (profile, precision) target and writes the per-target winner to a
    plain-text cache file.
  • Three-tier lookup at kernel-launch time:
    process-wide override → user cache → built-in cache (embedded from
    data/autotune_cache.txt) → compile-time default. Each tier is
    independent; lookup falls through gracefully on miss.
  • Cache file location resolves $SCAMP_AUTOTUNE_CACHE → $XDG_CACHE_HOME
    $HOME/.cache/scamp/autotune.txt on POSIX or
    %LOCALAPPDATA%\scamp\autotune.txt on Windows.
  • One-shot stderr warning ("no autotune entry for device '<name>'…")
    on miss, silenceable via SCAMP_AUTOTUNE_QUIET. pyscamp sets this at
    module init so notebook users don't get uninvited stderr.
  • Default benchmark workload size is 256K elements
    (SCAMP_AUTOTUNE_INPUT_LENGTH=262144); measured against 64K/128K on
    an RTX 3080, 256K is the smallest size where launch overhead is
    reliably dominated by steady-state kernel cost — see the comparison
    table in docs/source/autotune.rst.
  • Built-in cache ships entries for RTX 3080 (sm_86) and P100 (sm_60).

2. Kernel-variant infrastructure

  • SCAMP_VARIANT_TUPLES in src/core/gpu_kernel/CMakeLists.txt is the
    single source of truth for which kernel geometries the binary
    supports. The foreach in that file regenerates the dispatch table,
    the kVariants[] constexpr array, and the per-(profile, variant)
    .cu files via configure_file.
  • Five active variants today: 3 sliding-window + 2 shfl, after a
    multi-device sweep retired the ones that never won.
  • Build split: 5 profiles × 5 variants = 25 generated .cu files,
    one do_tile<...> instantiation per file, so make/ninja can
    dispatch them in parallel. Full gpu_kernels rebuild on a 3080 dev
    box dropped from ~260s to ~74s; the dominant v3+v4 (DPT=8) TUs
    alone went 211s → 34s after dropping #pragma unroll on the shfl
    outer row loop (a __syncthreads()-terminated loop with up to 256
    iterations that nvcc was inlining 256× for no ILP gain).

3. cov-shuffle ("shfl") variant

  • New GPU kernel design where the per-row covariance value walks
    lane-to-lane via warp shuffles rather than being read from shared
    memory. Files: kernels_compute_shfl.h, kernels_impl_shfl.h,
    with an ASCII walkthrough generator in
    scripts/cov_shuffle_diagram.py.
  • Two enabled shfl tuples: DPT=4 / tile=64 and DPT=8 / tile=256. v3
    (DPT=4) is the compile-time default fallback; the autotuner picks
    per-device.

4. Eigen port of the sliding-window kernel

  • Sliding-window cov / dist arrays in kernels_compute.h +
    kernels_smem.h use Eigen::Array<T, N, 1> and Eigen::Map
    instead of hand-unrolled C arrays. Required bumping the Eigen
    submodule 3.4.0 → 5.0.1 for better nvcc support; device-callable
    under --expt-relaxed-constexpr (already set at the top level).

5. Platform / CUDA-arch matrix

  • Hardened cmake/SCAMPMacros.cmake set_cuda_architectures() for
    CUDA 11.0 through 13.0+, including Blackwell (sm_100, sm_103,
    sm_110, sm_120, sm_121) and the CCCL 2.8.x arch-token limit
    workarounds.
  • Honors -DCMAKE_CUDA_ARCHITECTURES=<list> /
    USER_SPECIFIED_CUDA_ARCHITECTURES for single-arch dev builds (huge
    compile-time win when you know you're targeting one GPU).
  • Windows MSVC compatibility for the autotune cache code:
    std::filesystem for atomic-rename + recursive mkdir,
    _putenv_s for the env-var-set test helpers, explicit lambda
    captures for MSVC C3493, LOCALAPPDATA / USERPROFILE env-var
    resolution.

6. Docs

  • New docs/source/autotune.rst covering lookup chain, cache
    locations (POSIX + Windows), the "no autotune entry" warning,
    clearing/resetting the cache, the upgrade compatibility contract,
    and per-N workload-size guidance.
  • New "Environment variables" section in docs/source/environment.rst
    listing every env var SCAMP / pyscamp / the distributed gRPC client
    reads.
  • pyscamp autodoc stub (docs/pyscamp.py) now surfaces autotune()
    and gpu_supported().

Test plan

  • 22 C++ unit tests in test/cpp/test_autotune_cache.cpp
    covering the cache lookup chain, cache-miss warning dedup, the
    SCAMP_AUTOTUNE_QUIET env-var contract, on-disk round-trip,
    atomic replace, recursive parent-dir creation, the full
    bench-driven autotune pipeline driven by a synthetic
    BenchmarkFn (no GPU needed), and four cache-upgrade scenarios
    (future version header, missing header, partially stale cache,
    user wipe with built-in fallback).
  • Tests build + pass under CUDA-toolchain-only CI on Linux
    (g++, clang++) and Windows (MSVC cl), with no GPU required.
    Verified locally against VS2022 MSVC 14.30 + CUDA 11.6.
  • SCAMP --autotune end-to-end on RTX 3080 (Linux + Windows)
    produces a cache that subsequent SCAMP runs consume without
    the cache-miss warning.
  • cmake --build . --config Release --parallel 2 (the exact
    invocation in build-cuda-cli) builds + runs test_autotune_cache
    via ctest --output-on-failure.

🤖 Generated with Claude Code

zpzim and others added 30 commits May 16, 2026 11:18
The CPU kernels already live under src/core/cpu_kernel/. Mirror that layout
on the GPU side so the next round of GPU work (per-arch knobs + autotuning,
issue #115) has a clear home and is not interleaved with the host-side
tile/profile orchestration.

Files moved (git rename, no behavior change):
  kernels.h, kernels.cu                 -- main kernel entry + dispatch
  kernels_compute.h, kernels_smem.h     -- per-profile compute strategies + smem
  kernel_gpu_utils.h, kernel_gpu_utils.cu -- launch config helpers
  qt_kernels.h, qt_kernels.cu           -- cuFFT sliding-dot-product kernels

Files staying at src/core/ (host-side or shared):
  kernel_common.h/.cpp -- shared with CPU kernels
  qt_helper.h/.cpp     -- host-side cuFFT context manager (has CPU fallback)
  tile.h/.cpp          -- tile orchestration

src/core/gpu_kernel/CMakeLists.txt now owns the qt_kernels, gpu_utils, and
gpu_kernels library targets and is only added when CMAKE_CUDA_COMPILER is
present, mirroring how cpu_kernel/ is structured.

Verified that build_cuda/ builds cleanly with -DFORCE_CUDA=1 and that the
1NN_INDEX self-join on a 2000-element series exactly matches
distance_matrix_np reference output (same vectors and indexes).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The do_tile<...> template was instantiated 45 times in a single .cu file
(5 profile types x 3 precisions x 3 row/col modes), serialising nvcc into
a ~100-second front-end for that one translation unit. Split it so each
profile type gets its own .cu file and the per-TU work shrinks to 9
instantiations.

  kernels_impl.h          -- shared template body for do_tile + LaunchDoTile
  kernels_dispatch.h      -- LaunchKernel_<PROFILE> entry-point declarations
  kernels.cu              -- slimmed: just the runtime dispatch +
                             gpu_kernel_self/ab_join_* + match_gpu_sort
  kernel_1nn.cu
  kernel_1nn_index.cu     -- one per profile type, each instantiates
  kernel_sum_thresh.cu       LaunchDoTile<...> bound to its PROFILE_TYPE
  kernel_matrix_summary.cu
  kernel_approx_all_neighbors.cu

Also extracted the plain launch constants (KERNEL_TILE_HEIGHT, BLOCKSZ_*,
BLOCKSPERSM, DIAGS_PER_THREAD, TILE_HEIGHT_*) into a new CUDA-free
kernel_constants.h. kernel_gpu_utils.h still re-exports them; host TUs
can now include the constants without dragging in nvcc-only device
intrinsics.

Measured on this 8-core box (RTX 3080, nvcc 12.2):
  baseline (single kernels.cu):     101s wall, 740MB peak rss
  split (6 .cu, --parallel 8):       35s wall, 418MB peak rss
A 2.9x wall-clock win and 43% lower peak memory for the gpu_kernels target.

GPU output is byte-identical: verified that 1NN_INDEX self-join on a 2000-
element series produces vectors+indexes matching distance_matrix_np
exactly. PTX is unchanged -- this is a pure refactor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
)

The kernel launch path now consults a small per-(device, profile_type,
precision) configuration table to pick the right (blocksz, tile_height,
blocks_per_sm) tuple. The table is wired with three layers, checked in
priority order at GetKernelConfigForDevice():

  1. User override file ($SCAMP_AUTOTUNE_CACHE, $XDG_CACHE_HOME, or
     ~/.cache/scamp/autotune.txt). RunAutotune() writes here. Useful for
     developers iterating on a single machine.
  2. Built-in cache embedded into the binary at build time from
     data/autotune_cache.txt. This is what conda-forge / pip-wheel users
     benefit from since they cannot recompile to add kernel variants.
  3. GetDefaultKernelConfig() -- the compile-time default that matches
     the constants SCAMP has historically shipped with.

Workflow for shipping a new device's tuned config to end users:
  1. Build SCAMP from source on a machine with the target GPU.
  2. Run `SCAMP --autotune` (or `pyscamp.autotune()`) -- this writes the
     user override file.
  3. Merge the new device's lines into data/autotune_cache.txt and open
     a PR. The next release picks them up via configure_file embedding.

Files added:
  data/autotune_cache.txt              -- checked-in cache with header
  src/core/gpu_kernel/
    kernel_config.{h,cpp}              -- KernelConfig struct + default
    kernel_constants.h                 -- moved compile-time constants
                                          here (CUDA-free) so host TUs
                                          can include them
    device_props.{h,cpp}               -- cudaGetDeviceProperties wrapper
                                          producing a stable cache key
    autotune_cache.{h,cpp}             -- text-format cache I/O, with
                                          LoadFromString hook for the
                                          built-in
    builtin_autotune_cache.h           -- declares the embedded constant
    builtin_autotune_cache.cpp.in      -- configure_file template
    autotune.{h,cpp}                   -- public API: RunAutotune,
                                          GetKernelConfigForDevice
                                          (override + builtin lookup)

Files modified:
  src/core/gpu_kernel/CMakeLists.txt   -- builds the new sources;
                                          configure_file embeds the
                                          cache into the binary
  src/core/gpu_kernel/kernels.cu       -- compute_gpu_resources_and_launch
                                          calls GetKernelConfigForDevice
                                          for every launch
  src/main.cpp                         -- --autotune CLI flag
  src/python/CMakeLists.txt            -- pyscamp links gpu_utils +
                                          CUDA::cudart when CUDA is on
  src/python/SCAMP_python.cpp          -- pyscamp.autotune() binding

Today only the compile-time-default KernelConfig is template-instantiated
in the kernel, so IsSupportedKernelConfig() rejects any other tuple and
the runtime silently falls through to the default. The infrastructure
is in place for follow-up PRs to add alternative (blocksz, tile_height)
kernel variants, at which point the cache becomes load-bearing for
end users on tuned devices.

Verified:
  - CLI --autotune on RTX 3080 writes a 15-entry override file and
    prints the developer workflow instructions.
  - pyscamp.autotune() does the same from Python.
  - SCAMP_AUTOTUNE_CACHE=/tmp/empty_cache + GPU 1NN_INDEX self-join still
    matches distance_matrix_np exactly (falls through to default).
  - Full test/test_pyscamp.py suite passes on GPU.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eigen 5.0.1 (released Nov 2025) is the first stable release with
~1474 commits of fixes since 3.4.0, including many that matter for
device-side use: added EIGEN_DEVICE_FUNC annotations, NVCC build
fixes for CUDA 10+, MSVC+NVCC pragma fixes, and bug fixes for inverse
evaluators / tridiagonalization / arg() under CUDA. This unblocks
using Eigen expressions in .cu code without a long tail of workarounds.

The CMake target name (Eigen3::Eigen) is unchanged; downstream link
lines are unaffected. Existing CPU-kernel Eigen usage (Eigen::Array
+ Eigen::Map of fixed/dynamic 1-D arrays, plus EIGEN_MAX_ALIGN_BYTES)
is API-compatible. Verified by building the full tree + running a
self-join smoke test on both CPU and GPU paths with bit-identical
output between them.

Note: in Eigen 5 the version macros are EIGEN_WORLD_VERSION=3,
EIGEN_MAJOR_VERSION=5, EIGEN_MINOR_VERSION=0 (the WORLD field is
frozen at 3 for source-compat with downstream version checks). CMake
package version is reported as 5.0.1, so find_package(Eigen3 5.0.0 ...)
is the right gate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the build-system plumbing needed for the follow-up GPU kernel
rewrite to start using Eigen expressions in .cu code:

  - --expt-relaxed-constexpr added to CMAKE_CUDA_FLAGS so nvcc can
    call Eigen's constexpr host functions from __device__ contexts
    without warnings.
  - gpu_kernels now links Eigen3::Eigen. The target is defined by
    src/core/cpu_kernel/CMakeLists.txt, which the parent
    src/core/CMakeLists.txt processes before this subdirectory.
  - A throwaway __global__ smoke test in kernels.cu instantiates
    Eigen::Array<float, 4, 1> with arithmetic + reduction inside
    device code. It is never launched at runtime and will be removed
    when the real Eigen rewrite replaces it. Verified to compile for
    sm_86 on CUDA 12.2; full SCAMP build + CPU/GPU self-join produce
    bit-identical output.

No behavior change: the dispatcher, kernel templates, and runtime
paths are untouched; only the toolchain wiring is in place so the
next commit can begin replacing hand-unrolled accumulator arrays
with Eigen::Array.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… sliding-window inner loop

Heavy rewrite of the per-thread GPU compute path, drawing structure
from the earlier use-eigen-in-gpu-kernels experiment branch. The
changes are architectural: the kernel still computes the same output
(bit-identical to master on 1NN / 1NN_INDEX / SUM_THRESH self-joins
on RTX 3080 over randomlist128K, full pyscamp test suite passes
including small-m m=3/m=4 and MATRIX_SUMMARY / KNN / AB-join paths).

Highlights:

  - SCAMPSmem becomes templated on (tile_width, tile_height) and
    exposes each smem region as Eigen::Map<Eigen::Array<T, N, 1>>,
    so callees can write smem.df_col.segment<unrolled_diags>(offset)
    instead of hand-unrolled raw-pointer loads. The ctor uses
    placement new on nullptr-initialized Maps to re-seat them at
    the per-region smem offsets (Eigen::Map's only public way to
    rebind a pointer at construction time).

  - SCAMPSmem gains a `using DataType = DATA_TYPE` typedef so
    do_iteration_fast can declare its register-window arrays with
    the smem's scalar type, not the cov accumulator's. This matters
    for PRECISION_MIXED (cov is double, columns are float).

  - SCAMPThreadInfo.cov1..cov4 collapse to a single
    Eigen::Array<ACCUM_TYPE, DIAGS_PER_THREAD, 1> field. cov
    indexing in do_row / do_row_edge is now cov[i] instead of
    covN, paving the way for changing DIAGS_PER_THREAD without
    a kernel rewrite.

  - kernels_compute.h drops ~500 lines: the five per-profile-type
    overloads of update_row / merge_to_column / update_cols /
    reduce_edge / reduce_row collapse into single if-constexpr
    templates with Eigen::ArrayBase<Derived> parameters. Same for
    kernels_smem.h (init_smem, write_back).

  - New for_<N> constexpr-loop helper (std::index_sequence + lambda
    taking auto i) lets compile-time loop indices flow into
    template arguments cleanly, replacing #pragma unroll blocks for
    cases where the body uses the index as a template parameter.

  - do_iteration_fast now processes outer_unrolled_rows rows per
    call (vs DIAGS_PER_THREAD=4 in master) using a sliding column
    register window of width inner_unrolled_cols, refilling from
    smem after each inner row-batch. New compile-time knobs in
    kernel_constants.h: DIAGS_PER_THREAD=2, unrolled_rows=2,
    outer_unrolled_rows=16, inner_unrolled_cols=3, unrolled_cols=17,
    KERNEL_TILE_HEIGHT=256. These will be the variants the autotuner
    benchmarks once dispatch lands.

  - do_row_edge ports to Eigen-array form but keeps element-wise
    scalar math for the cov*inormc*inormr compute, since Eigen 5
    doesn't auto-promote across Arrays of different scalar types
    in cwise ops and PRECISION_MIXED needs the cross-type math.

  - PRECISION_MIXED preserved (the eigen branch dropped it).

  - Small-m fix (num_diags = args.n_x - args.exclusion_upper + 1)
    preserved (the eigen branch predates it).

  - kernels.cu smoke-test kernel removed; production paths now
    exercise Eigen device-side.

  - --expt-relaxed-constexpr already added to nvcc flags by the
    earlier scaffold commit. gpu_utils target now also links
    Eigen3::Eigen (kernel_gpu_utils.h declares Eigen::Map fields).

Perf on RTX 3080, randomlist1M window=200, 3 runs each:

  precision   master    eigen-port   delta
  double       25.7s     31.7s       +23%
  mixed        50.7s     56.4s       +11%
  single        5.3s      6.2s       +17%

This regression is expected and acknowledged. The eigen-branch
parameter values (DIAGS_PER_THREAD=2, outer_unrolled_rows=16, etc.)
were never benchmarked against master; they're a WIP starting
point. The follow-up "make the autotune cache load-bearing" PR
will pre-instantiate multiple (DIAGS_PER_THREAD, tile_height,
blocks_per_sm) variants and let RunAutotune benchmark + select the
fastest per device. End-user perf will recover (and likely exceed
master) once the autotuner can pick a config tuned for the device.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…IGHBORS

The SCAMPProfileType enum is not contiguous: PROFILE_TYPE_1NN=6,
PROFILE_TYPE_APPROX_ALL_NEIGHBORS=7, PROFILE_TYPE_MATRIX_SUMMARY=8 all
sit above the prior PROFILE_TYPE_1NN_MULTIDIM=5 that was the loop's
upper bound. Any cache entry naming one of those three profile types
failed to parse, throwing SCAMPException("Unknown profile type ...")
from AutotuneCache::Load(); GetKernelConfigForDevice silently caught
it and returned the compile-time default, defeating the cache lookup.

Existed since 4daa21a; only showed up now because the autotuner only
writes a record for profile types that ParseProfileTypeName can also
read back, so round-tripping its own output worked, but any
hand-edited cache entry for the three affected types was silently
ignored. Bumping the loop to PROFILE_TYPE_MATRIX_SUMMARY covers
every defined enum value; out-of-range integers fall through to
ProfileTypeName's default "INVALID" branch and harmlessly never
match user input.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…itch

Makes the autotune cache load-bearing at launch time. Previously
GetKernelConfigForDevice's return value was discarded (kernels.cu
had a literal "(void)cfg" comment) because LaunchDoTile only ever
called one pre-instantiated kernel variant. Now the cache result is
threaded through and the kernel actually dispatches to one of the
enumerated launch-geometry variants.

Mechanism:
  - kKernelVariants[] in kernel_config.cpp enumerates the supported
    (tile_height, blocks_per_sm) tuples. Index 0 is the canonical
    default; index 1 is the first alt (128, 4).
  - LaunchDoTile becomes a cfg-switching shim that calls the right
    pre-instantiated LaunchDoTileWithGeometry<tile_height, bps>.
    Each kernel_<X>.cu now instantiates both variants per
    (precision x row/col mode), doubling the do_tile template
    instantiation count from 45 to 90 (still parallelizable
    across the 5 per-profile .cu files).
  - IsSupportedKernelConfig walks the variant table instead of
    matching only the default, so cache entries naming any
    enumerated variant are now honored.
  - kernels.cu's compute_gpu_resources_and_launch threads cfg
    through to each LaunchKernel_<X>; blocksz now comes from
    cfg.blocksz, and get_smem takes cfg.tile_height so smem
    sizing matches the variant.
  - The launch-info print (gated by silent_mode) now includes
    tile_height + blocks_per_sm so users can see which variant
    fired without rebuilding.

The default config is still variant 0, so behavior is unchanged
for any cache that names the default or has no entry. Verified
with the existing pyscamp test suite (all pass) and a manual
test cache injecting variant 1 (128, 4) entries for every
(profile_type, precision) tuple, where output remains
bit-identical to the default-variant output on the 3080.

What's still missing (follow-up commits): RunAutotune today still
writes the default config for every (profile, precision) instead of
benchmarking each variant and picking the winner. And the variant
axes are limited to (tile_height, blocks_per_sm); the Eigen port
made diags_per_thread / unrolled_rows / outer_unrolled_rows /
kernel_tile_iters parameterizable too, but exposing them adds a
template-arg axis to do_tile/SCAMPThreadInfo that ripples further.
Both extensions land on top of this scaffold without redesign.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…olled_rows, kernel_tile_iters

The 2-axis (tile_height, blocks_per_sm) variant table was a placeholder
-- post-Eigen-port the inner-loop shape knobs are all
template-friendly, and DPT is the most consequential one for register
pressure / occupancy. Adds them as variant axes:

KernelConfig fields (replacing the redundant tile_height with the four
underlying knobs):
  - blocksz, blocks_per_sm                (as before; blocksz precision-tied)
  - diags_per_thread                      (cov array width; smem column stride)
  - unrolled_rows                         (inner row-batch in do_iteration_fast)
  - outer_unrolled_rows                   (rows per do_iteration_fast call)
  - kernel_tile_iters                     (do_iteration_fast calls per tile)
  - tile_height()                         (derived: kti * our)

Where the changes ripple:
  - SCAMPThreadInfo<ACCUM_TYPE, DiagsPerThread>: the cov accumulator's
    Eigen::Array compile-time size now comes from the template arg.
  - Every compute template (do_iteration_fast, do_row, do_row_edge,
    merge_to_row/column, update_rows/cols, reduce_row/edge) gains the
    inner-loop knob template params (DiagsPerThread on all; UnrolledRows
    + OuterUnrolledRows on do_iteration_fast). The derived constants
    inner_unrolled_cols = DPT + UR - 1, unrolled_cols = DPT + OUR - 1
    become local `constexpr int` inside do_iteration_fast.
  - do_tile gains 4 new int template params; tile_height = KTI * OUR
    derived inside.
  - LaunchDoTileWithGeometry takes the full 5-tuple and dispatches the
    9 precision x row/col instantiations via macro (replaces ~100 lines
    of duplicated switch arms).
  - LaunchDoTile picks among pre-instantiated geometries via a
    branched VARIANT_BRANCH macro keyed on the full 5-tuple.
  - get_smem takes diags_per_thread + tile_height args; GetTileHeight
    deleted (replaced by KernelConfig::tile_height()).
  - Host-side num_workers uses cfg.diags_per_thread.

Cache file format kept on V1 (no shipped production cache to migrate),
but the field count went 6 -> 9: blocksz | blocks_per_sm |
diags_per_thread | unrolled_rows | outer_unrolled_rows |
kernel_tile_iters. Stale 6-field caches trip SplitN, throw inside
GetKernelConfigForDevice, and are caught + fall through to the default
(same as any other parse error today). Re-running RunAutotune rewrites
in the new format.

Variant table seeded with 2 entries to validate the wider dispatch:
  v0: bps=2 dpt=2 ur=2 our=16 kti=16    (tile_height=256, current default)
  v1: bps=2 dpt=4 ur=2 our=4  kti=50    (tile_height=200, master-like DPT=4)

Verified on RTX 3080, randomlist1M, window=200, DOUBLE:
  variant 0: ~39.7s
  variant 1: ~31.9s   (20% faster, bit-identical output)
That perf delta is real proof the dispatch picks the right kernel --
identical-output alone wouldn't have shown it. Master baseline before
the Eigen port was ~25.7s on the same input; the autotuner can now at
least narrow the gap by picking variant 1 on this device, and further
gain is a matter of expanding kKernelVariants (which is the next
commit's job once the benchmark loop is in place).

Full pyscamp test suite still passes (1NN_INDEX, 1NN, SUM_THRESH,
MATRIX_SUMMARY, APPROX_ALL_NEIGHBORS, KNN, small-m m=3/m=4, self/ab
joins). CPU vs GPU bit-identical on randomlist128K self-join.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…XED, RTX 3080 cache

Five related changes that together turn the autotune cache from an unused
placeholder into a real per-device perf knob. Bundled into one commit
because splitting them would break the build mid-history.

1. Benchmark loop (autotune.{h,cpp}, autotune_bench.{h,cpp})
   - RunAutotuneWithBenchmark: for each (profile, precision) tuple, time
     every variant in kKernelVariants via a BenchmarkFn callback, write
     the fastest to the cache.
   - DefaultBenchmarkVariant: 65K x window=200 synthetic self-join, runs
     1 warmup + 3 timed runs, returns the min.
   - SetKernelConfigOverride / ClearKernelConfigOverride: thread-local
     override that GetKernelConfigForDevice consults first; the
     benchmark loop sets it around each timed do_SCAMP call to force the
     variant under test.
   - autotune_bench is a separate library so the dep on scamp_op stays
     out of gpu_utils (avoiding a scamp_op -> tile -> gpu_kernels ->
     gpu_utils -> scamp_op cycle). main.cpp links it for --autotune;
     legacy RunAutotune is preserved for pyscamp which hasn't been
     re-wired yet.

2. Per-(profile, variant) compile split
   - kernel_variant.cu.in: configure_file template. CMake foreach
     generates kernel_<profile>_v<N>.cu (5 profiles x 6 variants = 30
     TUs) each instantiating do_tile<...> for one variant geometry
     across 2 precisions x 3 row/col modes = 6 do_tile bodies.
   - kernels_variants.h: declarations for all 30 LaunchVariant_<X>_vN
     entry points + SCAMP_VARIANT_DISPATCH macro (the cfg switch).
   - kernel_<profile>.cu reduces to a one-liner that expands the macro.
     LaunchDoTile (the cfg switch as a template) is removed from
     kernels_impl.h.
   - Before: each kernel_<profile>.cu compiled all 6 variants x 6
     instantiations = 36 do_tile bodies serially. After: 30 small TUs
     compile in parallel under -j20+.

3. 6-variant kKernelVariants table (kernel_config.cpp + kernels_variants.h
   SCAMP_VARIANT_DISPATCH)
   - v0: bps=2 dpt=2 ur=2 our=16 kti=16  (tile=256, eigen-port default)
   - v1: bps=2 dpt=4 ur=2 our=4  kti=50  (tile=200, DPT=4 master-like)
   - v2: bps=2 dpt=4 ur=4 our=4  kti=50  (tile=200, matches master's
                                          4x4 hand-unroll exactly)
   - v3: bps=4 dpt=2 ur=2 our=8  kti=16  (tile=128, higher occupancy)
   - v4: bps=2 dpt=2 ur=2 our=8  kti=32  (tile=256, smaller outer-unroll)
   - v5: bps=1 dpt=4 ur=4 our=16 kti=16  (tile=256, low occupancy)

4. Drop PRECISION_MIXED
   - --mixed_precision CLI flag removed; pyscamp "mixed" kwarg removed;
     GetPrecisionType signature shrunk to (ultra, double, single).
   - kAutotuneTargets drops MIXED entries (15 -> 10 tuples).
   - LaunchDoTileWithGeometry's LAUNCH_FOR_ROWCOL_MODE macro drops the
     MIXED case.
   - SCAMPArgs::validate rejects PRECISION_MIXED with a clear error.
   - The enum value is kept in common.h for proto / distributed wire
     compat; the only code path that produced it is the now-removed CLI
     flag. Rationale: MIXED was uniformly slower than DOUBLE in practice
     (kept DP's accumulator cost without halving smem footprint), so no
     workload prefers it.
   - Stale MIXED-related comments in kernels_compute.h updated.

5. RTX 3080 (sm_86) entries baked into data/autotune_cache.txt
   - 10 lines, one per (profile, precision). Generated by `SCAMP
     --autotune` on this branch. Embedded into the binary via
     configure_file so conda-forge / pip-wheel users on an RTX 3080
     get tuned configs without rebuilding.

End-to-end on RTX 3080, randomlist1M, window=200, DOUBLE self-join, GPU:
  pre-Eigen-port baseline:   ~25.7s
  Eigen port (default cfg):  ~36.7s   (+43%, the regression we incurred
                                       to land the new architecture)
  autotuned (this commit):   ~25.3s   (parity with baseline)

So the autotune mechanism + 6-variant table closes the perf gap the
Eigen port opened. Future variant additions can extend further.

Full pyscamp test suite passes (1NN, 1NN_INDEX, SUM, MATRIX_SUMMARY
self/AB joins; small-m m=3/m=4; KNN self/AB), bit-identical CPU vs GPU
on randomlist128K.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…runtime

Before this change, do_iteration_fast used constexpr for_<N>(lambda)
loops that template-specialized do_row / merge_to_column / merge_to_row
/ update_cols / update_rows / reduce_edge for every (outer_row_iter,
row_iter, start_index, iter) value. With OuterUnrolledRows=16 (variants
v0 and v5) that meant 16 distinct do_row instantiations per
(profile, precision, row/col mode) tuple, which made those variants
~1.7x more expensive to compile than OUR=4 variants.

The template args were all only used as runtime offsets into Eigen
expressions like best_so_far.segment<unrolled_diags>(row_iter) -- the
*size* N is compile-time, but the *offset* can be runtime. So:

  - merge_to_row, update_rows, merge_to_column, update_cols, do_row,
    reduce_edge drop their index-only template parameters and take
    them as function args. The size-typed template parameters
    (num_to_update, DiagsPerThread, etc.) stay compile-time because
    they drive Eigen::Array sizes + #pragma unroll bounds.
  - The for_<OUR/UR>([&](auto j){...}) outer loop in do_iteration_fast,
    the for_<UR>([&](auto k){...}) inner loop, and the for_<DPT> loop
    in do_row_edge become plain #pragma-unrolled for loops. nvcc still
    fully unrolls them (compile-time-known trip counts) and folds j/k
    constants per iteration after unroll, so emitted PTX is unchanged
    on the deterministic-comparison cases (1M random self-join GPU
    output is bit-identical, 25.2s same as before).

Per-variant compile-time delta (mean across 5 profiles, 5-way parallel):

  variant     before    after     delta
  v0 OUR=16   112.8s    66.2s    -41%
  v1 OUR=4     72.7s    60.4s    -17%
  v2 OUR=4     64.8s    58.0s    -10%
  v3 OUR=8     86.6s    60.4s    -30%
  v4 OUR=8     86.2s    60.1s    -30%
  v5 OUR=16   113.7s    71.0s    -38%

Per-variant cost variance collapsed from 1.75x spread (65-114s) to
1.22x (58-71s). 5-way parallel wall-clock bottleneck dropped from 114s
to ~74s (-35%). Adding more variants (or variants with OUR>16) is now
much cheaper than before.

Correctness: pyscamp test suite passes 1NN_INDEX self/AB, SUM self/AB,
Matrix Summary self, m=3 and m=4 small-window, KNN self/AB on every
run. Matrix Summary AB join has a pre-existing ~20% flake rate from
np.allclose tolerance interacting with the atomic-merge ordering on
the transposed tile path -- unchanged by this refactor (the atomic
ops and FMA chains are byte-identical).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…counter

The Eigen-port (687a70b) refactor of write_back passed smem.profile_{a,b}_length
into write_back_value where the pre-Eigen code passed args.profile_{a,b}_length.
smem.profile_{a,b}_length is the smem-cached copy of the global counter that
do_tile's NeedsCheckIfDone early-exit consults; the match-output atomic-add
position must come from the *global* counter, which is what
Profile::CopyFromDevice reads back to size the match_value_unordered vector on
the host. With the atomic firing into smem instead, the global counter stayed at
zero, host saw length=0, and KNN/APPROX_ALL_NEIGHBORS returned an empty profile.

This only surfaced once the autotune cache could resolve APPROX_ALL_NEIGHBORS
entries (b471cdc); before that, the parser silently failed and the fallback
config happened to still produce zero matches via the same bug -- masked
because GetKernelConfigForDevice was returning the default cfg either way.
Test coverage gap: 687a70b's commit message claimed "KNN passes" but apparently
the run was against the old conda-installed pyscamp build rather than the
fresh tree.

Validation on RTX 3080:
  - test_pyscamp.py: all 11 cases pass (1NN/SUM/Matrix/KNN, self+ab, small-m)
  - run_tests.py --executable pyscamp: 2072/2072 tests pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bench

The Eigen Map<Array<T,N,1>>::segment<>() pattern that the Eigen port introduced
compiles, under nvcc device codegen, to N scalar ld.shared.{f32,f64} per
segment -- the packet path is mostly inactive on the device side. The pre-Eigen
master used raw reinterpret_cast<float4*> to get ld.shared.v4.f32. This commit
reclaims that pattern via a vec_load<N, AlignBytes, T> helper that picks the
widest aligned PTX vector load supported by the (T, AlignBytes) tuple and
recursively peels the tail.

do_iteration_fast is wired to vec_load at three sites:
  - initial column window load: aligned to DPT * sizeof(T) (info.local_col
    is threadIdx.x * DPT).
  - row data load: aligned to UR * sizeof(T) (info.local_row is a multiple
    of OUR, OUR is a multiple of UR, j*UR is a multiple of UR).
  - sliding-window refill: inner_unrolled_cols padded from DPT+UR-1 to
    DPT+UR so the refill offset lands at local_col + j*UR + DPT, which is
    UR-aligned for every variant (all current variants satisfy DPT % UR == 0,
    enforced by static_assert). The unread trailing slot costs 3 registers
    per thread (dfc/dgc/inormc).

PTX inspection on RTX 3080 confirms ld.shared.v4.f32 / v2.f32 / v2.f64
emitted at the three sites. Variant 5 (OUR=16, KTI=16, tile_height=256)
for SP self-join needs ~50KB of dynamic smem, exceeding the sm_8.x default
48KB per-block cap; LAUNCH_PRECISION now wraps the launch with
cudaFuncSetAttribute(cudaFuncAttributeMaxDynamicSharedMemorySize, smem)
when smem > 48KB. sm_86 max is ~100KB so v5 fits.

Caught a separate latent autotune bug while wiring the smem opt-in: when a
variant kernel failed to launch (e.g. v5 SP self-join exceeding smem cap
before the opt-in), do_SCAMP swallowed the SCAMP_CUDA_ERROR return value
and the autotune benchmark recorded the fast "failure" path as the winning
time, writing a broken variant to the cache. TimeOneRun now calls
cudaDeviceSynchronize + cudaGetLastError after do_SCAMP and throws if
either reports a failure, so the RunAutotuneWithBenchmark catch path
records that variant as +inf time instead.

Re-autotuned data/autotune_cache.txt for RTX 3080 with the new code path
active. DP cells are within ~3% of each other (FP64 throughput-bound on
sm_86 at 1/64 DP rate), so the choice there is in the noise floor; SP
cells show 5-15% spread between variants.

Validation:
  - test_pyscamp.py: 11/11 pass
  - run_tests.py --executable pyscamp: 2072/2072 pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Prototype of a no-smem-column-buffer kernel that uses warp shuffles to
walk cov across lanes, freeing ~30 KB DP / 15 KB SP of smem for higher
occupancy on smem-bound configurations.

Algorithm (per warp, per row):
  - Lane T owns a FIXED DPT-wide column slice. Ownership rolls over every
    32*DPT rows via update_info_shfl, which reads the next 32*DPT-block
    directly from global memory (no smem column buffer).
  - cov[i] at row r tracks cov(r, lane_T_col + i) on diagonal
    lane_T_col + i - r.
  - Per row: compute dist + update cov in place (cov(r, c) -> cov(r+1, c+1)).
  - Per row: shift cov[i] right within the lane and shuffle in
    lane (T-1)'s post-update cov[DPT-1] into cov[0].

Cross-warp cov hand-off (the bit the earlier eigen-gpu-cov-shuffle
prototype got wrong):
  - Lane 31 of warp k publishes its post-update cov[DPT-1] into a tiny
    smem hand-off region (2 * warps_per_block scalars, double-buffered).
  - Lane 0 of warp k > 0 reads warp k-1's published value.
  - One __syncthreads() per row at end-of-row, ordered so the read
    happens-before the write into the same slot (no race).
  - Lane 0 of warp 0 has no predecessor; its cov[0] is junk after row 0
    and is masked from distc/distr updates via the slot_valid check.

Smem layout (SCAMPShflSmem): row data + profile data + cov_handoff
region. NO df_col/dg_col/inorm_col regions. ~6.7 KB total for DP var 6
vs ~33 KB for sliding-window v4.

Variant table: variant 6 entry (bps=2, dpt=2, ur=0, our=8, kti=8) with
ur=0 as sentinel for "shfl algorithm." CMakeLists routes ur=0 tuples
through kernel_variant_shfl.cu.in instead of kernel_variant.cu.in.

State of correctness:
  - SUM_THRESH self-join + AB-join PASS end-to-end (cleanest signal: the
    warp-reduce-then-atomicAdd path is robust to bugs that would surface
    elsewhere).
  - 1NN_INDEX / MATRIX_SUMMARY / KNN FAIL. Two likely root causes:
    (a) Slow path (do_row_edge equivalent) is a no-op placeholder. Tail
        tiles of each meta-diag produce NaN entries in the column
        profile (~87 NaN of 1801 at n_x=2000).
    (b) Possible off-by-one in merge_to_column_shfl's idxc assignment
        or in the slot_valid mask. Atomic-max amplifies indexing bugs
        that atomic-add masks; SUM_THRESH passing rules out cov drift
        in the bulk.

State of dispatch:
  - TEMPORARY: v0..v5 commented out in CMakeLists.txt SCAMP_VARIANT_TUPLES
    and in kernels_variants.h SCAMP_VARIANT_DISPATCH so the build is
    fast during shfl-variant development. ONLY variant 6 is instantiated.
    Force the cache to (2,2,0,8,8) to exercise it. Re-enable v0..v5
    before merge.

Smem footprint comparison on RTX 3080 (sm_86) for the 4 enabled profile
types:
  variant     smem (DP)   smem (SP)
  v4 (slide)    32.8 KB     17.2 KB
  v6 (shfl)      6.8 KB      4.5 KB

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/cov_shuffle_diagram.py renders the per-row state of the design-A
variants (variant 6 cross-warp via smem hand-off; variant 8 per-warp-
independent) so you can read off where covariance values flow between
threads.

Each frame for one row shows:
  - the BEFORE state: which block-local diagonal each lane's cov[i]
    slot is currently tracking;
  - the smem cov_handoff region (variant 6 only): what lane 31 of each
    warp writes and what lane 0 reads;
  - the AFTER state: where each value moved to via the intra-warp
    shuffle + cross-warp override, with masked-but-still-present junk
    cells shown in parens.

Configurable BLOCKSZ, DPT, warp size, row count, variant. Defaults are
tiny (BLOCKSZ=8 WARP=4 DPT=2 rows=3) so the diagrams fit on one screen;
pass --blocksz 256 --warp 32 for realistic hardware geometry (verbose).

Side-by-side, the two variants make the design tradeoff obvious:
  variant 6 -- one junk source (lane 0 of warp 0), needs smem + 1 sync
               per row, but only warp 0's lane 0 produces masked cells.
  variant 8 -- junk at every warp boundary, no smem or sync, but every
               warp's lane 0 progressively masks more slots as r grows.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Alignment fix: compute the cell width once across all frames (including
masked '(N)' parenthesized variants AND lane headers), then apply the
same right-justified width to every cell. Previously narrow numerics
like ' 1' were padded inconsistently with wide variants like '(15)'.

Waste table: replace the single-tile_height waste line with a sweep
across tile_heights spanning [DPT, 4*32*DPT], shown for both the
DEMO geometry (whatever --blocksz the user picked) AND the realistic
kernel geometries (BLOCKSZ=256 for DP, 512 for SP, warp=32).

The tables make the var 6 vs var 8 tradeoff concrete. For BLOCKSZ=256
DPT=2 (the v6 setting in this branch):

  tile_height   var 6 wasted   var 8 wasted   8/6 ratio
            2          0.10%          0.78%       8.0x
            4          0.29%          2.34%       8.0x
           16          1.46%         11.72%       8.0x
           32          3.03%         24.22%       8.0x
           64          6.15%         49.22%       8.0x   <-- our v6 setting
          128         12.40%         74.61%       6.0x
          256         24.90%         87.30%       3.5x

So variant 6 wastes 6.15% at tile_height=64; variant 8 at the SAME
tile_height wastes 49.22% (8x more). This was the previously-overlooked
cost of dropping the cross-warp hand-off -- per-warp-independent only
wins at small tile_height (<~16), where the rotation amortization is
also worse. Earlier I'd ballparked variant 8 waste at ~1.5%, which was
only counting the initial junk slot, not its propagation through the
warp -- the table now shows the propagated cost too.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-of-bounds local_mp_col writes, eliminate KNN duplicate matches by re-initializing column registers, and implement CMAKE_CUDA_ARCHITECTURES build escape hatch.
Kernel/variant changes:
- Add v7 (shfl bps=8 DPT=4 our=8 kti=16 -> tile_height=128, max for DPT=4)
  and v8 (shfl bps=4 DPT=8 our=8 kti=32 -> tile_height=256). v8 is the
  first DPT=8 variant; combined with the bigger tile it wins 3 of the
  10 (profile, precision) autotune targets at 512K bench size,
  including 1NN_INDEX SP and both SUM_THRESH cases. Recommended default
  on RTX 3080 is now v8 bsz=128 (geomean 1.224x of per-target best,
  worst-case 2.04x).
- Fix do_update_info_shfl dispatcher: it only branched on
  updates_remaining = 0..3, so DPT >= 5 silently skipped slot rotations
  for slots 0..(DPT-5). That eventually de-synced state.global_col
  against tile_start_col and crashed in flush_all_cols_to_smem with a
  smem OOB. Extend ladder to 0..7 + static_assert(DPT <= 8).
- Sliding-window LaunchDoTileWithGeometry now dispatches do_tile on a
  runtime blocksz (64/128/256/512) the same way LaunchDoTileShfl
  already did, so the autotuner can sweep blocksz for those variants
  too.

Autotune machinery:
- g_cfg_override was thread_local; do_SCAMP spawns workers via
  std::async so they never saw it -- every trial silently fell back to
  the on-disk cache, flattening per-variant timings. Change to a
  process-wide std::mutex-guarded optional. This is the fix that
  unblocked all of the per-variant differentiation below.
- RunAutotuneWithBenchmark now sweeps (variant geometry x blocksz),
  prints a progress line per trial with elapsed/ETA, identifies the
  winner by variant name (v0..v8), and emits a cross-target weighted
  score table -- geomean of time/per-target-best across all 10 targets,
  also reports worst-case ratio. Recommends a "best safe default"
  trial at the end.
- DefaultBenchmarkVariant uses computing_rows=true so the workload is
  a real symmetric self-join (was upper-triangle only, hiding the cost
  asymmetry); max_tile_size bumped to 512000 to match the GPU default
  in main.cpp; benchmark input length is overridable by
  SCAMP_AUTOTUNE_INPUT_LENGTH env var (default 131072 for fast dev
  sweeps -- 512K via the env gives results that match production
  scale better).
- IsSupportedKernelConfig admits all of {64, 128, 256, 512} blocksz
  for every enabled variant, not just the precision-tied default.

Built-in autotune cache:
- Refresh data/autotune_cache.txt with the RTX 3080 winners from the
  new sweep. SP entries change substantially -- v8 (DPT=8 shfl) wins
  1NN_INDEX SP + both SUM_THRESH cases, v5 (bps=1 tile=256) wins the
  rest of the SP targets. DP entries shift to bsz=64.

Notes:
- v2 / v5 re-enabled alongside v0/v1/v3/v4/v6 so the score sweep covers
  the full set; only v8 is the brand-new shape, the rest were already
  there.
- SCAMPShflState comment cleaned up to reflect that dfc2/dgc2/inormc2
  staging is gone (read directly from global at rotation now).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related cache bugs that meant the new shfl variant defaults never
actually made it to end users:

1. data/autotune_cache.txt is embedded into the binary via CMake
   file(READ) + configure_file. file(READ) doesn't track the source
   path as a dependency, so editing the cache and running
   `cmake --build` would silently leave the previous contents baked
   into the binary -- the user updates the cache, rebuilds, and the
   runtime lookup still misses their new entries. Add
   CMAKE_CONFIGURE_DEPENDS for the cache file so changes auto-trigger
   a reconfigure. Verified via `strings build/SCAMP | grep RTX_3080`
   before/after.

2. pyscamp.autotune() called SCAMP::RunAutotune (the older stub) which
   just writes the compile-time default cfg for every (profile,
   precision) target -- no benchmarking. The user-cache write then
   clobbers the built-in cache lookup, so a conda-forge user who ran
   pyscamp.autotune() ended up WORSE off than before they called it
   (the shipped data/autotune_cache.txt entries got masked by the
   stub-default cache). Switch the Python binding to
   RunAutotuneWithBenchmark + DefaultBenchmarkVariant -- same path as
   the CLI's --autotune -- and link pyscamp against autotune_bench.
   Verified end-to-end: pyscamp.autotune() now runs the full sweep
   (~4 min for 9 variants x 4 blocksizes x 10 targets at 131K bench
   size), writes the benchmarked winners to the cache, and SCAMP at
   runtime picks them up (6/6 launches matched the cache entries).

Also default pyscamp.autotune() to device 0 only -- mirrors the CLI
behavior. Tuning every visible GPU when most multi-GPU boxes have
identical devices is wasted work. Callers can pass devices=[...] to
override.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the missing scaffolding around the autotune cache lookup so that
end users get a meaningful diagnostic when their device isn't in the
cache, and the cache + lookup logic is covered by lightweight C++
tests that don't require a GPU at runtime.

What's new for users:
- When SCAMP launches a kernel and finds no autotune entry for the
  current (device, profile, precision) in either the user override
  cache or the binary's built-in cache, a one-shot warning is emitted
  to stderr identifying the missing tuple and the compile-time-default
  config used as a fallback. The warning is deduplicated per tuple so
  a long-running process sees at most one line per tuple.
- SCAMP_AUTOTUNE_QUIET=1 suppresses the warning. pyscamp's module init
  sets this by default (with setenv overwrite=0 so an explicit user
  value still wins) -- notebook users don't get uninvited stderr; CLI
  users still do.
- Docs: new docs/source/autotune.rst covers the cache, lookup order,
  default path, --autotune workflow, the warning, env-var overrides,
  and troubleshooting; added to the toctree.

What's new for developers:
- Factored LookupKernelConfigForDeviceKey() out of
  GetKernelConfigForDevice() -- pure logic, no GPU dependency, takes
  cache pointers + device key + fallback as arguments. Production
  callers still use GetKernelConfigForDevice (which queries the real
  device and threads through the pure helper); tests drive the helper
  directly with synthetic device keys.
- New ResetAutotuneWarnings() for tests; the dedup set is process-
  global so without it later tests would silently no-op.
- test/cpp/test_autotune_cache.cpp: 15 unit tests covering the lookup
  chain (user hit, built-in hit, user-beats-built-in, unsupported
  variant falls through, multi-device key isolation), the cache-miss
  warning (fires once, one-shot per tuple, separately per
  (profile,precision)), the SCAMP_AUTOTUNE_QUIET env var (suppresses
  + respects falsey values), DefaultPath() resolution
  (SCAMP_AUTOTUNE_CACHE -> XDG_CACHE_HOME -> HOME/.cache), on-disk
  round-trip, and the malformed-cache throw.
- BUILD_SCAMP_TESTS=ON CMake option (default OFF) wires the test into
  ctest. Gated on CMAKE_CUDA_COMPILER for now since the only test
  module currently exercises GPU-side machinery (no CUDA *runtime*
  needed -- the test never touches a device -- but the toolchain has
  to be present to build gpu_utils).

Verified end-to-end:
- ctest passes 15/15.
- Real SCAMP run on a device in the built-in cache: no spurious
  warning.
- pyscamp module init sets SCAMP_AUTOTUNE_QUIET=1 (confirmed via
  libc.getenv from ctypes; Python's os.environ is a startup snapshot
  so doesn't show it, but the underlying env that the C++ warning
  code reads is correctly set).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The unit tests added in 1254f40 (test/cpp/test_autotune_cache.cpp) are
behind -DBUILD_SCAMP_TESTS=ON which CI wasn't passing, so they were
shipped but never run. Enable the flag + invoke ctest in all three
CUDA jobs:

- build-cuda-cli (Ubuntu/Windows x g++/clang++/cl): catches compiler
  matrix regressions in the cache parser / lookup code.
- build-cuda-versions (Ubuntu/Windows x CUDA 12.6/12.8/13.0): catches
  CUDA-toolkit-version regressions (rare but the test is cheap to
  run).
- build-and-test-cuda (self-hosted with a GPU): runs alongside the
  existing integration tests.

The unit tests require the CUDA toolchain at build time but never
touch a real device at runtime, so they run successfully on the build-
only runners with no NVIDIA driver. CMakeLists already gates the test
target on CMAKE_CUDA_COMPILER so the CPU-only CI jobs short-circuit
cleanly without any extra workflow changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ts.h

Variant table churn (prune, rename, add) used to require touching three
files in lockstep:
  - SCAMP_VARIANT_TUPLES in CMakeLists.txt (the build-system list).
  - kVariants[] in kernel_config.cpp (the runtime table).
  - SCAMP_DECL_VARIANTS_FOR_PROFILE + SCAMP_VARIANT_DISPATCH in
    kernels_variants.h (the per-variant forward decls + dispatch ladder).
Drift between them turned into hard-to-spot bugs (e.g., a kernel TU
built but never called because its dispatch case was missing).

Make SCAMP_VARIANT_TUPLES the single source of truth and generate the
other two:

* The tuple list drops its explicit-index prefix -- variant N is the
  Nth entry in the list. Add/remove/reorder edits one place.
* CMake walks the list, building three @-substitution strings: the
  kVariants[] initializer body, the per-variant forward decl block, and
  the per-variant dispatch if-ladder body.
* configure_file emits kernel_variants_table.h (consumed by
  kernel_config.cpp) and kernels_variants.h (consumed by the per-profile
  .cu dispatchers) into the build dir.
* gpu_utils and gpu_kernels pick up CMAKE_CURRENT_BINARY_DIR on the
  include path so the generated headers resolve.

Also retire the four sliding-window/shfl variants the multi-device
autotune sweep showed never won anything: the pre-prune labels v1, v3,
v4, v7 are gone. Surviving variants are compacted to indices 0..4:

  pre-prune  new        geometry              role
  ---------  ---------  --------------------  ----------------------
  v0         v0         {2,2,2,16,16}         sliding-window default
  v2         v1         {2,4,4,4,50}          master-like 4x4
  v5         v2         {1,4,4,16,16}         big tile / low occ
  v6         v3         {8,4,0,8,8}           shfl DPT=4 tile=64
  v8         v4         {4,8,0,8,32}          shfl DPT=8 tile=256

3080 cache entries already used the kept variants; the only entry
that referenced a retired variant was P100 APPROX_ALL_NEIGHBORS
DOUBLE (was v3 pre-prune), substituted with v5 -- the runner-up on
the P100 cross-target score sweep, geometry preserves bsz=256.

Saves 4 x 5 profiles x 6 instantiations = ~120 do_tile template
emissions on a clean build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
zpzim and others added 30 commits May 22, 2026 10:19
The autotune machinery requires QueryDeviceProps() to resolve the
device cache key, which in turn needs a CUDA-capable GPU. Our CI
machines (windows-latest, ubuntu-latest in the build-cuda-cli matrix)
have the CUDA toolchain but no device, so the bench-driven autotune
pathway was previously untestable in CI.

Refactor:
- Split RunAutotuneWithBenchmark into a CUDA-aware wrapper plus a new
  RunAutotuneWithBenchmarkForDeviceKey(device_key, device_id, bench,
  cache_path, verbose, print_banner) overload that takes the resolved
  device key directly. The wrapper calls QueryDeviceProps, prints the
  device-specific banner, then delegates. The inner overload prints a
  shorter device-key-only banner when print_banner is true (the wrapper
  passes false to avoid duplicating the header).
- No behavior change for production callers: SCAMP --autotune and
  pyscamp.autotune() go through the wrapper unchanged.

Test (test_autotune_cache.cpp):
- Add Test_AutotuneFullPipeline_FakeBench. Drives the full autotune
  sweep with a synthetic BenchmarkFn that returns 0.5s for one rigged
  (variant, blocksz) winner per (profile, precision) target and 1.5s
  for every other trial. After the call:
    * verifies the cache file exists at the requested path,
    * reloads it and asserts every target's recorded config matches the
      rigged winner's geometry tuple and blocksz,
    * exercises LookupKernelConfigForDeviceKey with the reloaded cache
      to confirm the lookup path returns the rigged winner end-to-end.
  Verifies the variant-sweep + per-target-winner + cache-write +
  cache-reload + lookup chain without touching a GPU.

Verification:
- Linux (gcc 11, build_sm86): 18/18 tests pass via ctest.
- Native Windows (VS2022 MSVC 14.30, CUDA 11.6, sm_86): same 18/18
  pass; the test compiles cleanly on MSVC.
- End-to-end on Windows with a real GPU (RTX 3080):
    * SCAMP --autotune wrote 10 entries to
      %LOCALAPPDATA%\scamp\autotune.txt
    * Subsequent SCAMP --print_debug_info launch picked the v3 shfl
      config (bps=8 dpt=4 our=8 kti=8 blocksz=64) from that cache for
      DP 1NN_INDEX without emitting the cache-miss warning.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Several places that document the autotune subsystem had drifted since
recent changes:

- The cache-path resolution was Linux-only in user-visible docs; the
  recent _WIN32 branch of AutotuneCache::DefaultPath (LOCALAPPDATA /
  USERPROFILE) wasn't reflected anywhere.
- The autotune.rst page quoted 9 variants and ~360 trials; we ship 5
  variants today and a full sweep is 200 trials (5 x 4 x 10).
- pyscamp.autotune()'s docstring claimed empty devices=[] means "every
  visible CUDA device is tuned" -- the implementation actually defaults
  to device 0 only (the rationale: a sweep takes O(minutes) and most
  multi-GPU hosts have identical devices).
- docs/pyscamp.py (the Python stub Sphinx reads for autodoc) was
  missing autotune() and gpu_supported() entirely, so neither showed up
  in the rendered pyscamp API page.

Updates:
- docs/source/autotune.rst: TL;DR, lookup chain, "Default cache
  location", "Clearing or resetting", and troubleshooting all mention
  the Windows path (%LOCALAPPDATA%\scamp\autotune.txt, with
  %USERPROFILE%\.cache fallback). Sweep size corrected to "5 variants
  enabled today -> 200 trials"; phrased so the count tracks future
  variant-table changes via the "current variant count" framing.
- docs/pyscamp.py: add autotune() and gpu_supported() to the
  autosummary and emit matching docstrings for the Sphinx stub build.
- src/python/SCAMP_python.cpp: rewrite the autotune() pbdoc to match
  the implementation (device 0 only by default; full per-platform
  cache-path resolution). Single backslashes inside R"pbdoc()" --
  pbdoc is a raw string, so \\scamp would render as \\scamp.
- src/main.cpp: --autotune --help text mentions all three resolution
  steps and the Windows path.
- src/core/gpu_kernel/{autotune.h,autotune_cache.h,autotune.cpp,
  builtin_autotune_cache.h}: doc-comments referencing the cache path
  now mention the platform-specific fallback (or refer readers to
  AutotuneCache::DefaultPath, which is the single source of truth).

No code paths changed; this is documentation only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Branch-wide sweep ahead of merging to master: comments that narrated
intermediate states ("before this split", "now lives in", "pre-prune
label v6", "design-A"), values that were no longer used, and the
backward-compat RunAutotune stub that nothing on the branch still
calls. Each removal is mechanical -- no behavior change.

Header/comment refresh:
- kernel_variant.cu.in, kernels_impl.h, kernels_compute.h,
  autotune_cache.cpp, autotune.cpp, autotune.h: replace
  before-vs-after-split / before-vs-after-this-change framing with a
  plain description of what the code does today.
- kernels_compute_shfl.h, kernels_impl_shfl.h, kernels.cu,
  kernel_variant_shfl.cu.in, kernel_gpu_utils.h: drop "design-A"
  codename (used during the multi-design bake-off, not meaningful on
  master) in favor of "cov-shuffle" / "shfl".
- kernel_config.cpp: drop "an earlier iteration had 9 entries v0..v8"
  and "pre-prune label v6" history. Reframe GetDefaultKernelConfig so
  it picks the first shfl variant by iterating kVariants rather than
  hard-coding index 3 with a stale-on-renumber comment, and let
  IsSupportedKernelConfig iterate kVariants directly instead of a
  redundant hard-coded {0..4} list.
- docs/source/autotune.rst: drop the "RunAutotune stub vs
  RunAutotuneWithBenchmark" troubleshooting bullet (RunAutotune is
  gone; see below).
- src/core/gpu_kernel/CMakeLists.txt: rewrite the layout/build-table
  prologue, drop "we no longer carry explicit indices" and "same as
  before" framing, drop stale "5 profiles x 6 variants = 30 TUs"
  count (it's 5 x 5 = 25 today and the number tracks variant changes
  anyway).
- SCAMP_python.cpp: drop the "not the older RunAutotune stub"
  comment from the autotune() impl now that there is no older stub.

Dead code removed:
- SCAMP::RunAutotune(...) (the non-benchmarking thin wrapper that
  wrote the compile-time default to every target). Not called by
  anything since RunAutotuneWithBenchmark landed; the bench path is
  now the only entry point.
- DEFAULT_DIAGS_PER_THREAD, DEFAULT_UNROLLED_ROWS,
  DEFAULT_OUTER_UNROLLED_ROWS, DEFAULT_KERNEL_TILE_ITERS,
  DEFAULT_BLOCKSPERSM constants in kernel_constants.h. Variant
  geometries live in SCAMP_VARIANT_TUPLES; these duplicates had no
  callers and would silently drift if a future variant-0 edit
  happened in only one place.

Doc-comment correctness:
- autotune.h prologue + SetKernelConfigOverride doc say "process-wide"
  override, not "thread-local" -- matches the implementation, which
  uses a process-wide std::mutex-guarded optional so std::async
  workers see the override (a thread_local would not).
- autotune_cache.h schema example shows 9 fields (the actual record
  layout) rather than the 6-field draft form.

Verified: SCAMP rebuilds clean; 18/18 unit tests pass on Linux.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cache file format already had graceful per-entry fall-through
built into the read path (a stale variant tuple loses to the next
cache source on lookup; a wrong-version header silently empties the
cache), but the upgrade contract wasn't explicit anywhere and the
two failure modes weren't all covered by unit tests. Adding both so
future kernel changes have a clear "do I need to wipe everyone's
cache?" decision tree and so a regression of the fall-through path
shows up in CI.

Upgrade contract (now spelled out in autotune_cache.h + the user
docs at docs/source/autotune.rst):

  - Add / remove / reorder variants in SCAMP_VARIANT_TUPLES:
    no version bump. Per-entry fall-through handles it -- the cache
    is keyed by (bps,dpt,ur,our,kti) tuple, not by index, so existing
    entries either still match (lookup hits) or fall through (lookup
    misses on that one tuple, the other cache entries are unaffected).

  - Schema or kernel-semantics change forces a wipe:
    bump kHeader (SCAMP_AUTOTUNE_V<N> -> V<N+1>). ParseStream silently
    treats a non-matching header as empty rather than throwing, so an
    end-user upgrading their pip wheel doesn't see SCAMPException --
    they just fall through to the new release's built-in cache and
    can re-tune at their leisure.

New tests in test/cpp/test_autotune_cache.cpp:

  - Test_FutureVersionHeaderSilentlyEmpties: SCAMP_AUTOTUNE_V99 header
    + a record line loads zero entries (not a throw), lookup hits the
    cache-miss warning. Covers the post-version-bump escape hatch.
  - Test_MissingHeaderSilentlyEmpties: a file with data lines but no
    header at all also loads zero entries.
  - Test_PartiallyStaleCachePreservesGoodEntries: a cache with one
    valid entry + two entries naming bogus variant tuples loads all
    three, but lookups hit only the valid one; the stale entries fall
    through to fallback + the one-shot warning. Covers the
    everyday "we changed the variant table" upgrade case.
  - Test_StaleUserCacheFallsThroughToBuiltin: a user cache with a
    future-version header (all entries silently dropped) plus a
    built-in cache with the same device key -- the built-in entry must
    win. Covers the worst-case "the upgrade wiped my tuned config but
    the shipped built-in still has me covered" path.

22/22 tests pass on both Linux (gcc 11) and native Windows
(VS2022 MSVC 14.30, CUDA 11.6, sm_86).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two unrelated doc fixes that came up together while preparing the
branch for review.

Env-var docs:
- environment.rst gains a "Environment variables" section that lists
  every env var SCAMP / pyscamp reads, build-time and run-time, with
  a one-line purpose for each. Before this commit the cache-related
  ones were documented only in autotune.rst and pyscamp.py, which
  meant they were discoverable only by someone already reading the
  autotune page. The distributed gRPC env vars
  (SCAMP_SERVER_SERVICE_HOST/PORT) weren't documented anywhere.
- pyscamp/intro.rst gains PYSCAMP_BUILD_TYPE and
  PYSCAMP_NO_PLATFORM_AUTOSELECT, two setup.py-honored env vars
  that previously had no user-facing documentation.

Autotune workload-size measurement (RTX 3080, full 10-target sweep):
  64K  -> ~4 min,  geomean 1.325, worst 2.25
  128K -> ~8 min,  geomean 1.308, worst 3.47   (current default)
  256K -> ~25 min, geomean 1.278, worst 2.77

The per-target winners themselves shift across sizes (e.g.
SUM_THRESH/DOUBLE picks v0 at 64K, v2 at 128K, v4 at 256K -- those
are different kernel geometries, not just different rankings), so a
short sweep doesn't just rank suboptimally for the cross-target
default, it picks suboptimal cache entries for individual targets.
256K is meaningfully tighter than the current 128K default.

Updates:
- docs/source/autotune.rst "Choosing the benchmark workload size"
  now has the comparison table above and concrete guidance: 128K is
  the deliberate casual-dev default; 256K or 512K is recommended
  before shipping a cache entry to data/autotune_cache.txt.
- src/core/gpu_kernel/autotune_bench.cpp comment block over
  kBenchmarkInputLengthDefault: dropped the stale "under a minute on
  a fast GPU" claim (the sweep is closer to 10 min at the default
  size) and the stale trial-count formula (it omitted the blocksz
  axis introduced when autotune started sweeping blocksz).

No default value change in this commit -- bumping
SCAMP_AUTOTUNE_INPUT_LENGTH from 131072 to 262144 would 4x the wall
time of the default `--autotune` run, which is a separate UX
trade-off worth a follow-up discussion.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The autotune-size sweep on RTX 3080 (committed in the prior diff) showed
that 256K gives meaningfully tighter cross-target rankings than the
previous 128K default:

  128K: ~8 min,  geomean 1.308, worst-case 3.47
  256K: ~25 min, geomean 1.278, worst-case 2.77   (~20% better worst case)

Per-target winners themselves shift between sizes (e.g.
SUM_THRESH/DOUBLE picks different kernel geometries at 64K vs 128K vs
256K), so a too-small sweep doesn't just rank the cross-target default
suboptimally -- it picks suboptimal entries for individual cache rows.
256K is the smallest size where launch overhead is reliably dominated
by steady-state kernel cost.

The 4x wall-time cost is amortized over the lifetime of the autotuned
cache -- users hit `SCAMP --autotune` once per device. Anyone tuning
on an older / slower GPU where 25 minutes is too long can dial it
back via SCAMP_AUTOTUNE_INPUT_LENGTH=131072 (the doc updated here
calls this out explicitly).

Changes:
- src/core/gpu_kernel/autotune_bench.cpp: kBenchmarkInputLengthDefault
  131072 -> 262144. Comment block above the constant explains the
  trade-off and points at the doc for the per-N comparison.
- docs/source/autotune.rst: "Choosing the benchmark workload size"
  reflects 256K as the new default and reframes the smaller-N row of
  the comparison table as "use this if 256K is too slow on your GPU".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…l-variant

# Conflicts:
#	.github/workflows/build-and-test.yml
#	src/python/SCAMP_python.cpp
…real SCAMP crashes

Four loosely-related fixes that came out of inspecting CI output for
the latest PR push.

1) SCAMP_distributed missing autotune_bench link dep

The BUILD_CLIENT_SERVER=1 docker build was failing with:
  undefined reference to `SCAMP::DefaultBenchmarkVariant(...)'

src/distributed/CMakeLists.txt builds SCAMP_distributed from the same
src/main.cpp the top-level SCAMP target uses, and that file references
DefaultBenchmarkVariant under _HAS_CUDA_ (which mark_cuda_if_available()
propagates to every distributed target). The top-level SCAMP target
gets autotune_bench linked under the same CUDA gate; SCAMP_distributed
did not. Mirror the link rule so the dep is satisfied in both places.

2) Test scripts spammed the autotune-cache-miss warning

run_tests.sh and run_tests.py spawn many short SCAMP processes. The
one-shot dedup on the "no autotune entry for device ..." warning is
per-process, so each spawn re-emits the warning. Both harnesses now
export SCAMP_AUTOTUNE_QUIET=1 before any SCAMP invocation; users
running the harnesses interactively can override by setting the env
var themselves first.

3) Warning text now points at SCAMP_AUTOTUNE_QUIET

Added a parenthetical to the cache-miss warning explaining how to
silence it process-wide. Users hitting this in a tight CI loop now
have a fix-it-yourself hint without grepping the source.

4) run_tests.py was swallowing SCAMP crashes

The harness called subprocess.call() and discarded the exit code, then
read the per-test output files. If SCAMP aborted (e.g.
"GPUasssert: unspecified launch failure" on a kernel-config the
target device couldn't run), the previous subtest's output files were
still on disk, so the comparator silently re-evaluated stale data and
the test "passed". Two changes:
  - rm the output files (mp_columns_out, mp_columns_out_index,
    mp_rows_out, mp_rows_out_index) before each SCAMP invocation, so a
    missing output surfaces as a None array instead of a stale read.
  - check subprocess.call's return code and raise RuntimeError on
    non-zero so the test runner exits non-zero.

This fourth change will likely surface a real bug on the next CI run
of build-and-test-cuda: the compile-time-default kernel variant (v3
shfl, blocksz=128 bps=8 dpt=4 ur=0 our=8 kti=8) hits an unspecified
launch failure on sm_60 (the self-hosted runner reports its device
key as 'NVIDIA_Graphics_Device__sm_60', which doesn't match the
'Tesla_P100-...' built-in cache rows, so the fallback fires). That's
intentional -- the bug was already happening; the harness was just
hiding it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…UTE_COLS

compute-sanitizer on a P100 (sm_60) bisected the
"GPUasssert: unspecified launch failure tile.cpp:175" reproducer
that build-and-test-cuda was hitting down to a single offset in
do_tile_shfl: every thread in block (0,0,0) trapped at
+0x21218 of the (PROFILE_TYPE=1NN, COMPUTE_ROWS=true,
COMPUTE_COLS=false) instantiation of the kernel, attributed to
kernels_impl_shfl.h:148.

That line is the start of the per-tile distc reload loop:

  for (int i = 0; i < DiagsPerThread; ++i) {
    int col_idx = state.local_col[i];
    if (col_idx >= 0 && col_idx < tile_width) {
      ...
      state.distc[i] = smem.local_mp_col[col_idx];
      ...
    }
  }

SCAMPShflSmem's ctor only re-binds local_mp_col to a real backing
pointer when compute_columns is true; with COMPUTE_COLS=false (the
ab-join lower-triangle path, where the col profile belongs to the
other tile in the pair) the Map's data pointer stays at the nullptr
sentinel from the member initializer list. The smem read then
dereferences a null pointer.

Why Ampere happened to "work": the kernel does a check
(col_idx >= 0 && col_idx < tile_width) before the dereference and
the resulting distc is never consumed (merge_to_column_shfl /
flush_all_cols_to_smem are also COMPUTE_COLS-gated, so the trash
distc gets overwritten before it matters). On sm_86 the load
through null seems to retire silently in this configuration; on
sm_60 the MMU traps on the load and the whole kernel aborts.

Fix: gate the entire per-tile distc-reload loop on COMPUTE_COLS,
mirroring the gating already in place for flush_all_cols_to_smem
(kernels_impl_shfl.h:219) and for the smem.local_mp_col writes
inside the init_smem_shfl helpers
(kernels_compute_shfl.h:628/659/691). When COMPUTE_COLS is off
distc is unused for the rest of the tile; just seed it with
init_dist so any speculative reads see a well-defined value.

Verified locally:
- sm_86 (RTX 3080): 460/460 Extended Tests pass before AND after.
  Lower bound that the fix is non-regressing; the bug never
  manifested on Ampere in the first place.
- The fix is small enough to reason about by inspection; the
  sm_60 confirmation comes from re-running compute-sanitizer
  against the patched kernel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refactor: blocksz is now a per-variant property declared in
SCAMP_VARIANT_TUPLES rather than a precision-tied global constant
(BLOCKSZ_DP / BLOCKSZ_SP) gated by a hidden "sliding-window vs shfl"
rule inside GetKernelConfigForVariant. The change removes the hidden
coupling without changing runtime behavior: each variant declares
default_blocksz_dp and default_blocksz_sp explicitly, and
GetKernelConfigForVariant returns the field that matches the requested
precision.

Tuple format grows from 5 to 7 fields:
  before: bps|dpt|ur|our|kti
  after:  bps|dpt|ur|our|kti|default_blocksz_dp|default_blocksz_sp

Values are picked to preserve current cold-start behavior bit-for-bit:
  - sliding-window variants get 256|512 (== BLOCKSZ_DP, BLOCKSZ_SP)
  - shfl variants get 128|128 (the prior hardcoded shfl default)

The kernel is still instantiated for all of {64, 128, 256, 512} per
variant, the autotuner still sweeps blocksz as an independent axis,
and on-disk cache entries can still name any of the four values; the
per-variant defaults only feed the cold-start fallback path where no
cache hit exists.

Files removed (everything was used only by the old precision-tied
default path):
  - src/core/gpu_kernel/kernel_constants.h (was BLOCKSZ_DP/SP only)
  - SCAMP::BlocksizeForPrecision in kernel_config.cpp
  - SCAMP::get_blocksz in kernel_gpu_utils.{h,cu} (already dead --
    declared but not called by anything)

Doc updates folded in:
  - docs/source/environment.rst: document SCAMP_USE_EXTERNAL_EIGEN
    (was honored by cpu_kernel/CMakeLists.txt but undocumented), plus
    a note about Eigen 5.x still publishing as the Eigen3 package.
  - src/core/gpu_kernel/CMakeLists.txt: expand the comment above the
    target_link_libraries(gpu_kernels ... Eigen3::Eigen) call to
    mention the dual Eigen-acquisition path (vendored INTERFACE
    target vs find_package import).
  - docs/source/autotune.rst + builtin_autotune_cache.h: small polish
    landed alongside.

Verified locally on RTX 3080:
  - kernel_variants_table.h regenerates cleanly with the 7-field
    initializer rows.
  - 22/22 unit tests pass.
  - run_tests.py subset (windows 13/33, inputs 100/250/1500, tiles
    1024/2048) = 460/460 GPU tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The prior commit gated the per-tile distc-reload loop on COMPUTE_COLS
to fix the v3 Pascal nullptr-deref. Tracing the lifecycle of distc
across a tile shows the gate fixed the bug but missed the deeper
redundancy: the pre-load itself was a no-op.

distc lifecycle inside one tile:

  1. init_smem_shfl: smem.local_mp_col[col_idx] = global_profile[col]
     (cross-tile running best from the global profile).
  2. (pre-load) distc[i] = smem.local_mp_col[col_idx]
     -- COPY of the prior_global value into the per-thread accumulator.
  3. merge_to_column_shfl, called per row:
       if (dist[i] > distc[i]) distc[i] = dist[i];
     -- so distc[i] ends the tile at max(prior_global, this_tile_max).
  4. flush_all_cols_to_smem:
       atomicMax(smem.local_mp_col + col_idx, distc[i])
     -- block-scoped atomic-max merge into the same smem cell.

The atomicMax in step 4 is what actually preserves cross-tile
continuity. Whether step 2 starts distc at prior_global or at init_dist,
step 4's atomic computes:

  smem.local_mp_col[col_idx]
      = max(prior_global, distc_final)
      = max(prior_global, this_tile_max)

in both cases. Same observable result.

Dropping the pre-load:
  - Removes DPT * tiles * blocks LDS instructions per kernel launch.
  - Removes the COMPUTE_COLS branch + reload-or-init duplication added
    in the prior commit -- the null-deref the gate was protecting
    against simply doesn't happen if we never read from
    smem.local_mp_col in this loop.
  - Removes the col_idx bounds check (`col_idx >= 0 && col_idx <
    tile_width`) since the only consumer (merge_to_column_shfl) reads
    distc by register index, not by col_idx.

Per-row merge_to_column_shfl is a branchless predicated select on the
GPU regardless of whether distc starts at prior_global or init_dist,
so no per-row cost difference either way.

A FUTURE OPT comment is left in the kernel for max-style profiles
(1NN / 1NN_INDEX / APPROX_ALL_NEIGHBORS / MATRIX_SUMMARY): pre-loading
plus the _check atomic variants (fAtomicMax_check / MPatomicMax_check)
could skip the flush-time CAS when this tile's distc didn't beat the
prior global. The bookkeeping has to refresh the cached check inside
update_info_shfl on rotation, and adds DPT registers per thread
(potentially regressing occupancy on v4 DPT=8 SP). Not worth pursuing
without a micro-benchmark on smaller GPUs or converged workloads
where the atomic-skip rate would be high.

Verified on RTX 3080 (sm_86):
  - 22/22 unit tests pass.
  - 685/685 GPU run_tests.py subset pass.
  - 1NN_INDEX 1M perf check (3 runs median): DP 12.16 -> 12.22 s
    (+0.5%), SP 2.10 -> 2.15 s (+2%). Both within run-to-run noise;
    net perf-neutral. The Pascal repro is fixed for the same reason as
    the prior commit's gate (no smem.local_mp_col read on the
    COMPUTE_COLS=false path), but expressed as a smaller kernel body
    instead of an if-constexpr.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The DPT-wide per-row hot loop in do_row_shfl was hand-unrolled via
#pragma unroll. The body is pure element-wise array ops on the
SCAMPShflState::cov / inormc / dfc / dgc / local_col / global_col
fields, all of which are already Eigen::Array<T, DPT, 1>. Express
the loop as Eigen array expressions and let nvcc unroll the
expression-template-generated IR instead.

The four sub-operations become:
  1. d_raw = (cov * inormc * inormr).cast<DISTANCE_TYPE>()
  2. cov  += dfc * dgr + dgc * dfr   (uses old cov on RHS only)
  3. diag = local_col.cast<int>() - row_in_tile  (signed; preserves
     the negative-diag case the original used)
  4. slot_valid (DPT-wide bool array) + slot_valid.select(d_raw,
     init_dist) for the final dist

The SLOW-PATH branch keeps the global_row-out-of-range check as a
scalar early-return (setZero on the bool array) because Eigen's && /
& between an Array and a scalar bool doesn't broadcast; per-element
predicates are then chained with Array && Array.

Verified on RTX 3080 (sm_86), 1NN_INDEX, 1M elements, window=100,
median of 8 runs:

  variant   manual loop   Eigen   delta
  ---------------------------------------------
  v3 DP     12.16 s       12.18 s noise (~+0.2%)
  v4 SP      2.10 s        2.00 s **-5%** (every Eigen run faster
                                          than every manual run)

Hypothesis for the SP-only win: v4 has DPT=8, so the manual
#pragma-unrolled loop produces a wider IR that nvcc's scheduler
seems to lay out worse than the equivalent Eigen-expression IR. v3
at DPT=4 is small enough that both forms produce essentially the
same code. The Eigen IR also keeps cov-update and d_raw computation
explicitly independent (no false ordering dependency between scalar
statements that nvcc would have to prove disjoint).

Correctness:
  - 22/22 unit tests pass.
  - 685/685 GPU run_tests.py subset (windows {13,33,100}, inputs
    {100,250,1500}, tiles {1024,2048}) pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the per-element memmove-style cov shift with Eigen segment
assignment:

  // Before:
  for (int i = DPT - 1; i > 0; --i) cov[i] = cov[i - 1];
  cov[0] = wrap_in;

  // After:
  if constexpr (DPT > 1) {
    cov.template tail<DPT - 1>() = cov.template head<DPT - 1>().eval();
  }
  cov[0] = wrap_in;

The .eval() forces a temporary so the tail-and-head overlap doesn't
alias (the manual loop handled aliasing implicitly by walking from
high-to-low). For DPT in {4, 8} the temporary is a small register
array; nvcc folds it into the same register allocation as the
destination, so the .eval() is essentially free.

Verified on RTX 3080 (sm_86), 1NN_INDEX, 1M, window=100, median of
13 SP runs / 5 DP runs:

  variant  prior (manual cov-shift)   cov-shift Eigen   delta
  -----------------------------------------------------------
  v3 DP    12.18 s                    12.16 s           noise
  v4 SP     2.00 s                     1.99 s           ~-1%

The SP improvement is small (within the variance band of any
individual run) but the median moves the right direction and every
run except one is sub-2.00. Combined with the prior commit
(Eigenify per-row dist + cov update), the v4 SP path is now ~6%
faster than the all-manual-loop baseline. DP is flat overall.

Correctness:
  - 22/22 unit tests pass.
  - 685/685 GPU run_tests.py subset pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two manual #pragma unroll loops over DPT replaced with Eigen array
ops on the SCAMPShflState fixed-size arrays:

  state.local_col = state.global_col - tile_start_col;
  state.distc.setConstant(init_dist<DISTANCE_TYPE, PROFILE_TYPE>());
  state.idxc.setZero();

Code-shape change only. Perf-checked on RTX 3080 with the reference
SP 1M / m=1024 / max_tile_size=1M / 1NN_INDEX workload:
  baseline:           ~1.985s (min of 5)
  Eigen-simplified:   ~1.867s (min of 5)
Within run-to-run noise -- treat as perf-neutral, not a win.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three additions to the SCAMP --autotune workflow that make iterative
tuning much faster:

- SCAMP_AUTOTUNE_PRECISION_FILTER={SINGLE,DOUBLE}: restricts the sweep
  to one precision. Filtered-out (profile, precision) targets are
  reported as SKIPPED in verbose mode and do NOT overwrite their cache
  entries on Save(), so the existing cache for the other precision is
  preserved untouched.

- SCAMP_AUTOTUNE_VARIANT_FILTER={shfl,sliding-window,all}: restricts
  the variant loop to one family. shfl variants are identified by
  unrolled_rows == 0 (the sentinel already used in SCAMP_VARIANT_TUPLES
  to route to kernel_variant_shfl.cu.in).

- SCAMP_AUTOTUNE_WARMUP_RUNS=<n> (default 0): per-trial warmup count.
  Was hardcoded at 1 warmup + 1 timed; in practice the first launch of
  a given (variant, blocksz) instantiation on Ampere is only a few %
  off steady-state because module-load / JIT cost is amortized at the
  process level, not per-trial. Skipping warmups by default ~halves
  sweep time. Set to 1 or more on noisier setups.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…onsolidation

Three closely-coupled changes to the GPU kernel infrastructure that
together cut the consolidated shfl + sliding-window variant count in
half while improving SP 1NN_INDEX on RTX 3080 by 14% (1.94s -> 1.71s
at 1M / m=1024).

1. ADAPTIVE __launch_bounds__ VIA safe_bps()

   Replaces __launch_bounds__(BLOCKSZ, variant_bps) with
   __launch_bounds__(BLOCKSZ, safe_bps(target_threads, BLOCKSZ,
   kHwThreadsPerSm)) in both do_tile and do_tile_shfl. The target
   threads/SM stays constant across the autotuner's blocksz sweep, so
   the per-thread register budget no longer collapses when blocksz
   grows past the variant's default. This removed a sharp register-
   spilling cliff observed at bsz=256 (DP 1NN_INDEX went 52s -> 12s,
   SP 14.9s -> 1.88s) where the old fixed-bps launch_bounds was telling
   ptxas to fit 4 * 256 = 1024 threads/SM at ~64 regs/thread budget.

   safe_bps + kHwThreadsPerSm live in kernel_gpu_utils.h. The hw cap
   resolves per __CUDA_ARCH__ (1536 for sm_86/sm_89, 2048 elsewhere)
   so launch_bounds never asks ptxas to overcommit the per-SM thread
   ceiling, which used to be silently clamped.

2. VARIANT TUPLE: DROP default_blocksz_dp (DERIVE AS SP / 2)

   Variant tuples in SCAMP_VARIANT_TUPLES now declare only
   default_blocksz_sp. The DP value is implicit:
   default_blocksz_sp / 2. DP uses 2x the registers per thread on
   Ampere/Ada (FP64 takes two FP32-sized register slots), so halving
   the target threads/SM keeps the per-thread register budget stable
   across precisions. The /2 is hardcoded in the launcher templates
   (LaunchDoTileShflWithGeometry / LaunchDoTileWithGeometry) and in
   DefaultBlockszForPrecision so the ratio is enforced by construction
   instead of being a per-variant convention that drifts.

3. cuda::barrier ARRIVE/WAIT SPLIT IN do_row_shfl

   Replaces the per-row __syncthreads() in the shfl inner loop with a
   cuda::barrier<thread_scope_block> arrive/wait pair (sm_80+ mbarrier
   PTX). The lane-31 cov publish + bar.arrive() are hoisted to right
   after the cov update, so the arrive-to-wait window covers the dist
   masking, merge_to_col, merge_to_row, intra-warp shfl cov rotation,
   register shift, and (every DPT rows) update_info_shfl. Warps that
   finish that second-half work faster don't block warps still
   draining it -- the wait blocks only on the slowest arrival.

   ncu confirms barrier stalls drop from 1.21 cyc/issued instr to
   0.03; eligible warps/sched climb from 0.90 to 1.43; issue rate
   +24%. Output is bit-identical to the __syncthreads version.

4. VARIANT TABLE CONSOLIDATION

   The full SP+DP autotune across the previous 9-variant table
   (3 SW + 6 shfl) showed only 4 variants ever win a target on sm_86,
   so the table is trimmed to those 4:

     SW dpt=4 bps=1  big-tile (target=512 thr/SM): wins DP 1NN + DP AAN
     SW dpt=4 bps=2  big-tile (target=1024 thr/SM): wins SP 1NN + SP/DP
                     MATRIX_SUMMARY + SP AAN
     shfl dpt=8 bps=5: wins SP 1NN_INDEX + DP/SP SUM_THRESH; best
                       overall geomean (1.016x of per-target best,
                       worst-case 1.03x).
     shfl dpt=4 bps=8 tile=128: wins DP 1NN_INDEX. Smaller per-thread
                                state lets DP fit higher occupancy
                                where dpt=8 spills.

   Dropped variants (each never won a target across all 10 (profile,
   precision) targets in the full autotune):
     - SW dpt=2 baseline (geomean rank 22+)
     - SW dpt=4 + 4x4 hand-unroll tile=200 (geomean rank 4-13 but
       always lost to one of the two big-tile variants)
     - SW big-tile bps=4 (forced spill cliff; geomean rank 17+)
     - shfl dpt=4 bps=8 tile=64 (the original short-tile shfl;
       never wins anywhere)
     - shfl dpt=8 bps=4 (loses by ~1% to bps=5)
     - shfl dpt=8 bps=6 (just past spill cliff on sm_86)
     - shfl dpt=4 bps=10 tile=128 (only wins broken-on-GPU
       MATRIX_SUMMARY DP)

   Compile time drops accordingly: 9 -> 4 variants means
   5 fewer per-profile .cu TUs to nvcc-instantiate, with the
   sliding-window TUs each compiling much faster than the shfl ones
   (the shfl variant has more template instantiations per TU).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GetDefaultKernelConfig previously took only precision and returned the
first shfl variant for every (profile, precision) combo. Now takes
profile_type too, and picks the right family for that profile based on
which family wins in the cross-(profile, precision) autotune:

  - 1NN_INDEX, SUM_THRESH -> shfl family. The shfl variant warp-reduces
    atomics before the final atomicMax / atomicAdd, which dominates
    these profile types.
  - 1NN, MATRIX_SUMMARY, APPROX_ALL_NEIGHBORS -> sliding-window family.
    These profiles have light per-row atomics; the SW variant's smem
    column buffer + heavy inner-loop unroll dominates.

The implementation picks the FIRST variant of the preferred family
(matching by unrolled_rows == 0 / != 0). By convention the first entry
of each family in SCAMP_VARIANT_TUPLES is that family's intended
default -- variant authors should keep the best generalist at the
front of the list. Robust to variant reordering and to the table being
re-tuned for a new device.

DefaultBlockszForPrecision still picks the precision-specific blocksz
within the chosen variant (SP gets default_blocksz_sp, DP gets
default_blocksz_sp / 2). This only matters when the user-cache lookup
misses the device's CacheKey + (profile, precision) -- once the
autotune cache has an entry, the cache wins.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a Per-variant CUDA Tests step to build-and-test-cuda that runs
the existing run_tests.py harness once per compiled kernel variant
with SCAMP_FORCE_VARIANT pinning the variant for every (profile,
precision) launch. Coverage: each variant gets exercised against all
profile types (1NN_INDEX, 1NN, SUM_THRESH, MATRIX_SUMMARY,
APPROX_ALL_NEIGHBORS) and both precisions, just at a smaller
sweep than the extended-tests step (1 window, 2 tile sizes, 2 input
sizes), so the extra CI time is ~4 min total for the current
4-variant table.

Two new mechanisms support the new step:

- SCAMP_FORCE_VARIANT=<index> env var: read by GetKernelConfigForDevice
  in autotune.cpp. When set to a valid variant index, returns the
  variant's cold-start config (via DefaultBlockszForPrecision) and
  bypasses the autotune cache + per-(profile, precision) default. The
  precision still picks the cold-start blocksz so the kernel runs at
  its variant-author-chosen blocksz, not the autotuned one -- the full
  {64,128,256,512} blocksz axis sweep belongs in `SCAMP --autotune`,
  not the correctness CI.

- --list_variants CLI flag: prints one line per variant in a stable
  machine-parseable format (v<i> bps=N dpt=N ur=N our=N kti=N
  family={shfl,sliding-window}). CI shell scripts iterate via
  $(SCAMP --list_variants | wc -l) so adding or dropping variants
  doesn't require touching the workflow YAML.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cuda/barrier requires sm_70+; libcudacxx hard-errors on older arches
with #error "CUDA synchronization primitives are only supported for
sm_70 and up." The shfl kernel's arrive/wait split is gated behind
SCAMP_SHFL_HAS_CUDA_BARRIER (defined only when the host pass or a
sm_70+ device pass is active).

ShflRowBarrier is a typedef wrapper used in do_row_shfl's signature
and the per-block __shared__ declaration:
  - sm_70+: cuda::barrier<thread_scope_block>
  - sm_60:  empty stub struct

On sm_60 the lane-31 publish still happens at the same early spot
(right after the cov update); only the per-row sync primitive
differs (cuda::barrier wait vs __syncthreads). The publish is just
an smem store and is ordered with the next row's read by either
sync primitive, so semantics are identical -- sm_60 just loses the
arrive/wait latency-hide.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cache lookup tests hardcoded the old v6 geometry
(bps=8, dpt=4, ur=0, our=8, kti=8 — the dpt=4 tile=64 shfl variant)
as the SupportedConfig() "hit" example, and the
Test_AutotuneFullPipeline_FakeBench rig referenced variant indices
4+ (out-of-bounds in the 4-variant consolidated table). Both were
silently failing after the variant table dropped from 9 variants to
4 -- SupportedConfig stopped matching anything in kVariants so
IsSupportedKernelConfig rejected cache hits, and the FakeBench rig
read past kVariants.size() (returning garbage geometry values like
-51257 from uninitialized stack).

Updates:
- SupportedConfig() now uses kti=16 (the dpt=4 tile=128 shfl variant
  that survived consolidation).
- All cache-fixture strings bumped from |8|4|0|8|8 to |8|4|0|8|16
  via sed.
- RiggedWinner table reworked to cover variant indices 0..3 (the
  current 4-variant range), with at least one rigging per index.

22/22 tests now pass on sm_86. Same fix applies to the CI failures
on the build-cuda-versions runners.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixes from a walk-through of the PR diff vs master:

- kHwThreadsPerSm: added missing arches. Turing (sm_75) was using the
  2048 fallback but Turing's hw cap is actually 1024 thr/SM, so
  safe_bps was overcommitting by 2x on Turing -- ptxas would silently
  clamp, defeating the safe_bps cap entirely. sm_87 (Orin) and sm_120
  (Blackwell GeForce) added at 1536. Also refactored from a constexpr
  #if-ladder to a constexpr function (hw_threads_per_sm()) used as the
  default arg for safe_bps -- simpler to read and extend; same per-arch
  __CUDA_ARCH__-driven values.

- LAUNCH_PRECISION_SHFL: factored the 4-way blocksz dispatch into a
  LAUNCH_PRECISION_SHFL_AT_BLOCKSZ helper macro, mirroring the
  sliding-window LAUNCH_PRECISION_AT_BLOCKSZ pattern in kernels_impl.h.
  ~50 lines of macro removed; identical generated code.

- Stale shipped autotune cache entries dropped from
  data/autotune_cache.txt. Every RTX 3080 and P100 entry referenced
  variant geometries dropped in the consolidation
  (old v6 dpt=4 tile=64, old v8 dpt=8 bps=4, old v0/v1 SW). They were
  silently rejected by IsSupportedKernelConfig -> users got cold-start
  defaults anyway, but the file lied about what was tuned. Header
  notes kept; entries cleared until we re-autotune on hw.

- Comment cleanup:
  * kernels_compute_shfl.h: "one __syncthreads per row" -> describes
    both the sm_70+ cuda::barrier path and the sm_60 __syncthreads
    fallback. Stale "DP variant 6 BLOCKSZ=256 DPT=2" smem-footprint
    example replaced with a real current-variant example.
  * kernels_impl_shfl.h: docstring template-param list said
    blocks_per_sm but the param was renamed to target_threads_per_sm
    when safe_bps landed. FUTURE OPT comment dropped its "v4 (DPT=8
    SP is already register-pressured)" reference -- v4 no longer
    exists; rewritten to "higher-DPT shfl variants" generically.
  * kernels_impl.h: dropped "variant 5 at OUR=16" and "Same as v6"
    references to dead variants. Added a load-bearing note about
    sm_60's 48KB dynamic-smem cap and the launcher's
    cudaFuncSetAttribute opt-in (which sm_60 ignores).
  * autotune_bench.cpp: trimmed "real bug: v5 (OUR=16)" past-incident
    reference; deduped two stale warmup-related comment blocks.
  * autotune.cpp: "Per-thread override" comment was misleading -- the
    override IS process-wide (the function-level comment above the
    setter spells out why), so the lookup-side comment now matches.
  * main.cpp: removed redundant nested #ifdef _HAS_CUDA_.
  * ci yml per-variant step: "one tile size" -> "two tile sizes"
    (matched actual command which sweeps {1024, 4096}).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- SCAMP_FORCE_VARIANT lookup now caches the parsed value at first call
  via a static-local initializer. The lookup sits on the per-tile-launch
  hot path; the prior every-call getenv() added a syscall per tile.
  Out-of-range and malformed values cache as -1 ("no override") so the
  fast path stays a single int compare.

- pyscamp module init no longer force-sets SCAMP_AUTOTUNE_QUIET=1.
  The autotune cache-miss warning now behaves the same in pyscamp as
  in the CLI: prints once per process if no cache entry is found, can
  be silenced with SCAMP_AUTOTUNE_QUIET=1. The prior asymmetric default
  hid the "run --autotune for max perf" hint from notebook users for
  whom it was most relevant.

- safe_bps() comment now spells out the default-arg evaluation semantics
  (hw_threads_per_sm() resolves under the current __CUDA_ARCH__ guard
  at the call site, so each per-arch device compilation pass bakes in
  the right per-SM cap).

- docs/source/autotune.rst: dropped the now-stale "pyscamp silences by
  default" claim; clarified the cache-miss warning is once-per-process.
  Added an "Other autotune environment variables" section covering
  SCAMP_AUTOTUNE_{PRECISION,VARIANT}_FILTER, SCAMP_AUTOTUNE_WARMUP_RUNS,
  and SCAMP_FORCE_VARIANT, including their first-call-cached semantics
  for the ones that latch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removes the build-time embedded autotune cache and its lookup chain
branch. The maintenance burden of keeping data/autotune_cache.txt
fresh across releases (every variant-table edit invalidates entries;
every new device needs a hand-merged PR; users hit silent stale-entry
rejection) outweighed the small out-of-box perf uplift it bought.

Removed:
- data/autotune_cache.txt (the source file, already empty post-review).
- src/core/gpu_kernel/builtin_autotune_cache.{h,cpp.in} (the embedded
  string + its declaration).
- The file(READ) + configure_file + extra source file in
  src/core/gpu_kernel/CMakeLists.txt that wired the embed into the
  binary.
- GetOrLoadBuiltinCache() + CacheState::builtin_cache from autotune.cpp.
- The `builtin_cache` parameter from LookupKernelConfigForDeviceKey
  (both signature + the only non-test caller, GetKernelConfigForDevice).
- 3 builtin-cache-specific unit tests (Test_BuiltinCacheHit,
  Test_UserCacheBeatsBuiltin, Test_StaleUserCacheFallsThroughToBuiltin).
  The last was renamed and repurposed to verify a stale user cache
  falls through to the FALLBACK config + emits the cache-miss warning
  (the upgrade-compat behavior we actually care about).
- Built-in / "ships with the binary" mentions in docs/source/autotune.rst,
  autotune.h, autotune_cache.h, autotune.cpp banner output.

New lookup chain (in priority order):
1. Process-wide KernelConfig override (autotune bench + SCAMP_FORCE_VARIANT).
2. User cache at $SCAMP_AUTOTUNE_CACHE / $XDG_CACHE_HOME / $HOME.
3. GetDefaultKernelConfig(profile, precision) -- the per-profile-type
   compile-time default, plus a one-shot per-process miss warning.

Users who want optimal perf still run `SCAMP --autotune` once per
device (or `pyscamp.autotune()`); the result lands in the user cache
and persists across runs. Cold-start defaults are reasonable
per-profile (shfl for atomic-heavy 1NN_INDEX/SUM_THRESH, sliding-
window for atomic-light 1NN/MS/AAN), so no-tune behavior is correct
and within a few percent of optimal on the variants and arches we
tested.

22/22 -> 20/20 unit tests (-2 from removed builtin-only tests, the
renamed one still runs).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- docs/source/environment.rst: SCAMP_AUTOTUNE_INPUT_LENGTH default
  documented as 131072 but the constant in autotune_bench.cpp is
  262144 (256K). Fixed. Also dropped the "Set by pyscamp at module
  init for notebook users" line (we removed that auto-set in an
  earlier review pass). Added entries for the four env vars that
  weren't documented at all: SCAMP_AUTOTUNE_PRECISION_FILTER,
  SCAMP_AUTOTUNE_VARIANT_FILTER, SCAMP_AUTOTUNE_WARMUP_RUNS,
  SCAMP_FORCE_VARIANT.

- docs/source/autotune.rst: the cache-miss warning example showed the
  pre-consolidation default (bps=8 dpt=4 ... kti=8); the current
  default for 1NN_INDEX SP is the first shfl variant, which is
  bps=5 dpt=8 ... kti=32. Updated the example to match.

- src/core/gpu_kernel/kernel_gpu_utils.h: hw_threads_per_sm() added
  sm_110 (Jetson Thor, Blackwell mobile) to the 1536-threads/SM
  bucket. Previously fell through to the 2048 default which would
  have over-committed safe_bps's per-arch hw cap by 33% on that
  device. SCAMPMacros.cmake already compiles for sm_110 on CUDA
  13.0+, so the omission was a real gap.

- src/core/gpu_kernel/kernels_impl_shfl.h: the
  "tile_height <= 32 * DPT for the first draft" static_assert message
  was a state-of-development phrasing. Replaced with a fact-only
  explanation of WHY the constraint must hold (update_info_shfl
  fires at most once per warp per tile under that bound).

- src/python/SCAMP_python.cpp: a comment in run_autotune claimed
  "main.cpp dispatches here too" -- backwards. The CLI's --autotune
  path in main.cpp calls RunAutotuneWithBenchmark directly; this
  Python entry point is a separate dispatch that shares the same
  bench impl. Fixed the comment to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant