Per-GPU kernel autotune + cov-shuffle (shfl) variant + Eigen port by zpzim · Pull Request #143 · zpzim/SCAMP

zpzim · 2026-05-22T22:41:32Z

Background

The cov-shuffle / warp-shuffle GPU kernel design has been on this repo's
wishlist since 2018 (experimental-warp-shuffle,
experimental-shared-memory) and was retried in late 2022 / early 2023
with an Eigen-port foundation (use-eigen-in-gpu-kernels,
eigen-gpu-warp-shuffle, eigen-gpu-cov-shuffle, eigen-tex-mem).
None of those branches reached a shippable state — the codegen
tradeoffs (compile time, kernel binary size, smem footprint per
variant) made it hard to keep the new kernel alongside the existing
sliding-window one without regressing builds for users who didn't need
it. This PR is the iteration that finally lands the work, on top of a
per-(profile, variant) build split + an autotune cache that picks per
device, so the new kernel can coexist with the legacy one in the same
binary.

Summary

Three threads of work that landed together because the autotune machinery
depends on the kernel-variant split, and the variant split is the thing
that lets the cov-shuffle variant live alongside the existing
sliding-window kernels without compile-time blow-up.

1. Per-GPU autotune

New --autotune CLI flag (and pyscamp.autotune() binding) that
sweeps (variant geometry, blocksz) for every supported
(profile, precision) target and writes the per-target winner to a
plain-text cache file.
Three-tier lookup at kernel-launch time:
process-wide override → user cache → built-in cache (embedded from
data/autotune_cache.txt) → compile-time default. Each tier is
independent; lookup falls through gracefully on miss.
Cache file location resolves $SCAMP_AUTOTUNE_CACHE → $XDG_CACHE_HOME
→ $HOME/.cache/scamp/autotune.txt on POSIX or
%LOCALAPPDATA%\scamp\autotune.txt on Windows.
One-shot stderr warning ("no autotune entry for device '<name>'…")
on miss, silenceable via SCAMP_AUTOTUNE_QUIET. pyscamp sets this at
module init so notebook users don't get uninvited stderr.
Default benchmark workload size is 256K elements
(SCAMP_AUTOTUNE_INPUT_LENGTH=262144); measured against 64K/128K on
an RTX 3080, 256K is the smallest size where launch overhead is
reliably dominated by steady-state kernel cost — see the comparison
table in docs/source/autotune.rst.
Built-in cache ships entries for RTX 3080 (sm_86) and P100 (sm_60).

2. Kernel-variant infrastructure

SCAMP_VARIANT_TUPLES in src/core/gpu_kernel/CMakeLists.txt is the
single source of truth for which kernel geometries the binary
supports. The foreach in that file regenerates the dispatch table,
the kVariants[] constexpr array, and the per-(profile, variant)
.cu files via configure_file.
Five active variants today: 3 sliding-window + 2 shfl, after a
multi-device sweep retired the ones that never won.
Build split: 5 profiles × 5 variants = 25 generated .cu files,
one do_tile<...> instantiation per file, so make/ninja can
dispatch them in parallel. Full gpu_kernels rebuild on a 3080 dev
box dropped from ~260s to ~74s; the dominant v3+v4 (DPT=8) TUs
alone went 211s → 34s after dropping #pragma unroll on the shfl
outer row loop (a __syncthreads()-terminated loop with up to 256
iterations that nvcc was inlining 256× for no ILP gain).

3. cov-shuffle ("shfl") variant

New GPU kernel design where the per-row covariance value walks
lane-to-lane via warp shuffles rather than being read from shared
memory. Files: kernels_compute_shfl.h, kernels_impl_shfl.h,
with an ASCII walkthrough generator in
scripts/cov_shuffle_diagram.py.
Two enabled shfl tuples: DPT=4 / tile=64 and DPT=8 / tile=256. v3
(DPT=4) is the compile-time default fallback; the autotuner picks
per-device.

4. Eigen port of the sliding-window kernel

Sliding-window cov / dist arrays in kernels_compute.h +
kernels_smem.h use Eigen::Array<T, N, 1> and Eigen::Map
instead of hand-unrolled C arrays. Required bumping the Eigen
submodule 3.4.0 → 5.0.1 for better nvcc support; device-callable
under --expt-relaxed-constexpr (already set at the top level).

5. Platform / CUDA-arch matrix

Hardened cmake/SCAMPMacros.cmake set_cuda_architectures() for
CUDA 11.0 through 13.0+, including Blackwell (sm_100, sm_103,
sm_110, sm_120, sm_121) and the CCCL 2.8.x arch-token limit
workarounds.
Honors -DCMAKE_CUDA_ARCHITECTURES=<list> /
USER_SPECIFIED_CUDA_ARCHITECTURES for single-arch dev builds (huge
compile-time win when you know you're targeting one GPU).
Windows MSVC compatibility for the autotune cache code:
std::filesystem for atomic-rename + recursive mkdir,
_putenv_s for the env-var-set test helpers, explicit lambda
captures for MSVC C3493, LOCALAPPDATA / USERPROFILE env-var
resolution.

6. Docs

New docs/source/autotune.rst covering lookup chain, cache
locations (POSIX + Windows), the "no autotune entry" warning,
clearing/resetting the cache, the upgrade compatibility contract,
and per-N workload-size guidance.
New "Environment variables" section in docs/source/environment.rst
listing every env var SCAMP / pyscamp / the distributed gRPC client
reads.
pyscamp autodoc stub (docs/pyscamp.py) now surfaces autotune()
and gpu_supported().

Test plan

22 C++ unit tests in test/cpp/test_autotune_cache.cpp
covering the cache lookup chain, cache-miss warning dedup, the
SCAMP_AUTOTUNE_QUIET env-var contract, on-disk round-trip,
atomic replace, recursive parent-dir creation, the full
bench-driven autotune pipeline driven by a synthetic
BenchmarkFn (no GPU needed), and four cache-upgrade scenarios
(future version header, missing header, partially stale cache,
user wipe with built-in fallback).
Tests build + pass under CUDA-toolchain-only CI on Linux
(g++, clang++) and Windows (MSVC cl), with no GPU required.
Verified locally against VS2022 MSVC 14.30 + CUDA 11.6.
SCAMP --autotune end-to-end on RTX 3080 (Linux + Windows)
produces a cache that subsequent SCAMP runs consume without
the cache-miss warning.
cmake --build . --config Release --parallel 2 (the exact
invocation in build-cuda-cli) builds + runs test_autotune_cache
via ctest --output-on-failure.

🤖 Generated with Claude Code

The CPU kernels already live under src/core/cpu_kernel/. Mirror that layout on the GPU side so the next round of GPU work (per-arch knobs + autotuning, issue #115) has a clear home and is not interleaved with the host-side tile/profile orchestration. Files moved (git rename, no behavior change): kernels.h, kernels.cu -- main kernel entry + dispatch kernels_compute.h, kernels_smem.h -- per-profile compute strategies + smem kernel_gpu_utils.h, kernel_gpu_utils.cu -- launch config helpers qt_kernels.h, qt_kernels.cu -- cuFFT sliding-dot-product kernels Files staying at src/core/ (host-side or shared): kernel_common.h/.cpp -- shared with CPU kernels qt_helper.h/.cpp -- host-side cuFFT context manager (has CPU fallback) tile.h/.cpp -- tile orchestration src/core/gpu_kernel/CMakeLists.txt now owns the qt_kernels, gpu_utils, and gpu_kernels library targets and is only added when CMAKE_CUDA_COMPILER is present, mirroring how cpu_kernel/ is structured. Verified that build_cuda/ builds cleanly with -DFORCE_CUDA=1 and that the 1NN_INDEX self-join on a 2000-element series exactly matches distance_matrix_np reference output (same vectors and indexes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The do_tile<...> template was instantiated 45 times in a single .cu file (5 profile types x 3 precisions x 3 row/col modes), serialising nvcc into a ~100-second front-end for that one translation unit. Split it so each profile type gets its own .cu file and the per-TU work shrinks to 9 instantiations. kernels_impl.h -- shared template body for do_tile + LaunchDoTile kernels_dispatch.h -- LaunchKernel_<PROFILE> entry-point declarations kernels.cu -- slimmed: just the runtime dispatch + gpu_kernel_self/ab_join_* + match_gpu_sort kernel_1nn.cu kernel_1nn_index.cu -- one per profile type, each instantiates kernel_sum_thresh.cu LaunchDoTile<...> bound to its PROFILE_TYPE kernel_matrix_summary.cu kernel_approx_all_neighbors.cu Also extracted the plain launch constants (KERNEL_TILE_HEIGHT, BLOCKSZ_*, BLOCKSPERSM, DIAGS_PER_THREAD, TILE_HEIGHT_*) into a new CUDA-free kernel_constants.h. kernel_gpu_utils.h still re-exports them; host TUs can now include the constants without dragging in nvcc-only device intrinsics. Measured on this 8-core box (RTX 3080, nvcc 12.2): baseline (single kernels.cu): 101s wall, 740MB peak rss split (6 .cu, --parallel 8): 35s wall, 418MB peak rss A 2.9x wall-clock win and 43% lower peak memory for the gpu_kernels target. GPU output is byte-identical: verified that 1NN_INDEX self-join on a 2000- element series produces vectors+indexes matching distance_matrix_np exactly. PTX is unchanged -- this is a pure refactor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

) The kernel launch path now consults a small per-(device, profile_type, precision) configuration table to pick the right (blocksz, tile_height, blocks_per_sm) tuple. The table is wired with three layers, checked in priority order at GetKernelConfigForDevice(): 1. User override file ($SCAMP_AUTOTUNE_CACHE, $XDG_CACHE_HOME, or ~/.cache/scamp/autotune.txt). RunAutotune() writes here. Useful for developers iterating on a single machine. 2. Built-in cache embedded into the binary at build time from data/autotune_cache.txt. This is what conda-forge / pip-wheel users benefit from since they cannot recompile to add kernel variants. 3. GetDefaultKernelConfig() -- the compile-time default that matches the constants SCAMP has historically shipped with. Workflow for shipping a new device's tuned config to end users: 1. Build SCAMP from source on a machine with the target GPU. 2. Run `SCAMP --autotune` (or `pyscamp.autotune()`) -- this writes the user override file. 3. Merge the new device's lines into data/autotune_cache.txt and open a PR. The next release picks them up via configure_file embedding. Files added: data/autotune_cache.txt -- checked-in cache with header src/core/gpu_kernel/ kernel_config.{h,cpp} -- KernelConfig struct + default kernel_constants.h -- moved compile-time constants here (CUDA-free) so host TUs can include them device_props.{h,cpp} -- cudaGetDeviceProperties wrapper producing a stable cache key autotune_cache.{h,cpp} -- text-format cache I/O, with LoadFromString hook for the built-in builtin_autotune_cache.h -- declares the embedded constant builtin_autotune_cache.cpp.in -- configure_file template autotune.{h,cpp} -- public API: RunAutotune, GetKernelConfigForDevice (override + builtin lookup) Files modified: src/core/gpu_kernel/CMakeLists.txt -- builds the new sources; configure_file embeds the cache into the binary src/core/gpu_kernel/kernels.cu -- compute_gpu_resources_and_launch calls GetKernelConfigForDevice for every launch src/main.cpp -- --autotune CLI flag src/python/CMakeLists.txt -- pyscamp links gpu_utils + CUDA::cudart when CUDA is on src/python/SCAMP_python.cpp -- pyscamp.autotune() binding Today only the compile-time-default KernelConfig is template-instantiated in the kernel, so IsSupportedKernelConfig() rejects any other tuple and the runtime silently falls through to the default. The infrastructure is in place for follow-up PRs to add alternative (blocksz, tile_height) kernel variants, at which point the cache becomes load-bearing for end users on tuned devices. Verified: - CLI --autotune on RTX 3080 writes a 15-entry override file and prints the developer workflow instructions. - pyscamp.autotune() does the same from Python. - SCAMP_AUTOTUNE_CACHE=/tmp/empty_cache + GPU 1NN_INDEX self-join still matches distance_matrix_np exactly (falls through to default). - Full test/test_pyscamp.py suite passes on GPU. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Eigen 5.0.1 (released Nov 2025) is the first stable release with ~1474 commits of fixes since 3.4.0, including many that matter for device-side use: added EIGEN_DEVICE_FUNC annotations, NVCC build fixes for CUDA 10+, MSVC+NVCC pragma fixes, and bug fixes for inverse evaluators / tridiagonalization / arg() under CUDA. This unblocks using Eigen expressions in .cu code without a long tail of workarounds. The CMake target name (Eigen3::Eigen) is unchanged; downstream link lines are unaffected. Existing CPU-kernel Eigen usage (Eigen::Array + Eigen::Map of fixed/dynamic 1-D arrays, plus EIGEN_MAX_ALIGN_BYTES) is API-compatible. Verified by building the full tree + running a self-join smoke test on both CPU and GPU paths with bit-identical output between them. Note: in Eigen 5 the version macros are EIGEN_WORLD_VERSION=3, EIGEN_MAJOR_VERSION=5, EIGEN_MINOR_VERSION=0 (the WORLD field is frozen at 3 for source-compat with downstream version checks). CMake package version is reported as 5.0.1, so find_package(Eigen3 5.0.0 ...) is the right gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds the build-system plumbing needed for the follow-up GPU kernel rewrite to start using Eigen expressions in .cu code: - --expt-relaxed-constexpr added to CMAKE_CUDA_FLAGS so nvcc can call Eigen's constexpr host functions from __device__ contexts without warnings. - gpu_kernels now links Eigen3::Eigen. The target is defined by src/core/cpu_kernel/CMakeLists.txt, which the parent src/core/CMakeLists.txt processes before this subdirectory. - A throwaway __global__ smoke test in kernels.cu instantiates Eigen::Array<float, 4, 1> with arithmetic + reduction inside device code. It is never launched at runtime and will be removed when the real Eigen rewrite replaces it. Verified to compile for sm_86 on CUDA 12.2; full SCAMP build + CPU/GPU self-join produce bit-identical output. No behavior change: the dispatcher, kernel templates, and runtime paths are untouched; only the toolchain wiring is in place so the next commit can begin replacing hand-unrolled accumulator arrays with Eigen::Array. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… sliding-window inner loop Heavy rewrite of the per-thread GPU compute path, drawing structure from the earlier use-eigen-in-gpu-kernels experiment branch. The changes are architectural: the kernel still computes the same output (bit-identical to master on 1NN / 1NN_INDEX / SUM_THRESH self-joins on RTX 3080 over randomlist128K, full pyscamp test suite passes including small-m m=3/m=4 and MATRIX_SUMMARY / KNN / AB-join paths). Highlights: - SCAMPSmem becomes templated on (tile_width, tile_height) and exposes each smem region as Eigen::Map<Eigen::Array<T, N, 1>>, so callees can write smem.df_col.segment<unrolled_diags>(offset) instead of hand-unrolled raw-pointer loads. The ctor uses placement new on nullptr-initialized Maps to re-seat them at the per-region smem offsets (Eigen::Map's only public way to rebind a pointer at construction time). - SCAMPSmem gains a `using DataType = DATA_TYPE` typedef so do_iteration_fast can declare its register-window arrays with the smem's scalar type, not the cov accumulator's. This matters for PRECISION_MIXED (cov is double, columns are float). - SCAMPThreadInfo.cov1..cov4 collapse to a single Eigen::Array<ACCUM_TYPE, DIAGS_PER_THREAD, 1> field. cov indexing in do_row / do_row_edge is now cov[i] instead of covN, paving the way for changing DIAGS_PER_THREAD without a kernel rewrite. - kernels_compute.h drops ~500 lines: the five per-profile-type overloads of update_row / merge_to_column / update_cols / reduce_edge / reduce_row collapse into single if-constexpr templates with Eigen::ArrayBase<Derived> parameters. Same for kernels_smem.h (init_smem, write_back). - New for_<N> constexpr-loop helper (std::index_sequence + lambda taking auto i) lets compile-time loop indices flow into template arguments cleanly, replacing #pragma unroll blocks for cases where the body uses the index as a template parameter. - do_iteration_fast now processes outer_unrolled_rows rows per call (vs DIAGS_PER_THREAD=4 in master) using a sliding column register window of width inner_unrolled_cols, refilling from smem after each inner row-batch. New compile-time knobs in kernel_constants.h: DIAGS_PER_THREAD=2, unrolled_rows=2, outer_unrolled_rows=16, inner_unrolled_cols=3, unrolled_cols=17, KERNEL_TILE_HEIGHT=256. These will be the variants the autotuner benchmarks once dispatch lands. - do_row_edge ports to Eigen-array form but keeps element-wise scalar math for the cov*inormc*inormr compute, since Eigen 5 doesn't auto-promote across Arrays of different scalar types in cwise ops and PRECISION_MIXED needs the cross-type math. - PRECISION_MIXED preserved (the eigen branch dropped it). - Small-m fix (num_diags = args.n_x - args.exclusion_upper + 1) preserved (the eigen branch predates it). - kernels.cu smoke-test kernel removed; production paths now exercise Eigen device-side. - --expt-relaxed-constexpr already added to nvcc flags by the earlier scaffold commit. gpu_utils target now also links Eigen3::Eigen (kernel_gpu_utils.h declares Eigen::Map fields). Perf on RTX 3080, randomlist1M window=200, 3 runs each: precision master eigen-port delta double 25.7s 31.7s +23% mixed 50.7s 56.4s +11% single 5.3s 6.2s +17% This regression is expected and acknowledged. The eigen-branch parameter values (DIAGS_PER_THREAD=2, outer_unrolled_rows=16, etc.) were never benchmarked against master; they're a WIP starting point. The follow-up "make the autotune cache load-bearing" PR will pre-instantiate multiple (DIAGS_PER_THREAD, tile_height, blocks_per_sm) variants and let RunAutotune benchmark + select the fastest per device. End-user perf will recover (and likely exceed master) once the autotuner can pick a config tuned for the device. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…IGHBORS The SCAMPProfileType enum is not contiguous: PROFILE_TYPE_1NN=6, PROFILE_TYPE_APPROX_ALL_NEIGHBORS=7, PROFILE_TYPE_MATRIX_SUMMARY=8 all sit above the prior PROFILE_TYPE_1NN_MULTIDIM=5 that was the loop's upper bound. Any cache entry naming one of those three profile types failed to parse, throwing SCAMPException("Unknown profile type ...") from AutotuneCache::Load(); GetKernelConfigForDevice silently caught it and returned the compile-time default, defeating the cache lookup. Existed since 4daa21a; only showed up now because the autotuner only writes a record for profile types that ParseProfileTypeName can also read back, so round-tripping its own output worked, but any hand-edited cache entry for the three affected types was silently ignored. Bumping the loop to PROFILE_TYPE_MATRIX_SUMMARY covers every defined enum value; out-of-range integers fall through to ProfileTypeName's default "INVALID" branch and harmlessly never match user input. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…itch Makes the autotune cache load-bearing at launch time. Previously GetKernelConfigForDevice's return value was discarded (kernels.cu had a literal "(void)cfg" comment) because LaunchDoTile only ever called one pre-instantiated kernel variant. Now the cache result is threaded through and the kernel actually dispatches to one of the enumerated launch-geometry variants. Mechanism: - kKernelVariants[] in kernel_config.cpp enumerates the supported (tile_height, blocks_per_sm) tuples. Index 0 is the canonical default; index 1 is the first alt (128, 4). - LaunchDoTile becomes a cfg-switching shim that calls the right pre-instantiated LaunchDoTileWithGeometry<tile_height, bps>. Each kernel_<X>.cu now instantiates both variants per (precision x row/col mode), doubling the do_tile template instantiation count from 45 to 90 (still parallelizable across the 5 per-profile .cu files). - IsSupportedKernelConfig walks the variant table instead of matching only the default, so cache entries naming any enumerated variant are now honored. - kernels.cu's compute_gpu_resources_and_launch threads cfg through to each LaunchKernel_<X>; blocksz now comes from cfg.blocksz, and get_smem takes cfg.tile_height so smem sizing matches the variant. - The launch-info print (gated by silent_mode) now includes tile_height + blocks_per_sm so users can see which variant fired without rebuilding. The default config is still variant 0, so behavior is unchanged for any cache that names the default or has no entry. Verified with the existing pyscamp test suite (all pass) and a manual test cache injecting variant 1 (128, 4) entries for every (profile_type, precision) tuple, where output remains bit-identical to the default-variant output on the 3080. What's still missing (follow-up commits): RunAutotune today still writes the default config for every (profile, precision) instead of benchmarking each variant and picking the winner. And the variant axes are limited to (tile_height, blocks_per_sm); the Eigen port made diags_per_thread / unrolled_rows / outer_unrolled_rows / kernel_tile_iters parameterizable too, but exposing them adds a template-arg axis to do_tile/SCAMPThreadInfo that ripples further. Both extensions land on top of this scaffold without redesign. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…olled_rows, kernel_tile_iters The 2-axis (tile_height, blocks_per_sm) variant table was a placeholder -- post-Eigen-port the inner-loop shape knobs are all template-friendly, and DPT is the most consequential one for register pressure / occupancy. Adds them as variant axes: KernelConfig fields (replacing the redundant tile_height with the four underlying knobs): - blocksz, blocks_per_sm (as before; blocksz precision-tied) - diags_per_thread (cov array width; smem column stride) - unrolled_rows (inner row-batch in do_iteration_fast) - outer_unrolled_rows (rows per do_iteration_fast call) - kernel_tile_iters (do_iteration_fast calls per tile) - tile_height() (derived: kti * our) Where the changes ripple: - SCAMPThreadInfo<ACCUM_TYPE, DiagsPerThread>: the cov accumulator's Eigen::Array compile-time size now comes from the template arg. - Every compute template (do_iteration_fast, do_row, do_row_edge, merge_to_row/column, update_rows/cols, reduce_row/edge) gains the inner-loop knob template params (DiagsPerThread on all; UnrolledRows + OuterUnrolledRows on do_iteration_fast). The derived constants inner_unrolled_cols = DPT + UR - 1, unrolled_cols = DPT + OUR - 1 become local `constexpr int` inside do_iteration_fast. - do_tile gains 4 new int template params; tile_height = KTI * OUR derived inside. - LaunchDoTileWithGeometry takes the full 5-tuple and dispatches the 9 precision x row/col instantiations via macro (replaces ~100 lines of duplicated switch arms). - LaunchDoTile picks among pre-instantiated geometries via a branched VARIANT_BRANCH macro keyed on the full 5-tuple. - get_smem takes diags_per_thread + tile_height args; GetTileHeight deleted (replaced by KernelConfig::tile_height()). - Host-side num_workers uses cfg.diags_per_thread. Cache file format kept on V1 (no shipped production cache to migrate), but the field count went 6 -> 9: blocksz | blocks_per_sm | diags_per_thread | unrolled_rows | outer_unrolled_rows | kernel_tile_iters. Stale 6-field caches trip SplitN, throw inside GetKernelConfigForDevice, and are caught + fall through to the default (same as any other parse error today). Re-running RunAutotune rewrites in the new format. Variant table seeded with 2 entries to validate the wider dispatch: v0: bps=2 dpt=2 ur=2 our=16 kti=16 (tile_height=256, current default) v1: bps=2 dpt=4 ur=2 our=4 kti=50 (tile_height=200, master-like DPT=4) Verified on RTX 3080, randomlist1M, window=200, DOUBLE: variant 0: ~39.7s variant 1: ~31.9s (20% faster, bit-identical output) That perf delta is real proof the dispatch picks the right kernel -- identical-output alone wouldn't have shown it. Master baseline before the Eigen port was ~25.7s on the same input; the autotuner can now at least narrow the gap by picking variant 1 on this device, and further gain is a matter of expanding kKernelVariants (which is the next commit's job once the benchmark loop is in place). Full pyscamp test suite still passes (1NN_INDEX, 1NN, SUM_THRESH, MATRIX_SUMMARY, APPROX_ALL_NEIGHBORS, KNN, small-m m=3/m=4, self/ab joins). CPU vs GPU bit-identical on randomlist128K self-join. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…XED, RTX 3080 cache Five related changes that together turn the autotune cache from an unused placeholder into a real per-device perf knob. Bundled into one commit because splitting them would break the build mid-history. 1. Benchmark loop (autotune.{h,cpp}, autotune_bench.{h,cpp}) - RunAutotuneWithBenchmark: for each (profile, precision) tuple, time every variant in kKernelVariants via a BenchmarkFn callback, write the fastest to the cache. - DefaultBenchmarkVariant: 65K x window=200 synthetic self-join, runs 1 warmup + 3 timed runs, returns the min. - SetKernelConfigOverride / ClearKernelConfigOverride: thread-local override that GetKernelConfigForDevice consults first; the benchmark loop sets it around each timed do_SCAMP call to force the variant under test. - autotune_bench is a separate library so the dep on scamp_op stays out of gpu_utils (avoiding a scamp_op -> tile -> gpu_kernels -> gpu_utils -> scamp_op cycle). main.cpp links it for --autotune; legacy RunAutotune is preserved for pyscamp which hasn't been re-wired yet. 2. Per-(profile, variant) compile split - kernel_variant.cu.in: configure_file template. CMake foreach generates kernel_<profile>_v<N>.cu (5 profiles x 6 variants = 30 TUs) each instantiating do_tile<...> for one variant geometry across 2 precisions x 3 row/col modes = 6 do_tile bodies. - kernels_variants.h: declarations for all 30 LaunchVariant_<X>_vN entry points + SCAMP_VARIANT_DISPATCH macro (the cfg switch). - kernel_<profile>.cu reduces to a one-liner that expands the macro. LaunchDoTile (the cfg switch as a template) is removed from kernels_impl.h. - Before: each kernel_<profile>.cu compiled all 6 variants x 6 instantiations = 36 do_tile bodies serially. After: 30 small TUs compile in parallel under -j20+. 3. 6-variant kKernelVariants table (kernel_config.cpp + kernels_variants.h SCAMP_VARIANT_DISPATCH) - v0: bps=2 dpt=2 ur=2 our=16 kti=16 (tile=256, eigen-port default) - v1: bps=2 dpt=4 ur=2 our=4 kti=50 (tile=200, DPT=4 master-like) - v2: bps=2 dpt=4 ur=4 our=4 kti=50 (tile=200, matches master's 4x4 hand-unroll exactly) - v3: bps=4 dpt=2 ur=2 our=8 kti=16 (tile=128, higher occupancy) - v4: bps=2 dpt=2 ur=2 our=8 kti=32 (tile=256, smaller outer-unroll) - v5: bps=1 dpt=4 ur=4 our=16 kti=16 (tile=256, low occupancy) 4. Drop PRECISION_MIXED - --mixed_precision CLI flag removed; pyscamp "mixed" kwarg removed; GetPrecisionType signature shrunk to (ultra, double, single). - kAutotuneTargets drops MIXED entries (15 -> 10 tuples). - LaunchDoTileWithGeometry's LAUNCH_FOR_ROWCOL_MODE macro drops the MIXED case. - SCAMPArgs::validate rejects PRECISION_MIXED with a clear error. - The enum value is kept in common.h for proto / distributed wire compat; the only code path that produced it is the now-removed CLI flag. Rationale: MIXED was uniformly slower than DOUBLE in practice (kept DP's accumulator cost without halving smem footprint), so no workload prefers it. - Stale MIXED-related comments in kernels_compute.h updated. 5. RTX 3080 (sm_86) entries baked into data/autotune_cache.txt - 10 lines, one per (profile, precision). Generated by `SCAMP --autotune` on this branch. Embedded into the binary via configure_file so conda-forge / pip-wheel users on an RTX 3080 get tuned configs without rebuilding. End-to-end on RTX 3080, randomlist1M, window=200, DOUBLE self-join, GPU: pre-Eigen-port baseline: ~25.7s Eigen port (default cfg): ~36.7s (+43%, the regression we incurred to land the new architecture) autotuned (this commit): ~25.3s (parity with baseline) So the autotune mechanism + 6-variant table closes the perf gap the Eigen port opened. Future variant additions can extend further. Full pyscamp test suite passes (1NN, 1NN_INDEX, SUM, MATRIX_SUMMARY self/AB joins; small-m m=3/m=4; KNN self/AB), bit-identical CPU vs GPU on randomlist128K. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…runtime Before this change, do_iteration_fast used constexpr for_<N>(lambda) loops that template-specialized do_row / merge_to_column / merge_to_row / update_cols / update_rows / reduce_edge for every (outer_row_iter, row_iter, start_index, iter) value. With OuterUnrolledRows=16 (variants v0 and v5) that meant 16 distinct do_row instantiations per (profile, precision, row/col mode) tuple, which made those variants ~1.7x more expensive to compile than OUR=4 variants. The template args were all only used as runtime offsets into Eigen expressions like best_so_far.segment<unrolled_diags>(row_iter) -- the *size* N is compile-time, but the *offset* can be runtime. So: - merge_to_row, update_rows, merge_to_column, update_cols, do_row, reduce_edge drop their index-only template parameters and take them as function args. The size-typed template parameters (num_to_update, DiagsPerThread, etc.) stay compile-time because they drive Eigen::Array sizes + #pragma unroll bounds. - The for_<OUR/UR>([&](auto j){...}) outer loop in do_iteration_fast, the for_<UR>([&](auto k){...}) inner loop, and the for_<DPT> loop in do_row_edge become plain #pragma-unrolled for loops. nvcc still fully unrolls them (compile-time-known trip counts) and folds j/k constants per iteration after unroll, so emitted PTX is unchanged on the deterministic-comparison cases (1M random self-join GPU output is bit-identical, 25.2s same as before). Per-variant compile-time delta (mean across 5 profiles, 5-way parallel): variant before after delta v0 OUR=16 112.8s 66.2s -41% v1 OUR=4 72.7s 60.4s -17% v2 OUR=4 64.8s 58.0s -10% v3 OUR=8 86.6s 60.4s -30% v4 OUR=8 86.2s 60.1s -30% v5 OUR=16 113.7s 71.0s -38% Per-variant cost variance collapsed from 1.75x spread (65-114s) to 1.22x (58-71s). 5-way parallel wall-clock bottleneck dropped from 114s to ~74s (-35%). Adding more variants (or variants with OUR>16) is now much cheaper than before. Correctness: pyscamp test suite passes 1NN_INDEX self/AB, SUM self/AB, Matrix Summary self, m=3 and m=4 small-window, KNN self/AB on every run. Matrix Summary AB join has a pre-existing ~20% flake rate from np.allclose tolerance interacting with the atomic-merge ordering on the transposed tile path -- unchanged by this refactor (the atomic ops and FMA chains are byte-identical). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…counter The Eigen-port (687a70b) refactor of write_back passed smem.profile_{a,b}_length into write_back_value where the pre-Eigen code passed args.profile_{a,b}_length. smem.profile_{a,b}_length is the smem-cached copy of the global counter that do_tile's NeedsCheckIfDone early-exit consults; the match-output atomic-add position must come from the *global* counter, which is what Profile::CopyFromDevice reads back to size the match_value_unordered vector on the host. With the atomic firing into smem instead, the global counter stayed at zero, host saw length=0, and KNN/APPROX_ALL_NEIGHBORS returned an empty profile. This only surfaced once the autotune cache could resolve APPROX_ALL_NEIGHBORS entries (b471cdc); before that, the parser silently failed and the fallback config happened to still produce zero matches via the same bug -- masked because GetKernelConfigForDevice was returning the default cfg either way. Test coverage gap: 687a70b's commit message claimed "KNN passes" but apparently the run was against the old conda-installed pyscamp build rather than the fresh tree. Validation on RTX 3080: - test_pyscamp.py: all 11 cases pass (1NN/SUM/Matrix/KNN, self+ab, small-m) - run_tests.py --executable pyscamp: 2072/2072 tests pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…bench The Eigen Map<Array<T,N,1>>::segment<>() pattern that the Eigen port introduced compiles, under nvcc device codegen, to N scalar ld.shared.{f32,f64} per segment -- the packet path is mostly inactive on the device side. The pre-Eigen master used raw reinterpret_cast<float4*> to get ld.shared.v4.f32. This commit reclaims that pattern via a vec_load<N, AlignBytes, T> helper that picks the widest aligned PTX vector load supported by the (T, AlignBytes) tuple and recursively peels the tail. do_iteration_fast is wired to vec_load at three sites: - initial column window load: aligned to DPT * sizeof(T) (info.local_col is threadIdx.x * DPT). - row data load: aligned to UR * sizeof(T) (info.local_row is a multiple of OUR, OUR is a multiple of UR, j*UR is a multiple of UR). - sliding-window refill: inner_unrolled_cols padded from DPT+UR-1 to DPT+UR so the refill offset lands at local_col + j*UR + DPT, which is UR-aligned for every variant (all current variants satisfy DPT % UR == 0, enforced by static_assert). The unread trailing slot costs 3 registers per thread (dfc/dgc/inormc). PTX inspection on RTX 3080 confirms ld.shared.v4.f32 / v2.f32 / v2.f64 emitted at the three sites. Variant 5 (OUR=16, KTI=16, tile_height=256) for SP self-join needs ~50KB of dynamic smem, exceeding the sm_8.x default 48KB per-block cap; LAUNCH_PRECISION now wraps the launch with cudaFuncSetAttribute(cudaFuncAttributeMaxDynamicSharedMemorySize, smem) when smem > 48KB. sm_86 max is ~100KB so v5 fits. Caught a separate latent autotune bug while wiring the smem opt-in: when a variant kernel failed to launch (e.g. v5 SP self-join exceeding smem cap before the opt-in), do_SCAMP swallowed the SCAMP_CUDA_ERROR return value and the autotune benchmark recorded the fast "failure" path as the winning time, writing a broken variant to the cache. TimeOneRun now calls cudaDeviceSynchronize + cudaGetLastError after do_SCAMP and throws if either reports a failure, so the RunAutotuneWithBenchmark catch path records that variant as +inf time instead. Re-autotuned data/autotune_cache.txt for RTX 3080 with the new code path active. DP cells are within ~3% of each other (FP64 throughput-bound on sm_86 at 1/64 DP rate), so the choice there is in the noise floor; SP cells show 5-15% spread between variants. Validation: - test_pyscamp.py: 11/11 pass - run_tests.py --executable pyscamp: 2072/2072 pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Prototype of a no-smem-column-buffer kernel that uses warp shuffles to walk cov across lanes, freeing ~30 KB DP / 15 KB SP of smem for higher occupancy on smem-bound configurations. Algorithm (per warp, per row): - Lane T owns a FIXED DPT-wide column slice. Ownership rolls over every 32*DPT rows via update_info_shfl, which reads the next 32*DPT-block directly from global memory (no smem column buffer). - cov[i] at row r tracks cov(r, lane_T_col + i) on diagonal lane_T_col + i - r. - Per row: compute dist + update cov in place (cov(r, c) -> cov(r+1, c+1)). - Per row: shift cov[i] right within the lane and shuffle in lane (T-1)'s post-update cov[DPT-1] into cov[0]. Cross-warp cov hand-off (the bit the earlier eigen-gpu-cov-shuffle prototype got wrong): - Lane 31 of warp k publishes its post-update cov[DPT-1] into a tiny smem hand-off region (2 * warps_per_block scalars, double-buffered). - Lane 0 of warp k > 0 reads warp k-1's published value. - One __syncthreads() per row at end-of-row, ordered so the read happens-before the write into the same slot (no race). - Lane 0 of warp 0 has no predecessor; its cov[0] is junk after row 0 and is masked from distc/distr updates via the slot_valid check. Smem layout (SCAMPShflSmem): row data + profile data + cov_handoff region. NO df_col/dg_col/inorm_col regions. ~6.7 KB total for DP var 6 vs ~33 KB for sliding-window v4. Variant table: variant 6 entry (bps=2, dpt=2, ur=0, our=8, kti=8) with ur=0 as sentinel for "shfl algorithm." CMakeLists routes ur=0 tuples through kernel_variant_shfl.cu.in instead of kernel_variant.cu.in. State of correctness: - SUM_THRESH self-join + AB-join PASS end-to-end (cleanest signal: the warp-reduce-then-atomicAdd path is robust to bugs that would surface elsewhere). - 1NN_INDEX / MATRIX_SUMMARY / KNN FAIL. Two likely root causes: (a) Slow path (do_row_edge equivalent) is a no-op placeholder. Tail tiles of each meta-diag produce NaN entries in the column profile (~87 NaN of 1801 at n_x=2000). (b) Possible off-by-one in merge_to_column_shfl's idxc assignment or in the slot_valid mask. Atomic-max amplifies indexing bugs that atomic-add masks; SUM_THRESH passing rules out cov drift in the bulk. State of dispatch: - TEMPORARY: v0..v5 commented out in CMakeLists.txt SCAMP_VARIANT_TUPLES and in kernels_variants.h SCAMP_VARIANT_DISPATCH so the build is fast during shfl-variant development. ONLY variant 6 is instantiated. Force the cache to (2,2,0,8,8) to exercise it. Re-enable v0..v5 before merge. Smem footprint comparison on RTX 3080 (sm_86) for the 4 enabled profile types: variant smem (DP) smem (SP) v4 (slide) 32.8 KB 17.2 KB v6 (shfl) 6.8 KB 4.5 KB Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

scripts/cov_shuffle_diagram.py renders the per-row state of the design-A variants (variant 6 cross-warp via smem hand-off; variant 8 per-warp- independent) so you can read off where covariance values flow between threads. Each frame for one row shows: - the BEFORE state: which block-local diagonal each lane's cov[i] slot is currently tracking; - the smem cov_handoff region (variant 6 only): what lane 31 of each warp writes and what lane 0 reads; - the AFTER state: where each value moved to via the intra-warp shuffle + cross-warp override, with masked-but-still-present junk cells shown in parens. Configurable BLOCKSZ, DPT, warp size, row count, variant. Defaults are tiny (BLOCKSZ=8 WARP=4 DPT=2 rows=3) so the diagrams fit on one screen; pass --blocksz 256 --warp 32 for realistic hardware geometry (verbose). Side-by-side, the two variants make the design tradeoff obvious: variant 6 -- one junk source (lane 0 of warp 0), needs smem + 1 sync per row, but only warp 0's lane 0 produces masked cells. variant 8 -- junk at every warp boundary, no smem or sync, but every warp's lane 0 progressively masks more slots as r grows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Alignment fix: compute the cell width once across all frames (including masked '(N)' parenthesized variants AND lane headers), then apply the same right-justified width to every cell. Previously narrow numerics like ' 1' were padded inconsistently with wide variants like '(15)'. Waste table: replace the single-tile_height waste line with a sweep across tile_heights spanning [DPT, 4*32*DPT], shown for both the DEMO geometry (whatever --blocksz the user picked) AND the realistic kernel geometries (BLOCKSZ=256 for DP, 512 for SP, warp=32). The tables make the var 6 vs var 8 tradeoff concrete. For BLOCKSZ=256 DPT=2 (the v6 setting in this branch): tile_height var 6 wasted var 8 wasted 8/6 ratio 2 0.10% 0.78% 8.0x 4 0.29% 2.34% 8.0x 16 1.46% 11.72% 8.0x 32 3.03% 24.22% 8.0x 64 6.15% 49.22% 8.0x <-- our v6 setting 128 12.40% 74.61% 6.0x 256 24.90% 87.30% 3.5x So variant 6 wastes 6.15% at tile_height=64; variant 8 at the SAME tile_height wastes 49.22% (8x more). This was the previously-overlooked cost of dropping the cross-warp hand-off -- per-warp-independent only wins at small tile_height (<~16), where the rotation amortization is also worse. Earlier I'd ballparked variant 8 waste at ~1.5%, which was only counting the initial junk slot, not its propagation through the warp -- the table now shows the propagated cost too. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-of-bounds local_mp_col writes, eliminate KNN duplicate matches by re-initializing column registers, and implement CMAKE_CUDA_ARCHITECTURES build escape hatch.

…int tuning, and dynamic tile size defaults

…pu-kernel-shfl-variant

Kernel/variant changes: - Add v7 (shfl bps=8 DPT=4 our=8 kti=16 -> tile_height=128, max for DPT=4) and v8 (shfl bps=4 DPT=8 our=8 kti=32 -> tile_height=256). v8 is the first DPT=8 variant; combined with the bigger tile it wins 3 of the 10 (profile, precision) autotune targets at 512K bench size, including 1NN_INDEX SP and both SUM_THRESH cases. Recommended default on RTX 3080 is now v8 bsz=128 (geomean 1.224x of per-target best, worst-case 2.04x). - Fix do_update_info_shfl dispatcher: it only branched on updates_remaining = 0..3, so DPT >= 5 silently skipped slot rotations for slots 0..(DPT-5). That eventually de-synced state.global_col against tile_start_col and crashed in flush_all_cols_to_smem with a smem OOB. Extend ladder to 0..7 + static_assert(DPT <= 8). - Sliding-window LaunchDoTileWithGeometry now dispatches do_tile on a runtime blocksz (64/128/256/512) the same way LaunchDoTileShfl already did, so the autotuner can sweep blocksz for those variants too. Autotune machinery: - g_cfg_override was thread_local; do_SCAMP spawns workers via std::async so they never saw it -- every trial silently fell back to the on-disk cache, flattening per-variant timings. Change to a process-wide std::mutex-guarded optional. This is the fix that unblocked all of the per-variant differentiation below. - RunAutotuneWithBenchmark now sweeps (variant geometry x blocksz), prints a progress line per trial with elapsed/ETA, identifies the winner by variant name (v0..v8), and emits a cross-target weighted score table -- geomean of time/per-target-best across all 10 targets, also reports worst-case ratio. Recommends a "best safe default" trial at the end. - DefaultBenchmarkVariant uses computing_rows=true so the workload is a real symmetric self-join (was upper-triangle only, hiding the cost asymmetry); max_tile_size bumped to 512000 to match the GPU default in main.cpp; benchmark input length is overridable by SCAMP_AUTOTUNE_INPUT_LENGTH env var (default 131072 for fast dev sweeps -- 512K via the env gives results that match production scale better). - IsSupportedKernelConfig admits all of {64, 128, 256, 512} blocksz for every enabled variant, not just the precision-tied default. Built-in autotune cache: - Refresh data/autotune_cache.txt with the RTX 3080 winners from the new sweep. SP entries change substantially -- v8 (DPT=8 shfl) wins 1NN_INDEX SP + both SUM_THRESH cases, v5 (bps=1 tile=256) wins the rest of the SP targets. DP entries shift to bsz=64. Notes: - v2 / v5 re-enabled alongside v0/v1/v3/v4/v6 so the score sweep covers the full set; only v8 is the brand-new shape, the rest were already there. - SCAMPShflState comment cleaned up to reflect that dfc2/dgc2/inormc2 staging is gone (read directly from global at rotation now). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two related cache bugs that meant the new shfl variant defaults never actually made it to end users: 1. data/autotune_cache.txt is embedded into the binary via CMake file(READ) + configure_file. file(READ) doesn't track the source path as a dependency, so editing the cache and running `cmake --build` would silently leave the previous contents baked into the binary -- the user updates the cache, rebuilds, and the runtime lookup still misses their new entries. Add CMAKE_CONFIGURE_DEPENDS for the cache file so changes auto-trigger a reconfigure. Verified via `strings build/SCAMP | grep RTX_3080` before/after. 2. pyscamp.autotune() called SCAMP::RunAutotune (the older stub) which just writes the compile-time default cfg for every (profile, precision) target -- no benchmarking. The user-cache write then clobbers the built-in cache lookup, so a conda-forge user who ran pyscamp.autotune() ended up WORSE off than before they called it (the shipped data/autotune_cache.txt entries got masked by the stub-default cache). Switch the Python binding to RunAutotuneWithBenchmark + DefaultBenchmarkVariant -- same path as the CLI's --autotune -- and link pyscamp against autotune_bench. Verified end-to-end: pyscamp.autotune() now runs the full sweep (~4 min for 9 variants x 4 blocksizes x 10 targets at 131K bench size), writes the benchmarked winners to the cache, and SCAMP at runtime picks them up (6/6 launches matched the cache entries). Also default pyscamp.autotune() to device 0 only -- mirrors the CLI behavior. Tuning every visible GPU when most multi-GPU boxes have identical devices is wasted work. Callers can pass devices=[...] to override. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds the missing scaffolding around the autotune cache lookup so that end users get a meaningful diagnostic when their device isn't in the cache, and the cache + lookup logic is covered by lightweight C++ tests that don't require a GPU at runtime. What's new for users: - When SCAMP launches a kernel and finds no autotune entry for the current (device, profile, precision) in either the user override cache or the binary's built-in cache, a one-shot warning is emitted to stderr identifying the missing tuple and the compile-time-default config used as a fallback. The warning is deduplicated per tuple so a long-running process sees at most one line per tuple. - SCAMP_AUTOTUNE_QUIET=1 suppresses the warning. pyscamp's module init sets this by default (with setenv overwrite=0 so an explicit user value still wins) -- notebook users don't get uninvited stderr; CLI users still do. - Docs: new docs/source/autotune.rst covers the cache, lookup order, default path, --autotune workflow, the warning, env-var overrides, and troubleshooting; added to the toctree. What's new for developers: - Factored LookupKernelConfigForDeviceKey() out of GetKernelConfigForDevice() -- pure logic, no GPU dependency, takes cache pointers + device key + fallback as arguments. Production callers still use GetKernelConfigForDevice (which queries the real device and threads through the pure helper); tests drive the helper directly with synthetic device keys. - New ResetAutotuneWarnings() for tests; the dedup set is process- global so without it later tests would silently no-op. - test/cpp/test_autotune_cache.cpp: 15 unit tests covering the lookup chain (user hit, built-in hit, user-beats-built-in, unsupported variant falls through, multi-device key isolation), the cache-miss warning (fires once, one-shot per tuple, separately per (profile,precision)), the SCAMP_AUTOTUNE_QUIET env var (suppresses + respects falsey values), DefaultPath() resolution (SCAMP_AUTOTUNE_CACHE -> XDG_CACHE_HOME -> HOME/.cache), on-disk round-trip, and the malformed-cache throw. - BUILD_SCAMP_TESTS=ON CMake option (default OFF) wires the test into ctest. Gated on CMAKE_CUDA_COMPILER for now since the only test module currently exercises GPU-side machinery (no CUDA *runtime* needed -- the test never touches a device -- but the toolchain has to be present to build gpu_utils). Verified end-to-end: - ctest passes 15/15. - Real SCAMP run on a device in the built-in cache: no spurious warning. - pyscamp module init sets SCAMP_AUTOTUNE_QUIET=1 (confirmed via libc.getenv from ctypes; Python's os.environ is a startup snapshot so doesn't show it, but the underlying env that the C++ warning code reads is correctly set). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The unit tests added in 1254f40 (test/cpp/test_autotune_cache.cpp) are behind -DBUILD_SCAMP_TESTS=ON which CI wasn't passing, so they were shipped but never run. Enable the flag + invoke ctest in all three CUDA jobs: - build-cuda-cli (Ubuntu/Windows x g++/clang++/cl): catches compiler matrix regressions in the cache parser / lookup code. - build-cuda-versions (Ubuntu/Windows x CUDA 12.6/12.8/13.0): catches CUDA-toolkit-version regressions (rare but the test is cheap to run). - build-and-test-cuda (self-hosted with a GPU): runs alongside the existing integration tests. The unit tests require the CUDA toolchain at build time but never touch a real device at runtime, so they run successfully on the build- only runners with no NVIDIA driver. CMakeLists already gates the test target on CMAKE_CUDA_COMPILER so the CPU-only CI jobs short-circuit cleanly without any extra workflow changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ts.h Variant table churn (prune, rename, add) used to require touching three files in lockstep: - SCAMP_VARIANT_TUPLES in CMakeLists.txt (the build-system list). - kVariants[] in kernel_config.cpp (the runtime table). - SCAMP_DECL_VARIANTS_FOR_PROFILE + SCAMP_VARIANT_DISPATCH in kernels_variants.h (the per-variant forward decls + dispatch ladder). Drift between them turned into hard-to-spot bugs (e.g., a kernel TU built but never called because its dispatch case was missing). Make SCAMP_VARIANT_TUPLES the single source of truth and generate the other two: * The tuple list drops its explicit-index prefix -- variant N is the Nth entry in the list. Add/remove/reorder edits one place. * CMake walks the list, building three @-substitution strings: the kVariants[] initializer body, the per-variant forward decl block, and the per-variant dispatch if-ladder body. * configure_file emits kernel_variants_table.h (consumed by kernel_config.cpp) and kernels_variants.h (consumed by the per-profile .cu dispatchers) into the build dir. * gpu_utils and gpu_kernels pick up CMAKE_CURRENT_BINARY_DIR on the include path so the generated headers resolve. Also retire the four sliding-window/shfl variants the multi-device autotune sweep showed never won anything: the pre-prune labels v1, v3, v4, v7 are gone. Surviving variants are compacted to indices 0..4: pre-prune new geometry role --------- --------- -------------------- ---------------------- v0 v0 {2,2,2,16,16} sliding-window default v2 v1 {2,4,4,4,50} master-like 4x4 v5 v2 {1,4,4,16,16} big tile / low occ v6 v3 {8,4,0,8,8} shfl DPT=4 tile=64 v8 v4 {4,8,0,8,32} shfl DPT=8 tile=256 3080 cache entries already used the kept variants; the only entry that referenced a retired variant was P100 APPROX_ALL_NEIGHBORS DOUBLE (was v3 pre-prune), substituted with v5 -- the runner-up on the P100 cross-target score sweep, geometry preserves bsz=256. Saves 4 x 5 profiles x 6 instantiations = ~120 do_tile template emissions on a clean build. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The autotune machinery requires QueryDeviceProps() to resolve the device cache key, which in turn needs a CUDA-capable GPU. Our CI machines (windows-latest, ubuntu-latest in the build-cuda-cli matrix) have the CUDA toolchain but no device, so the bench-driven autotune pathway was previously untestable in CI. Refactor: - Split RunAutotuneWithBenchmark into a CUDA-aware wrapper plus a new RunAutotuneWithBenchmarkForDeviceKey(device_key, device_id, bench, cache_path, verbose, print_banner) overload that takes the resolved device key directly. The wrapper calls QueryDeviceProps, prints the device-specific banner, then delegates. The inner overload prints a shorter device-key-only banner when print_banner is true (the wrapper passes false to avoid duplicating the header). - No behavior change for production callers: SCAMP --autotune and pyscamp.autotune() go through the wrapper unchanged. Test (test_autotune_cache.cpp): - Add Test_AutotuneFullPipeline_FakeBench. Drives the full autotune sweep with a synthetic BenchmarkFn that returns 0.5s for one rigged (variant, blocksz) winner per (profile, precision) target and 1.5s for every other trial. After the call: * verifies the cache file exists at the requested path, * reloads it and asserts every target's recorded config matches the rigged winner's geometry tuple and blocksz, * exercises LookupKernelConfigForDeviceKey with the reloaded cache to confirm the lookup path returns the rigged winner end-to-end. Verifies the variant-sweep + per-target-winner + cache-write + cache-reload + lookup chain without touching a GPU. Verification: - Linux (gcc 11, build_sm86): 18/18 tests pass via ctest. - Native Windows (VS2022 MSVC 14.30, CUDA 11.6, sm_86): same 18/18 pass; the test compiles cleanly on MSVC. - End-to-end on Windows with a real GPU (RTX 3080): * SCAMP --autotune wrote 10 entries to %LOCALAPPDATA%\scamp\autotune.txt * Subsequent SCAMP --print_debug_info launch picked the v3 shfl config (bps=8 dpt=4 our=8 kti=8 blocksz=64) from that cache for DP 1NN_INDEX without emitting the cache-miss warning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Several places that document the autotune subsystem had drifted since recent changes: - The cache-path resolution was Linux-only in user-visible docs; the recent _WIN32 branch of AutotuneCache::DefaultPath (LOCALAPPDATA / USERPROFILE) wasn't reflected anywhere. - The autotune.rst page quoted 9 variants and ~360 trials; we ship 5 variants today and a full sweep is 200 trials (5 x 4 x 10). - pyscamp.autotune()'s docstring claimed empty devices=[] means "every visible CUDA device is tuned" -- the implementation actually defaults to device 0 only (the rationale: a sweep takes O(minutes) and most multi-GPU hosts have identical devices). - docs/pyscamp.py (the Python stub Sphinx reads for autodoc) was missing autotune() and gpu_supported() entirely, so neither showed up in the rendered pyscamp API page. Updates: - docs/source/autotune.rst: TL;DR, lookup chain, "Default cache location", "Clearing or resetting", and troubleshooting all mention the Windows path (%LOCALAPPDATA%\scamp\autotune.txt, with %USERPROFILE%\.cache fallback). Sweep size corrected to "5 variants enabled today -> 200 trials"; phrased so the count tracks future variant-table changes via the "current variant count" framing. - docs/pyscamp.py: add autotune() and gpu_supported() to the autosummary and emit matching docstrings for the Sphinx stub build. - src/python/SCAMP_python.cpp: rewrite the autotune() pbdoc to match the implementation (device 0 only by default; full per-platform cache-path resolution). Single backslashes inside R"pbdoc()" -- pbdoc is a raw string, so \\scamp would render as \\scamp. - src/main.cpp: --autotune --help text mentions all three resolution steps and the Windows path. - src/core/gpu_kernel/{autotune.h,autotune_cache.h,autotune.cpp, builtin_autotune_cache.h}: doc-comments referencing the cache path now mention the platform-specific fallback (or refer readers to AutotuneCache::DefaultPath, which is the single source of truth). No code paths changed; this is documentation only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Branch-wide sweep ahead of merging to master: comments that narrated intermediate states ("before this split", "now lives in", "pre-prune label v6", "design-A"), values that were no longer used, and the backward-compat RunAutotune stub that nothing on the branch still calls. Each removal is mechanical -- no behavior change. Header/comment refresh: - kernel_variant.cu.in, kernels_impl.h, kernels_compute.h, autotune_cache.cpp, autotune.cpp, autotune.h: replace before-vs-after-split / before-vs-after-this-change framing with a plain description of what the code does today. - kernels_compute_shfl.h, kernels_impl_shfl.h, kernels.cu, kernel_variant_shfl.cu.in, kernel_gpu_utils.h: drop "design-A" codename (used during the multi-design bake-off, not meaningful on master) in favor of "cov-shuffle" / "shfl". - kernel_config.cpp: drop "an earlier iteration had 9 entries v0..v8" and "pre-prune label v6" history. Reframe GetDefaultKernelConfig so it picks the first shfl variant by iterating kVariants rather than hard-coding index 3 with a stale-on-renumber comment, and let IsSupportedKernelConfig iterate kVariants directly instead of a redundant hard-coded {0..4} list. - docs/source/autotune.rst: drop the "RunAutotune stub vs RunAutotuneWithBenchmark" troubleshooting bullet (RunAutotune is gone; see below). - src/core/gpu_kernel/CMakeLists.txt: rewrite the layout/build-table prologue, drop "we no longer carry explicit indices" and "same as before" framing, drop stale "5 profiles x 6 variants = 30 TUs" count (it's 5 x 5 = 25 today and the number tracks variant changes anyway). - SCAMP_python.cpp: drop the "not the older RunAutotune stub" comment from the autotune() impl now that there is no older stub. Dead code removed: - SCAMP::RunAutotune(...) (the non-benchmarking thin wrapper that wrote the compile-time default to every target). Not called by anything since RunAutotuneWithBenchmark landed; the bench path is now the only entry point. - DEFAULT_DIAGS_PER_THREAD, DEFAULT_UNROLLED_ROWS, DEFAULT_OUTER_UNROLLED_ROWS, DEFAULT_KERNEL_TILE_ITERS, DEFAULT_BLOCKSPERSM constants in kernel_constants.h. Variant geometries live in SCAMP_VARIANT_TUPLES; these duplicates had no callers and would silently drift if a future variant-0 edit happened in only one place. Doc-comment correctness: - autotune.h prologue + SetKernelConfigOverride doc say "process-wide" override, not "thread-local" -- matches the implementation, which uses a process-wide std::mutex-guarded optional so std::async workers see the override (a thread_local would not). - autotune_cache.h schema example shows 9 fields (the actual record layout) rather than the 6-field draft form. Verified: SCAMP rebuilds clean; 18/18 unit tests pass on Linux. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The cache file format already had graceful per-entry fall-through built into the read path (a stale variant tuple loses to the next cache source on lookup; a wrong-version header silently empties the cache), but the upgrade contract wasn't explicit anywhere and the two failure modes weren't all covered by unit tests. Adding both so future kernel changes have a clear "do I need to wipe everyone's cache?" decision tree and so a regression of the fall-through path shows up in CI. Upgrade contract (now spelled out in autotune_cache.h + the user docs at docs/source/autotune.rst): - Add / remove / reorder variants in SCAMP_VARIANT_TUPLES: no version bump. Per-entry fall-through handles it -- the cache is keyed by (bps,dpt,ur,our,kti) tuple, not by index, so existing entries either still match (lookup hits) or fall through (lookup misses on that one tuple, the other cache entries are unaffected). - Schema or kernel-semantics change forces a wipe: bump kHeader (SCAMP_AUTOTUNE_V<N> -> V<N+1>). ParseStream silently treats a non-matching header as empty rather than throwing, so an end-user upgrading their pip wheel doesn't see SCAMPException -- they just fall through to the new release's built-in cache and can re-tune at their leisure. New tests in test/cpp/test_autotune_cache.cpp: - Test_FutureVersionHeaderSilentlyEmpties: SCAMP_AUTOTUNE_V99 header + a record line loads zero entries (not a throw), lookup hits the cache-miss warning. Covers the post-version-bump escape hatch. - Test_MissingHeaderSilentlyEmpties: a file with data lines but no header at all also loads zero entries. - Test_PartiallyStaleCachePreservesGoodEntries: a cache with one valid entry + two entries naming bogus variant tuples loads all three, but lookups hit only the valid one; the stale entries fall through to fallback + the one-shot warning. Covers the everyday "we changed the variant table" upgrade case. - Test_StaleUserCacheFallsThroughToBuiltin: a user cache with a future-version header (all entries silently dropped) plus a built-in cache with the same device key -- the built-in entry must win. Covers the worst-case "the upgrade wiped my tuned config but the shipped built-in still has me covered" path. 22/22 tests pass on both Linux (gcc 11) and native Windows (VS2022 MSVC 14.30, CUDA 11.6, sm_86). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two unrelated doc fixes that came up together while preparing the branch for review. Env-var docs: - environment.rst gains a "Environment variables" section that lists every env var SCAMP / pyscamp reads, build-time and run-time, with a one-line purpose for each. Before this commit the cache-related ones were documented only in autotune.rst and pyscamp.py, which meant they were discoverable only by someone already reading the autotune page. The distributed gRPC env vars (SCAMP_SERVER_SERVICE_HOST/PORT) weren't documented anywhere. - pyscamp/intro.rst gains PYSCAMP_BUILD_TYPE and PYSCAMP_NO_PLATFORM_AUTOSELECT, two setup.py-honored env vars that previously had no user-facing documentation. Autotune workload-size measurement (RTX 3080, full 10-target sweep): 64K -> ~4 min, geomean 1.325, worst 2.25 128K -> ~8 min, geomean 1.308, worst 3.47 (current default) 256K -> ~25 min, geomean 1.278, worst 2.77 The per-target winners themselves shift across sizes (e.g. SUM_THRESH/DOUBLE picks v0 at 64K, v2 at 128K, v4 at 256K -- those are different kernel geometries, not just different rankings), so a short sweep doesn't just rank suboptimally for the cross-target default, it picks suboptimal cache entries for individual targets. 256K is meaningfully tighter than the current 128K default. Updates: - docs/source/autotune.rst "Choosing the benchmark workload size" now has the comparison table above and concrete guidance: 128K is the deliberate casual-dev default; 256K or 512K is recommended before shipping a cache entry to data/autotune_cache.txt. - src/core/gpu_kernel/autotune_bench.cpp comment block over kBenchmarkInputLengthDefault: dropped the stale "under a minute on a fast GPU" claim (the sweep is closer to 10 min at the default size) and the stale trial-count formula (it omitted the blocksz axis introduced when autotune started sweeping blocksz). No default value change in this commit -- bumping SCAMP_AUTOTUNE_INPUT_LENGTH from 131072 to 262144 would 4x the wall time of the default `--autotune` run, which is a separate UX trade-off worth a follow-up discussion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The autotune-size sweep on RTX 3080 (committed in the prior diff) showed that 256K gives meaningfully tighter cross-target rankings than the previous 128K default: 128K: ~8 min, geomean 1.308, worst-case 3.47 256K: ~25 min, geomean 1.278, worst-case 2.77 (~20% better worst case) Per-target winners themselves shift between sizes (e.g. SUM_THRESH/DOUBLE picks different kernel geometries at 64K vs 128K vs 256K), so a too-small sweep doesn't just rank the cross-target default suboptimally -- it picks suboptimal entries for individual cache rows. 256K is the smallest size where launch overhead is reliably dominated by steady-state kernel cost. The 4x wall-time cost is amortized over the lifetime of the autotuned cache -- users hit `SCAMP --autotune` once per device. Anyone tuning on an older / slower GPU where 25 minutes is too long can dial it back via SCAMP_AUTOTUNE_INPUT_LENGTH=131072 (the doc updated here calls this out explicitly). Changes: - src/core/gpu_kernel/autotune_bench.cpp: kBenchmarkInputLengthDefault 131072 -> 262144. Comment block above the constant explains the trade-off and points at the doc for the per-N comparison. - docs/source/autotune.rst: "Choosing the benchmark workload size" reflects 256K as the new default and reframes the smaller-N row of the comparison table as "use this if 256K is too slow on your GPU". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…l-variant # Conflicts: # .github/workflows/build-and-test.yml # src/python/SCAMP_python.cpp

…real SCAMP crashes Four loosely-related fixes that came out of inspecting CI output for the latest PR push. 1) SCAMP_distributed missing autotune_bench link dep The BUILD_CLIENT_SERVER=1 docker build was failing with: undefined reference to `SCAMP::DefaultBenchmarkVariant(...)' src/distributed/CMakeLists.txt builds SCAMP_distributed from the same src/main.cpp the top-level SCAMP target uses, and that file references DefaultBenchmarkVariant under _HAS_CUDA_ (which mark_cuda_if_available() propagates to every distributed target). The top-level SCAMP target gets autotune_bench linked under the same CUDA gate; SCAMP_distributed did not. Mirror the link rule so the dep is satisfied in both places. 2) Test scripts spammed the autotune-cache-miss warning run_tests.sh and run_tests.py spawn many short SCAMP processes. The one-shot dedup on the "no autotune entry for device ..." warning is per-process, so each spawn re-emits the warning. Both harnesses now export SCAMP_AUTOTUNE_QUIET=1 before any SCAMP invocation; users running the harnesses interactively can override by setting the env var themselves first. 3) Warning text now points at SCAMP_AUTOTUNE_QUIET Added a parenthetical to the cache-miss warning explaining how to silence it process-wide. Users hitting this in a tight CI loop now have a fix-it-yourself hint without grepping the source. 4) run_tests.py was swallowing SCAMP crashes The harness called subprocess.call() and discarded the exit code, then read the per-test output files. If SCAMP aborted (e.g. "GPUasssert: unspecified launch failure" on a kernel-config the target device couldn't run), the previous subtest's output files were still on disk, so the comparator silently re-evaluated stale data and the test "passed". Two changes: - rm the output files (mp_columns_out, mp_columns_out_index, mp_rows_out, mp_rows_out_index) before each SCAMP invocation, so a missing output surfaces as a None array instead of a stale read. - check subprocess.call's return code and raise RuntimeError on non-zero so the test runner exits non-zero. This fourth change will likely surface a real bug on the next CI run of build-and-test-cuda: the compile-time-default kernel variant (v3 shfl, blocksz=128 bps=8 dpt=4 ur=0 our=8 kti=8) hits an unspecified launch failure on sm_60 (the self-hosted runner reports its device key as 'NVIDIA_Graphics_Device__sm_60', which doesn't match the 'Tesla_P100-...' built-in cache rows, so the fallback fires). That's intentional -- the bug was already happening; the harness was just hiding it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…UTE_COLS compute-sanitizer on a P100 (sm_60) bisected the "GPUasssert: unspecified launch failure tile.cpp:175" reproducer that build-and-test-cuda was hitting down to a single offset in do_tile_shfl: every thread in block (0,0,0) trapped at +0x21218 of the (PROFILE_TYPE=1NN, COMPUTE_ROWS=true, COMPUTE_COLS=false) instantiation of the kernel, attributed to kernels_impl_shfl.h:148. That line is the start of the per-tile distc reload loop: for (int i = 0; i < DiagsPerThread; ++i) { int col_idx = state.local_col[i]; if (col_idx >= 0 && col_idx < tile_width) { ... state.distc[i] = smem.local_mp_col[col_idx]; ... } } SCAMPShflSmem's ctor only re-binds local_mp_col to a real backing pointer when compute_columns is true; with COMPUTE_COLS=false (the ab-join lower-triangle path, where the col profile belongs to the other tile in the pair) the Map's data pointer stays at the nullptr sentinel from the member initializer list. The smem read then dereferences a null pointer. Why Ampere happened to "work": the kernel does a check (col_idx >= 0 && col_idx < tile_width) before the dereference and the resulting distc is never consumed (merge_to_column_shfl / flush_all_cols_to_smem are also COMPUTE_COLS-gated, so the trash distc gets overwritten before it matters). On sm_86 the load through null seems to retire silently in this configuration; on sm_60 the MMU traps on the load and the whole kernel aborts. Fix: gate the entire per-tile distc-reload loop on COMPUTE_COLS, mirroring the gating already in place for flush_all_cols_to_smem (kernels_impl_shfl.h:219) and for the smem.local_mp_col writes inside the init_smem_shfl helpers (kernels_compute_shfl.h:628/659/691). When COMPUTE_COLS is off distc is unused for the rest of the tile; just seed it with init_dist so any speculative reads see a well-defined value. Verified locally: - sm_86 (RTX 3080): 460/460 Extended Tests pass before AND after. Lower bound that the fix is non-regressing; the bug never manifested on Ampere in the first place. - The fix is small enough to reason about by inspection; the sm_60 confirmation comes from re-running compute-sanitizer against the patched kernel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Refactor: blocksz is now a per-variant property declared in SCAMP_VARIANT_TUPLES rather than a precision-tied global constant (BLOCKSZ_DP / BLOCKSZ_SP) gated by a hidden "sliding-window vs shfl" rule inside GetKernelConfigForVariant. The change removes the hidden coupling without changing runtime behavior: each variant declares default_blocksz_dp and default_blocksz_sp explicitly, and GetKernelConfigForVariant returns the field that matches the requested precision. Tuple format grows from 5 to 7 fields: before: bps|dpt|ur|our|kti after: bps|dpt|ur|our|kti|default_blocksz_dp|default_blocksz_sp Values are picked to preserve current cold-start behavior bit-for-bit: - sliding-window variants get 256|512 (== BLOCKSZ_DP, BLOCKSZ_SP) - shfl variants get 128|128 (the prior hardcoded shfl default) The kernel is still instantiated for all of {64, 128, 256, 512} per variant, the autotuner still sweeps blocksz as an independent axis, and on-disk cache entries can still name any of the four values; the per-variant defaults only feed the cold-start fallback path where no cache hit exists. Files removed (everything was used only by the old precision-tied default path): - src/core/gpu_kernel/kernel_constants.h (was BLOCKSZ_DP/SP only) - SCAMP::BlocksizeForPrecision in kernel_config.cpp - SCAMP::get_blocksz in kernel_gpu_utils.{h,cu} (already dead -- declared but not called by anything) Doc updates folded in: - docs/source/environment.rst: document SCAMP_USE_EXTERNAL_EIGEN (was honored by cpu_kernel/CMakeLists.txt but undocumented), plus a note about Eigen 5.x still publishing as the Eigen3 package. - src/core/gpu_kernel/CMakeLists.txt: expand the comment above the target_link_libraries(gpu_kernels ... Eigen3::Eigen) call to mention the dual Eigen-acquisition path (vendored INTERFACE target vs find_package import). - docs/source/autotune.rst + builtin_autotune_cache.h: small polish landed alongside. Verified locally on RTX 3080: - kernel_variants_table.h regenerates cleanly with the 7-field initializer rows. - 22/22 unit tests pass. - run_tests.py subset (windows 13/33, inputs 100/250/1500, tiles 1024/2048) = 460/460 GPU tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The prior commit gated the per-tile distc-reload loop on COMPUTE_COLS to fix the v3 Pascal nullptr-deref. Tracing the lifecycle of distc across a tile shows the gate fixed the bug but missed the deeper redundancy: the pre-load itself was a no-op. distc lifecycle inside one tile: 1. init_smem_shfl: smem.local_mp_col[col_idx] = global_profile[col] (cross-tile running best from the global profile). 2. (pre-load) distc[i] = smem.local_mp_col[col_idx] -- COPY of the prior_global value into the per-thread accumulator. 3. merge_to_column_shfl, called per row: if (dist[i] > distc[i]) distc[i] = dist[i]; -- so distc[i] ends the tile at max(prior_global, this_tile_max). 4. flush_all_cols_to_smem: atomicMax(smem.local_mp_col + col_idx, distc[i]) -- block-scoped atomic-max merge into the same smem cell. The atomicMax in step 4 is what actually preserves cross-tile continuity. Whether step 2 starts distc at prior_global or at init_dist, step 4's atomic computes: smem.local_mp_col[col_idx] = max(prior_global, distc_final) = max(prior_global, this_tile_max) in both cases. Same observable result. Dropping the pre-load: - Removes DPT * tiles * blocks LDS instructions per kernel launch. - Removes the COMPUTE_COLS branch + reload-or-init duplication added in the prior commit -- the null-deref the gate was protecting against simply doesn't happen if we never read from smem.local_mp_col in this loop. - Removes the col_idx bounds check (`col_idx >= 0 && col_idx < tile_width`) since the only consumer (merge_to_column_shfl) reads distc by register index, not by col_idx. Per-row merge_to_column_shfl is a branchless predicated select on the GPU regardless of whether distc starts at prior_global or init_dist, so no per-row cost difference either way. A FUTURE OPT comment is left in the kernel for max-style profiles (1NN / 1NN_INDEX / APPROX_ALL_NEIGHBORS / MATRIX_SUMMARY): pre-loading plus the _check atomic variants (fAtomicMax_check / MPatomicMax_check) could skip the flush-time CAS when this tile's distc didn't beat the prior global. The bookkeeping has to refresh the cached check inside update_info_shfl on rotation, and adds DPT registers per thread (potentially regressing occupancy on v4 DPT=8 SP). Not worth pursuing without a micro-benchmark on smaller GPUs or converged workloads where the atomic-skip rate would be high. Verified on RTX 3080 (sm_86): - 22/22 unit tests pass. - 685/685 GPU run_tests.py subset pass. - 1NN_INDEX 1M perf check (3 runs median): DP 12.16 -> 12.22 s (+0.5%), SP 2.10 -> 2.15 s (+2%). Both within run-to-run noise; net perf-neutral. The Pascal repro is fixed for the same reason as the prior commit's gate (no smem.local_mp_col read on the COMPUTE_COLS=false path), but expressed as a smaller kernel body instead of an if-constexpr. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The DPT-wide per-row hot loop in do_row_shfl was hand-unrolled via #pragma unroll. The body is pure element-wise array ops on the SCAMPShflState::cov / inormc / dfc / dgc / local_col / global_col fields, all of which are already Eigen::Array<T, DPT, 1>. Express the loop as Eigen array expressions and let nvcc unroll the expression-template-generated IR instead. The four sub-operations become: 1. d_raw = (cov * inormc * inormr).cast<DISTANCE_TYPE>() 2. cov += dfc * dgr + dgc * dfr (uses old cov on RHS only) 3. diag = local_col.cast<int>() - row_in_tile (signed; preserves the negative-diag case the original used) 4. slot_valid (DPT-wide bool array) + slot_valid.select(d_raw, init_dist) for the final dist The SLOW-PATH branch keeps the global_row-out-of-range check as a scalar early-return (setZero on the bool array) because Eigen's && / & between an Array and a scalar bool doesn't broadcast; per-element predicates are then chained with Array && Array. Verified on RTX 3080 (sm_86), 1NN_INDEX, 1M elements, window=100, median of 8 runs: variant manual loop Eigen delta --------------------------------------------- v3 DP 12.16 s 12.18 s noise (~+0.2%) v4 SP 2.10 s 2.00 s **-5%** (every Eigen run faster than every manual run) Hypothesis for the SP-only win: v4 has DPT=8, so the manual #pragma-unrolled loop produces a wider IR that nvcc's scheduler seems to lay out worse than the equivalent Eigen-expression IR. v3 at DPT=4 is small enough that both forms produce essentially the same code. The Eigen IR also keeps cov-update and d_raw computation explicitly independent (no false ordering dependency between scalar statements that nvcc would have to prove disjoint). Correctness: - 22/22 unit tests pass. - 685/685 GPU run_tests.py subset (windows {13,33,100}, inputs {100,250,1500}, tiles {1024,2048}) pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace the per-element memmove-style cov shift with Eigen segment assignment: // Before: for (int i = DPT - 1; i > 0; --i) cov[i] = cov[i - 1]; cov[0] = wrap_in; // After: if constexpr (DPT > 1) { cov.template tail<DPT - 1>() = cov.template head<DPT - 1>().eval(); } cov[0] = wrap_in; The .eval() forces a temporary so the tail-and-head overlap doesn't alias (the manual loop handled aliasing implicitly by walking from high-to-low). For DPT in {4, 8} the temporary is a small register array; nvcc folds it into the same register allocation as the destination, so the .eval() is essentially free. Verified on RTX 3080 (sm_86), 1NN_INDEX, 1M, window=100, median of 13 SP runs / 5 DP runs: variant prior (manual cov-shift) cov-shift Eigen delta ----------------------------------------------------------- v3 DP 12.18 s 12.16 s noise v4 SP 2.00 s 1.99 s ~-1% The SP improvement is small (within the variance band of any individual run) but the median moves the right direction and every run except one is sub-2.00. Combined with the prior commit (Eigenify per-row dist + cov update), the v4 SP path is now ~6% faster than the all-manual-loop baseline. DP is flat overall. Correctness: - 22/22 unit tests pass. - 685/685 GPU run_tests.py subset pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two manual #pragma unroll loops over DPT replaced with Eigen array ops on the SCAMPShflState fixed-size arrays: state.local_col = state.global_col - tile_start_col; state.distc.setConstant(init_dist<DISTANCE_TYPE, PROFILE_TYPE>()); state.idxc.setZero(); Code-shape change only. Perf-checked on RTX 3080 with the reference SP 1M / m=1024 / max_tile_size=1M / 1NN_INDEX workload: baseline: ~1.985s (min of 5) Eigen-simplified: ~1.867s (min of 5) Within run-to-run noise -- treat as perf-neutral, not a win. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three additions to the SCAMP --autotune workflow that make iterative tuning much faster: - SCAMP_AUTOTUNE_PRECISION_FILTER={SINGLE,DOUBLE}: restricts the sweep to one precision. Filtered-out (profile, precision) targets are reported as SKIPPED in verbose mode and do NOT overwrite their cache entries on Save(), so the existing cache for the other precision is preserved untouched. - SCAMP_AUTOTUNE_VARIANT_FILTER={shfl,sliding-window,all}: restricts the variant loop to one family. shfl variants are identified by unrolled_rows == 0 (the sentinel already used in SCAMP_VARIANT_TUPLES to route to kernel_variant_shfl.cu.in). - SCAMP_AUTOTUNE_WARMUP_RUNS=<n> (default 0): per-trial warmup count. Was hardcoded at 1 warmup + 1 timed; in practice the first launch of a given (variant, blocksz) instantiation on Ampere is only a few % off steady-state because module-load / JIT cost is amortized at the process level, not per-trial. Skipping warmups by default ~halves sweep time. Set to 1 or more on noisier setups. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…onsolidation Three closely-coupled changes to the GPU kernel infrastructure that together cut the consolidated shfl + sliding-window variant count in half while improving SP 1NN_INDEX on RTX 3080 by 14% (1.94s -> 1.71s at 1M / m=1024). 1. ADAPTIVE __launch_bounds__ VIA safe_bps() Replaces __launch_bounds__(BLOCKSZ, variant_bps) with __launch_bounds__(BLOCKSZ, safe_bps(target_threads, BLOCKSZ, kHwThreadsPerSm)) in both do_tile and do_tile_shfl. The target threads/SM stays constant across the autotuner's blocksz sweep, so the per-thread register budget no longer collapses when blocksz grows past the variant's default. This removed a sharp register- spilling cliff observed at bsz=256 (DP 1NN_INDEX went 52s -> 12s, SP 14.9s -> 1.88s) where the old fixed-bps launch_bounds was telling ptxas to fit 4 * 256 = 1024 threads/SM at ~64 regs/thread budget. safe_bps + kHwThreadsPerSm live in kernel_gpu_utils.h. The hw cap resolves per __CUDA_ARCH__ (1536 for sm_86/sm_89, 2048 elsewhere) so launch_bounds never asks ptxas to overcommit the per-SM thread ceiling, which used to be silently clamped. 2. VARIANT TUPLE: DROP default_blocksz_dp (DERIVE AS SP / 2) Variant tuples in SCAMP_VARIANT_TUPLES now declare only default_blocksz_sp. The DP value is implicit: default_blocksz_sp / 2. DP uses 2x the registers per thread on Ampere/Ada (FP64 takes two FP32-sized register slots), so halving the target threads/SM keeps the per-thread register budget stable across precisions. The /2 is hardcoded in the launcher templates (LaunchDoTileShflWithGeometry / LaunchDoTileWithGeometry) and in DefaultBlockszForPrecision so the ratio is enforced by construction instead of being a per-variant convention that drifts. 3. cuda::barrier ARRIVE/WAIT SPLIT IN do_row_shfl Replaces the per-row __syncthreads() in the shfl inner loop with a cuda::barrier<thread_scope_block> arrive/wait pair (sm_80+ mbarrier PTX). The lane-31 cov publish + bar.arrive() are hoisted to right after the cov update, so the arrive-to-wait window covers the dist masking, merge_to_col, merge_to_row, intra-warp shfl cov rotation, register shift, and (every DPT rows) update_info_shfl. Warps that finish that second-half work faster don't block warps still draining it -- the wait blocks only on the slowest arrival. ncu confirms barrier stalls drop from 1.21 cyc/issued instr to 0.03; eligible warps/sched climb from 0.90 to 1.43; issue rate +24%. Output is bit-identical to the __syncthreads version. 4. VARIANT TABLE CONSOLIDATION The full SP+DP autotune across the previous 9-variant table (3 SW + 6 shfl) showed only 4 variants ever win a target on sm_86, so the table is trimmed to those 4: SW dpt=4 bps=1 big-tile (target=512 thr/SM): wins DP 1NN + DP AAN SW dpt=4 bps=2 big-tile (target=1024 thr/SM): wins SP 1NN + SP/DP MATRIX_SUMMARY + SP AAN shfl dpt=8 bps=5: wins SP 1NN_INDEX + DP/SP SUM_THRESH; best overall geomean (1.016x of per-target best, worst-case 1.03x). shfl dpt=4 bps=8 tile=128: wins DP 1NN_INDEX. Smaller per-thread state lets DP fit higher occupancy where dpt=8 spills. Dropped variants (each never won a target across all 10 (profile, precision) targets in the full autotune): - SW dpt=2 baseline (geomean rank 22+) - SW dpt=4 + 4x4 hand-unroll tile=200 (geomean rank 4-13 but always lost to one of the two big-tile variants) - SW big-tile bps=4 (forced spill cliff; geomean rank 17+) - shfl dpt=4 bps=8 tile=64 (the original short-tile shfl; never wins anywhere) - shfl dpt=8 bps=4 (loses by ~1% to bps=5) - shfl dpt=8 bps=6 (just past spill cliff on sm_86) - shfl dpt=4 bps=10 tile=128 (only wins broken-on-GPU MATRIX_SUMMARY DP) Compile time drops accordingly: 9 -> 4 variants means 5 fewer per-profile .cu TUs to nvcc-instantiate, with the sliding-window TUs each compiling much faster than the shfl ones (the shfl variant has more template instantiations per TU). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

GetDefaultKernelConfig previously took only precision and returned the first shfl variant for every (profile, precision) combo. Now takes profile_type too, and picks the right family for that profile based on which family wins in the cross-(profile, precision) autotune: - 1NN_INDEX, SUM_THRESH -> shfl family. The shfl variant warp-reduces atomics before the final atomicMax / atomicAdd, which dominates these profile types. - 1NN, MATRIX_SUMMARY, APPROX_ALL_NEIGHBORS -> sliding-window family. These profiles have light per-row atomics; the SW variant's smem column buffer + heavy inner-loop unroll dominates. The implementation picks the FIRST variant of the preferred family (matching by unrolled_rows == 0 / != 0). By convention the first entry of each family in SCAMP_VARIANT_TUPLES is that family's intended default -- variant authors should keep the best generalist at the front of the list. Robust to variant reordering and to the table being re-tuned for a new device. DefaultBlockszForPrecision still picks the precision-specific blocksz within the chosen variant (SP gets default_blocksz_sp, DP gets default_blocksz_sp / 2). This only matters when the user-cache lookup misses the device's CacheKey + (profile, precision) -- once the autotune cache has an entry, the cache wins. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds a Per-variant CUDA Tests step to build-and-test-cuda that runs the existing run_tests.py harness once per compiled kernel variant with SCAMP_FORCE_VARIANT pinning the variant for every (profile, precision) launch. Coverage: each variant gets exercised against all profile types (1NN_INDEX, 1NN, SUM_THRESH, MATRIX_SUMMARY, APPROX_ALL_NEIGHBORS) and both precisions, just at a smaller sweep than the extended-tests step (1 window, 2 tile sizes, 2 input sizes), so the extra CI time is ~4 min total for the current 4-variant table. Two new mechanisms support the new step: - SCAMP_FORCE_VARIANT=<index> env var: read by GetKernelConfigForDevice in autotune.cpp. When set to a valid variant index, returns the variant's cold-start config (via DefaultBlockszForPrecision) and bypasses the autotune cache + per-(profile, precision) default. The precision still picks the cold-start blocksz so the kernel runs at its variant-author-chosen blocksz, not the autotuned one -- the full {64,128,256,512} blocksz axis sweep belongs in `SCAMP --autotune`, not the correctness CI. - --list_variants CLI flag: prints one line per variant in a stable machine-parseable format (v<i> bps=N dpt=N ur=N our=N kti=N family={shfl,sliding-window}). CI shell scripts iterate via $(SCAMP --list_variants | wc -l) so adding or dropping variants doesn't require touching the workflow YAML. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cuda/barrier requires sm_70+; libcudacxx hard-errors on older arches with #error "CUDA synchronization primitives are only supported for sm_70 and up." The shfl kernel's arrive/wait split is gated behind SCAMP_SHFL_HAS_CUDA_BARRIER (defined only when the host pass or a sm_70+ device pass is active). ShflRowBarrier is a typedef wrapper used in do_row_shfl's signature and the per-block __shared__ declaration: - sm_70+: cuda::barrier<thread_scope_block> - sm_60: empty stub struct On sm_60 the lane-31 publish still happens at the same early spot (right after the cov update); only the per-row sync primitive differs (cuda::barrier wait vs __syncthreads). The publish is just an smem store and is ordered with the next row's read by either sync primitive, so semantics are identical -- sm_60 just loses the arrive/wait latency-hide. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The cache lookup tests hardcoded the old v6 geometry (bps=8, dpt=4, ur=0, our=8, kti=8 — the dpt=4 tile=64 shfl variant) as the SupportedConfig() "hit" example, and the Test_AutotuneFullPipeline_FakeBench rig referenced variant indices 4+ (out-of-bounds in the 4-variant consolidated table). Both were silently failing after the variant table dropped from 9 variants to 4 -- SupportedConfig stopped matching anything in kVariants so IsSupportedKernelConfig rejected cache hits, and the FakeBench rig read past kVariants.size() (returning garbage geometry values like -51257 from uninitialized stack). Updates: - SupportedConfig() now uses kti=16 (the dpt=4 tile=128 shfl variant that survived consolidation). - All cache-fixture strings bumped from |8|4|0|8|8 to |8|4|0|8|16 via sed. - RiggedWinner table reworked to cover variant indices 0..3 (the current 4-variant range), with at least one rigging per index. 22/22 tests now pass on sm_86. Same fix applies to the CI failures on the build-cuda-versions runners. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fixes from a walk-through of the PR diff vs master: - kHwThreadsPerSm: added missing arches. Turing (sm_75) was using the 2048 fallback but Turing's hw cap is actually 1024 thr/SM, so safe_bps was overcommitting by 2x on Turing -- ptxas would silently clamp, defeating the safe_bps cap entirely. sm_87 (Orin) and sm_120 (Blackwell GeForce) added at 1536. Also refactored from a constexpr #if-ladder to a constexpr function (hw_threads_per_sm()) used as the default arg for safe_bps -- simpler to read and extend; same per-arch __CUDA_ARCH__-driven values. - LAUNCH_PRECISION_SHFL: factored the 4-way blocksz dispatch into a LAUNCH_PRECISION_SHFL_AT_BLOCKSZ helper macro, mirroring the sliding-window LAUNCH_PRECISION_AT_BLOCKSZ pattern in kernels_impl.h. ~50 lines of macro removed; identical generated code. - Stale shipped autotune cache entries dropped from data/autotune_cache.txt. Every RTX 3080 and P100 entry referenced variant geometries dropped in the consolidation (old v6 dpt=4 tile=64, old v8 dpt=8 bps=4, old v0/v1 SW). They were silently rejected by IsSupportedKernelConfig -> users got cold-start defaults anyway, but the file lied about what was tuned. Header notes kept; entries cleared until we re-autotune on hw. - Comment cleanup: * kernels_compute_shfl.h: "one __syncthreads per row" -> describes both the sm_70+ cuda::barrier path and the sm_60 __syncthreads fallback. Stale "DP variant 6 BLOCKSZ=256 DPT=2" smem-footprint example replaced with a real current-variant example. * kernels_impl_shfl.h: docstring template-param list said blocks_per_sm but the param was renamed to target_threads_per_sm when safe_bps landed. FUTURE OPT comment dropped its "v4 (DPT=8 SP is already register-pressured)" reference -- v4 no longer exists; rewritten to "higher-DPT shfl variants" generically. * kernels_impl.h: dropped "variant 5 at OUR=16" and "Same as v6" references to dead variants. Added a load-bearing note about sm_60's 48KB dynamic-smem cap and the launcher's cudaFuncSetAttribute opt-in (which sm_60 ignores). * autotune_bench.cpp: trimmed "real bug: v5 (OUR=16)" past-incident reference; deduped two stale warmup-related comment blocks. * autotune.cpp: "Per-thread override" comment was misleading -- the override IS process-wide (the function-level comment above the setter spells out why), so the lookup-side comment now matches. * main.cpp: removed redundant nested #ifdef _HAS_CUDA_. * ci yml per-variant step: "one tile size" -> "two tile sizes" (matched actual command which sweeps {1024, 4096}). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- SCAMP_FORCE_VARIANT lookup now caches the parsed value at first call via a static-local initializer. The lookup sits on the per-tile-launch hot path; the prior every-call getenv() added a syscall per tile. Out-of-range and malformed values cache as -1 ("no override") so the fast path stays a single int compare. - pyscamp module init no longer force-sets SCAMP_AUTOTUNE_QUIET=1. The autotune cache-miss warning now behaves the same in pyscamp as in the CLI: prints once per process if no cache entry is found, can be silenced with SCAMP_AUTOTUNE_QUIET=1. The prior asymmetric default hid the "run --autotune for max perf" hint from notebook users for whom it was most relevant. - safe_bps() comment now spells out the default-arg evaluation semantics (hw_threads_per_sm() resolves under the current __CUDA_ARCH__ guard at the call site, so each per-arch device compilation pass bakes in the right per-SM cap). - docs/source/autotune.rst: dropped the now-stale "pyscamp silences by default" claim; clarified the cache-miss warning is once-per-process. Added an "Other autotune environment variables" section covering SCAMP_AUTOTUNE_{PRECISION,VARIANT}_FILTER, SCAMP_AUTOTUNE_WARMUP_RUNS, and SCAMP_FORCE_VARIANT, including their first-call-cached semantics for the ones that latch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Removes the build-time embedded autotune cache and its lookup chain branch. The maintenance burden of keeping data/autotune_cache.txt fresh across releases (every variant-table edit invalidates entries; every new device needs a hand-merged PR; users hit silent stale-entry rejection) outweighed the small out-of-box perf uplift it bought. Removed: - data/autotune_cache.txt (the source file, already empty post-review). - src/core/gpu_kernel/builtin_autotune_cache.{h,cpp.in} (the embedded string + its declaration). - The file(READ) + configure_file + extra source file in src/core/gpu_kernel/CMakeLists.txt that wired the embed into the binary. - GetOrLoadBuiltinCache() + CacheState::builtin_cache from autotune.cpp. - The `builtin_cache` parameter from LookupKernelConfigForDeviceKey (both signature + the only non-test caller, GetKernelConfigForDevice). - 3 builtin-cache-specific unit tests (Test_BuiltinCacheHit, Test_UserCacheBeatsBuiltin, Test_StaleUserCacheFallsThroughToBuiltin). The last was renamed and repurposed to verify a stale user cache falls through to the FALLBACK config + emits the cache-miss warning (the upgrade-compat behavior we actually care about). - Built-in / "ships with the binary" mentions in docs/source/autotune.rst, autotune.h, autotune_cache.h, autotune.cpp banner output. New lookup chain (in priority order): 1. Process-wide KernelConfig override (autotune bench + SCAMP_FORCE_VARIANT). 2. User cache at $SCAMP_AUTOTUNE_CACHE / $XDG_CACHE_HOME / $HOME. 3. GetDefaultKernelConfig(profile, precision) -- the per-profile-type compile-time default, plus a one-shot per-process miss warning. Users who want optimal perf still run `SCAMP --autotune` once per device (or `pyscamp.autotune()`); the result lands in the user cache and persists across runs. Cold-start defaults are reasonable per-profile (shfl for atomic-heavy 1NN_INDEX/SUM_THRESH, sliding- window for atomic-light 1NN/MS/AAN), so no-tune behavior is correct and within a few percent of optimal on the variants and arches we tested. 22/22 -> 20/20 unit tests (-2 from removed builtin-only tests, the renamed one still runs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- docs/source/environment.rst: SCAMP_AUTOTUNE_INPUT_LENGTH default documented as 131072 but the constant in autotune_bench.cpp is 262144 (256K). Fixed. Also dropped the "Set by pyscamp at module init for notebook users" line (we removed that auto-set in an earlier review pass). Added entries for the four env vars that weren't documented at all: SCAMP_AUTOTUNE_PRECISION_FILTER, SCAMP_AUTOTUNE_VARIANT_FILTER, SCAMP_AUTOTUNE_WARMUP_RUNS, SCAMP_FORCE_VARIANT. - docs/source/autotune.rst: the cache-miss warning example showed the pre-consolidation default (bps=8 dpt=4 ... kti=8); the current default for 1NN_INDEX SP is the first shfl variant, which is bps=5 dpt=8 ... kti=32. Updated the example to match. - src/core/gpu_kernel/kernel_gpu_utils.h: hw_threads_per_sm() added sm_110 (Jetson Thor, Blackwell mobile) to the 1536-threads/SM bucket. Previously fell through to the 2048 default which would have over-committed safe_bps's per-arch hw cap by 33% on that device. SCAMPMacros.cmake already compiles for sm_110 on CUDA 13.0+, so the omission was a real gap. - src/core/gpu_kernel/kernels_impl_shfl.h: the "tile_height <= 32 * DPT for the first draft" static_assert message was a state-of-development phrasing. Replaced with a fact-only explanation of WHY the constraint must hold (update_info_shfl fires at most once per warp per tile under that bound). - src/python/SCAMP_python.cpp: a comment in run_autotune claimed "main.cpp dispatches here too" -- backwards. The CLI's --autotune path in main.cpp calls RunAutotuneWithBenchmark directly; this Python entry point is a separate dispatch that shares the same bench impl. Fixed the comment to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

zpzim and others added 30 commits May 16, 2026 11:18

Committing clang-format changes

4815002

Fix GPU shfl kernel bugs: resolve covariance propagation, correct out…

de8dae0

…-of-bounds local_mp_col writes, eliminate KNN duplicate matches by re-initializing column registers, and implement CMAKE_CUDA_ARCHITECTURES build escape hatch.

Optimize GPU shuffle kernel via fast-path dispatch, occupancy constra…

6eea0ed

…int tuning, and dynamic tile size defaults

Merge remote branch 'origin/feat/gpu-kernel-shfl-variant' into feat/g…

e3126c2

…pu-kernel-shfl-variant

Committing clang-format changes

9bbe26a

Committing clang-format changes

88662a9

Add P100 autotune cache

2c6f32e

Committing clang-format changes

1ba5d98

Committing clang-format changes

5f90640

zpzim and others added 30 commits May 22, 2026 10:19

Committing clang-format changes

ebe8cc6

Committing clang-format changes

c1a2320

Merge remote-tracking branch 'origin/master' into feat/gpu-kernel-shf…

8e3878f

…l-variant # Conflicts: # .github/workflows/build-and-test.yml # src/python/SCAMP_python.cpp

Committing clang-format changes

305db28

Committing clang-format changes

0a4efcc

Committing clang-format changes

5d9160e

Committing clang-format changes

18ba9f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-GPU kernel autotune + cov-shuffle (shfl) variant + Eigen port#143

Per-GPU kernel autotune + cov-shuffle (shfl) variant + Eigen port#143
zpzim wants to merge 63 commits into
masterfrom
feat/gpu-kernel-shfl-variant

zpzim commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zpzim commented May 22, 2026

Background

Summary

1. Per-GPU autotune

2. Kernel-variant infrastructure

3. cov-shuffle ("shfl") variant

4. Eigen port of the sliding-window kernel

5. Platform / CUDA-arch matrix

6. Docs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant