Per-GPU kernel autotune + cov-shuffle (shfl) variant + Eigen port#143
Open
zpzim wants to merge 63 commits into
Open
Per-GPU kernel autotune + cov-shuffle (shfl) variant + Eigen port#143zpzim wants to merge 63 commits into
zpzim wants to merge 63 commits into
Conversation
The CPU kernels already live under src/core/cpu_kernel/. Mirror that layout on the GPU side so the next round of GPU work (per-arch knobs + autotuning, issue #115) has a clear home and is not interleaved with the host-side tile/profile orchestration. Files moved (git rename, no behavior change): kernels.h, kernels.cu -- main kernel entry + dispatch kernels_compute.h, kernels_smem.h -- per-profile compute strategies + smem kernel_gpu_utils.h, kernel_gpu_utils.cu -- launch config helpers qt_kernels.h, qt_kernels.cu -- cuFFT sliding-dot-product kernels Files staying at src/core/ (host-side or shared): kernel_common.h/.cpp -- shared with CPU kernels qt_helper.h/.cpp -- host-side cuFFT context manager (has CPU fallback) tile.h/.cpp -- tile orchestration src/core/gpu_kernel/CMakeLists.txt now owns the qt_kernels, gpu_utils, and gpu_kernels library targets and is only added when CMAKE_CUDA_COMPILER is present, mirroring how cpu_kernel/ is structured. Verified that build_cuda/ builds cleanly with -DFORCE_CUDA=1 and that the 1NN_INDEX self-join on a 2000-element series exactly matches distance_matrix_np reference output (same vectors and indexes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The do_tile<...> template was instantiated 45 times in a single .cu file
(5 profile types x 3 precisions x 3 row/col modes), serialising nvcc into
a ~100-second front-end for that one translation unit. Split it so each
profile type gets its own .cu file and the per-TU work shrinks to 9
instantiations.
kernels_impl.h -- shared template body for do_tile + LaunchDoTile
kernels_dispatch.h -- LaunchKernel_<PROFILE> entry-point declarations
kernels.cu -- slimmed: just the runtime dispatch +
gpu_kernel_self/ab_join_* + match_gpu_sort
kernel_1nn.cu
kernel_1nn_index.cu -- one per profile type, each instantiates
kernel_sum_thresh.cu LaunchDoTile<...> bound to its PROFILE_TYPE
kernel_matrix_summary.cu
kernel_approx_all_neighbors.cu
Also extracted the plain launch constants (KERNEL_TILE_HEIGHT, BLOCKSZ_*,
BLOCKSPERSM, DIAGS_PER_THREAD, TILE_HEIGHT_*) into a new CUDA-free
kernel_constants.h. kernel_gpu_utils.h still re-exports them; host TUs
can now include the constants without dragging in nvcc-only device
intrinsics.
Measured on this 8-core box (RTX 3080, nvcc 12.2):
baseline (single kernels.cu): 101s wall, 740MB peak rss
split (6 .cu, --parallel 8): 35s wall, 418MB peak rss
A 2.9x wall-clock win and 43% lower peak memory for the gpu_kernels target.
GPU output is byte-identical: verified that 1NN_INDEX self-join on a 2000-
element series produces vectors+indexes matching distance_matrix_np
exactly. PTX is unchanged -- this is a pure refactor.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
) The kernel launch path now consults a small per-(device, profile_type, precision) configuration table to pick the right (blocksz, tile_height, blocks_per_sm) tuple. The table is wired with three layers, checked in priority order at GetKernelConfigForDevice(): 1. User override file ($SCAMP_AUTOTUNE_CACHE, $XDG_CACHE_HOME, or ~/.cache/scamp/autotune.txt). RunAutotune() writes here. Useful for developers iterating on a single machine. 2. Built-in cache embedded into the binary at build time from data/autotune_cache.txt. This is what conda-forge / pip-wheel users benefit from since they cannot recompile to add kernel variants. 3. GetDefaultKernelConfig() -- the compile-time default that matches the constants SCAMP has historically shipped with. Workflow for shipping a new device's tuned config to end users: 1. Build SCAMP from source on a machine with the target GPU. 2. Run `SCAMP --autotune` (or `pyscamp.autotune()`) -- this writes the user override file. 3. Merge the new device's lines into data/autotune_cache.txt and open a PR. The next release picks them up via configure_file embedding. Files added: data/autotune_cache.txt -- checked-in cache with header src/core/gpu_kernel/ kernel_config.{h,cpp} -- KernelConfig struct + default kernel_constants.h -- moved compile-time constants here (CUDA-free) so host TUs can include them device_props.{h,cpp} -- cudaGetDeviceProperties wrapper producing a stable cache key autotune_cache.{h,cpp} -- text-format cache I/O, with LoadFromString hook for the built-in builtin_autotune_cache.h -- declares the embedded constant builtin_autotune_cache.cpp.in -- configure_file template autotune.{h,cpp} -- public API: RunAutotune, GetKernelConfigForDevice (override + builtin lookup) Files modified: src/core/gpu_kernel/CMakeLists.txt -- builds the new sources; configure_file embeds the cache into the binary src/core/gpu_kernel/kernels.cu -- compute_gpu_resources_and_launch calls GetKernelConfigForDevice for every launch src/main.cpp -- --autotune CLI flag src/python/CMakeLists.txt -- pyscamp links gpu_utils + CUDA::cudart when CUDA is on src/python/SCAMP_python.cpp -- pyscamp.autotune() binding Today only the compile-time-default KernelConfig is template-instantiated in the kernel, so IsSupportedKernelConfig() rejects any other tuple and the runtime silently falls through to the default. The infrastructure is in place for follow-up PRs to add alternative (blocksz, tile_height) kernel variants, at which point the cache becomes load-bearing for end users on tuned devices. Verified: - CLI --autotune on RTX 3080 writes a 15-entry override file and prints the developer workflow instructions. - pyscamp.autotune() does the same from Python. - SCAMP_AUTOTUNE_CACHE=/tmp/empty_cache + GPU 1NN_INDEX self-join still matches distance_matrix_np exactly (falls through to default). - Full test/test_pyscamp.py suite passes on GPU. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eigen 5.0.1 (released Nov 2025) is the first stable release with ~1474 commits of fixes since 3.4.0, including many that matter for device-side use: added EIGEN_DEVICE_FUNC annotations, NVCC build fixes for CUDA 10+, MSVC+NVCC pragma fixes, and bug fixes for inverse evaluators / tridiagonalization / arg() under CUDA. This unblocks using Eigen expressions in .cu code without a long tail of workarounds. The CMake target name (Eigen3::Eigen) is unchanged; downstream link lines are unaffected. Existing CPU-kernel Eigen usage (Eigen::Array + Eigen::Map of fixed/dynamic 1-D arrays, plus EIGEN_MAX_ALIGN_BYTES) is API-compatible. Verified by building the full tree + running a self-join smoke test on both CPU and GPU paths with bit-identical output between them. Note: in Eigen 5 the version macros are EIGEN_WORLD_VERSION=3, EIGEN_MAJOR_VERSION=5, EIGEN_MINOR_VERSION=0 (the WORLD field is frozen at 3 for source-compat with downstream version checks). CMake package version is reported as 5.0.1, so find_package(Eigen3 5.0.0 ...) is the right gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the build-system plumbing needed for the follow-up GPU kernel
rewrite to start using Eigen expressions in .cu code:
- --expt-relaxed-constexpr added to CMAKE_CUDA_FLAGS so nvcc can
call Eigen's constexpr host functions from __device__ contexts
without warnings.
- gpu_kernels now links Eigen3::Eigen. The target is defined by
src/core/cpu_kernel/CMakeLists.txt, which the parent
src/core/CMakeLists.txt processes before this subdirectory.
- A throwaway __global__ smoke test in kernels.cu instantiates
Eigen::Array<float, 4, 1> with arithmetic + reduction inside
device code. It is never launched at runtime and will be removed
when the real Eigen rewrite replaces it. Verified to compile for
sm_86 on CUDA 12.2; full SCAMP build + CPU/GPU self-join produce
bit-identical output.
No behavior change: the dispatcher, kernel templates, and runtime
paths are untouched; only the toolchain wiring is in place so the
next commit can begin replacing hand-unrolled accumulator arrays
with Eigen::Array.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… sliding-window inner loop
Heavy rewrite of the per-thread GPU compute path, drawing structure
from the earlier use-eigen-in-gpu-kernels experiment branch. The
changes are architectural: the kernel still computes the same output
(bit-identical to master on 1NN / 1NN_INDEX / SUM_THRESH self-joins
on RTX 3080 over randomlist128K, full pyscamp test suite passes
including small-m m=3/m=4 and MATRIX_SUMMARY / KNN / AB-join paths).
Highlights:
- SCAMPSmem becomes templated on (tile_width, tile_height) and
exposes each smem region as Eigen::Map<Eigen::Array<T, N, 1>>,
so callees can write smem.df_col.segment<unrolled_diags>(offset)
instead of hand-unrolled raw-pointer loads. The ctor uses
placement new on nullptr-initialized Maps to re-seat them at
the per-region smem offsets (Eigen::Map's only public way to
rebind a pointer at construction time).
- SCAMPSmem gains a `using DataType = DATA_TYPE` typedef so
do_iteration_fast can declare its register-window arrays with
the smem's scalar type, not the cov accumulator's. This matters
for PRECISION_MIXED (cov is double, columns are float).
- SCAMPThreadInfo.cov1..cov4 collapse to a single
Eigen::Array<ACCUM_TYPE, DIAGS_PER_THREAD, 1> field. cov
indexing in do_row / do_row_edge is now cov[i] instead of
covN, paving the way for changing DIAGS_PER_THREAD without
a kernel rewrite.
- kernels_compute.h drops ~500 lines: the five per-profile-type
overloads of update_row / merge_to_column / update_cols /
reduce_edge / reduce_row collapse into single if-constexpr
templates with Eigen::ArrayBase<Derived> parameters. Same for
kernels_smem.h (init_smem, write_back).
- New for_<N> constexpr-loop helper (std::index_sequence + lambda
taking auto i) lets compile-time loop indices flow into
template arguments cleanly, replacing #pragma unroll blocks for
cases where the body uses the index as a template parameter.
- do_iteration_fast now processes outer_unrolled_rows rows per
call (vs DIAGS_PER_THREAD=4 in master) using a sliding column
register window of width inner_unrolled_cols, refilling from
smem after each inner row-batch. New compile-time knobs in
kernel_constants.h: DIAGS_PER_THREAD=2, unrolled_rows=2,
outer_unrolled_rows=16, inner_unrolled_cols=3, unrolled_cols=17,
KERNEL_TILE_HEIGHT=256. These will be the variants the autotuner
benchmarks once dispatch lands.
- do_row_edge ports to Eigen-array form but keeps element-wise
scalar math for the cov*inormc*inormr compute, since Eigen 5
doesn't auto-promote across Arrays of different scalar types
in cwise ops and PRECISION_MIXED needs the cross-type math.
- PRECISION_MIXED preserved (the eigen branch dropped it).
- Small-m fix (num_diags = args.n_x - args.exclusion_upper + 1)
preserved (the eigen branch predates it).
- kernels.cu smoke-test kernel removed; production paths now
exercise Eigen device-side.
- --expt-relaxed-constexpr already added to nvcc flags by the
earlier scaffold commit. gpu_utils target now also links
Eigen3::Eigen (kernel_gpu_utils.h declares Eigen::Map fields).
Perf on RTX 3080, randomlist1M window=200, 3 runs each:
precision master eigen-port delta
double 25.7s 31.7s +23%
mixed 50.7s 56.4s +11%
single 5.3s 6.2s +17%
This regression is expected and acknowledged. The eigen-branch
parameter values (DIAGS_PER_THREAD=2, outer_unrolled_rows=16, etc.)
were never benchmarked against master; they're a WIP starting
point. The follow-up "make the autotune cache load-bearing" PR
will pre-instantiate multiple (DIAGS_PER_THREAD, tile_height,
blocks_per_sm) variants and let RunAutotune benchmark + select the
fastest per device. End-user perf will recover (and likely exceed
master) once the autotuner can pick a config tuned for the device.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…IGHBORS
The SCAMPProfileType enum is not contiguous: PROFILE_TYPE_1NN=6,
PROFILE_TYPE_APPROX_ALL_NEIGHBORS=7, PROFILE_TYPE_MATRIX_SUMMARY=8 all
sit above the prior PROFILE_TYPE_1NN_MULTIDIM=5 that was the loop's
upper bound. Any cache entry naming one of those three profile types
failed to parse, throwing SCAMPException("Unknown profile type ...")
from AutotuneCache::Load(); GetKernelConfigForDevice silently caught
it and returned the compile-time default, defeating the cache lookup.
Existed since 4daa21a; only showed up now because the autotuner only
writes a record for profile types that ParseProfileTypeName can also
read back, so round-tripping its own output worked, but any
hand-edited cache entry for the three affected types was silently
ignored. Bumping the loop to PROFILE_TYPE_MATRIX_SUMMARY covers
every defined enum value; out-of-range integers fall through to
ProfileTypeName's default "INVALID" branch and harmlessly never
match user input.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…itch
Makes the autotune cache load-bearing at launch time. Previously
GetKernelConfigForDevice's return value was discarded (kernels.cu
had a literal "(void)cfg" comment) because LaunchDoTile only ever
called one pre-instantiated kernel variant. Now the cache result is
threaded through and the kernel actually dispatches to one of the
enumerated launch-geometry variants.
Mechanism:
- kKernelVariants[] in kernel_config.cpp enumerates the supported
(tile_height, blocks_per_sm) tuples. Index 0 is the canonical
default; index 1 is the first alt (128, 4).
- LaunchDoTile becomes a cfg-switching shim that calls the right
pre-instantiated LaunchDoTileWithGeometry<tile_height, bps>.
Each kernel_<X>.cu now instantiates both variants per
(precision x row/col mode), doubling the do_tile template
instantiation count from 45 to 90 (still parallelizable
across the 5 per-profile .cu files).
- IsSupportedKernelConfig walks the variant table instead of
matching only the default, so cache entries naming any
enumerated variant are now honored.
- kernels.cu's compute_gpu_resources_and_launch threads cfg
through to each LaunchKernel_<X>; blocksz now comes from
cfg.blocksz, and get_smem takes cfg.tile_height so smem
sizing matches the variant.
- The launch-info print (gated by silent_mode) now includes
tile_height + blocks_per_sm so users can see which variant
fired without rebuilding.
The default config is still variant 0, so behavior is unchanged
for any cache that names the default or has no entry. Verified
with the existing pyscamp test suite (all pass) and a manual
test cache injecting variant 1 (128, 4) entries for every
(profile_type, precision) tuple, where output remains
bit-identical to the default-variant output on the 3080.
What's still missing (follow-up commits): RunAutotune today still
writes the default config for every (profile, precision) instead of
benchmarking each variant and picking the winner. And the variant
axes are limited to (tile_height, blocks_per_sm); the Eigen port
made diags_per_thread / unrolled_rows / outer_unrolled_rows /
kernel_tile_iters parameterizable too, but exposing them adds a
template-arg axis to do_tile/SCAMPThreadInfo that ripples further.
Both extensions land on top of this scaffold without redesign.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…olled_rows, kernel_tile_iters
The 2-axis (tile_height, blocks_per_sm) variant table was a placeholder
-- post-Eigen-port the inner-loop shape knobs are all
template-friendly, and DPT is the most consequential one for register
pressure / occupancy. Adds them as variant axes:
KernelConfig fields (replacing the redundant tile_height with the four
underlying knobs):
- blocksz, blocks_per_sm (as before; blocksz precision-tied)
- diags_per_thread (cov array width; smem column stride)
- unrolled_rows (inner row-batch in do_iteration_fast)
- outer_unrolled_rows (rows per do_iteration_fast call)
- kernel_tile_iters (do_iteration_fast calls per tile)
- tile_height() (derived: kti * our)
Where the changes ripple:
- SCAMPThreadInfo<ACCUM_TYPE, DiagsPerThread>: the cov accumulator's
Eigen::Array compile-time size now comes from the template arg.
- Every compute template (do_iteration_fast, do_row, do_row_edge,
merge_to_row/column, update_rows/cols, reduce_row/edge) gains the
inner-loop knob template params (DiagsPerThread on all; UnrolledRows
+ OuterUnrolledRows on do_iteration_fast). The derived constants
inner_unrolled_cols = DPT + UR - 1, unrolled_cols = DPT + OUR - 1
become local `constexpr int` inside do_iteration_fast.
- do_tile gains 4 new int template params; tile_height = KTI * OUR
derived inside.
- LaunchDoTileWithGeometry takes the full 5-tuple and dispatches the
9 precision x row/col instantiations via macro (replaces ~100 lines
of duplicated switch arms).
- LaunchDoTile picks among pre-instantiated geometries via a
branched VARIANT_BRANCH macro keyed on the full 5-tuple.
- get_smem takes diags_per_thread + tile_height args; GetTileHeight
deleted (replaced by KernelConfig::tile_height()).
- Host-side num_workers uses cfg.diags_per_thread.
Cache file format kept on V1 (no shipped production cache to migrate),
but the field count went 6 -> 9: blocksz | blocks_per_sm |
diags_per_thread | unrolled_rows | outer_unrolled_rows |
kernel_tile_iters. Stale 6-field caches trip SplitN, throw inside
GetKernelConfigForDevice, and are caught + fall through to the default
(same as any other parse error today). Re-running RunAutotune rewrites
in the new format.
Variant table seeded with 2 entries to validate the wider dispatch:
v0: bps=2 dpt=2 ur=2 our=16 kti=16 (tile_height=256, current default)
v1: bps=2 dpt=4 ur=2 our=4 kti=50 (tile_height=200, master-like DPT=4)
Verified on RTX 3080, randomlist1M, window=200, DOUBLE:
variant 0: ~39.7s
variant 1: ~31.9s (20% faster, bit-identical output)
That perf delta is real proof the dispatch picks the right kernel --
identical-output alone wouldn't have shown it. Master baseline before
the Eigen port was ~25.7s on the same input; the autotuner can now at
least narrow the gap by picking variant 1 on this device, and further
gain is a matter of expanding kKernelVariants (which is the next
commit's job once the benchmark loop is in place).
Full pyscamp test suite still passes (1NN_INDEX, 1NN, SUM_THRESH,
MATRIX_SUMMARY, APPROX_ALL_NEIGHBORS, KNN, small-m m=3/m=4, self/ab
joins). CPU vs GPU bit-identical on randomlist128K self-join.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…XED, RTX 3080 cache
Five related changes that together turn the autotune cache from an unused
placeholder into a real per-device perf knob. Bundled into one commit
because splitting them would break the build mid-history.
1. Benchmark loop (autotune.{h,cpp}, autotune_bench.{h,cpp})
- RunAutotuneWithBenchmark: for each (profile, precision) tuple, time
every variant in kKernelVariants via a BenchmarkFn callback, write
the fastest to the cache.
- DefaultBenchmarkVariant: 65K x window=200 synthetic self-join, runs
1 warmup + 3 timed runs, returns the min.
- SetKernelConfigOverride / ClearKernelConfigOverride: thread-local
override that GetKernelConfigForDevice consults first; the
benchmark loop sets it around each timed do_SCAMP call to force the
variant under test.
- autotune_bench is a separate library so the dep on scamp_op stays
out of gpu_utils (avoiding a scamp_op -> tile -> gpu_kernels ->
gpu_utils -> scamp_op cycle). main.cpp links it for --autotune;
legacy RunAutotune is preserved for pyscamp which hasn't been
re-wired yet.
2. Per-(profile, variant) compile split
- kernel_variant.cu.in: configure_file template. CMake foreach
generates kernel_<profile>_v<N>.cu (5 profiles x 6 variants = 30
TUs) each instantiating do_tile<...> for one variant geometry
across 2 precisions x 3 row/col modes = 6 do_tile bodies.
- kernels_variants.h: declarations for all 30 LaunchVariant_<X>_vN
entry points + SCAMP_VARIANT_DISPATCH macro (the cfg switch).
- kernel_<profile>.cu reduces to a one-liner that expands the macro.
LaunchDoTile (the cfg switch as a template) is removed from
kernels_impl.h.
- Before: each kernel_<profile>.cu compiled all 6 variants x 6
instantiations = 36 do_tile bodies serially. After: 30 small TUs
compile in parallel under -j20+.
3. 6-variant kKernelVariants table (kernel_config.cpp + kernels_variants.h
SCAMP_VARIANT_DISPATCH)
- v0: bps=2 dpt=2 ur=2 our=16 kti=16 (tile=256, eigen-port default)
- v1: bps=2 dpt=4 ur=2 our=4 kti=50 (tile=200, DPT=4 master-like)
- v2: bps=2 dpt=4 ur=4 our=4 kti=50 (tile=200, matches master's
4x4 hand-unroll exactly)
- v3: bps=4 dpt=2 ur=2 our=8 kti=16 (tile=128, higher occupancy)
- v4: bps=2 dpt=2 ur=2 our=8 kti=32 (tile=256, smaller outer-unroll)
- v5: bps=1 dpt=4 ur=4 our=16 kti=16 (tile=256, low occupancy)
4. Drop PRECISION_MIXED
- --mixed_precision CLI flag removed; pyscamp "mixed" kwarg removed;
GetPrecisionType signature shrunk to (ultra, double, single).
- kAutotuneTargets drops MIXED entries (15 -> 10 tuples).
- LaunchDoTileWithGeometry's LAUNCH_FOR_ROWCOL_MODE macro drops the
MIXED case.
- SCAMPArgs::validate rejects PRECISION_MIXED with a clear error.
- The enum value is kept in common.h for proto / distributed wire
compat; the only code path that produced it is the now-removed CLI
flag. Rationale: MIXED was uniformly slower than DOUBLE in practice
(kept DP's accumulator cost without halving smem footprint), so no
workload prefers it.
- Stale MIXED-related comments in kernels_compute.h updated.
5. RTX 3080 (sm_86) entries baked into data/autotune_cache.txt
- 10 lines, one per (profile, precision). Generated by `SCAMP
--autotune` on this branch. Embedded into the binary via
configure_file so conda-forge / pip-wheel users on an RTX 3080
get tuned configs without rebuilding.
End-to-end on RTX 3080, randomlist1M, window=200, DOUBLE self-join, GPU:
pre-Eigen-port baseline: ~25.7s
Eigen port (default cfg): ~36.7s (+43%, the regression we incurred
to land the new architecture)
autotuned (this commit): ~25.3s (parity with baseline)
So the autotune mechanism + 6-variant table closes the perf gap the
Eigen port opened. Future variant additions can extend further.
Full pyscamp test suite passes (1NN, 1NN_INDEX, SUM, MATRIX_SUMMARY
self/AB joins; small-m m=3/m=4; KNN self/AB), bit-identical CPU vs GPU
on randomlist128K.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…runtime
Before this change, do_iteration_fast used constexpr for_<N>(lambda)
loops that template-specialized do_row / merge_to_column / merge_to_row
/ update_cols / update_rows / reduce_edge for every (outer_row_iter,
row_iter, start_index, iter) value. With OuterUnrolledRows=16 (variants
v0 and v5) that meant 16 distinct do_row instantiations per
(profile, precision, row/col mode) tuple, which made those variants
~1.7x more expensive to compile than OUR=4 variants.
The template args were all only used as runtime offsets into Eigen
expressions like best_so_far.segment<unrolled_diags>(row_iter) -- the
*size* N is compile-time, but the *offset* can be runtime. So:
- merge_to_row, update_rows, merge_to_column, update_cols, do_row,
reduce_edge drop their index-only template parameters and take
them as function args. The size-typed template parameters
(num_to_update, DiagsPerThread, etc.) stay compile-time because
they drive Eigen::Array sizes + #pragma unroll bounds.
- The for_<OUR/UR>([&](auto j){...}) outer loop in do_iteration_fast,
the for_<UR>([&](auto k){...}) inner loop, and the for_<DPT> loop
in do_row_edge become plain #pragma-unrolled for loops. nvcc still
fully unrolls them (compile-time-known trip counts) and folds j/k
constants per iteration after unroll, so emitted PTX is unchanged
on the deterministic-comparison cases (1M random self-join GPU
output is bit-identical, 25.2s same as before).
Per-variant compile-time delta (mean across 5 profiles, 5-way parallel):
variant before after delta
v0 OUR=16 112.8s 66.2s -41%
v1 OUR=4 72.7s 60.4s -17%
v2 OUR=4 64.8s 58.0s -10%
v3 OUR=8 86.6s 60.4s -30%
v4 OUR=8 86.2s 60.1s -30%
v5 OUR=16 113.7s 71.0s -38%
Per-variant cost variance collapsed from 1.75x spread (65-114s) to
1.22x (58-71s). 5-way parallel wall-clock bottleneck dropped from 114s
to ~74s (-35%). Adding more variants (or variants with OUR>16) is now
much cheaper than before.
Correctness: pyscamp test suite passes 1NN_INDEX self/AB, SUM self/AB,
Matrix Summary self, m=3 and m=4 small-window, KNN self/AB on every
run. Matrix Summary AB join has a pre-existing ~20% flake rate from
np.allclose tolerance interacting with the atomic-merge ordering on
the transposed tile path -- unchanged by this refactor (the atomic
ops and FMA chains are byte-identical).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…counter The Eigen-port (687a70b) refactor of write_back passed smem.profile_{a,b}_length into write_back_value where the pre-Eigen code passed args.profile_{a,b}_length. smem.profile_{a,b}_length is the smem-cached copy of the global counter that do_tile's NeedsCheckIfDone early-exit consults; the match-output atomic-add position must come from the *global* counter, which is what Profile::CopyFromDevice reads back to size the match_value_unordered vector on the host. With the atomic firing into smem instead, the global counter stayed at zero, host saw length=0, and KNN/APPROX_ALL_NEIGHBORS returned an empty profile. This only surfaced once the autotune cache could resolve APPROX_ALL_NEIGHBORS entries (b471cdc); before that, the parser silently failed and the fallback config happened to still produce zero matches via the same bug -- masked because GetKernelConfigForDevice was returning the default cfg either way. Test coverage gap: 687a70b's commit message claimed "KNN passes" but apparently the run was against the old conda-installed pyscamp build rather than the fresh tree. Validation on RTX 3080: - test_pyscamp.py: all 11 cases pass (1NN/SUM/Matrix/KNN, self+ab, small-m) - run_tests.py --executable pyscamp: 2072/2072 tests pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bench
The Eigen Map<Array<T,N,1>>::segment<>() pattern that the Eigen port introduced
compiles, under nvcc device codegen, to N scalar ld.shared.{f32,f64} per
segment -- the packet path is mostly inactive on the device side. The pre-Eigen
master used raw reinterpret_cast<float4*> to get ld.shared.v4.f32. This commit
reclaims that pattern via a vec_load<N, AlignBytes, T> helper that picks the
widest aligned PTX vector load supported by the (T, AlignBytes) tuple and
recursively peels the tail.
do_iteration_fast is wired to vec_load at three sites:
- initial column window load: aligned to DPT * sizeof(T) (info.local_col
is threadIdx.x * DPT).
- row data load: aligned to UR * sizeof(T) (info.local_row is a multiple
of OUR, OUR is a multiple of UR, j*UR is a multiple of UR).
- sliding-window refill: inner_unrolled_cols padded from DPT+UR-1 to
DPT+UR so the refill offset lands at local_col + j*UR + DPT, which is
UR-aligned for every variant (all current variants satisfy DPT % UR == 0,
enforced by static_assert). The unread trailing slot costs 3 registers
per thread (dfc/dgc/inormc).
PTX inspection on RTX 3080 confirms ld.shared.v4.f32 / v2.f32 / v2.f64
emitted at the three sites. Variant 5 (OUR=16, KTI=16, tile_height=256)
for SP self-join needs ~50KB of dynamic smem, exceeding the sm_8.x default
48KB per-block cap; LAUNCH_PRECISION now wraps the launch with
cudaFuncSetAttribute(cudaFuncAttributeMaxDynamicSharedMemorySize, smem)
when smem > 48KB. sm_86 max is ~100KB so v5 fits.
Caught a separate latent autotune bug while wiring the smem opt-in: when a
variant kernel failed to launch (e.g. v5 SP self-join exceeding smem cap
before the opt-in), do_SCAMP swallowed the SCAMP_CUDA_ERROR return value
and the autotune benchmark recorded the fast "failure" path as the winning
time, writing a broken variant to the cache. TimeOneRun now calls
cudaDeviceSynchronize + cudaGetLastError after do_SCAMP and throws if
either reports a failure, so the RunAutotuneWithBenchmark catch path
records that variant as +inf time instead.
Re-autotuned data/autotune_cache.txt for RTX 3080 with the new code path
active. DP cells are within ~3% of each other (FP64 throughput-bound on
sm_86 at 1/64 DP rate), so the choice there is in the noise floor; SP
cells show 5-15% spread between variants.
Validation:
- test_pyscamp.py: 11/11 pass
- run_tests.py --executable pyscamp: 2072/2072 pass
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Prototype of a no-smem-column-buffer kernel that uses warp shuffles to
walk cov across lanes, freeing ~30 KB DP / 15 KB SP of smem for higher
occupancy on smem-bound configurations.
Algorithm (per warp, per row):
- Lane T owns a FIXED DPT-wide column slice. Ownership rolls over every
32*DPT rows via update_info_shfl, which reads the next 32*DPT-block
directly from global memory (no smem column buffer).
- cov[i] at row r tracks cov(r, lane_T_col + i) on diagonal
lane_T_col + i - r.
- Per row: compute dist + update cov in place (cov(r, c) -> cov(r+1, c+1)).
- Per row: shift cov[i] right within the lane and shuffle in
lane (T-1)'s post-update cov[DPT-1] into cov[0].
Cross-warp cov hand-off (the bit the earlier eigen-gpu-cov-shuffle
prototype got wrong):
- Lane 31 of warp k publishes its post-update cov[DPT-1] into a tiny
smem hand-off region (2 * warps_per_block scalars, double-buffered).
- Lane 0 of warp k > 0 reads warp k-1's published value.
- One __syncthreads() per row at end-of-row, ordered so the read
happens-before the write into the same slot (no race).
- Lane 0 of warp 0 has no predecessor; its cov[0] is junk after row 0
and is masked from distc/distr updates via the slot_valid check.
Smem layout (SCAMPShflSmem): row data + profile data + cov_handoff
region. NO df_col/dg_col/inorm_col regions. ~6.7 KB total for DP var 6
vs ~33 KB for sliding-window v4.
Variant table: variant 6 entry (bps=2, dpt=2, ur=0, our=8, kti=8) with
ur=0 as sentinel for "shfl algorithm." CMakeLists routes ur=0 tuples
through kernel_variant_shfl.cu.in instead of kernel_variant.cu.in.
State of correctness:
- SUM_THRESH self-join + AB-join PASS end-to-end (cleanest signal: the
warp-reduce-then-atomicAdd path is robust to bugs that would surface
elsewhere).
- 1NN_INDEX / MATRIX_SUMMARY / KNN FAIL. Two likely root causes:
(a) Slow path (do_row_edge equivalent) is a no-op placeholder. Tail
tiles of each meta-diag produce NaN entries in the column
profile (~87 NaN of 1801 at n_x=2000).
(b) Possible off-by-one in merge_to_column_shfl's idxc assignment
or in the slot_valid mask. Atomic-max amplifies indexing bugs
that atomic-add masks; SUM_THRESH passing rules out cov drift
in the bulk.
State of dispatch:
- TEMPORARY: v0..v5 commented out in CMakeLists.txt SCAMP_VARIANT_TUPLES
and in kernels_variants.h SCAMP_VARIANT_DISPATCH so the build is
fast during shfl-variant development. ONLY variant 6 is instantiated.
Force the cache to (2,2,0,8,8) to exercise it. Re-enable v0..v5
before merge.
Smem footprint comparison on RTX 3080 (sm_86) for the 4 enabled profile
types:
variant smem (DP) smem (SP)
v4 (slide) 32.8 KB 17.2 KB
v6 (shfl) 6.8 KB 4.5 KB
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/cov_shuffle_diagram.py renders the per-row state of the design-A
variants (variant 6 cross-warp via smem hand-off; variant 8 per-warp-
independent) so you can read off where covariance values flow between
threads.
Each frame for one row shows:
- the BEFORE state: which block-local diagonal each lane's cov[i]
slot is currently tracking;
- the smem cov_handoff region (variant 6 only): what lane 31 of each
warp writes and what lane 0 reads;
- the AFTER state: where each value moved to via the intra-warp
shuffle + cross-warp override, with masked-but-still-present junk
cells shown in parens.
Configurable BLOCKSZ, DPT, warp size, row count, variant. Defaults are
tiny (BLOCKSZ=8 WARP=4 DPT=2 rows=3) so the diagrams fit on one screen;
pass --blocksz 256 --warp 32 for realistic hardware geometry (verbose).
Side-by-side, the two variants make the design tradeoff obvious:
variant 6 -- one junk source (lane 0 of warp 0), needs smem + 1 sync
per row, but only warp 0's lane 0 produces masked cells.
variant 8 -- junk at every warp boundary, no smem or sync, but every
warp's lane 0 progressively masks more slots as r grows.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Alignment fix: compute the cell width once across all frames (including
masked '(N)' parenthesized variants AND lane headers), then apply the
same right-justified width to every cell. Previously narrow numerics
like ' 1' were padded inconsistently with wide variants like '(15)'.
Waste table: replace the single-tile_height waste line with a sweep
across tile_heights spanning [DPT, 4*32*DPT], shown for both the
DEMO geometry (whatever --blocksz the user picked) AND the realistic
kernel geometries (BLOCKSZ=256 for DP, 512 for SP, warp=32).
The tables make the var 6 vs var 8 tradeoff concrete. For BLOCKSZ=256
DPT=2 (the v6 setting in this branch):
tile_height var 6 wasted var 8 wasted 8/6 ratio
2 0.10% 0.78% 8.0x
4 0.29% 2.34% 8.0x
16 1.46% 11.72% 8.0x
32 3.03% 24.22% 8.0x
64 6.15% 49.22% 8.0x <-- our v6 setting
128 12.40% 74.61% 6.0x
256 24.90% 87.30% 3.5x
So variant 6 wastes 6.15% at tile_height=64; variant 8 at the SAME
tile_height wastes 49.22% (8x more). This was the previously-overlooked
cost of dropping the cross-warp hand-off -- per-warp-independent only
wins at small tile_height (<~16), where the rotation amortization is
also worse. Earlier I'd ballparked variant 8 waste at ~1.5%, which was
only counting the initial junk slot, not its propagation through the
warp -- the table now shows the propagated cost too.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-of-bounds local_mp_col writes, eliminate KNN duplicate matches by re-initializing column registers, and implement CMAKE_CUDA_ARCHITECTURES build escape hatch.
…int tuning, and dynamic tile size defaults
…pu-kernel-shfl-variant
Kernel/variant changes:
- Add v7 (shfl bps=8 DPT=4 our=8 kti=16 -> tile_height=128, max for DPT=4)
and v8 (shfl bps=4 DPT=8 our=8 kti=32 -> tile_height=256). v8 is the
first DPT=8 variant; combined with the bigger tile it wins 3 of the
10 (profile, precision) autotune targets at 512K bench size,
including 1NN_INDEX SP and both SUM_THRESH cases. Recommended default
on RTX 3080 is now v8 bsz=128 (geomean 1.224x of per-target best,
worst-case 2.04x).
- Fix do_update_info_shfl dispatcher: it only branched on
updates_remaining = 0..3, so DPT >= 5 silently skipped slot rotations
for slots 0..(DPT-5). That eventually de-synced state.global_col
against tile_start_col and crashed in flush_all_cols_to_smem with a
smem OOB. Extend ladder to 0..7 + static_assert(DPT <= 8).
- Sliding-window LaunchDoTileWithGeometry now dispatches do_tile on a
runtime blocksz (64/128/256/512) the same way LaunchDoTileShfl
already did, so the autotuner can sweep blocksz for those variants
too.
Autotune machinery:
- g_cfg_override was thread_local; do_SCAMP spawns workers via
std::async so they never saw it -- every trial silently fell back to
the on-disk cache, flattening per-variant timings. Change to a
process-wide std::mutex-guarded optional. This is the fix that
unblocked all of the per-variant differentiation below.
- RunAutotuneWithBenchmark now sweeps (variant geometry x blocksz),
prints a progress line per trial with elapsed/ETA, identifies the
winner by variant name (v0..v8), and emits a cross-target weighted
score table -- geomean of time/per-target-best across all 10 targets,
also reports worst-case ratio. Recommends a "best safe default"
trial at the end.
- DefaultBenchmarkVariant uses computing_rows=true so the workload is
a real symmetric self-join (was upper-triangle only, hiding the cost
asymmetry); max_tile_size bumped to 512000 to match the GPU default
in main.cpp; benchmark input length is overridable by
SCAMP_AUTOTUNE_INPUT_LENGTH env var (default 131072 for fast dev
sweeps -- 512K via the env gives results that match production
scale better).
- IsSupportedKernelConfig admits all of {64, 128, 256, 512} blocksz
for every enabled variant, not just the precision-tied default.
Built-in autotune cache:
- Refresh data/autotune_cache.txt with the RTX 3080 winners from the
new sweep. SP entries change substantially -- v8 (DPT=8 shfl) wins
1NN_INDEX SP + both SUM_THRESH cases, v5 (bps=1 tile=256) wins the
rest of the SP targets. DP entries shift to bsz=64.
Notes:
- v2 / v5 re-enabled alongside v0/v1/v3/v4/v6 so the score sweep covers
the full set; only v8 is the brand-new shape, the rest were already
there.
- SCAMPShflState comment cleaned up to reflect that dfc2/dgc2/inormc2
staging is gone (read directly from global at rotation now).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related cache bugs that meant the new shfl variant defaults never actually made it to end users: 1. data/autotune_cache.txt is embedded into the binary via CMake file(READ) + configure_file. file(READ) doesn't track the source path as a dependency, so editing the cache and running `cmake --build` would silently leave the previous contents baked into the binary -- the user updates the cache, rebuilds, and the runtime lookup still misses their new entries. Add CMAKE_CONFIGURE_DEPENDS for the cache file so changes auto-trigger a reconfigure. Verified via `strings build/SCAMP | grep RTX_3080` before/after. 2. pyscamp.autotune() called SCAMP::RunAutotune (the older stub) which just writes the compile-time default cfg for every (profile, precision) target -- no benchmarking. The user-cache write then clobbers the built-in cache lookup, so a conda-forge user who ran pyscamp.autotune() ended up WORSE off than before they called it (the shipped data/autotune_cache.txt entries got masked by the stub-default cache). Switch the Python binding to RunAutotuneWithBenchmark + DefaultBenchmarkVariant -- same path as the CLI's --autotune -- and link pyscamp against autotune_bench. Verified end-to-end: pyscamp.autotune() now runs the full sweep (~4 min for 9 variants x 4 blocksizes x 10 targets at 131K bench size), writes the benchmarked winners to the cache, and SCAMP at runtime picks them up (6/6 launches matched the cache entries). Also default pyscamp.autotune() to device 0 only -- mirrors the CLI behavior. Tuning every visible GPU when most multi-GPU boxes have identical devices is wasted work. Callers can pass devices=[...] to override. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the missing scaffolding around the autotune cache lookup so that end users get a meaningful diagnostic when their device isn't in the cache, and the cache + lookup logic is covered by lightweight C++ tests that don't require a GPU at runtime. What's new for users: - When SCAMP launches a kernel and finds no autotune entry for the current (device, profile, precision) in either the user override cache or the binary's built-in cache, a one-shot warning is emitted to stderr identifying the missing tuple and the compile-time-default config used as a fallback. The warning is deduplicated per tuple so a long-running process sees at most one line per tuple. - SCAMP_AUTOTUNE_QUIET=1 suppresses the warning. pyscamp's module init sets this by default (with setenv overwrite=0 so an explicit user value still wins) -- notebook users don't get uninvited stderr; CLI users still do. - Docs: new docs/source/autotune.rst covers the cache, lookup order, default path, --autotune workflow, the warning, env-var overrides, and troubleshooting; added to the toctree. What's new for developers: - Factored LookupKernelConfigForDeviceKey() out of GetKernelConfigForDevice() -- pure logic, no GPU dependency, takes cache pointers + device key + fallback as arguments. Production callers still use GetKernelConfigForDevice (which queries the real device and threads through the pure helper); tests drive the helper directly with synthetic device keys. - New ResetAutotuneWarnings() for tests; the dedup set is process- global so without it later tests would silently no-op. - test/cpp/test_autotune_cache.cpp: 15 unit tests covering the lookup chain (user hit, built-in hit, user-beats-built-in, unsupported variant falls through, multi-device key isolation), the cache-miss warning (fires once, one-shot per tuple, separately per (profile,precision)), the SCAMP_AUTOTUNE_QUIET env var (suppresses + respects falsey values), DefaultPath() resolution (SCAMP_AUTOTUNE_CACHE -> XDG_CACHE_HOME -> HOME/.cache), on-disk round-trip, and the malformed-cache throw. - BUILD_SCAMP_TESTS=ON CMake option (default OFF) wires the test into ctest. Gated on CMAKE_CUDA_COMPILER for now since the only test module currently exercises GPU-side machinery (no CUDA *runtime* needed -- the test never touches a device -- but the toolchain has to be present to build gpu_utils). Verified end-to-end: - ctest passes 15/15. - Real SCAMP run on a device in the built-in cache: no spurious warning. - pyscamp module init sets SCAMP_AUTOTUNE_QUIET=1 (confirmed via libc.getenv from ctypes; Python's os.environ is a startup snapshot so doesn't show it, but the underlying env that the C++ warning code reads is correctly set). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The unit tests added in 1254f40 (test/cpp/test_autotune_cache.cpp) are behind -DBUILD_SCAMP_TESTS=ON which CI wasn't passing, so they were shipped but never run. Enable the flag + invoke ctest in all three CUDA jobs: - build-cuda-cli (Ubuntu/Windows x g++/clang++/cl): catches compiler matrix regressions in the cache parser / lookup code. - build-cuda-versions (Ubuntu/Windows x CUDA 12.6/12.8/13.0): catches CUDA-toolkit-version regressions (rare but the test is cheap to run). - build-and-test-cuda (self-hosted with a GPU): runs alongside the existing integration tests. The unit tests require the CUDA toolchain at build time but never touch a real device at runtime, so they run successfully on the build- only runners with no NVIDIA driver. CMakeLists already gates the test target on CMAKE_CUDA_COMPILER so the CPU-only CI jobs short-circuit cleanly without any extra workflow changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ts.h
Variant table churn (prune, rename, add) used to require touching three
files in lockstep:
- SCAMP_VARIANT_TUPLES in CMakeLists.txt (the build-system list).
- kVariants[] in kernel_config.cpp (the runtime table).
- SCAMP_DECL_VARIANTS_FOR_PROFILE + SCAMP_VARIANT_DISPATCH in
kernels_variants.h (the per-variant forward decls + dispatch ladder).
Drift between them turned into hard-to-spot bugs (e.g., a kernel TU
built but never called because its dispatch case was missing).
Make SCAMP_VARIANT_TUPLES the single source of truth and generate the
other two:
* The tuple list drops its explicit-index prefix -- variant N is the
Nth entry in the list. Add/remove/reorder edits one place.
* CMake walks the list, building three @-substitution strings: the
kVariants[] initializer body, the per-variant forward decl block, and
the per-variant dispatch if-ladder body.
* configure_file emits kernel_variants_table.h (consumed by
kernel_config.cpp) and kernels_variants.h (consumed by the per-profile
.cu dispatchers) into the build dir.
* gpu_utils and gpu_kernels pick up CMAKE_CURRENT_BINARY_DIR on the
include path so the generated headers resolve.
Also retire the four sliding-window/shfl variants the multi-device
autotune sweep showed never won anything: the pre-prune labels v1, v3,
v4, v7 are gone. Surviving variants are compacted to indices 0..4:
pre-prune new geometry role
--------- --------- -------------------- ----------------------
v0 v0 {2,2,2,16,16} sliding-window default
v2 v1 {2,4,4,4,50} master-like 4x4
v5 v2 {1,4,4,16,16} big tile / low occ
v6 v3 {8,4,0,8,8} shfl DPT=4 tile=64
v8 v4 {4,8,0,8,32} shfl DPT=8 tile=256
3080 cache entries already used the kept variants; the only entry
that referenced a retired variant was P100 APPROX_ALL_NEIGHBORS
DOUBLE (was v3 pre-prune), substituted with v5 -- the runner-up on
the P100 cross-target score sweep, geometry preserves bsz=256.
Saves 4 x 5 profiles x 6 instantiations = ~120 do_tile template
emissions on a clean build.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The autotune machinery requires QueryDeviceProps() to resolve the
device cache key, which in turn needs a CUDA-capable GPU. Our CI
machines (windows-latest, ubuntu-latest in the build-cuda-cli matrix)
have the CUDA toolchain but no device, so the bench-driven autotune
pathway was previously untestable in CI.
Refactor:
- Split RunAutotuneWithBenchmark into a CUDA-aware wrapper plus a new
RunAutotuneWithBenchmarkForDeviceKey(device_key, device_id, bench,
cache_path, verbose, print_banner) overload that takes the resolved
device key directly. The wrapper calls QueryDeviceProps, prints the
device-specific banner, then delegates. The inner overload prints a
shorter device-key-only banner when print_banner is true (the wrapper
passes false to avoid duplicating the header).
- No behavior change for production callers: SCAMP --autotune and
pyscamp.autotune() go through the wrapper unchanged.
Test (test_autotune_cache.cpp):
- Add Test_AutotuneFullPipeline_FakeBench. Drives the full autotune
sweep with a synthetic BenchmarkFn that returns 0.5s for one rigged
(variant, blocksz) winner per (profile, precision) target and 1.5s
for every other trial. After the call:
* verifies the cache file exists at the requested path,
* reloads it and asserts every target's recorded config matches the
rigged winner's geometry tuple and blocksz,
* exercises LookupKernelConfigForDeviceKey with the reloaded cache
to confirm the lookup path returns the rigged winner end-to-end.
Verifies the variant-sweep + per-target-winner + cache-write +
cache-reload + lookup chain without touching a GPU.
Verification:
- Linux (gcc 11, build_sm86): 18/18 tests pass via ctest.
- Native Windows (VS2022 MSVC 14.30, CUDA 11.6, sm_86): same 18/18
pass; the test compiles cleanly on MSVC.
- End-to-end on Windows with a real GPU (RTX 3080):
* SCAMP --autotune wrote 10 entries to
%LOCALAPPDATA%\scamp\autotune.txt
* Subsequent SCAMP --print_debug_info launch picked the v3 shfl
config (bps=8 dpt=4 our=8 kti=8 blocksz=64) from that cache for
DP 1NN_INDEX without emitting the cache-miss warning.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Several places that document the autotune subsystem had drifted since
recent changes:
- The cache-path resolution was Linux-only in user-visible docs; the
recent _WIN32 branch of AutotuneCache::DefaultPath (LOCALAPPDATA /
USERPROFILE) wasn't reflected anywhere.
- The autotune.rst page quoted 9 variants and ~360 trials; we ship 5
variants today and a full sweep is 200 trials (5 x 4 x 10).
- pyscamp.autotune()'s docstring claimed empty devices=[] means "every
visible CUDA device is tuned" -- the implementation actually defaults
to device 0 only (the rationale: a sweep takes O(minutes) and most
multi-GPU hosts have identical devices).
- docs/pyscamp.py (the Python stub Sphinx reads for autodoc) was
missing autotune() and gpu_supported() entirely, so neither showed up
in the rendered pyscamp API page.
Updates:
- docs/source/autotune.rst: TL;DR, lookup chain, "Default cache
location", "Clearing or resetting", and troubleshooting all mention
the Windows path (%LOCALAPPDATA%\scamp\autotune.txt, with
%USERPROFILE%\.cache fallback). Sweep size corrected to "5 variants
enabled today -> 200 trials"; phrased so the count tracks future
variant-table changes via the "current variant count" framing.
- docs/pyscamp.py: add autotune() and gpu_supported() to the
autosummary and emit matching docstrings for the Sphinx stub build.
- src/python/SCAMP_python.cpp: rewrite the autotune() pbdoc to match
the implementation (device 0 only by default; full per-platform
cache-path resolution). Single backslashes inside R"pbdoc()" --
pbdoc is a raw string, so \\scamp would render as \\scamp.
- src/main.cpp: --autotune --help text mentions all three resolution
steps and the Windows path.
- src/core/gpu_kernel/{autotune.h,autotune_cache.h,autotune.cpp,
builtin_autotune_cache.h}: doc-comments referencing the cache path
now mention the platform-specific fallback (or refer readers to
AutotuneCache::DefaultPath, which is the single source of truth).
No code paths changed; this is documentation only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Branch-wide sweep ahead of merging to master: comments that narrated
intermediate states ("before this split", "now lives in", "pre-prune
label v6", "design-A"), values that were no longer used, and the
backward-compat RunAutotune stub that nothing on the branch still
calls. Each removal is mechanical -- no behavior change.
Header/comment refresh:
- kernel_variant.cu.in, kernels_impl.h, kernels_compute.h,
autotune_cache.cpp, autotune.cpp, autotune.h: replace
before-vs-after-split / before-vs-after-this-change framing with a
plain description of what the code does today.
- kernels_compute_shfl.h, kernels_impl_shfl.h, kernels.cu,
kernel_variant_shfl.cu.in, kernel_gpu_utils.h: drop "design-A"
codename (used during the multi-design bake-off, not meaningful on
master) in favor of "cov-shuffle" / "shfl".
- kernel_config.cpp: drop "an earlier iteration had 9 entries v0..v8"
and "pre-prune label v6" history. Reframe GetDefaultKernelConfig so
it picks the first shfl variant by iterating kVariants rather than
hard-coding index 3 with a stale-on-renumber comment, and let
IsSupportedKernelConfig iterate kVariants directly instead of a
redundant hard-coded {0..4} list.
- docs/source/autotune.rst: drop the "RunAutotune stub vs
RunAutotuneWithBenchmark" troubleshooting bullet (RunAutotune is
gone; see below).
- src/core/gpu_kernel/CMakeLists.txt: rewrite the layout/build-table
prologue, drop "we no longer carry explicit indices" and "same as
before" framing, drop stale "5 profiles x 6 variants = 30 TUs"
count (it's 5 x 5 = 25 today and the number tracks variant changes
anyway).
- SCAMP_python.cpp: drop the "not the older RunAutotune stub"
comment from the autotune() impl now that there is no older stub.
Dead code removed:
- SCAMP::RunAutotune(...) (the non-benchmarking thin wrapper that
wrote the compile-time default to every target). Not called by
anything since RunAutotuneWithBenchmark landed; the bench path is
now the only entry point.
- DEFAULT_DIAGS_PER_THREAD, DEFAULT_UNROLLED_ROWS,
DEFAULT_OUTER_UNROLLED_ROWS, DEFAULT_KERNEL_TILE_ITERS,
DEFAULT_BLOCKSPERSM constants in kernel_constants.h. Variant
geometries live in SCAMP_VARIANT_TUPLES; these duplicates had no
callers and would silently drift if a future variant-0 edit
happened in only one place.
Doc-comment correctness:
- autotune.h prologue + SetKernelConfigOverride doc say "process-wide"
override, not "thread-local" -- matches the implementation, which
uses a process-wide std::mutex-guarded optional so std::async
workers see the override (a thread_local would not).
- autotune_cache.h schema example shows 9 fields (the actual record
layout) rather than the 6-field draft form.
Verified: SCAMP rebuilds clean; 18/18 unit tests pass on Linux.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cache file format already had graceful per-entry fall-through
built into the read path (a stale variant tuple loses to the next
cache source on lookup; a wrong-version header silently empties the
cache), but the upgrade contract wasn't explicit anywhere and the
two failure modes weren't all covered by unit tests. Adding both so
future kernel changes have a clear "do I need to wipe everyone's
cache?" decision tree and so a regression of the fall-through path
shows up in CI.
Upgrade contract (now spelled out in autotune_cache.h + the user
docs at docs/source/autotune.rst):
- Add / remove / reorder variants in SCAMP_VARIANT_TUPLES:
no version bump. Per-entry fall-through handles it -- the cache
is keyed by (bps,dpt,ur,our,kti) tuple, not by index, so existing
entries either still match (lookup hits) or fall through (lookup
misses on that one tuple, the other cache entries are unaffected).
- Schema or kernel-semantics change forces a wipe:
bump kHeader (SCAMP_AUTOTUNE_V<N> -> V<N+1>). ParseStream silently
treats a non-matching header as empty rather than throwing, so an
end-user upgrading their pip wheel doesn't see SCAMPException --
they just fall through to the new release's built-in cache and
can re-tune at their leisure.
New tests in test/cpp/test_autotune_cache.cpp:
- Test_FutureVersionHeaderSilentlyEmpties: SCAMP_AUTOTUNE_V99 header
+ a record line loads zero entries (not a throw), lookup hits the
cache-miss warning. Covers the post-version-bump escape hatch.
- Test_MissingHeaderSilentlyEmpties: a file with data lines but no
header at all also loads zero entries.
- Test_PartiallyStaleCachePreservesGoodEntries: a cache with one
valid entry + two entries naming bogus variant tuples loads all
three, but lookups hit only the valid one; the stale entries fall
through to fallback + the one-shot warning. Covers the
everyday "we changed the variant table" upgrade case.
- Test_StaleUserCacheFallsThroughToBuiltin: a user cache with a
future-version header (all entries silently dropped) plus a
built-in cache with the same device key -- the built-in entry must
win. Covers the worst-case "the upgrade wiped my tuned config but
the shipped built-in still has me covered" path.
22/22 tests pass on both Linux (gcc 11) and native Windows
(VS2022 MSVC 14.30, CUDA 11.6, sm_86).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two unrelated doc fixes that came up together while preparing the branch for review. Env-var docs: - environment.rst gains a "Environment variables" section that lists every env var SCAMP / pyscamp reads, build-time and run-time, with a one-line purpose for each. Before this commit the cache-related ones were documented only in autotune.rst and pyscamp.py, which meant they were discoverable only by someone already reading the autotune page. The distributed gRPC env vars (SCAMP_SERVER_SERVICE_HOST/PORT) weren't documented anywhere. - pyscamp/intro.rst gains PYSCAMP_BUILD_TYPE and PYSCAMP_NO_PLATFORM_AUTOSELECT, two setup.py-honored env vars that previously had no user-facing documentation. Autotune workload-size measurement (RTX 3080, full 10-target sweep): 64K -> ~4 min, geomean 1.325, worst 2.25 128K -> ~8 min, geomean 1.308, worst 3.47 (current default) 256K -> ~25 min, geomean 1.278, worst 2.77 The per-target winners themselves shift across sizes (e.g. SUM_THRESH/DOUBLE picks v0 at 64K, v2 at 128K, v4 at 256K -- those are different kernel geometries, not just different rankings), so a short sweep doesn't just rank suboptimally for the cross-target default, it picks suboptimal cache entries for individual targets. 256K is meaningfully tighter than the current 128K default. Updates: - docs/source/autotune.rst "Choosing the benchmark workload size" now has the comparison table above and concrete guidance: 128K is the deliberate casual-dev default; 256K or 512K is recommended before shipping a cache entry to data/autotune_cache.txt. - src/core/gpu_kernel/autotune_bench.cpp comment block over kBenchmarkInputLengthDefault: dropped the stale "under a minute on a fast GPU" claim (the sweep is closer to 10 min at the default size) and the stale trial-count formula (it omitted the blocksz axis introduced when autotune started sweeping blocksz). No default value change in this commit -- bumping SCAMP_AUTOTUNE_INPUT_LENGTH from 131072 to 262144 would 4x the wall time of the default `--autotune` run, which is a separate UX trade-off worth a follow-up discussion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The autotune-size sweep on RTX 3080 (committed in the prior diff) showed that 256K gives meaningfully tighter cross-target rankings than the previous 128K default: 128K: ~8 min, geomean 1.308, worst-case 3.47 256K: ~25 min, geomean 1.278, worst-case 2.77 (~20% better worst case) Per-target winners themselves shift between sizes (e.g. SUM_THRESH/DOUBLE picks different kernel geometries at 64K vs 128K vs 256K), so a too-small sweep doesn't just rank the cross-target default suboptimally -- it picks suboptimal entries for individual cache rows. 256K is the smallest size where launch overhead is reliably dominated by steady-state kernel cost. The 4x wall-time cost is amortized over the lifetime of the autotuned cache -- users hit `SCAMP --autotune` once per device. Anyone tuning on an older / slower GPU where 25 minutes is too long can dial it back via SCAMP_AUTOTUNE_INPUT_LENGTH=131072 (the doc updated here calls this out explicitly). Changes: - src/core/gpu_kernel/autotune_bench.cpp: kBenchmarkInputLengthDefault 131072 -> 262144. Comment block above the constant explains the trade-off and points at the doc for the per-N comparison. - docs/source/autotune.rst: "Choosing the benchmark workload size" reflects 256K as the new default and reframes the smaller-N row of the comparison table as "use this if 256K is too slow on your GPU". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…l-variant # Conflicts: # .github/workflows/build-and-test.yml # src/python/SCAMP_python.cpp
…real SCAMP crashes
Four loosely-related fixes that came out of inspecting CI output for
the latest PR push.
1) SCAMP_distributed missing autotune_bench link dep
The BUILD_CLIENT_SERVER=1 docker build was failing with:
undefined reference to `SCAMP::DefaultBenchmarkVariant(...)'
src/distributed/CMakeLists.txt builds SCAMP_distributed from the same
src/main.cpp the top-level SCAMP target uses, and that file references
DefaultBenchmarkVariant under _HAS_CUDA_ (which mark_cuda_if_available()
propagates to every distributed target). The top-level SCAMP target
gets autotune_bench linked under the same CUDA gate; SCAMP_distributed
did not. Mirror the link rule so the dep is satisfied in both places.
2) Test scripts spammed the autotune-cache-miss warning
run_tests.sh and run_tests.py spawn many short SCAMP processes. The
one-shot dedup on the "no autotune entry for device ..." warning is
per-process, so each spawn re-emits the warning. Both harnesses now
export SCAMP_AUTOTUNE_QUIET=1 before any SCAMP invocation; users
running the harnesses interactively can override by setting the env
var themselves first.
3) Warning text now points at SCAMP_AUTOTUNE_QUIET
Added a parenthetical to the cache-miss warning explaining how to
silence it process-wide. Users hitting this in a tight CI loop now
have a fix-it-yourself hint without grepping the source.
4) run_tests.py was swallowing SCAMP crashes
The harness called subprocess.call() and discarded the exit code, then
read the per-test output files. If SCAMP aborted (e.g.
"GPUasssert: unspecified launch failure" on a kernel-config the
target device couldn't run), the previous subtest's output files were
still on disk, so the comparator silently re-evaluated stale data and
the test "passed". Two changes:
- rm the output files (mp_columns_out, mp_columns_out_index,
mp_rows_out, mp_rows_out_index) before each SCAMP invocation, so a
missing output surfaces as a None array instead of a stale read.
- check subprocess.call's return code and raise RuntimeError on
non-zero so the test runner exits non-zero.
This fourth change will likely surface a real bug on the next CI run
of build-and-test-cuda: the compile-time-default kernel variant (v3
shfl, blocksz=128 bps=8 dpt=4 ur=0 our=8 kti=8) hits an unspecified
launch failure on sm_60 (the self-hosted runner reports its device
key as 'NVIDIA_Graphics_Device__sm_60', which doesn't match the
'Tesla_P100-...' built-in cache rows, so the fallback fires). That's
intentional -- the bug was already happening; the harness was just
hiding it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…UTE_COLS
compute-sanitizer on a P100 (sm_60) bisected the
"GPUasssert: unspecified launch failure tile.cpp:175" reproducer
that build-and-test-cuda was hitting down to a single offset in
do_tile_shfl: every thread in block (0,0,0) trapped at
+0x21218 of the (PROFILE_TYPE=1NN, COMPUTE_ROWS=true,
COMPUTE_COLS=false) instantiation of the kernel, attributed to
kernels_impl_shfl.h:148.
That line is the start of the per-tile distc reload loop:
for (int i = 0; i < DiagsPerThread; ++i) {
int col_idx = state.local_col[i];
if (col_idx >= 0 && col_idx < tile_width) {
...
state.distc[i] = smem.local_mp_col[col_idx];
...
}
}
SCAMPShflSmem's ctor only re-binds local_mp_col to a real backing
pointer when compute_columns is true; with COMPUTE_COLS=false (the
ab-join lower-triangle path, where the col profile belongs to the
other tile in the pair) the Map's data pointer stays at the nullptr
sentinel from the member initializer list. The smem read then
dereferences a null pointer.
Why Ampere happened to "work": the kernel does a check
(col_idx >= 0 && col_idx < tile_width) before the dereference and
the resulting distc is never consumed (merge_to_column_shfl /
flush_all_cols_to_smem are also COMPUTE_COLS-gated, so the trash
distc gets overwritten before it matters). On sm_86 the load
through null seems to retire silently in this configuration; on
sm_60 the MMU traps on the load and the whole kernel aborts.
Fix: gate the entire per-tile distc-reload loop on COMPUTE_COLS,
mirroring the gating already in place for flush_all_cols_to_smem
(kernels_impl_shfl.h:219) and for the smem.local_mp_col writes
inside the init_smem_shfl helpers
(kernels_compute_shfl.h:628/659/691). When COMPUTE_COLS is off
distc is unused for the rest of the tile; just seed it with
init_dist so any speculative reads see a well-defined value.
Verified locally:
- sm_86 (RTX 3080): 460/460 Extended Tests pass before AND after.
Lower bound that the fix is non-regressing; the bug never
manifested on Ampere in the first place.
- The fix is small enough to reason about by inspection; the
sm_60 confirmation comes from re-running compute-sanitizer
against the patched kernel.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refactor: blocksz is now a per-variant property declared in
SCAMP_VARIANT_TUPLES rather than a precision-tied global constant
(BLOCKSZ_DP / BLOCKSZ_SP) gated by a hidden "sliding-window vs shfl"
rule inside GetKernelConfigForVariant. The change removes the hidden
coupling without changing runtime behavior: each variant declares
default_blocksz_dp and default_blocksz_sp explicitly, and
GetKernelConfigForVariant returns the field that matches the requested
precision.
Tuple format grows from 5 to 7 fields:
before: bps|dpt|ur|our|kti
after: bps|dpt|ur|our|kti|default_blocksz_dp|default_blocksz_sp
Values are picked to preserve current cold-start behavior bit-for-bit:
- sliding-window variants get 256|512 (== BLOCKSZ_DP, BLOCKSZ_SP)
- shfl variants get 128|128 (the prior hardcoded shfl default)
The kernel is still instantiated for all of {64, 128, 256, 512} per
variant, the autotuner still sweeps blocksz as an independent axis,
and on-disk cache entries can still name any of the four values; the
per-variant defaults only feed the cold-start fallback path where no
cache hit exists.
Files removed (everything was used only by the old precision-tied
default path):
- src/core/gpu_kernel/kernel_constants.h (was BLOCKSZ_DP/SP only)
- SCAMP::BlocksizeForPrecision in kernel_config.cpp
- SCAMP::get_blocksz in kernel_gpu_utils.{h,cu} (already dead --
declared but not called by anything)
Doc updates folded in:
- docs/source/environment.rst: document SCAMP_USE_EXTERNAL_EIGEN
(was honored by cpu_kernel/CMakeLists.txt but undocumented), plus
a note about Eigen 5.x still publishing as the Eigen3 package.
- src/core/gpu_kernel/CMakeLists.txt: expand the comment above the
target_link_libraries(gpu_kernels ... Eigen3::Eigen) call to
mention the dual Eigen-acquisition path (vendored INTERFACE
target vs find_package import).
- docs/source/autotune.rst + builtin_autotune_cache.h: small polish
landed alongside.
Verified locally on RTX 3080:
- kernel_variants_table.h regenerates cleanly with the 7-field
initializer rows.
- 22/22 unit tests pass.
- run_tests.py subset (windows 13/33, inputs 100/250/1500, tiles
1024/2048) = 460/460 GPU tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The prior commit gated the per-tile distc-reload loop on COMPUTE_COLS
to fix the v3 Pascal nullptr-deref. Tracing the lifecycle of distc
across a tile shows the gate fixed the bug but missed the deeper
redundancy: the pre-load itself was a no-op.
distc lifecycle inside one tile:
1. init_smem_shfl: smem.local_mp_col[col_idx] = global_profile[col]
(cross-tile running best from the global profile).
2. (pre-load) distc[i] = smem.local_mp_col[col_idx]
-- COPY of the prior_global value into the per-thread accumulator.
3. merge_to_column_shfl, called per row:
if (dist[i] > distc[i]) distc[i] = dist[i];
-- so distc[i] ends the tile at max(prior_global, this_tile_max).
4. flush_all_cols_to_smem:
atomicMax(smem.local_mp_col + col_idx, distc[i])
-- block-scoped atomic-max merge into the same smem cell.
The atomicMax in step 4 is what actually preserves cross-tile
continuity. Whether step 2 starts distc at prior_global or at init_dist,
step 4's atomic computes:
smem.local_mp_col[col_idx]
= max(prior_global, distc_final)
= max(prior_global, this_tile_max)
in both cases. Same observable result.
Dropping the pre-load:
- Removes DPT * tiles * blocks LDS instructions per kernel launch.
- Removes the COMPUTE_COLS branch + reload-or-init duplication added
in the prior commit -- the null-deref the gate was protecting
against simply doesn't happen if we never read from
smem.local_mp_col in this loop.
- Removes the col_idx bounds check (`col_idx >= 0 && col_idx <
tile_width`) since the only consumer (merge_to_column_shfl) reads
distc by register index, not by col_idx.
Per-row merge_to_column_shfl is a branchless predicated select on the
GPU regardless of whether distc starts at prior_global or init_dist,
so no per-row cost difference either way.
A FUTURE OPT comment is left in the kernel for max-style profiles
(1NN / 1NN_INDEX / APPROX_ALL_NEIGHBORS / MATRIX_SUMMARY): pre-loading
plus the _check atomic variants (fAtomicMax_check / MPatomicMax_check)
could skip the flush-time CAS when this tile's distc didn't beat the
prior global. The bookkeeping has to refresh the cached check inside
update_info_shfl on rotation, and adds DPT registers per thread
(potentially regressing occupancy on v4 DPT=8 SP). Not worth pursuing
without a micro-benchmark on smaller GPUs or converged workloads
where the atomic-skip rate would be high.
Verified on RTX 3080 (sm_86):
- 22/22 unit tests pass.
- 685/685 GPU run_tests.py subset pass.
- 1NN_INDEX 1M perf check (3 runs median): DP 12.16 -> 12.22 s
(+0.5%), SP 2.10 -> 2.15 s (+2%). Both within run-to-run noise;
net perf-neutral. The Pascal repro is fixed for the same reason as
the prior commit's gate (no smem.local_mp_col read on the
COMPUTE_COLS=false path), but expressed as a smaller kernel body
instead of an if-constexpr.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The DPT-wide per-row hot loop in do_row_shfl was hand-unrolled via
#pragma unroll. The body is pure element-wise array ops on the
SCAMPShflState::cov / inormc / dfc / dgc / local_col / global_col
fields, all of which are already Eigen::Array<T, DPT, 1>. Express
the loop as Eigen array expressions and let nvcc unroll the
expression-template-generated IR instead.
The four sub-operations become:
1. d_raw = (cov * inormc * inormr).cast<DISTANCE_TYPE>()
2. cov += dfc * dgr + dgc * dfr (uses old cov on RHS only)
3. diag = local_col.cast<int>() - row_in_tile (signed; preserves
the negative-diag case the original used)
4. slot_valid (DPT-wide bool array) + slot_valid.select(d_raw,
init_dist) for the final dist
The SLOW-PATH branch keeps the global_row-out-of-range check as a
scalar early-return (setZero on the bool array) because Eigen's && /
& between an Array and a scalar bool doesn't broadcast; per-element
predicates are then chained with Array && Array.
Verified on RTX 3080 (sm_86), 1NN_INDEX, 1M elements, window=100,
median of 8 runs:
variant manual loop Eigen delta
---------------------------------------------
v3 DP 12.16 s 12.18 s noise (~+0.2%)
v4 SP 2.10 s 2.00 s **-5%** (every Eigen run faster
than every manual run)
Hypothesis for the SP-only win: v4 has DPT=8, so the manual
#pragma-unrolled loop produces a wider IR that nvcc's scheduler
seems to lay out worse than the equivalent Eigen-expression IR. v3
at DPT=4 is small enough that both forms produce essentially the
same code. The Eigen IR also keeps cov-update and d_raw computation
explicitly independent (no false ordering dependency between scalar
statements that nvcc would have to prove disjoint).
Correctness:
- 22/22 unit tests pass.
- 685/685 GPU run_tests.py subset (windows {13,33,100}, inputs
{100,250,1500}, tiles {1024,2048}) pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the per-element memmove-style cov shift with Eigen segment
assignment:
// Before:
for (int i = DPT - 1; i > 0; --i) cov[i] = cov[i - 1];
cov[0] = wrap_in;
// After:
if constexpr (DPT > 1) {
cov.template tail<DPT - 1>() = cov.template head<DPT - 1>().eval();
}
cov[0] = wrap_in;
The .eval() forces a temporary so the tail-and-head overlap doesn't
alias (the manual loop handled aliasing implicitly by walking from
high-to-low). For DPT in {4, 8} the temporary is a small register
array; nvcc folds it into the same register allocation as the
destination, so the .eval() is essentially free.
Verified on RTX 3080 (sm_86), 1NN_INDEX, 1M, window=100, median of
13 SP runs / 5 DP runs:
variant prior (manual cov-shift) cov-shift Eigen delta
-----------------------------------------------------------
v3 DP 12.18 s 12.16 s noise
v4 SP 2.00 s 1.99 s ~-1%
The SP improvement is small (within the variance band of any
individual run) but the median moves the right direction and every
run except one is sub-2.00. Combined with the prior commit
(Eigenify per-row dist + cov update), the v4 SP path is now ~6%
faster than the all-manual-loop baseline. DP is flat overall.
Correctness:
- 22/22 unit tests pass.
- 685/685 GPU run_tests.py subset pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two manual #pragma unroll loops over DPT replaced with Eigen array ops on the SCAMPShflState fixed-size arrays: state.local_col = state.global_col - tile_start_col; state.distc.setConstant(init_dist<DISTANCE_TYPE, PROFILE_TYPE>()); state.idxc.setZero(); Code-shape change only. Perf-checked on RTX 3080 with the reference SP 1M / m=1024 / max_tile_size=1M / 1NN_INDEX workload: baseline: ~1.985s (min of 5) Eigen-simplified: ~1.867s (min of 5) Within run-to-run noise -- treat as perf-neutral, not a win. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three additions to the SCAMP --autotune workflow that make iterative
tuning much faster:
- SCAMP_AUTOTUNE_PRECISION_FILTER={SINGLE,DOUBLE}: restricts the sweep
to one precision. Filtered-out (profile, precision) targets are
reported as SKIPPED in verbose mode and do NOT overwrite their cache
entries on Save(), so the existing cache for the other precision is
preserved untouched.
- SCAMP_AUTOTUNE_VARIANT_FILTER={shfl,sliding-window,all}: restricts
the variant loop to one family. shfl variants are identified by
unrolled_rows == 0 (the sentinel already used in SCAMP_VARIANT_TUPLES
to route to kernel_variant_shfl.cu.in).
- SCAMP_AUTOTUNE_WARMUP_RUNS=<n> (default 0): per-trial warmup count.
Was hardcoded at 1 warmup + 1 timed; in practice the first launch of
a given (variant, blocksz) instantiation on Ampere is only a few %
off steady-state because module-load / JIT cost is amortized at the
process level, not per-trial. Skipping warmups by default ~halves
sweep time. Set to 1 or more on noisier setups.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…onsolidation
Three closely-coupled changes to the GPU kernel infrastructure that
together cut the consolidated shfl + sliding-window variant count in
half while improving SP 1NN_INDEX on RTX 3080 by 14% (1.94s -> 1.71s
at 1M / m=1024).
1. ADAPTIVE __launch_bounds__ VIA safe_bps()
Replaces __launch_bounds__(BLOCKSZ, variant_bps) with
__launch_bounds__(BLOCKSZ, safe_bps(target_threads, BLOCKSZ,
kHwThreadsPerSm)) in both do_tile and do_tile_shfl. The target
threads/SM stays constant across the autotuner's blocksz sweep, so
the per-thread register budget no longer collapses when blocksz
grows past the variant's default. This removed a sharp register-
spilling cliff observed at bsz=256 (DP 1NN_INDEX went 52s -> 12s,
SP 14.9s -> 1.88s) where the old fixed-bps launch_bounds was telling
ptxas to fit 4 * 256 = 1024 threads/SM at ~64 regs/thread budget.
safe_bps + kHwThreadsPerSm live in kernel_gpu_utils.h. The hw cap
resolves per __CUDA_ARCH__ (1536 for sm_86/sm_89, 2048 elsewhere)
so launch_bounds never asks ptxas to overcommit the per-SM thread
ceiling, which used to be silently clamped.
2. VARIANT TUPLE: DROP default_blocksz_dp (DERIVE AS SP / 2)
Variant tuples in SCAMP_VARIANT_TUPLES now declare only
default_blocksz_sp. The DP value is implicit:
default_blocksz_sp / 2. DP uses 2x the registers per thread on
Ampere/Ada (FP64 takes two FP32-sized register slots), so halving
the target threads/SM keeps the per-thread register budget stable
across precisions. The /2 is hardcoded in the launcher templates
(LaunchDoTileShflWithGeometry / LaunchDoTileWithGeometry) and in
DefaultBlockszForPrecision so the ratio is enforced by construction
instead of being a per-variant convention that drifts.
3. cuda::barrier ARRIVE/WAIT SPLIT IN do_row_shfl
Replaces the per-row __syncthreads() in the shfl inner loop with a
cuda::barrier<thread_scope_block> arrive/wait pair (sm_80+ mbarrier
PTX). The lane-31 cov publish + bar.arrive() are hoisted to right
after the cov update, so the arrive-to-wait window covers the dist
masking, merge_to_col, merge_to_row, intra-warp shfl cov rotation,
register shift, and (every DPT rows) update_info_shfl. Warps that
finish that second-half work faster don't block warps still
draining it -- the wait blocks only on the slowest arrival.
ncu confirms barrier stalls drop from 1.21 cyc/issued instr to
0.03; eligible warps/sched climb from 0.90 to 1.43; issue rate
+24%. Output is bit-identical to the __syncthreads version.
4. VARIANT TABLE CONSOLIDATION
The full SP+DP autotune across the previous 9-variant table
(3 SW + 6 shfl) showed only 4 variants ever win a target on sm_86,
so the table is trimmed to those 4:
SW dpt=4 bps=1 big-tile (target=512 thr/SM): wins DP 1NN + DP AAN
SW dpt=4 bps=2 big-tile (target=1024 thr/SM): wins SP 1NN + SP/DP
MATRIX_SUMMARY + SP AAN
shfl dpt=8 bps=5: wins SP 1NN_INDEX + DP/SP SUM_THRESH; best
overall geomean (1.016x of per-target best,
worst-case 1.03x).
shfl dpt=4 bps=8 tile=128: wins DP 1NN_INDEX. Smaller per-thread
state lets DP fit higher occupancy
where dpt=8 spills.
Dropped variants (each never won a target across all 10 (profile,
precision) targets in the full autotune):
- SW dpt=2 baseline (geomean rank 22+)
- SW dpt=4 + 4x4 hand-unroll tile=200 (geomean rank 4-13 but
always lost to one of the two big-tile variants)
- SW big-tile bps=4 (forced spill cliff; geomean rank 17+)
- shfl dpt=4 bps=8 tile=64 (the original short-tile shfl;
never wins anywhere)
- shfl dpt=8 bps=4 (loses by ~1% to bps=5)
- shfl dpt=8 bps=6 (just past spill cliff on sm_86)
- shfl dpt=4 bps=10 tile=128 (only wins broken-on-GPU
MATRIX_SUMMARY DP)
Compile time drops accordingly: 9 -> 4 variants means
5 fewer per-profile .cu TUs to nvcc-instantiate, with the
sliding-window TUs each compiling much faster than the shfl ones
(the shfl variant has more template instantiations per TU).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GetDefaultKernelConfig previously took only precision and returned the
first shfl variant for every (profile, precision) combo. Now takes
profile_type too, and picks the right family for that profile based on
which family wins in the cross-(profile, precision) autotune:
- 1NN_INDEX, SUM_THRESH -> shfl family. The shfl variant warp-reduces
atomics before the final atomicMax / atomicAdd, which dominates
these profile types.
- 1NN, MATRIX_SUMMARY, APPROX_ALL_NEIGHBORS -> sliding-window family.
These profiles have light per-row atomics; the SW variant's smem
column buffer + heavy inner-loop unroll dominates.
The implementation picks the FIRST variant of the preferred family
(matching by unrolled_rows == 0 / != 0). By convention the first entry
of each family in SCAMP_VARIANT_TUPLES is that family's intended
default -- variant authors should keep the best generalist at the
front of the list. Robust to variant reordering and to the table being
re-tuned for a new device.
DefaultBlockszForPrecision still picks the precision-specific blocksz
within the chosen variant (SP gets default_blocksz_sp, DP gets
default_blocksz_sp / 2). This only matters when the user-cache lookup
misses the device's CacheKey + (profile, precision) -- once the
autotune cache has an entry, the cache wins.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a Per-variant CUDA Tests step to build-and-test-cuda that runs
the existing run_tests.py harness once per compiled kernel variant
with SCAMP_FORCE_VARIANT pinning the variant for every (profile,
precision) launch. Coverage: each variant gets exercised against all
profile types (1NN_INDEX, 1NN, SUM_THRESH, MATRIX_SUMMARY,
APPROX_ALL_NEIGHBORS) and both precisions, just at a smaller
sweep than the extended-tests step (1 window, 2 tile sizes, 2 input
sizes), so the extra CI time is ~4 min total for the current
4-variant table.
Two new mechanisms support the new step:
- SCAMP_FORCE_VARIANT=<index> env var: read by GetKernelConfigForDevice
in autotune.cpp. When set to a valid variant index, returns the
variant's cold-start config (via DefaultBlockszForPrecision) and
bypasses the autotune cache + per-(profile, precision) default. The
precision still picks the cold-start blocksz so the kernel runs at
its variant-author-chosen blocksz, not the autotuned one -- the full
{64,128,256,512} blocksz axis sweep belongs in `SCAMP --autotune`,
not the correctness CI.
- --list_variants CLI flag: prints one line per variant in a stable
machine-parseable format (v<i> bps=N dpt=N ur=N our=N kti=N
family={shfl,sliding-window}). CI shell scripts iterate via
$(SCAMP --list_variants | wc -l) so adding or dropping variants
doesn't require touching the workflow YAML.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cuda/barrier requires sm_70+; libcudacxx hard-errors on older arches with #error "CUDA synchronization primitives are only supported for sm_70 and up." The shfl kernel's arrive/wait split is gated behind SCAMP_SHFL_HAS_CUDA_BARRIER (defined only when the host pass or a sm_70+ device pass is active). ShflRowBarrier is a typedef wrapper used in do_row_shfl's signature and the per-block __shared__ declaration: - sm_70+: cuda::barrier<thread_scope_block> - sm_60: empty stub struct On sm_60 the lane-31 publish still happens at the same early spot (right after the cov update); only the per-row sync primitive differs (cuda::barrier wait vs __syncthreads). The publish is just an smem store and is ordered with the next row's read by either sync primitive, so semantics are identical -- sm_60 just loses the arrive/wait latency-hide. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cache lookup tests hardcoded the old v6 geometry (bps=8, dpt=4, ur=0, our=8, kti=8 — the dpt=4 tile=64 shfl variant) as the SupportedConfig() "hit" example, and the Test_AutotuneFullPipeline_FakeBench rig referenced variant indices 4+ (out-of-bounds in the 4-variant consolidated table). Both were silently failing after the variant table dropped from 9 variants to 4 -- SupportedConfig stopped matching anything in kVariants so IsSupportedKernelConfig rejected cache hits, and the FakeBench rig read past kVariants.size() (returning garbage geometry values like -51257 from uninitialized stack). Updates: - SupportedConfig() now uses kti=16 (the dpt=4 tile=128 shfl variant that survived consolidation). - All cache-fixture strings bumped from |8|4|0|8|8 to |8|4|0|8|16 via sed. - RiggedWinner table reworked to cover variant indices 0..3 (the current 4-variant range), with at least one rigging per index. 22/22 tests now pass on sm_86. Same fix applies to the CI failures on the build-cuda-versions runners. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixes from a walk-through of the PR diff vs master:
- kHwThreadsPerSm: added missing arches. Turing (sm_75) was using the
2048 fallback but Turing's hw cap is actually 1024 thr/SM, so
safe_bps was overcommitting by 2x on Turing -- ptxas would silently
clamp, defeating the safe_bps cap entirely. sm_87 (Orin) and sm_120
(Blackwell GeForce) added at 1536. Also refactored from a constexpr
#if-ladder to a constexpr function (hw_threads_per_sm()) used as the
default arg for safe_bps -- simpler to read and extend; same per-arch
__CUDA_ARCH__-driven values.
- LAUNCH_PRECISION_SHFL: factored the 4-way blocksz dispatch into a
LAUNCH_PRECISION_SHFL_AT_BLOCKSZ helper macro, mirroring the
sliding-window LAUNCH_PRECISION_AT_BLOCKSZ pattern in kernels_impl.h.
~50 lines of macro removed; identical generated code.
- Stale shipped autotune cache entries dropped from
data/autotune_cache.txt. Every RTX 3080 and P100 entry referenced
variant geometries dropped in the consolidation
(old v6 dpt=4 tile=64, old v8 dpt=8 bps=4, old v0/v1 SW). They were
silently rejected by IsSupportedKernelConfig -> users got cold-start
defaults anyway, but the file lied about what was tuned. Header
notes kept; entries cleared until we re-autotune on hw.
- Comment cleanup:
* kernels_compute_shfl.h: "one __syncthreads per row" -> describes
both the sm_70+ cuda::barrier path and the sm_60 __syncthreads
fallback. Stale "DP variant 6 BLOCKSZ=256 DPT=2" smem-footprint
example replaced with a real current-variant example.
* kernels_impl_shfl.h: docstring template-param list said
blocks_per_sm but the param was renamed to target_threads_per_sm
when safe_bps landed. FUTURE OPT comment dropped its "v4 (DPT=8
SP is already register-pressured)" reference -- v4 no longer
exists; rewritten to "higher-DPT shfl variants" generically.
* kernels_impl.h: dropped "variant 5 at OUR=16" and "Same as v6"
references to dead variants. Added a load-bearing note about
sm_60's 48KB dynamic-smem cap and the launcher's
cudaFuncSetAttribute opt-in (which sm_60 ignores).
* autotune_bench.cpp: trimmed "real bug: v5 (OUR=16)" past-incident
reference; deduped two stale warmup-related comment blocks.
* autotune.cpp: "Per-thread override" comment was misleading -- the
override IS process-wide (the function-level comment above the
setter spells out why), so the lookup-side comment now matches.
* main.cpp: removed redundant nested #ifdef _HAS_CUDA_.
* ci yml per-variant step: "one tile size" -> "two tile sizes"
(matched actual command which sweeps {1024, 4096}).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- SCAMP_FORCE_VARIANT lookup now caches the parsed value at first call
via a static-local initializer. The lookup sits on the per-tile-launch
hot path; the prior every-call getenv() added a syscall per tile.
Out-of-range and malformed values cache as -1 ("no override") so the
fast path stays a single int compare.
- pyscamp module init no longer force-sets SCAMP_AUTOTUNE_QUIET=1.
The autotune cache-miss warning now behaves the same in pyscamp as
in the CLI: prints once per process if no cache entry is found, can
be silenced with SCAMP_AUTOTUNE_QUIET=1. The prior asymmetric default
hid the "run --autotune for max perf" hint from notebook users for
whom it was most relevant.
- safe_bps() comment now spells out the default-arg evaluation semantics
(hw_threads_per_sm() resolves under the current __CUDA_ARCH__ guard
at the call site, so each per-arch device compilation pass bakes in
the right per-SM cap).
- docs/source/autotune.rst: dropped the now-stale "pyscamp silences by
default" claim; clarified the cache-miss warning is once-per-process.
Added an "Other autotune environment variables" section covering
SCAMP_AUTOTUNE_{PRECISION,VARIANT}_FILTER, SCAMP_AUTOTUNE_WARMUP_RUNS,
and SCAMP_FORCE_VARIANT, including their first-call-cached semantics
for the ones that latch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removes the build-time embedded autotune cache and its lookup chain
branch. The maintenance burden of keeping data/autotune_cache.txt
fresh across releases (every variant-table edit invalidates entries;
every new device needs a hand-merged PR; users hit silent stale-entry
rejection) outweighed the small out-of-box perf uplift it bought.
Removed:
- data/autotune_cache.txt (the source file, already empty post-review).
- src/core/gpu_kernel/builtin_autotune_cache.{h,cpp.in} (the embedded
string + its declaration).
- The file(READ) + configure_file + extra source file in
src/core/gpu_kernel/CMakeLists.txt that wired the embed into the
binary.
- GetOrLoadBuiltinCache() + CacheState::builtin_cache from autotune.cpp.
- The `builtin_cache` parameter from LookupKernelConfigForDeviceKey
(both signature + the only non-test caller, GetKernelConfigForDevice).
- 3 builtin-cache-specific unit tests (Test_BuiltinCacheHit,
Test_UserCacheBeatsBuiltin, Test_StaleUserCacheFallsThroughToBuiltin).
The last was renamed and repurposed to verify a stale user cache
falls through to the FALLBACK config + emits the cache-miss warning
(the upgrade-compat behavior we actually care about).
- Built-in / "ships with the binary" mentions in docs/source/autotune.rst,
autotune.h, autotune_cache.h, autotune.cpp banner output.
New lookup chain (in priority order):
1. Process-wide KernelConfig override (autotune bench + SCAMP_FORCE_VARIANT).
2. User cache at $SCAMP_AUTOTUNE_CACHE / $XDG_CACHE_HOME / $HOME.
3. GetDefaultKernelConfig(profile, precision) -- the per-profile-type
compile-time default, plus a one-shot per-process miss warning.
Users who want optimal perf still run `SCAMP --autotune` once per
device (or `pyscamp.autotune()`); the result lands in the user cache
and persists across runs. Cold-start defaults are reasonable
per-profile (shfl for atomic-heavy 1NN_INDEX/SUM_THRESH, sliding-
window for atomic-light 1NN/MS/AAN), so no-tune behavior is correct
and within a few percent of optimal on the variants and arches we
tested.
22/22 -> 20/20 unit tests (-2 from removed builtin-only tests, the
renamed one still runs).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- docs/source/environment.rst: SCAMP_AUTOTUNE_INPUT_LENGTH default documented as 131072 but the constant in autotune_bench.cpp is 262144 (256K). Fixed. Also dropped the "Set by pyscamp at module init for notebook users" line (we removed that auto-set in an earlier review pass). Added entries for the four env vars that weren't documented at all: SCAMP_AUTOTUNE_PRECISION_FILTER, SCAMP_AUTOTUNE_VARIANT_FILTER, SCAMP_AUTOTUNE_WARMUP_RUNS, SCAMP_FORCE_VARIANT. - docs/source/autotune.rst: the cache-miss warning example showed the pre-consolidation default (bps=8 dpt=4 ... kti=8); the current default for 1NN_INDEX SP is the first shfl variant, which is bps=5 dpt=8 ... kti=32. Updated the example to match. - src/core/gpu_kernel/kernel_gpu_utils.h: hw_threads_per_sm() added sm_110 (Jetson Thor, Blackwell mobile) to the 1536-threads/SM bucket. Previously fell through to the 2048 default which would have over-committed safe_bps's per-arch hw cap by 33% on that device. SCAMPMacros.cmake already compiles for sm_110 on CUDA 13.0+, so the omission was a real gap. - src/core/gpu_kernel/kernels_impl_shfl.h: the "tile_height <= 32 * DPT for the first draft" static_assert message was a state-of-development phrasing. Replaced with a fact-only explanation of WHY the constraint must hold (update_info_shfl fires at most once per warp per tile under that bound). - src/python/SCAMP_python.cpp: a comment in run_autotune claimed "main.cpp dispatches here too" -- backwards. The CLI's --autotune path in main.cpp calls RunAutotuneWithBenchmark directly; this Python entry point is a separate dispatch that shares the same bench impl. Fixed the comment to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
The cov-shuffle / warp-shuffle GPU kernel design has been on this repo's
wishlist since 2018 (
experimental-warp-shuffle,experimental-shared-memory) and was retried in late 2022 / early 2023with an Eigen-port foundation (
use-eigen-in-gpu-kernels,eigen-gpu-warp-shuffle,eigen-gpu-cov-shuffle,eigen-tex-mem).None of those branches reached a shippable state — the codegen
tradeoffs (compile time, kernel binary size, smem footprint per
variant) made it hard to keep the new kernel alongside the existing
sliding-window one without regressing builds for users who didn't need
it. This PR is the iteration that finally lands the work, on top of a
per-(profile, variant) build split + an autotune cache that picks per
device, so the new kernel can coexist with the legacy one in the same
binary.
Summary
Three threads of work that landed together because the autotune machinery
depends on the kernel-variant split, and the variant split is the thing
that lets the cov-shuffle variant live alongside the existing
sliding-window kernels without compile-time blow-up.
1. Per-GPU autotune
--autotuneCLI flag (andpyscamp.autotune()binding) thatsweeps
(variant geometry, blocksz)for every supported(profile, precision)target and writes the per-target winner to aplain-text cache file.
process-wide override → user cache → built-in cache (embedded from
data/autotune_cache.txt) → compile-time default. Each tier isindependent; lookup falls through gracefully on miss.
→
$HOME/.cache/scamp/autotune.txton POSIX or%LOCALAPPDATA%\scamp\autotune.txton Windows.no autotune entry for device '<name>'…")on miss, silenceable via
SCAMP_AUTOTUNE_QUIET. pyscamp sets this atmodule init so notebook users don't get uninvited stderr.
(
SCAMP_AUTOTUNE_INPUT_LENGTH=262144); measured against 64K/128K onan RTX 3080, 256K is the smallest size where launch overhead is
reliably dominated by steady-state kernel cost — see the comparison
table in
docs/source/autotune.rst.2. Kernel-variant infrastructure
SCAMP_VARIANT_TUPLESinsrc/core/gpu_kernel/CMakeLists.txtis thesingle source of truth for which kernel geometries the binary
supports. The foreach in that file regenerates the dispatch table,
the
kVariants[]constexpr array, and the per-(profile, variant).cufiles viaconfigure_file.multi-device sweep retired the ones that never won.
.cufiles,one
do_tile<...>instantiation per file, so make/ninja candispatch them in parallel. Full
gpu_kernelsrebuild on a 3080 devbox dropped from ~260s to ~74s; the dominant
v3+v4(DPT=8) TUsalone went 211s → 34s after dropping
#pragma unrollon the shflouter row loop (a
__syncthreads()-terminated loop with up to 256iterations that nvcc was inlining 256× for no ILP gain).
3. cov-shuffle ("shfl") variant
lane-to-lane via warp shuffles rather than being read from shared
memory. Files:
kernels_compute_shfl.h,kernels_impl_shfl.h,with an ASCII walkthrough generator in
scripts/cov_shuffle_diagram.py.(DPT=4) is the compile-time default fallback; the autotuner picks
per-device.
4. Eigen port of the sliding-window kernel
kernels_compute.h+kernels_smem.huseEigen::Array<T, N, 1>andEigen::Mapinstead of hand-unrolled C arrays. Required bumping the Eigen
submodule 3.4.0 → 5.0.1 for better nvcc support; device-callable
under
--expt-relaxed-constexpr(already set at the top level).5. Platform / CUDA-arch matrix
cmake/SCAMPMacros.cmakeset_cuda_architectures()forCUDA 11.0 through 13.0+, including Blackwell (sm_100, sm_103,
sm_110, sm_120, sm_121) and the CCCL 2.8.x arch-token limit
workarounds.
-DCMAKE_CUDA_ARCHITECTURES=<list>/USER_SPECIFIED_CUDA_ARCHITECTURESfor single-arch dev builds (hugecompile-time win when you know you're targeting one GPU).
std::filesystemfor atomic-rename + recursivemkdir,_putenv_sfor the env-var-set test helpers, explicit lambdacaptures for MSVC C3493,
LOCALAPPDATA/USERPROFILEenv-varresolution.
6. Docs
docs/source/autotune.rstcovering lookup chain, cachelocations (POSIX + Windows), the "no autotune entry" warning,
clearing/resetting the cache, the upgrade compatibility contract,
and per-N workload-size guidance.
docs/source/environment.rstlisting every env var SCAMP / pyscamp / the distributed gRPC client
reads.
docs/pyscamp.py) now surfacesautotune()and
gpu_supported().Test plan
test/cpp/test_autotune_cache.cppcovering the cache lookup chain, cache-miss warning dedup, the
SCAMP_AUTOTUNE_QUIETenv-var contract, on-disk round-trip,atomic replace, recursive parent-dir creation, the full
bench-driven autotune pipeline driven by a synthetic
BenchmarkFn(no GPU needed), and four cache-upgrade scenarios(future version header, missing header, partially stale cache,
user wipe with built-in fallback).
(g++, clang++) and Windows (MSVC
cl), with no GPU required.Verified locally against VS2022 MSVC 14.30 + CUDA 11.6.
SCAMP --autotuneend-to-end on RTX 3080 (Linux + Windows)produces a cache that subsequent
SCAMPruns consume withoutthe cache-miss warning.
cmake --build . --config Release --parallel 2(the exactinvocation in
build-cuda-cli) builds + runstest_autotune_cachevia
ctest --output-on-failure.🤖 Generated with Claude Code