- Added AMD Instinct MI300X, MI325X, MI350X, MI355X to GPU database with correct peak FP16 TFLOPS, memory bandwidth, and L2 cache specs
- Added
gcnArchName-based GPU detection for ROCm (device name is often empty on ROCm;gcnArchNamelikegfx942is always available) - Guarded
clock_rateaccess behindhasattr+> 0check (ROCm devices reportclock_rate=0) - Applied same fixes to
profile.pyfallback detector - Tested on AMD Instinct MI300X (gfx942, ROCm 6.3) and MI350X (gfx950, ROCm 7.2)
- Fixed
verify.pySyntaxError on Python 3.13+: movedglobaldeclaration before variable usage inmain()-- file would not even import on Python 3.13/3.14 - Fixed CUDA flash_attention
sm_scaleparameter being ignored: thesm_scaleargument was accepted but the kernel hardcodedrsqrtf(D)instead of using it -- now correctly passessm_scalethrough to the CUDA kernel - Fixed CUDA cross_entropy returning wrong dtype: loss was cast back to input dtype instead of always returning
float32(matchingF.cross_entropybehavior) - Fixed Triton rotary_embedding broadcasting truncation:
cos/sinrepeat used integer division which truncated whenn_rowswas not a multiple ofcos.shape[0]-- now uses ceiling division and slices to exact size - Fixed Triton reduce output shape for non-last-dim reductions: after permuting to move the reduce dim last, the output was reshaped using the original dim order instead of the permuted order
- Added
--export-traceflag to export Chrome trace JSON for HTA/trace-blame analysis - Added
--memory-snapshotflag to capture CUDA memory snapshots for mosaic analysis - Added
--torch-compile-logflag to save torch.compile logs for tlparse analysis - Added optional HTA (Holistic Trace Analysis) integration -- when installed, runs temporal and kernel breakdown analysis
- Added exported artifacts summary with suggested next steps for each tool
- Added
HolisticTraceAnalysisas optionalprofilingdependency
- Added
export_hf.py-- exports optimized kernels to HuggingFace Kernels format - Supports CUDA C++ kernels: auto-extracts CUDA source, parses function signatures, generates
build.toml+torch_binding.cpp+__init__.py - Supports Triton kernels: packages as a Python module with pyproject.toml
- Generates ready-to-upload project structure compatible with
kernels uploadCLI - Added
kernelsandhuggingface-hubas optionalhf-kernelsdependencies
- Added 9 CUDA C++ starter kernels with advanced GPU features:
- matmul -- Tensor core GEMM via
wmmaAPI, 128x128 tiles, double-buffered shared memory - softmax -- Warp shuffle reductions,
half2vectorized loads, grid-stride loop - layernorm -- Welford's single-pass algorithm,
float4vectorized loads, warp shuffle stats - rmsnorm -- Warp shuffle cascade,
rsqrtffast inverse sqrt,half2vectorization - flash_attention -- Tiled online softmax, double-buffered shared memory, causal mask support
- fused_mlp -- Fused SwiGLU (gate + up + SiLU + mul), shared memory tiling
- cross_entropy -- Fused online log-sum-exp + NLL in single pass, warp reductions
- rotary_embedding --
__sincosfintrinsic,half2read-modify-write - reduce -- Hierarchical warp shuffle + shared memory + grid-level atomic
- matmul -- Tensor core GEMM via
- Added
kernels/cuda/_compile.py-- shared compilation utility:- Hash-based caching (recompile only when source changes)
- GPU architecture auto-detection via
torch.cuda.get_device_capability() - Forward declaration extraction for cross-translation-unit linking
- Thread-safe compilation with file locking
- Detailed error diagnostics with source line numbers
- Added
--backend triton|cudaflag toextract.py - Added CUDA C++ optimization playbook to
program.md - Added
ninjaas optional dependency for faster compilation
- Added
kernelbench/bridge.py-- problem loader supporting 3 sources:- HuggingFace datasets (
--source hf) - Local KernelBench repo clone (
--source local) - Individual Python files (
--source file) - Automatic problem analysis (50+ operation patterns)
- Starter
ModelNewgeneration with CUDA/Triton templates
- HuggingFace datasets (
- Added
kernelbench/bench_kb.py-- 4-stage evaluation pipeline:- Stage 1: Correctness (5 random input trials, atol/rtol=1e-2)
- Stage 2: Stability (NaN/Inf detection)
- Stage 3: Determinism (3 identical runs)
- Stage 4: Performance (CUDA event timing, trimmed median)
- Greppable output with
fast_pat 7 thresholds
- Added
kernelbench/scorer.py-- batch evaluation and metrics:fast_pmetric at thresholds: 1.0x, 1.1x, 1.25x, 1.5x, 2.0x, 3.0x, 5.0x- Incremental scoring with JSON persistence
- Leaderboard-style reports with progress bars
- Added
kernelbench/program_kb.md-- agent instructions for KernelBench mode:- Optimization playbook per difficulty level (L1-L4)
- CUDA C++ and Triton strategy examples
- Decision framework and anti-patterns
- Updated README with KernelBench section, dual-backend docs, and Discord link
- Added
datasets>=2.16.0as optionalkernelbenchdependency
- Triton kernel optimization pipeline (profile, extract, bench, orchestrate, verify)
- 9 starter Triton kernels (matmul, softmax, layernorm, rmsnorm, flash_attention, fused_mlp, cross_entropy, rotary_embedding, reduce)
- 5-stage correctness harness + roofline analysis
- Amdahl's law orchestration for multi-kernel optimization
- Self-contained model definitions (GPT-2, LLaMA, BERT)
- TSV logging and experiment visualization