Skip to content

kitrakrev/kernel-dev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kernel-dev

Hands-on accelerator kernels and training primitives across multiple stacks.

Layout

MLX/    Apple Silicon exercises + perf benches (local M4)
CUDA/   NVIDIA exercises + perf benches (Colab T4)
TPU/    JAX/TPU exercises + perf benches (Colab v5e)
projects/
  A_softmax/   softmax 4 ways (MLX, JAX, CUDA, Triton) + perf
  B_matmul/    matmul ladder w/ roofline analysis + perf

Device matrix

Stack Hardware Compiler / runtime Verified on
MLX Apple M4, 16 GB MLX 0.31.1, Metal macOS 26.2 arm64
CUDA Tesla T4 (sm_75) nvcc, CUDA 13.0 Colab Linux 6.6 x86_64
TPU v5 lite (v5e), 1 ch JAX 0.7.2, PJRT C API Colab Linux 6.6 x86_64

CUDA exercises

# File Concept
1 01_vector_add.cu kernel, grid/block, cudaMemcpy
2 02_saxpy.cu streaming, bandwidth measurement
3 03_reduction.cu shared-mem tree reduce
4 04_transpose.cu coalescing, bank conflicts
5 05_matmul_tiled.cu tiled matmul vs naive

Build : nvcc -O3 -arch=sm_75 <file> -o <exe>

TPU/JAX exercises

# File Concept
1 01_jnp_jit.py jnp ops, jit, inspect HLO
2 02_grad_linreg.py value_and_grad, plain SGD
3 03_flax_mnist.py flax.nnx MLP + optax SGD
4 04_vmap.py per-example fn, vmap, nested vmap
5 05_shard_map.py Mesh + PartitionSpec + shard_map

Run : colab exec -s tpu -f <file>

MLX exercises

# File Concept
1 01_array_ops.py mx.array, lazy ops, mx.eval
2 02_linear_regression.py value_and_grad, manual SGD
3 03_mlp_mnist.py nn.Module MLP + nn.value_and_grad
4 04_attention.py scaled-dot-product + multi-head attn
5 05_llama_inference.py tiny GPT decoder + greedy decode

Run : python3 <file>

Projects

Project A — softmax 4 ways

Row-wise softmax on (4096, 4096) implemented in MLX, JAX, raw CUDA, and Triton. See projects/A_softmax/bench.md for cross-stack timings.

Project B — matmul ladder + roofline

Climbs from naive CUDA -> tiled CUDA -> Tensor Core cuBLAS -> MLX -> JAX MXU, tracking TFLOPS vs peak at each rung. See projects/B_matmul/roofline.md.

Perf benches

Each stack has a perf/ directory with reproducible benchmarks:

  • MLX/perf/ — matmul size sweep, memory BW
  • CUDA/perf/ — matmul size sweep (cuBLAS), streaming BW
  • TPU/perf/ — matmul size sweep (MXU), HBM BW
  • projects/*/perf/ — end-to-end project runners (run_all.sh) + cached results.txt

Outputs

Each NN_*.py / NN_*.cu has a sibling NN_*.out containing real captured stdout from the listed hardware. No mocked results.

Identity

All commits in this repo are authored as kitrakrev via per-command git -c overrides — no global config change on the development host.

About

Trying to understand the hardware running our models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors