Hands-on accelerator kernels and training primitives across multiple stacks.
MLX/ Apple Silicon exercises + perf benches (local M4)
CUDA/ NVIDIA exercises + perf benches (Colab T4)
TPU/ JAX/TPU exercises + perf benches (Colab v5e)
projects/
A_softmax/ softmax 4 ways (MLX, JAX, CUDA, Triton) + perf
B_matmul/ matmul ladder w/ roofline analysis + perf
| Stack | Hardware | Compiler / runtime | Verified on |
|---|---|---|---|
| MLX | Apple M4, 16 GB | MLX 0.31.1, Metal | macOS 26.2 arm64 |
| CUDA | Tesla T4 (sm_75) | nvcc, CUDA 13.0 | Colab Linux 6.6 x86_64 |
| TPU | v5 lite (v5e), 1 ch | JAX 0.7.2, PJRT C API | Colab Linux 6.6 x86_64 |
| # | File | Concept |
|---|---|---|
| 1 | 01_vector_add.cu |
kernel, grid/block, cudaMemcpy |
| 2 | 02_saxpy.cu |
streaming, bandwidth measurement |
| 3 | 03_reduction.cu |
shared-mem tree reduce |
| 4 | 04_transpose.cu |
coalescing, bank conflicts |
| 5 | 05_matmul_tiled.cu |
tiled matmul vs naive |
Build : nvcc -O3 -arch=sm_75 <file> -o <exe>
| # | File | Concept |
|---|---|---|
| 1 | 01_jnp_jit.py |
jnp ops, jit, inspect HLO |
| 2 | 02_grad_linreg.py |
value_and_grad, plain SGD |
| 3 | 03_flax_mnist.py |
flax.nnx MLP + optax SGD |
| 4 | 04_vmap.py |
per-example fn, vmap, nested vmap |
| 5 | 05_shard_map.py |
Mesh + PartitionSpec + shard_map |
Run : colab exec -s tpu -f <file>
| # | File | Concept |
|---|---|---|
| 1 | 01_array_ops.py |
mx.array, lazy ops, mx.eval |
| 2 | 02_linear_regression.py |
value_and_grad, manual SGD |
| 3 | 03_mlp_mnist.py |
nn.Module MLP + nn.value_and_grad |
| 4 | 04_attention.py |
scaled-dot-product + multi-head attn |
| 5 | 05_llama_inference.py |
tiny GPT decoder + greedy decode |
Run : python3 <file>
Row-wise softmax on (4096, 4096) implemented in MLX, JAX, raw CUDA, and Triton.
See projects/A_softmax/bench.md for cross-stack timings.
Climbs from naive CUDA -> tiled CUDA -> Tensor Core cuBLAS -> MLX -> JAX MXU,
tracking TFLOPS vs peak at each rung. See projects/B_matmul/roofline.md.
Each stack has a perf/ directory with reproducible benchmarks:
MLX/perf/— matmul size sweep, memory BWCUDA/perf/— matmul size sweep (cuBLAS), streaming BWTPU/perf/— matmul size sweep (MXU), HBM BWprojects/*/perf/— end-to-end project runners (run_all.sh) + cachedresults.txt
Each NN_*.py / NN_*.cu has a sibling NN_*.out containing real captured
stdout from the listed hardware. No mocked results.
All commits in this repo are authored as kitrakrev via per-command git -c
overrides — no global config change on the development host.