A Bare-Metal, High-Performance LLM Inference Engine in C++ and CUDA/HIP.
cuLlama is a project to understand exactly what happens inside the GPU when an LLM generates text. It strips away the massive overhead of PyTorch and Python to implement a raw inference loop.
There are no dependencies on torch, accelerate, or huggingface. Just C++, CMake, and raw Kernel code.
The aim of cuLlama is to prove control over the hardware.
- Manual Memory Management: We do not use a Garbage Collector. We allocate a single contiguous block of GPU memory (Arena) at startup and manually manage pointers to avoid fragmentation and allocation overhead.
- Kernel Fusion: We replace PyTorch's 100+ tiny kernel launches per layer with fused kernels (RMSNorm, SwiGLU, RoPE) to keep the GPU compute-bound, not latency-bound.
- Dual-Backend Compilation: The codebase is designed to compile for NVIDIA (CUDA) and AMD (ROCm/HIP) from a single source using a custom abstraction layer.
The project is structured to separate the Host (CPU) logic from the Device (GPU) execution.
cuLlama/
βββ CMakeLists.txt # The Build System (Critical for C++)
βββ README.md # "Architecture & Benchmarks"
β
βββ src/ # THE HOST CODE (C++ Logic)
β βββ main.cpp # Entry point (CLI for text generation)
β βββ engine.cpp # Orchestrates the generation loop (Forward -> Sample -> Cache)
β βββ model_loader.cppcuLlama/
βββ src/ # HOST CODE (Orchestration)
β βββ engine.cpp # The inference loop (Forward -> Sample -> Cache)
β βββ memory_manager.cpp# [SYSTEMS] Manual GPU Arena & Paged KV Cache
β βββ model_loader.cpp # mmap() weights directly from disk
β
βββ kernels/ # DEVICE CODE (High-Performance Math)
β βββ attention/ # FlashAttention & PagedAttention implementations
β βββ layers/ # Fused kernels (RMSNorm, SwiGLU)
β βββ common/ # hip_compat.h (The Magic Switch: CUDA <-> HIP)
β
βββ include/ # INTERFACES
βββ kv_cache.h # Ring Buffer definitions
βββ config.h # Model Hyperparameters # mmap() weights from disk (System Call knowledge)
β βββ memory_manager.cpp # [SYSTEMS] Manual GPU malloc/free & KV Cache Paging
β βββ sampler.cpp # Top-K / Top-P sampling logic (Host side)
β
βββ include/ # HEADERS (Interface Definitions)
β βββ config.h # Model Hyperparams (Llama-2-7b, TinyLlama)
β βββ layers.h # Class definitions for Linear, RMSNorm, Attention
β βββ kv_cache.h # [SYSTEMS] Ring Buffer / Paged Attention logic
β βββ cuda_utils.h # Error checking macros (CUDA_CHECK)
β
βββ kernels/ # THE DEVICE CODE (CUDA/HIP)
β βββ attention/
β β βββ flash_attention.cu # [CORE] Custom Tiled Attention Kernel
β β βββ paged_attention.cu # [ADVANCED] Handling non-contiguous KV blocks
β β βββ rope.cu # Rotary Positional Embeddings
β βββ layers/
β β βββ rmsnorm.cu # Fused RMSNorm (Warp Shuffle Reduction)
β β βββ silu_mul.cu # Fused SwiGLU Activation
β β βββ softmax.cu # Fast Softmax
β βββ quantization/
β β βββ int8_dequant.cu # [AMD OPTIMIZATION] W8A16 Kernel
β β βββ fp8_utils.cu # FP8 casting (for future proofing)
β βββ common/
β βββ hip_compat.h # [AMD] Macros to map cudaMalloc -> hipMalloc
β
βββ scripts/ # PYTHON HELPERS
β βββ export_weights.py # PyTorch -> Binary format exporter
β βββ compare_logits.py # Debugger: Checks C++ output vs PyTorch
β βββ benchmark.py # Plot tokens/sec
β
βββ third_party/ # EXTERNAL LIBS
β βββ cutlass/ # (Optional) For high-performance GEMMs
β βββ nlohmann_json/ # For config parsing
β
βββ tests/ # UNIT TESTS (GoogleTest)
βββ test_rmsnorm.cpp # Verifies Kernel output vs CPU reference
βββ test_kv_cache.cpp # Verifies memory logic
- Zero-Copy Loading: Weights are loaded via
mmap, allowing the OS to page them in lazily. This avoids massive CPU RAM allocation and double-copying. - Static Allocation (The Arena): We allocate one contiguous block of VRAM at startup. All tensors (Linear layers, KV Cache) are views into this block. There is zero
malloc/freeoverhead during the inference loop. - Fused Operations: Instead of launching 100+ small kernels per layer (standard PyTorch behavior), we fuse
RMSNorm + ResidualandSwiGLUto keep the GPU compute-bound, not latency-bound. - Platform Agnosticism: The codebase uses a thin abstraction layer (
hip_compat.h) to compile native CUDA code for NVIDIA or native HIP code for AMD without changing the kernel logic.
- CMake (3.18+)
- Compiler:
nvcc(NVIDIA) orhipcc(AMD) - C++ Compiler:
g++orclang
git clone https://github.com/vivek-kumar9696/cullama.git
cd cuLlamamkdir build && cd build
cmake ..
makeTo compile for AMD GPUs, we switch the backend flag. This triggers the hip_compat.h layer to remap CUDA calls to HIP.
mkdir build && cd build
cmake -DBACKEND=HIP ..
makeTo run the C++ inference engine, you first need to download and convert the model weights from Hugging Face into our custom .bin format. We use a short Python script (scripts/export_weights.py) for this.
We recommend using uv, an extremely fast Python package installer and resolver, to manage the dependencies.
If you don't have uv installed on your system, you can install it via curl:
curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | shNavigate to your project directory and create a virtual environment using uv: Bash
# Create a virtual environment
uv venv
# Activate the environment (Linux/macOS)
source .venv/bin/activate
# (Note: If you are on Windows, use `.venv\Scripts\activate`)Once the virtual environment is activated, use uv pip to install the required machine learning packages. uv will install these significantly faster than standard pip.
uv pip install torch numpy transformersWith the dependencies installed, you can now run the exporter script to generate the model.bin file:
python3 scripts/export_weights.py./cuLlama --model model.bin --prompt "The future of AI is"| Device | Precision | Model | Tokens/Sec |
|---|---|---|---|
| RTX 4090 | FP16 | Llama-2-7B | Pending |
| A100 80GB | FP16 | Llama-2-7B | Pending |
| MI250X | FP16 | Llama-2-7B | Pending |
- Phase 0: Build System (CMake with Dual Backend support).
- Phase 1: Memory Arena & KV Cache Manager.
- Phase 2: Fused Kernels (RMSNorm, RoPE).
- Phase 3: FlashAttention Implementation.
- Phase 4: W8A16 Quantization (Int8 weights, FP16 compute).
This is a portfolio project demonstrating Systems Engineering skills. Issues and PRs focusing on kernel optimization or hardware compatibility are welcome.
MIT License.