ECE 60827 CUDA Programming Part 2

Professor: Timothy Rogers
TA: Junrui Pan

Introduction

Large Language Models (LLMs) like GPT and LLaMA are built on the Transformer architecture, which relies heavily on matrix multiplications. The core operations in Transformers—attention mechanisms (Q×K^T, softmax×V) and feed-forward layers—are essentially sequences of GEMMs. In fact, matrix multiplications account for the vast majority of compute in modern LLMs, making GEMM optimization critical for efficient inference and training. See Matrix Multiplication Background for more details.

The purpose of this lab is to deepen your understanding of CUDA programming by implementing General Matrix Multiplication (GEMM). You will implement two versions—a naive approach and an optimized shared memory version—and compare their performance against PyTorch's highly-optimized implementation. Beyond learning CUDA itself, this lab helps you understand how widely-used frameworks like PyTorch interact with the underlying GPU hardware. By building your own GEMM kernels and comparing them against PyTorch's implementation, you'll gain insight into the entire software stack—from high-level Python APIs down to low-level GPU execution.

The official CUDA Documentation is the best resource for implementation details and API specifics.

GEMM (General Matrix Multiplication)

GEMM computes the matrix product:

C = A × B

Where:

A is an M × K matrix
B is a K × N matrix
C is an M × N matrix (output)

Each element of C is computed as:

C[i][j] = Σ(k=0 to K-1) A[i][k] * B[k][j]

Matrix Storage

All matrices are stored in row-major order:

A[i][j] is stored at A[i * K + j]
B[i][j] is stored at B[i * N + j]
C[i][j] is stored at C[i * N + j]

Part A: Naive GEMM

Implement gemm_kernel in cuda_gemm.cu.

Each thread computes one element of the output matrix C. The thread reads an entire row of A and an entire column of B from global memory to compute the dot product.

Part B: Shared Memory GEMM

Implement gemm_shared_kernel in cuda_gemm.cu.

Use shared memory to reduce global memory accesses. Threads within a block should cooperatively load data into shared memory before computing.

Part C: Loop Unrolling Optimization

Build on your shared memory tiled GEMM implementation by manually unrolling the inner computation loop. Instead of relying solely on the compiler’s automatic unrolling, explicitly expand multiple loop iterations within the loop body to reduce loop overhead and expose additional instruction-level parallelism (ILP). You may use #pragma unroll to assist the compiler, but manual unrolling is required.

Repository Structure

This repository demonstrates how Python frameworks like PyTorch integrate with custom CUDA code:

test_gemm.py — The Python test script that imports cuda_gemm as a module and calls functions like cuda_gemm.gemm(A, B) on PyTorch GPU tensors.
setup.py — Defines how the cuda_gemm module gets built as a PyTorch C++/CUDA extension.
cuda_gemm.cu — Contains the CUDA kernels you will implement, along with wrapper functions that bridge Python calls to GPU execution.

Understanding this flow—from high-level Python API to low-level CUDA kernel—is a key learning objective of this lab.

test_gemm.py

Function	Description
`test_naive()`	Tests Part A correctness by comparing your naive GEMM output against `torch.mm()` across multiple matrix sizes.
`test_shared()`	Tests Part B correctness by comparing your shared memory GEMM output against `torch.mm()` across multiple matrix sizes.
`test_unrolled()`	Tests Part C correctness by comparing your loop unrolling GEMM output against `torch.mm()` across multiple matrix sizes.
`benchmark()`	Measures and compares execution time of all three implementations against PyTorch's optimized GEMM.

cuda_gemm.cu

Function	Description
`gemm_kernel`	[TODO] The naive CUDA kernel where each thread computes one element of the output matrix.
`gemm_shared_kernel`	[TODO] The shared memory CUDA kernel that uses tiling to reduce global memory accesses.
`gemm_unrolled_kernel`	[TODO] The shared memory + loop unrolling CUDA kernel.
`gemm_cuda()`	C++ wrapper that validates input tensors, configures grid/block dimensions, and launches `gemm_kernel`.
`gemm_shared_cuda()`	C++ wrapper that validates input tensors, configures grid/block dimensions, and launches `gemm_shared_kernel`.
`gemm_unrolled_cuda()`	C++ wrapper that validates input tensors, configures grid/block dimensions, and launches `gemm_unrolled_kernel`.
`PYBIND11_MODULE`	Registers the C++ functions as Python-callable methods (`cuda_gemm.gemm`, `cuda_gemm.gemm_shared`, and `cuda_gemm.gemm_unrolled`). See pybind11 documentation.

Note: The grid and block dimensions in the wrapper functions can be adjusted as needed for your kernel implementation.

Building and Testing

Setup Environment

module load gcc/11.4.1 cuda ngc pytorch

Build

make build

Note: The first build may take a few minutes (~3 min) as it compiles the PyTorch CUDA extension.

Test

make test

This will run correctness tests and benchmarks comparing your implementations against PyTorch's optimized GEMM.

Clean

make clean

Expected Output

When your implementation is correct, make test should show:

==================================================
Benchmarking...
==================================================

Matrix size: (1024x2048) x (2048x512)
Naive max error:  X.XXXXXXe-XX
Shared max error: X.XXXXXXe-XX

PyTorch mm:    X.XXX ms
Naive GEMM:    X.XXX ms
Shared GEMM:   X.XXX ms

Speedup (shared vs naive): X.XXx
Relative to PyTorch (shared): X.XXx

Submission and Autograding

When you submit your assignment through GitHub Classroom, an autograder will automatically test your implementation.

What Gets Tested

Part A - Naive GEMM: Correctness against PyTorch reference
Part B - Shared Memory GEMM: Correctness against PyTorch reference
Part C - Loop Unrolling GEMM: Correctness against PyTorch reference

Running the Grader Locally

To test each part separately (as the autograder does):

# Test Part A only
python3 test_gemm.py --part-a

# Test Part B only
python3 test_gemm.py --part-b

# Test Part C only
python3 test_gemm.py --part-c

Running make test will execute the full benchmark, which tests both parts and reports timing comparisons.

Report

Please write all your answers directly in the report.md file. Do not create a separate PDF or other document.

Report Questions

Answer the following questions in your report. Read through all provided files (cuda_gemm.cu, setup.py, test_gemm.py) carefully before answering. Keep each answer concise—a few sentences is sufficient unless the question asks for data or code.

Part 1: Understanding the Codebase

How does a Python call to cuda_gemm.gemm(A, B) end up executing your CUDA kernel? Trace the path through test_gemm.py, setup.py, and cuda_gemm.cu. See Matrix Multiplication Background for additional context.

Part 2: Performance Analysis

Run make test and report the timing results. How does the shared memory version compare to the naive version? What speedup do you observe?
For the naive kernel, how many global memory loads does each thread perform to compute one element of C? For an M×K by K×N multiplication, what is the total number of global memory loads across all threads?
For the shared memory kernel with tile size T (BLOCK_SIZE), how many global memory loads does each thread perform? How does tiling reduce the total number of global memory accesses?
Both your implementations are likely slower than PyTorch's torch.mm. Why do you think this happens?

Part 3: Loop Unrolling Analysis

What is loop unrolling and how does it help? What trade-offs does it introduce (e.g., instruction cache pressure, register usage)?
Compare the performance of your unrolled kernel against the shared memory version. Do you see any speedup? Why or Why not?
Describe one additional optimization technique (beyond shared memory tiling and loop unrolling) that could further improve GEMM performance. Explain the underlying principle and expected benefit.

Part 4: FP16

Write a separate Python script that runs your three GEMM implementations and PyTorch's torch.mm using FP16 (torch.float16) tensors instead of FP32. You can create FP16 tensors like this:
```
A = torch.randn(M, K, device='cuda', dtype=torch.float16)
```
Report the timing results for all four. What happens to PyTorch's performance compared to FP32? What happens to your kernels' performance? Explain why PyTorch sees a significant speedup with FP16 while your custom kernels do not.

(hint: Some special hardware in Volta can only do FP16, but not FP32.)

Grading Rubric

Component	Points
Part A - Naive GEMM	15
Part B - Shared Memory GEMM	15
Part C - Loop Unrolling GEMM	40
Report	30
Total	100

Submission Requirements

Include your written report named report.md in the root folder of your repository. See Markdown Guide for formatting help.

Academic Integrity

The use of AI tools (ChatGPT, GitHub Copilot, Claude, etc.) is prohibited for this assignment. All code must be written by you. Violations will be treated as academic dishonesty.

References

[1] CUDA C++ Programming Guide

[2] PyTorch C++/CUDA Extension Tutorial

[3] Matrix Multiplication Background

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
Makefile		Makefile
Readme.md		Readme.md
cuda_gemm.cu		cuda_gemm.cu
report.md		report.md
setup.py		setup.py
test_gemm.py		test_gemm.py

Folders and files

Latest commit

History

Repository files navigation

ECE 60827 CUDA Programming Part 2

Professor: Timothy Rogers TA: Junrui Pan

Introduction

GEMM (General Matrix Multiplication)

Matrix Storage

Part A: Naive GEMM

Part B: Shared Memory GEMM

Part C: Loop Unrolling Optimization

Repository Structure

test_gemm.py

cuda_gemm.cu

Building and Testing

Setup Environment

Build

Test

Clean

Expected Output

Submission and Autograding

What Gets Tested

Running the Grader Locally

Report

Report Questions

Part 1: Understanding the Codebase

Part 2: Performance Analysis

Part 3: Loop Unrolling Analysis

Part 4: FP16

Grading Rubric

Submission Requirements

Academic Integrity

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Professor: Timothy Rogers
TA: Junrui Pan

Packages