Skip to content

purdue-aalp/ECE60827-CUDA2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ECE 60827 CUDA Programming Part 2

Professor: Timothy Rogers
TA: Junrui Pan

Introduction

Large Language Models (LLMs) like GPT and LLaMA are built on the Transformer architecture, which relies heavily on matrix multiplications. The core operations in Transformers—attention mechanisms (Q×K^T, softmax×V) and feed-forward layers—are essentially sequences of GEMMs. In fact, matrix multiplications account for the vast majority of compute in modern LLMs, making GEMM optimization critical for efficient inference and training. See Matrix Multiplication Background for more details.

The purpose of this lab is to deepen your understanding of CUDA programming by implementing General Matrix Multiplication (GEMM). You will implement two versions—a naive approach and an optimized shared memory version—and compare their performance against PyTorch's highly-optimized implementation. Beyond learning CUDA itself, this lab helps you understand how widely-used frameworks like PyTorch interact with the underlying GPU hardware. By building your own GEMM kernels and comparing them against PyTorch's implementation, you'll gain insight into the entire software stack—from high-level Python APIs down to low-level GPU execution.

The official CUDA Documentation is the best resource for implementation details and API specifics.



GEMM (General Matrix Multiplication)

GEMM computes the matrix product:

C = A × B

Where:

  • A is an M × K matrix
  • B is a K × N matrix
  • C is an M × N matrix (output)

Each element of C is computed as:

C[i][j] = Σ(k=0 to K-1) A[i][k] * B[k][j]

Matrix Storage

All matrices are stored in row-major order:

  • A[i][j] is stored at A[i * K + j]
  • B[i][j] is stored at B[i * N + j]
  • C[i][j] is stored at C[i * N + j]


Part A: Naive GEMM

Implement gemm_kernel in cuda_gemm.cu.

Each thread computes one element of the output matrix C. The thread reads an entire row of A and an entire column of B from global memory to compute the dot product.



Part B: Shared Memory GEMM

Implement gemm_shared_kernel in cuda_gemm.cu.

Use shared memory to reduce global memory accesses. Threads within a block should cooperatively load data into shared memory before computing.



Part C: Loop Unrolling Optimization

Build on your shared memory tiled GEMM implementation by manually unrolling the inner computation loop. Instead of relying solely on the compiler’s automatic unrolling, explicitly expand multiple loop iterations within the loop body to reduce loop overhead and expose additional instruction-level parallelism (ILP). You may use #pragma unroll to assist the compiler, but manual unrolling is required.



Repository Structure

This repository demonstrates how Python frameworks like PyTorch integrate with custom CUDA code:

  • test_gemm.py — The Python test script that imports cuda_gemm as a module and calls functions like cuda_gemm.gemm(A, B) on PyTorch GPU tensors.
  • setup.py — Defines how the cuda_gemm module gets built as a PyTorch C++/CUDA extension.
  • cuda_gemm.cu — Contains the CUDA kernels you will implement, along with wrapper functions that bridge Python calls to GPU execution.

Understanding this flow—from high-level Python API to low-level CUDA kernel—is a key learning objective of this lab.

test_gemm.py

Function Description
test_naive() Tests Part A correctness by comparing your naive GEMM output against torch.mm() across multiple matrix sizes.
test_shared() Tests Part B correctness by comparing your shared memory GEMM output against torch.mm() across multiple matrix sizes.
test_unrolled() Tests Part C correctness by comparing your loop unrolling GEMM output against torch.mm() across multiple matrix sizes.
benchmark() Measures and compares execution time of all three implementations against PyTorch's optimized GEMM.

cuda_gemm.cu

Function Description
gemm_kernel [TODO] The naive CUDA kernel where each thread computes one element of the output matrix.
gemm_shared_kernel [TODO] The shared memory CUDA kernel that uses tiling to reduce global memory accesses.
gemm_unrolled_kernel [TODO] The shared memory + loop unrolling CUDA kernel.
gemm_cuda() C++ wrapper that validates input tensors, configures grid/block dimensions, and launches gemm_kernel.
gemm_shared_cuda() C++ wrapper that validates input tensors, configures grid/block dimensions, and launches gemm_shared_kernel.
gemm_unrolled_cuda() C++ wrapper that validates input tensors, configures grid/block dimensions, and launches gemm_unrolled_kernel.
PYBIND11_MODULE Registers the C++ functions as Python-callable methods (cuda_gemm.gemm, cuda_gemm.gemm_shared, and cuda_gemm.gemm_unrolled). See pybind11 documentation.

Note: The grid and block dimensions in the wrapper functions can be adjusted as needed for your kernel implementation.



Building and Testing

Setup Environment

module load gcc/11.4.1 cuda ngc pytorch

Build

make build

Note: The first build may take a few minutes (~3 min) as it compiles the PyTorch CUDA extension.

Test

make test

This will run correctness tests and benchmarks comparing your implementations against PyTorch's optimized GEMM.

Clean

make clean


Expected Output

When your implementation is correct, make test should show:

==================================================
Benchmarking...
==================================================

Matrix size: (1024x2048) x (2048x512)
Naive max error:  X.XXXXXXe-XX
Shared max error: X.XXXXXXe-XX

PyTorch mm:    X.XXX ms
Naive GEMM:    X.XXX ms
Shared GEMM:   X.XXX ms

Speedup (shared vs naive): X.XXx
Relative to PyTorch (shared): X.XXx


Submission and Autograding

When you submit your assignment through GitHub Classroom, an autograder will automatically test your implementation.

What Gets Tested

  • Part A - Naive GEMM: Correctness against PyTorch reference
  • Part B - Shared Memory GEMM: Correctness against PyTorch reference
  • Part C - Loop Unrolling GEMM: Correctness against PyTorch reference

Running the Grader Locally

To test each part separately (as the autograder does):

# Test Part A only
python3 test_gemm.py --part-a

# Test Part B only
python3 test_gemm.py --part-b

# Test Part C only
python3 test_gemm.py --part-c

Running make test will execute the full benchmark, which tests both parts and reports timing comparisons.



Report

Please write all your answers directly in the report.md file. Do not create a separate PDF or other document.

Report Questions

Answer the following questions in your report. Read through all provided files (cuda_gemm.cu, setup.py, test_gemm.py) carefully before answering. Keep each answer concise—a few sentences is sufficient unless the question asks for data or code.

Part 1: Understanding the Codebase

How does a Python call to cuda_gemm.gemm(A, B) end up executing your CUDA kernel? Trace the path through test_gemm.py, setup.py, and cuda_gemm.cu. See Matrix Multiplication Background for additional context.

Part 2: Performance Analysis

  1. Run make test and report the timing results. How does the shared memory version compare to the naive version? What speedup do you observe?

  2. For the naive kernel, how many global memory loads does each thread perform to compute one element of C? For an M×K by K×N multiplication, what is the total number of global memory loads across all threads?

  3. For the shared memory kernel with tile size T (BLOCK_SIZE), how many global memory loads does each thread perform? How does tiling reduce the total number of global memory accesses?

  4. Both your implementations are likely slower than PyTorch's torch.mm. Why do you think this happens?

Part 3: Loop Unrolling Analysis

  1. What is loop unrolling and how does it help? What trade-offs does it introduce (e.g., instruction cache pressure, register usage)?

  2. Compare the performance of your unrolled kernel against the shared memory version. Do you see any speedup? Why or Why not?

  3. Describe one additional optimization technique (beyond shared memory tiling and loop unrolling) that could further improve GEMM performance. Explain the underlying principle and expected benefit.

Part 4: FP16

  1. Write a separate Python script that runs your three GEMM implementations and PyTorch's torch.mm using FP16 (torch.float16) tensors instead of FP32. You can create FP16 tensors like this:

    A = torch.randn(M, K, device='cuda', dtype=torch.float16)

    Report the timing results for all four. What happens to PyTorch's performance compared to FP32? What happens to your kernels' performance? Explain why PyTorch sees a significant speedup with FP16 while your custom kernels do not.

    (hint: Some special hardware in Volta can only do FP16, but not FP32.)

Grading Rubric

Component Points
Part A - Naive GEMM 15
Part B - Shared Memory GEMM 15
Part C - Loop Unrolling GEMM 40
Report 30
Total 100

Submission Requirements

Include your written report named report.md in the root folder of your repository. See Markdown Guide for formatting help.

Academic Integrity

The use of AI tools (ChatGPT, GitHub Copilot, Claude, etc.) is prohibited for this assignment. All code must be written by you. Violations will be treated as academic dishonesty.

References

[1] CUDA C++ Programming Guide

[2] PyTorch C++/CUDA Extension Tutorial

[3] Matrix Multiplication Background

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Generated from Connie120/ECE60827-CUDA2