A hands-on CUDA learning repository covering fundamental GPU programming concepts — from vector addition to cuBLAS. This project originated as a fork of CoffeeBeforeArch's cuda_programming (Nick's "CUDA Crash Course v3" YouTube series) and has since evolved into an independent, restructured repository with significant additions.
This repository was initially forked from CoffeeBeforeArch/cuda_programming (941+ stars, GPL-3.0 licensed). The original repo contains code examples from Nick's excellent YouTube tutorial series on GPGPU programming with CUDA.
An intermediate fork existed at ruohai0925/cuda_programming before being reorganized into this standalone repository.
- Core CUDA topics (vector addition, matrix multiplication, sum reduction, histogram, convolution) — all code has been modified and enhanced
- The progressive optimization approach (baseline → optimized variants)
- GPL-3.0 license (see LICENSE)
| Area | Original (Nick) | This Repo |
|---|---|---|
| Structure | Flat numbered directories (01–05) | Same core structure + additional modules (06_cuBLAS, misc, timing) |
| cuBLAS | Not included | 06_cuBLAS/: SAXPY with cublasSetVector, batched matrix multiplication with cuRAND |
| Performance Benchmarking | No timing framework | cuda_timing_Nick/: Timing harness for matrix multiplication (naive, coalesced, prefetch, tmp_var, unroll) and sum reduction |
| Comparative Study | N/A | cuda_timing_ZDSJTU/: Subset of examples with timing instrumentation for performance comparison |
| Misc Examples | N/A | misc/: OpenACC matrix multiplication, clock-based timing, device query utility |
| Learning Notes | N/A | learning_notes: Profiling notes with nsys profile commands and example output |
| Documentation | Minimal README | Detailed README (EN/CN), CLAUDE.md with build commands, code patterns, and key concepts |
| Code Comments | Minimal | Enhanced comments and annotations for self-study |
cuda_gpu_for_beginner/
├── 01_vector_addition/ # baseline → grid_stride/vectorized → pinned → unified memory
├── 02_matrix_mul/ # baseline → alignment → restrict → rectangular → nonMultiple → tiled (1D, 2D)
├── 03_sum_reduction/ # diverged → bank_conflicts → reduce_idle → no_conflicts → device_function → cooperative_groups
├── 04_histogram/ # global_atomic → shmem_atomic
├── 05_convolution/ # 1d_naive → 1d_cache → 1d_constant_memory → 1d_tiled → 2d_constant_memory
├── 06_cuBLAS/ # SAXPY (vector_add_cublas.cu), batched matmul (matrix_mul_cublas.cu)
├── cuda_timing_Nick/ # Performance benchmarking framework
├── cuda_timing_ZDSJTU/ # Timing instrumentation for comparison study
├── misc/ # OpenACC, clock, query_device
└── learning_notes # nsys profiling notes
Every topic in this repository comes with a video walkthrough explaining the code, CUDA concepts, and optimization techniques step by step. Watch the full playlist on YouTube: CUDA GPU for Beginners
| # | Video | Link |
|---|---|---|
| 1 | vector add | Watch |
| 2 | vectorAdd um baseline | Watch |
| 2.5 | vectorAdd grid stride and vectorized memory access | Watch |
| 3 | vectorAdd um prefetch | Watch |
| 4 | vectorAdd pinned | Watch |
| 5 | mmul baseline | Watch |
| 6 | mmul tile | Watch |
| 7 | gpu architecture | Watch |
| 8 | mmul alignment | Watch |
| 9 | cublas vectorAdd | Watch |
| 10 | cublas matrix mul | Watch |
| 11 | sum reduction diverged | Watch |
| 12 | sum reduction bank conflicts | Watch |
| 13 | some questions about bank conflicts | Watch |
| 14 | sum reduction no conflicts | Watch |
| 15 | sum reduction reduce idle | Watch |
| 16 | sum reduction device function | Watch |
| 17 | sum reduction cooperative groups | Watch |
| 18 | visual studio build cuda project | Watch |
| 19 | vectorAdd baseline profiling | Watch |
| 20 | Nsight Systems vs Nsight Compute | Watch |
| 21 | gpu profiling scripts for mmul and sumReduction | Watch |
| 21.5 | gpu profiling results for mmul and sumReduction | Watch |
| 22 | 1d convolution naive | Watch |
| 23 | GPU concepts recap | Watch |
| 24 | 1d convolution constant memory | Watch |
| 25 | 1d convolution shared memory | Watch |
| 26 | 1d convolution cache simplification | Watch |
| 27 | 2d convolution | Watch |
| 28 | short summary thinking spatially | Watch |
| 29 | histogram global atomic | Watch |
| 30 | histogram shmem atomic | Watch |
| 31 | matrix multiplication demo | Watch |
| 32 | OpenACC | Watch |
| 33 | gpu device properties | Watch |
| 34 | clock function | Watch |
- Grid-stride loops & vectorized memory access (
int4for 128-bit loads) - Pinned memory (DMA direct access, eliminates double-copy penalty)
- Unified memory (
cudaMallocManaged, automatic migration) - Shared memory tiling (
__shared__for data reuse, 1D & 2D variants) - Non-square & non-multiple matrices (ceiling division, boundary checks)
- Bank conflicts (sequential vs. strided shared memory access)
- Warp-level optimization (unrolled last warp with
warpReduce()) - Cooperative groups (modern sync API,
int4vectorized loads,atomicAdd) - Constant memory (
__constant__for read-only mask/coefficient data) - cuBLAS v2 API (column-major layout,
cublasSetVector, cuRAND) - OpenACC (directive-based GPU programming)
No build system (Makefile/CMake). Each .cu file is compiled individually:
# Standard compilation
nvcc source.cu -o executable
# cuBLAS projects (requires linking)
nvcc vector_add_cublas.cu -o vector_add -lcublas
nvcc matrix_mul_cublas.cu -o matrix_mul -lcublas -lcurandThis repository is released under both of the following licenses. All code in this repository (including modifications to the original examples) must comply with both licenses simultaneously:
-
GPL-3.0 — inherited from the upstream CoffeeBeforeArch/cuda_programming repository. See LICENSE.
-
PolyForm Strict 1.0.0 — additional restriction applied to the entire repository by the maintainer. See LICENSE-POLYFORM.
In summary: All code in this repository is source-available for personal and non-commercial use only. You may study, experiment, and learn from this code, but commercial use is not permitted without explicit written permission from the maintainer. Both licenses must be respected when using any part of this repository.
- CoffeeBeforeArch (Nick) for the original "CUDA Crash Course (v3)" YouTube series and codebase
- The CUDA community for invaluable learning resources