CUDA GPU Programming for Beginners

Overview

A hands-on CUDA learning repository covering fundamental GPU programming concepts — from vector addition to cuBLAS. This project originated as a fork of CoffeeBeforeArch's cuda_programming (Nick's "CUDA Crash Course v3" YouTube series) and has since evolved into an independent, restructured repository with significant additions.

Relationship to the Original Repository

Origin

This repository was initially forked from CoffeeBeforeArch/cuda_programming (941+ stars, GPL-3.0 licensed). The original repo contains code examples from Nick's excellent YouTube tutorial series on GPGPU programming with CUDA.

An intermediate fork existed at ruohai0925/cuda_programming before being reorganized into this standalone repository.

What We Build Upon from Nick's Original

Core CUDA topics (vector addition, matrix multiplication, sum reduction, histogram, convolution) — all code has been modified and enhanced
The progressive optimization approach (baseline → optimized variants)
GPL-3.0 license (see LICENSE)

What's New in This Repository

Area	Original (Nick)	This Repo
Structure	Flat numbered directories (01–05)	Same core structure + additional modules (06_cuBLAS, misc, timing)
cuBLAS	Not included	`06_cuBLAS/`: SAXPY with `cublasSetVector`, batched matrix multiplication with cuRAND
Performance Benchmarking	No timing framework	`cuda_timing_Nick/`: Timing harness for matrix multiplication (naive, coalesced, prefetch, tmp_var, unroll) and sum reduction
Comparative Study	N/A	`cuda_timing_ZDSJTU/`: Subset of examples with timing instrumentation for performance comparison
Misc Examples	N/A	`misc/`: OpenACC matrix multiplication, clock-based timing, device query utility
Learning Notes	N/A	`learning_notes`: Profiling notes with `nsys profile` commands and example output
Documentation	Minimal README	Detailed README (EN/CN), CLAUDE.md with build commands, code patterns, and key concepts
Code Comments	Minimal	Enhanced comments and annotations for self-study

Repository Structure

cuda_gpu_for_beginner/
├── 01_vector_addition/      # baseline → grid_stride/vectorized → pinned → unified memory
├── 02_matrix_mul/           # baseline → alignment → restrict → rectangular → nonMultiple → tiled (1D, 2D)
├── 03_sum_reduction/        # diverged → bank_conflicts → reduce_idle → no_conflicts → device_function → cooperative_groups
├── 04_histogram/            # global_atomic → shmem_atomic
├── 05_convolution/          # 1d_naive → 1d_cache → 1d_constant_memory → 1d_tiled → 2d_constant_memory
├── 06_cuBLAS/               # SAXPY (vector_add_cublas.cu), batched matmul (matrix_mul_cublas.cu)
├── cuda_timing_Nick/        # Performance benchmarking framework
├── cuda_timing_ZDSJTU/      # Timing instrumentation for comparison study
├── misc/                    # OpenACC, clock, query_device
└── learning_notes           # nsys profiling notes

Video Tutorials

Every topic in this repository comes with a video walkthrough explaining the code, CUDA concepts, and optimization techniques step by step. Watch the full playlist on YouTube: CUDA GPU for Beginners

#	Video	Link
1	vector add	Watch
2	vectorAdd um baseline	Watch
2.5	vectorAdd grid stride and vectorized memory access	Watch
3	vectorAdd um prefetch	Watch
4	vectorAdd pinned	Watch
5	mmul baseline	Watch
6	mmul tile	Watch
7	gpu architecture	Watch
8	mmul alignment	Watch
9	cublas vectorAdd	Watch
10	cublas matrix mul	Watch
11	sum reduction diverged	Watch
12	sum reduction bank conflicts	Watch
13	some questions about bank conflicts	Watch
14	sum reduction no conflicts	Watch
15	sum reduction reduce idle	Watch
16	sum reduction device function	Watch
17	sum reduction cooperative groups	Watch
18	visual studio build cuda project	Watch
19	vectorAdd baseline profiling	Watch
20	Nsight Systems vs Nsight Compute	Watch
21	gpu profiling scripts for mmul and sumReduction	Watch
21.5	gpu profiling results for mmul and sumReduction	Watch
22	1d convolution naive	Watch
23	GPU concepts recap	Watch
24	1d convolution constant memory	Watch
25	1d convolution shared memory	Watch
26	1d convolution cache simplification	Watch
27	2d convolution	Watch
28	short summary thinking spatially	Watch
29	histogram global atomic	Watch
30	histogram shmem atomic	Watch
31	matrix multiplication demo	Watch
32	OpenACC	Watch
33	gpu device properties	Watch
34	clock function	Watch

Key Concepts Covered

Grid-stride loops & vectorized memory access (int4 for 128-bit loads)
Pinned memory (DMA direct access, eliminates double-copy penalty)
Unified memory (cudaMallocManaged, automatic migration)
Shared memory tiling (__shared__ for data reuse, 1D & 2D variants)
Non-square & non-multiple matrices (ceiling division, boundary checks)
Bank conflicts (sequential vs. strided shared memory access)
Warp-level optimization (unrolled last warp with warpReduce())
Cooperative groups (modern sync API, int4 vectorized loads, atomicAdd)
Constant memory (__constant__ for read-only mask/coefficient data)
cuBLAS v2 API (column-major layout, cublasSetVector, cuRAND)
OpenACC (directive-based GPU programming)

Build

No build system (Makefile/CMake). Each .cu file is compiled individually:

# Standard compilation
nvcc source.cu -o executable

# cuBLAS projects (requires linking)
nvcc vector_add_cublas.cu -o vector_add -lcublas
nvcc matrix_mul_cublas.cu -o matrix_mul -lcublas -lcurand

License

This repository is released under both of the following licenses. All code in this repository (including modifications to the original examples) must comply with both licenses simultaneously:

GPL-3.0 — inherited from the upstream CoffeeBeforeArch/cuda_programming repository. See LICENSE.
PolyForm Strict 1.0.0 — additional restriction applied to the entire repository by the maintainer. See LICENSE-POLYFORM.

In summary: All code in this repository is source-available for personal and non-commercial use only. You may study, experiment, and learn from this code, but commercial use is not permitted without explicit written permission from the maintainer. Both licenses must be respected when using any part of this repository.

Acknowledgments

CoffeeBeforeArch (Nick) for the original "CUDA Crash Course (v3)" YouTube series and codebase
The CUDA community for invaluable learning resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA GPU Programming for Beginners

Overview

Relationship to the Original Repository

Origin

What We Build Upon from Nick's Original

What's New in This Repository

Repository Structure

Video Tutorials

Key Concepts Covered

Build

License

Acknowledgments

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
01_vector_addition		01_vector_addition
02_matrix_mul		02_matrix_mul
03_sum_reduction		03_sum_reduction
04_histogram		04_histogram
05_convolution		05_convolution
cuda_timing_Nick		cuda_timing_Nick
cuda_timing_ZDSJTU		cuda_timing_ZDSJTU
misc		misc
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-POLYFORM		LICENSE-POLYFORM
README.md		README.md
README_CN.md		README_CN.md

Folders and files

Latest commit

History

Repository files navigation

CUDA GPU Programming for Beginners

Overview

Relationship to the Original Repository

Origin

What We Build Upon from Nick's Original

What's New in This Repository

Repository Structure

Video Tutorials

Key Concepts Covered

Build

License

Acknowledgments

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages