Skip to content

ruohai0925/cuda_gpu_for_beginner

Repository files navigation

CUDA GPU Programming for Beginners

中文版本

Overview

A hands-on CUDA learning repository covering fundamental GPU programming concepts — from vector addition to cuBLAS. This project originated as a fork of CoffeeBeforeArch's cuda_programming (Nick's "CUDA Crash Course v3" YouTube series) and has since evolved into an independent, restructured repository with significant additions.

Relationship to the Original Repository

Origin

This repository was initially forked from CoffeeBeforeArch/cuda_programming (941+ stars, GPL-3.0 licensed). The original repo contains code examples from Nick's excellent YouTube tutorial series on GPGPU programming with CUDA.

An intermediate fork existed at ruohai0925/cuda_programming before being reorganized into this standalone repository.

What We Build Upon from Nick's Original

  • Core CUDA topics (vector addition, matrix multiplication, sum reduction, histogram, convolution) — all code has been modified and enhanced
  • The progressive optimization approach (baseline → optimized variants)
  • GPL-3.0 license (see LICENSE)

What's New in This Repository

Area Original (Nick) This Repo
Structure Flat numbered directories (01–05) Same core structure + additional modules (06_cuBLAS, misc, timing)
cuBLAS Not included 06_cuBLAS/: SAXPY with cublasSetVector, batched matrix multiplication with cuRAND
Performance Benchmarking No timing framework cuda_timing_Nick/: Timing harness for matrix multiplication (naive, coalesced, prefetch, tmp_var, unroll) and sum reduction
Comparative Study N/A cuda_timing_ZDSJTU/: Subset of examples with timing instrumentation for performance comparison
Misc Examples N/A misc/: OpenACC matrix multiplication, clock-based timing, device query utility
Learning Notes N/A learning_notes: Profiling notes with nsys profile commands and example output
Documentation Minimal README Detailed README (EN/CN), CLAUDE.md with build commands, code patterns, and key concepts
Code Comments Minimal Enhanced comments and annotations for self-study

Repository Structure

cuda_gpu_for_beginner/
├── 01_vector_addition/      # baseline → grid_stride/vectorized → pinned → unified memory
├── 02_matrix_mul/           # baseline → alignment → restrict → rectangular → nonMultiple → tiled (1D, 2D)
├── 03_sum_reduction/        # diverged → bank_conflicts → reduce_idle → no_conflicts → device_function → cooperative_groups
├── 04_histogram/            # global_atomic → shmem_atomic
├── 05_convolution/          # 1d_naive → 1d_cache → 1d_constant_memory → 1d_tiled → 2d_constant_memory
├── 06_cuBLAS/               # SAXPY (vector_add_cublas.cu), batched matmul (matrix_mul_cublas.cu)
├── cuda_timing_Nick/        # Performance benchmarking framework
├── cuda_timing_ZDSJTU/      # Timing instrumentation for comparison study
├── misc/                    # OpenACC, clock, query_device
└── learning_notes           # nsys profiling notes

Video Tutorials

Every topic in this repository comes with a video walkthrough explaining the code, CUDA concepts, and optimization techniques step by step. Watch the full playlist on YouTube: CUDA GPU for Beginners

# Video Link
1 vector add Watch
2 vectorAdd um baseline Watch
2.5 vectorAdd grid stride and vectorized memory access Watch
3 vectorAdd um prefetch Watch
4 vectorAdd pinned Watch
5 mmul baseline Watch
6 mmul tile Watch
7 gpu architecture Watch
8 mmul alignment Watch
9 cublas vectorAdd Watch
10 cublas matrix mul Watch
11 sum reduction diverged Watch
12 sum reduction bank conflicts Watch
13 some questions about bank conflicts Watch
14 sum reduction no conflicts Watch
15 sum reduction reduce idle Watch
16 sum reduction device function Watch
17 sum reduction cooperative groups Watch
18 visual studio build cuda project Watch
19 vectorAdd baseline profiling Watch
20 Nsight Systems vs Nsight Compute Watch
21 gpu profiling scripts for mmul and sumReduction Watch
21.5 gpu profiling results for mmul and sumReduction Watch
22 1d convolution naive Watch
23 GPU concepts recap Watch
24 1d convolution constant memory Watch
25 1d convolution shared memory Watch
26 1d convolution cache simplification Watch
27 2d convolution Watch
28 short summary thinking spatially Watch
29 histogram global atomic Watch
30 histogram shmem atomic Watch
31 matrix multiplication demo Watch
32 OpenACC Watch
33 gpu device properties Watch
34 clock function Watch

Key Concepts Covered

  • Grid-stride loops & vectorized memory access (int4 for 128-bit loads)
  • Pinned memory (DMA direct access, eliminates double-copy penalty)
  • Unified memory (cudaMallocManaged, automatic migration)
  • Shared memory tiling (__shared__ for data reuse, 1D & 2D variants)
  • Non-square & non-multiple matrices (ceiling division, boundary checks)
  • Bank conflicts (sequential vs. strided shared memory access)
  • Warp-level optimization (unrolled last warp with warpReduce())
  • Cooperative groups (modern sync API, int4 vectorized loads, atomicAdd)
  • Constant memory (__constant__ for read-only mask/coefficient data)
  • cuBLAS v2 API (column-major layout, cublasSetVector, cuRAND)
  • OpenACC (directive-based GPU programming)

Build

No build system (Makefile/CMake). Each .cu file is compiled individually:

# Standard compilation
nvcc source.cu -o executable

# cuBLAS projects (requires linking)
nvcc vector_add_cublas.cu -o vector_add -lcublas
nvcc matrix_mul_cublas.cu -o matrix_mul -lcublas -lcurand

License

This repository is released under both of the following licenses. All code in this repository (including modifications to the original examples) must comply with both licenses simultaneously:

  1. GPL-3.0 — inherited from the upstream CoffeeBeforeArch/cuda_programming repository. See LICENSE.

  2. PolyForm Strict 1.0.0 — additional restriction applied to the entire repository by the maintainer. See LICENSE-POLYFORM.

In summary: All code in this repository is source-available for personal and non-commercial use only. You may study, experiment, and learn from this code, but commercial use is not permitted without explicit written permission from the maintainer. Both licenses must be respected when using any part of this repository.

Acknowledgments

  • CoffeeBeforeArch (Nick) for the original "CUDA Crash Course (v3)" YouTube series and codebase
  • The CUDA community for invaluable learning resources

About

No description, website, or topics provided.

Resources

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE
Unknown
LICENSE-POLYFORM

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors