Cuda Neural Network

Note that we followed pytorch kernels names.

Our architecture

We implemented a simple neural Network in Cuda and Trained it for a classification problem. We also implemented a similar pytorch model to compare it with our model.

Linear Layer
Activation function (RelU)
Numerical stable softmax
NLL Loss
SGD optimizer

Later we merge the softmax with NLLLoss to be in one kernel (Cross entropy kernel)

Math notes behind the scene

We implemented logSoftmax from the beginning to ensure numerical stability and subtract the maximum from the input to avoid overflow logSoftmax(xi)=xi-log(j=0nexj-max(x0->n))

Implemented kernels

For forward propagation
1. Linear forward kernel
2. RelU forward kernel
3. Log softmax kernel
4. NLL loss kernel
5. Cross Entropy kernel
For backward propagation
1. Cross Entropy backward kernel
2. Linear Backward kernel
3. ReLU backward kernel
4. SGD optimizer kernel
Common helper kernels
1. Three kernels for reduction (Sum, Mean)
2. Five kernels for Matrix multiplication
3. Reduce On Axis

Performance analysis

Floating point operations for each implementation device

GPU

GPU 3050 RTX ti
5.299 TFLOPS
Bandwidth: 192.0 GB/s

CPU

CPU: i7-11800
4.6 Ghz
73.6 GFlops per core

RAM

Ram bandwidth: 51.1 GB/s.

For B = 10000,C=10000,M,N,L all = 10000

layer	Compute Complexity	Memory Complexity Reads/Writes	CPU - time	GPU - time	Theoretical speedup
softmax	5BC	NC/NC 4B	22.47 ms	0.0943ms +4.167ms =4.26ms	x5.274
NLLLoss	B	N*C+N/N 4B	7.84ms	0+0.0212=0.0212ms	x369.8
Cross Entropy	5BC+B	(NC+N)/(N*C+N) 4B	22.417 ms	0.094ms+4.1ms=4.194ms	x5.34
Matrix multiplication	MN(2L)	ML+LN/MN 4B	27201 ms	377.42ms+6.25ms=383.67ms	X70.8 This is due to the fact that the operation is very arithmetic focused
Array reduction	B	N/1 4B	0.0009 ms	0+0=0.0002ms	~no speedup
RelU	BN	BN/BN	17.04ms	0.0188ms+4.1667ms=4.185ms	x4

**Practical comparison Forward propagation on 10000*10000**

layer	CPU	GPU	speedup
softmax	1327.024048 ms	84.1993 ms	~x16
NLLLoss	0.225000 ms	0.0155 ms	~x15
Cross Entropy	1430.562988 ms	31.2810 ms	~x45
RelU	790.794983 ms	0.0074 ms	~x106864
Linear Layer (2048^3)	28367.611328 ms	49.7811 ms	~x570
Array Reduction	0.027000 ms	0.017562 ms	~x2
SGD	0.022000 ms	0.0070 ms	~x3

GPU compared with pytorch

layer	Cuda Forward	Pytorch Forward
softmax	84.1993 ms	8.226 ms
NLLLoss	0.0155 ms	0.291530 ms
Cross Entropy	31.2810 ms	8.053ms
Linear Layer	49.7811 ms	5.419ms
RelU	0.0074 ms	4.551ms

**Practical comparison Backward propagation on 10000*10000**

layer	CPU	GPU	speedup
Cross Entropy	0	0.0050 ms	overhead
Linear Backward ((2048^3))	No end	98.2392 ms	inf
RelU	860.95 ms	0.0103 ms	83587

Theoretical VS practical speed

Several factors can contribute to the observed speedup being lower than the theoretical one:

Memory Transfer Overhead: Moving data between the CPU and GPU can incur significant latency. Optimizing data transfer by minimizing the frequency and size of transfers can help.
Kernel Launch Overhead: The time taken to launch GPU kernels can affect performance. Overlapping data transfer with computation (using techniques such as CUDA streams) can help.
Suboptimal GPU Utilization: Not fully utilizing the GPU's computational units can reduce performance. Ensuring that the workload is large enough and properly distributed across the GPU can improve utilization.
Algorithm Optimization: The specific algorithm and its implementation can have a significant impact. Optimizing the algorithm for parallel execution and leveraging GPU-specific libraries can enhance performance.

To achieve a better speedup:

Optimize Data Transfer: Use pinned memory and asynchronous data transfers.
Kernel Fusion: Combine multiple small kernels into a larger one to reduce launch overhead.
Algorithm Tuning: Refine the algorithm to better exploit GPU architecture.

Files description

File for each layer contains the following:
- Kernel and kernel launcher
- equivalent cpu layer
- Main to test the layer, benchmark the difference in speed and check the both GPU and CPU output same results
- Writing inputs and outputs to .npy file to be tested in python.
ModelLayers.hpp
- Class contains Layers implemented in CPU ,taken from each separate file, as a static member void functions.
Kernels.cuh
- Cuda header file contains all layers kernels taken from each separate file
KernelLauncher.cuh
- Cuda header class contain static void member function to run each kernel exists in kernels.cuh
ModelMemoryHandler.hpp
- A class handles anything related to memory
  - allocating / deallocating layers memory
  - A function mimics to_cuda() in pytorch to move our model from cpu to gpu.
- Contain to essential structs
  - Parameter: to store linear layers parameters
  - Activation: to store the output of the layer
Main_training.cu: training the gpu model
Main_training_merged_cross.cu
- Training the gpu model but replacing the softmax followed by NLLLoss with cross entropy
Main_training_cpu.cpp: training the cpu model
Main_training_cpu_merged_cross.cu
- Training the cpu model but replacing the softmax followed by NLLLoss with cross entropy
Mat_muls_kernels.cu: Contains all multiplication kernels

Kernel Fusion

We apply kernel fusion in three places

Merging softmax with NLLLoss to perform Cross entropy in the forward propagation that reduce time from 84.2 ms to 31 ms
Merging softmax with NLLLoss to perform Cross entropy in the backward propagation, which also helps in algorithm optimization. Instead of calculating jacobian matrix, we simply subtract the input from the output of the softmax in the forward pass

Streaming

We only had a change of streaming that when one batch is doing its forward and backward, the following batch would then load its data into memory then wait till weights are updated from the previous mini-batch.

This results in the following Timeline:

Where the green parts are the loading of the data from another stream.

One way to improve this approach was to use more than 2 streams which will all load their data while waiting for the update of the previous batches but we would be then limited by the VRAM of the GPU and also the batch size will need to be smaller So it would be somekind of a tradeoff.

Time of Training:

Pytorch:

Total time: 6662.16ms

Without Streaming:

Total time: 982.333435ms

With Streaming

Total time: 922.212280ms

6% speedup.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.vscode		.vscode
nn-cuda		nn-cuda
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cuda Neural Network

Our architecture

Math notes behind the scene

Implemented kernels

Performance analysis

GPU

CPU

RAM

**Practical comparison Forward propagation on 10000*10000**

GPU compared with pytorch

**Practical comparison Backward propagation on 10000*10000**

Theoretical VS practical speed

Files description

Kernel Fusion

Streaming

Time of Training:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

fady2001/NN-cuda

Folders and files

Latest commit

History

Repository files navigation

Cuda Neural Network

Our architecture

Math notes behind the scene

Implemented kernels

Performance analysis

GPU

CPU

RAM

Practical comparison Forward propagation on 10000*10000

GPU compared with pytorch

Practical comparison Backward propagation on 10000*10000

Theoretical VS practical speed

Files description

Kernel Fusion

Streaming

Time of Training:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

**Practical comparison Forward propagation on 10000*10000**

**Practical comparison Backward propagation on 10000*10000**

Packages