This project focuses on implementing the Breadth First Search (BFS) algorithm using CUDA, a parallel computing platform, to achieve significant performance improvements compared to a simple sequential implementation. The primary goal is to explore the potential of GPU acceleration for graph traversal tasks.
For more information on the approach and its benefits, you can refer to the Nvidia paper titled Scalable GPU Graph Traversal authored by D. Merill, M. Garland, and A. Grimshaw.
To compile and run the code, you need to set the USE_HOST compilation flag to specify whether you want to run the algorithm sequentially (set to 1) or in parallel on a CUDA-enabled GPU (set to 0).
Here are the steps to run the code:
-
Open the Google Colab notebook provided in this repository and initialize the environment.
-
Load the desired test cases from the
testfolder or create your own test cases if needed. -
Run the CUDA program to execute the BFS algorithm.
Please note that a CUDA-enabled GPU is required to run the parallel version of the algorithm effectively.
The parallel implementation of the BFS algorithm outperforms the sequential version in a noticeable manner. One particularly interesting aspect is the performance improvement achieved by leveraging the device's shared block memory for writing the neighborhood nodes (next_queue) instead of directly using global memory, as commonly described in the literature.
This project is inspired by the aforementioned research paper and aims to demonstrate the advantages of utilizing CUDA for graph traversal tasks. Feel free to explore and adapt the code to your specific requirements and use cases.