You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+82-15Lines changed: 82 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,8 +12,10 @@
12
12
13
13
14
14
15
-
_Kernel Launcher_ is a C++ library that makes it easy to dynamically compile _CUDA_ kernels at run time (using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html)) and call them in an easy type-safe way using C++ magic.
16
-
Additionally, _Kernel Launcher_ supports exporting kernel specifications, to enable tuning by [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner), and importing the tuning results, known as _wisdom_ files, back into the application.
15
+
_Kernel Launcher_ is a C++ library that enables dynamic compilation _CUDA_ kernels at run time (using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html)) and launching them in an easy type-safe way using C++ magic.
16
+
On top of that, Kernel Launcher supports _capturing_ kernel launches, to enable tuning by [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner), and importing the tuning results, known as _wisdom_ files, back into the application.
17
+
The result: highly efficient GPU applications with maximum portability.
18
+
17
19
18
20
19
21
@@ -23,32 +25,89 @@ Recommended installation is using CMake. See the [installation guide](https://ke
23
25
24
26
## Example
25
27
26
-
See the documentation for [examples](https://kerneltuner.github.io/kernel_launcher/example.html) or check out the [examples](https://github.com/KernelTuner/kernel_launcher/tree/master/examples) directory.
28
+
There are many ways of using Kernel Launcher. See the documentation for [examples](https://kerneltuner.github.io/kernel_launcher/example.html) or check out the [examples](https://github.com/KernelTuner/kernel_launcher/tree/master/examples) directory.
29
+
30
+
31
+
### Pragma-based API
32
+
Below shows an example of using the pragma-based API, which allows existing CUDA kernels to be annotated with Kernel-Launcher-specific directives.
auto vector_add_kernel = kl::WisdomKernel(builder);
52
111
53
112
// Initialize CUDA memory. This is outside the scope of kernel_launcher.
54
113
unsigned int n = 1000000;
@@ -60,16 +119,24 @@ int main() {
60
119
// derived from the kernel specifications and run-time arguments.
61
120
vector_add_kernel(n, dev_C, dev_A, dev_B);
62
121
}
63
-
64
122
```
65
123
124
+
125
+
66
126
## License
67
127
68
128
Licensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_launcher/blob/master/LICENSE).
69
129
130
+
70
131
## Citation
71
132
72
-
```
133
+
If you use Kernel Launcher in your work, please cite the following publication:
134
+
135
+
> S. Heldens, B. van Werkhoven (2023), "Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications", The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IPDPS 2023
136
+
137
+
As BibTeX:
138
+
139
+
```Latex
73
140
@article{heldens2023kernellauncher,
74
141
title={Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications},
In the previous examples, we demonstrated how a tunable kernel can be specified by defining a ``KernelBuilder`` instance in the host-side code.
5
+
While this API offers flexiblity, it can be cumbersome and requires keeping the kernel code in CUDA in sync with the host-side code in C++.
6
+
7
+
Kernel Launcher also provides a way to define kernel specifications directly in the CUDA code by using pragma directives to annotate the kernel code.
8
+
Although this method is less flexible than the ``KernelBuilder`` API, it is much more convenient and suitable for most CUDA kernels.
9
+
10
+
11
+
Source Code
12
+
-----------
13
+
14
+
The following code example shows valid CUDA kernel code containing pragma directives.
15
+
The ``#pragma`` annotations will be ignored by the ``nvcc`` compiler (but they may produce compiler warnings).
16
+
17
+
18
+
.. literalinclude:: vector_add_annotated.cu
19
+
:lines: 1-20
20
+
:lineno-start: 1
21
+
22
+
23
+
Code Explanation
24
+
----------------
25
+
26
+
The kernel contains the following ``pragma`` directives:
27
+
28
+
.. literalinclude:: vector_add_annotated.cu
29
+
:lines: 1-2
30
+
:lineno-start: 1
31
+
32
+
The tune directives specify the tunable parameters: ``threads_per_block`` and ``items_per_thread``.
33
+
Since ``items_per_thread`` is also the name of the template parameter, so it is passed to the kernel as a compile-time constant via this parameter.
34
+
The value of ``threads_per_block`` is not passed to the kernel but is used by subsequent pragmas.
35
+
36
+
.. literalinclude:: vector_add_annotated.cu
37
+
:lines: 3-3
38
+
:lineno-start: 3
39
+
40
+
The ``set`` directives defines a constant.
41
+
In this case, the constant ``items_per_block`` is defined as the product of ``threads_per_block`` and ``items_per_thread``.
42
+
43
+
.. literalinclude:: vector_add_annotated.cu
44
+
:lines: 4-6
45
+
:lineno-start: 4
46
+
47
+
The ``problem_size`` directive defines the problem size (as discussed in as discussed in :doc:`basic`), ``block_size`` specifies the thread block size, and ``grid_divisor`` specifies how the problem size should be divided to obtain the thread grid size.
48
+
Alternatively, ``grid_size`` can be used to specify the grid size directly.
49
+
50
+
51
+
.. literalinclude:: vector_add_annotated.cu
52
+
:lines: 7-7
53
+
:lineno-start: 7
54
+
55
+
The ``buffers`` directive specifies the size of each buffer (``A``, ``B``, and ``C``) as ``n`` elements to be known by Kernel Launcher.
56
+
This is necessary since raw pointers can be used for buffer arguments, for which size information may not be available.
57
+
If the ``buffers`` pragma is not specified, Kernel Launcher can still be used but it is not possible to capture kernel launches.
58
+
59
+
.. literalinclude:: vector_add_annotated.cu
60
+
:lines: 8-8
61
+
:lineno-start: 8
62
+
63
+
The ``tuning_key`` directive specifies the tuning key, which can be a concatenation of strings or variables.
64
+
In this example, the tuning key is ``"vector_add_" + T``, where ``T`` is the name of the type.
65
+
66
+
67
+
Host Code
68
+
---------
69
+
70
+
The below code shows how to call the kernel from the host in C++::
71
+
72
+
#include "kernel_launcher/pragma.h"
73
+
using namespace kl = kernel_launcher;
74
+
75
+
void launch_vector_add(float* C, const float* A, const float* B) {
0 commit comments