-
Notifications
You must be signed in to change notification settings - Fork 297
Description
Is your feature request related to a problem? Please describe.
This problem is annoying me for years: I find pycuda runs extremely slow on Windows but not on Linux. My program contains ~20 ElementwiseKernels and ReductionKernels. I find that the SourceModule is used to compile the code, and it will save the cubin files to the cache_dir. It works well on any Linux machine as I tested, which only have ~1s overhead to load the functions later. However, running my code on Windows for the first time costs ~2min, and later it still costs ~1min. This is because it always need to preprocess the code since the source code always contains #include <pycuda-complex.hpp>:
Lines 89 to 90 in 96aab3f
| if "#include" in source: | |
| checksum.update(preprocess_source(source, options, nvcc).encode("utf-8")) |
As I tested, on any Windows computer, running
nvcc --preprocess "empty_file.cu" --compiler-options -EP takes several seconds. In other words, the condition of whether using cache takes a very long time to compute.
Describe the solution you'd like
I tried to monkey patch this to remove the preprocess call above, and it works well. I'd like to find a better way to do it. The easiest way I can think of is adding an option to force ignoring the #include check (though it should not be used by default, since the user must know the potential risk)
Describe alternatives you've considered
Is there any nvcc options to speed-up the preprocessing? I don't know.
Additional context
The link below is one of the examples I worked on, but I guess any simple functionality of the GPUArray relies on the SourceModule is impacted by this.
https://github.com/bu-cisl/SSNP-IDT/blob/master/examples/forward_model.py