This is a followup of #346: on ggttggg it is clear that build times start becoming very long again (20 minutes or more, mainly in CUDA, but also in clang/C++ the situation looks bad).
The issue is clearly related to inlining of FFV functions (hence to their templating in PR #328) and more generally to LTO/RDC/inlining optimizations over very large code bases (#229 et al).
Removing inlining by hand is an option, but small tests I have done in the past were really bad for performance.
The only viable solution is most likely splitting kernels (#310), not only for CUDA but also for C++. Once we have more than 1000 Feynman diagrams as in ggttggg, it makes no sense to do any optimizations across a single calculate_wavefunctions method with O(1k-10k) FFV calls. It looks better, even just for C++ and for build times, to split this into O(1k) functions, one per diagram.