-
Notifications
You must be signed in to change notification settings - Fork 37
Kernel splitting ihel4-ihel6: Feynman diagram groups #1066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…oups and storeWfs/retrieveWfs only for selected wfs
…gram groups - only missing diagrams.h
…eWfs to diagrams_header.h
… selection of wavefunctions to retrieve or store
…election of wavefunctions to retrieve/store
…t ggttgg fails runTest
…d warning (nevt declared but never referenced)
… - failures in ggttgg, ggttggg, smeftggtttt Note also that already with 5 diagrams per group this is now a factor 2 faster in cuda for ggttggg (and ~30% faster in C++) STARTED AT Sat Oct 18 11:58:05 AM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sat Oct 18 01:07:03 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling ENDED(1-scaling) AT Sat Oct 18 01:20:08 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ENDED(2) AT Sat Oct 18 01:23:31 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling ENDED(2-scaling) AT Sat Oct 18 01:39:28 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(3) AT Sat Oct 18 01:51:53 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(4) AT Sat Oct 18 01:59:57 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(5) AT Sat Oct 18 02:02:36 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(6) AT Sat Oct 18 02:05:07 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(7) AT Sat Oct 18 02:07:49 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(8) AT Sat Oct 18 02:19:40 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(9) AT Sat Oct 18 02:51:16 PM CEST 2025 [Status=2] ./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_blasOn.txt: 2 FAILED TESTS ./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 2 FAILED TESTS
…- failures in ggttgg, ggttggg, smeftggtttt STARTED AT Sat Oct 18 02:51:16 PM CEST 2025 (SM tests) ENDED(1) AT Sat Oct 18 03:46:22 PM CEST 2025 [Status=0] (BSM tests) ENDED(1) AT Sat Oct 18 03:49:46 PM CEST 2025 [Status=0] tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file: tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file: tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file:
…itscrd90 Revert "[hack_ihel4p2] rerun 144 tput tests on itscrd90 (with diagram groups) - failures in ggttgg, ggttggg, smeftggtttt" This reverts commit fd6d902. Revert "[hack_ihel4p2] rerun 30 tmad tests on itscrd90 (with diagram groups) - failures in ggttgg, ggttggg, smeftggtttt" This reverts commit 29f3b9b.
…put wavefunctions (4-particle vertices!)
…G FIX in CODEGEN (amplitudes with >=4 input wavefunctions) Checked that ggttg has no change in generated code Checked that ggttgg, ggttggg and smeftggtttt now pass runTest (Note that ggttggg builds seem faster than with a single kernel or with 1k kernels?)
…and CODEGEN bug fix (>=4 input wavefunctions for amplitudes)
…r kernel and CUDA Graphs) - all ok With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a): - CUDA (without blas) is a factor ~10 slower for small grids and ~2.5 slower for large grids (1 cycle) for ggttggg - C++ is 15-20% slower STARTED AT Sat Oct 18 09:09:50 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sat Oct 18 10:20:37 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling ENDED(1-scaling) AT Sat Oct 18 10:32:51 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ENDED(2) AT Sat Oct 18 10:37:13 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling ENDED(2-scaling) AT Sat Oct 18 10:52:17 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(3) AT Sat Oct 18 11:07:44 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(4) AT Sat Oct 18 11:17:54 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(5) AT Sat Oct 18 11:21:09 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(6) AT Sat Oct 18 11:24:22 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(7) AT Sat Oct 18 11:27:42 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(8) AT Sat Oct 18 11:37:29 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(9) AT Sat Oct 18 11:59:39 PM CEST 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs
… kernel and CUDA graphs) - all ok With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b): - CUDA is a factor 2 slower - C++ is 20% slower STARTED AT Sat Oct 18 11:59:39 PM CEST 2025 (SM tests) ENDED(1) AT Sun Oct 19 12:53:11 AM CEST 2025 [Status=0] (BSM tests) ENDED(1) AT Sun Oct 19 12:57:06 AM CEST 2025 [Status=0]
…PUGRAPHS is set and non empty
…f CUDACPP_RUNTIME_GPUGRAPHS is set and non empty Checked that gg_tt.md is regenerated correctly
… option -useGraphs (and run also x10 scaling tests in that case)
…phs (162 total: the old 144 now do not use cuda graphs)
…CPP_RUNTIME_GPUGRAPHS is non empty
for f in logs_ggtt*/*0.scaling; do cp $f ${f/.scaling/_graphs.scaling}; done
… new cuda graphs scaling) ./tput/allTees.sh -scalingonly STARTED AT Sun Oct 19 11:21:55 AM CEST 2025 SKIP './tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ' ENDED(1) AT Sun Oct 19 11:21:55 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling -makeclean ENDED(1-scaling) AT Sun Oct 19 11:37:14 AM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ' ENDED(2) AT Sun Oct 19 11:37:14 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling ENDED(2-scaling) AT Sun Oct 19 11:52:33 AM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs ' ENDED(3) AT Sun Oct 19 11:52:33 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling ENDED(3-scaling) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly ' ENDED(4) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge ' ENDED(5) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ' ENDED(6) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ' ENDED(7) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ' ENDED(8) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ' ENDED(9) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb ' ENDED(10) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
… all 6 wavefunction components
…en dep/indep couplings (add depCoup bool flag for easier codegen after removing templates from helicity amplitude methods)
…pr in color sum (e.g. needed for gg_ttggggg)
… from gg_ttggggg.dpg100dpf1000.sa This was created on the A100 node using the sse4 build - the test took 3h30m On [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttggggg.dpg100dpf1000.sa/... > date; CUDACPP_RUNTIME_GOODHELICITIES=ALL CUDACPP_RUNTEST_DUMPEVENTS=1 \ ./build.sse4_m_inl0_hrd0/runTest_cpp.exe ; date \cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/ Fri Nov 7 06:42:09 AM CET 2025 [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_XXX [ RUN ] SIGMA_SM_GG_TTXGGGGG_CPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTXGGGGG_CPU_XXX.testxxx (0 ms) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_XXX (0 ms total) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_MISC [ RUN ] SIGMA_SM_GG_TTXGGGGG_CPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTXGGGGG_CPU_MISC.testmisc (3 ms) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_MISC (3 ms total) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL [ RUN ] SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL.compareMomAndME INFO: Env variable CUDACPP_RUNTIME_GOODHELICITIES equals "ALL": keep all helicities Event dump written to ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttxggggg.txt [ OK ] SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL.compareMomAndME (12888302 ms) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL (12888302 ms total) [----------] Global test environment tear-down [==========] 3 tests from 3 test suites ran. (12888306 ms total) [ PASSED ] 3 tests. Fri Nov 7 10:16:58 AM CET 2025
…handling of constexpr in color sum (e.g. for gg_ttggggg)
With respect to the last LUMI logs for the 'hack_ihel6p1' codebase (commit 6495dbd): - What changed is the removal of templates from helicit yamplitude methods - Throughputs from HIP/dcd0/dcd1/noBlas and C++ are unchanged or slightly faster - Throughputs from HIP/blasOn are 10% slower STARTED AT Fri 07 Nov 2025 02:16:05 PM EET ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean -nocuda ENDED(1) AT Fri 07 Nov 2025 03:24:01 PM EET [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean -nocuda ENDED(1-scaling) AT Fri 07 Nov 2025 03:34:04 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean -nocuda ENDED(2) AT Fri 07 Nov 2025 03:38:37 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean -nocuda ENDED(2-scaling) AT Fri 07 Nov 2025 03:58:45 PM EET [Status=0] ./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean -nocuda ENDED(3) AT Fri 07 Nov 2025 04:02:46 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean -nocuda ENDED(3-scaling) AT Fri 07 Nov 2025 04:13:03 PM EET [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(4) AT Fri 07 Nov 2025 04:46:13 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean -nocuda ENDED(5) AT Fri 07 Nov 2025 04:52:36 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean -nocuda ENDED(6) AT Fri 07 Nov 2025 04:55:54 PM EET [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean -nocuda' ENDED(7) AT Fri 07 Nov 2025 04:55:54 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean -nocuda ENDED(8) AT Fri 07 Nov 2025 04:59:11 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean -nocuda ENDED(9) AT Fri 07 Nov 2025 05:04:15 PM EET [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(10) AT Fri 07 Nov 2025 05:40:05 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -ggttg5 -dcd -makeclean -nocuda ENDED(11) AT Fri 07 Nov 2025 05:42:32 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -ggttg5 -useGraphs -nocuda ENDED(12) AT Fri 07 Nov 2025 05:42:52 PM EET [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs
With respect to the last LUMI logs for the 'hack_ihel6p1' codebase (commit 4aa41e7): - What changed is the removal of templates from helicity amplitude methods - Throughputs from HIP/dcd0 and C++ are unchanged or slightly faster
With respect to the last rd90 logs for the 'hack_ihel6p1' codebase (commit e96ecf3): - What changed is the removal of templates from helicity amplitude methods - Throughputs from CUDA/dcd0/dcd1 (with/without BLAS) are unchanged - Throughputs from C++ are 5% faster STARTED AT Fri Nov 7 01:13:35 PM CET 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean ENDED(1) AT Fri Nov 7 03:38:52 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean ENDED(1-scaling) AT Fri Nov 7 03:51:10 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean ENDED(2) AT Fri Nov 7 03:56:51 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean ENDED(2-scaling) AT Fri Nov 7 04:11:45 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean ENDED(3) AT Fri Nov 7 04:19:29 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean ENDED(3-scaling) AT Fri Nov 7 04:40:27 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(4) AT Fri Nov 7 05:21:54 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean ENDED(5) AT Fri Nov 7 05:31:08 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean ENDED(6) AT Fri Nov 7 05:35:41 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean ENDED(7) AT Fri Nov 7 05:40:00 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean ENDED(8) AT Fri Nov 7 05:44:28 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(9) AT Fri Nov 7 05:50:54 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(10) AT Fri Nov 7 06:09:23 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -makej -ggttg5 -dcd -makeclean ENDED(11) AT Fri Nov 7 06:10:50 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -makej -ggttg5 -useGraphs ENDED(12) AT Fri Nov 7 06:11:28 PM CET 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs
With respect to the last rd90 logs for the 'hack_ihel6p1' codebase (commit 628be94): - What changed is the removal of templates from helicity amplitude methods - Throughputs from CUDA and C++ are unchanged STARTED AT Fri Nov 7 06:11:28 PM CET 2025 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean (SM tests) ENDED(1) AT Fri Nov 7 07:03:51 PM CET 2025 [Status=0] /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean (BSM tests) ENDED(1) AT Fri Nov 7 07:08:44 PM CET 2025 [Status=0] tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ! No asserts found in logs No segmentation fault found in logs
…EN/generateAndCompare.sh
dab6ac4 to
900b59f
Compare
|
Full documentation in https://arxiv.org/abs/2510.05392v2 that should appear tomorrow |
|
good. Excellent! Thanks for the notification. Olivier |
|
Thanks Olivier! Final notification: https://arxiv.org/abs/2510.05392v3 should appear tomorrow and was submitted to EPJC. This clarifies a few points in #1072 and PR #1073 but leaves this PR unchanged. I mark this PR #1066 as ready for review. This completes my kernel splitting work on madgraph4gpu. |
|
Hi Olivier, Daniele, I mark you as reviewers. Let me know if you want to discuss this. Thanks |
Hi @oliviermattelaer as we discussed, this is the PR that complete my work on kernel splitting.
It extends and replaces PR #1050 that I will now close. It is described in v2 of my paper that will be in arxiv in a couple of days.
Essentially this makes it possible to define Feynman diagram groups, in separate source code files, and launch them either as kernels (DCDIAG=0) or as device functions in a single kernel (DCDIAG=1). Its main interest is very complex processes. I was able to run gg->ttgggg (2->6) on CPU and GPU and gg->ttggggg (2->7) on CPU. For our standard candles like gg_ttggg (2->5) it has the same performance as the current master when generation is configured to have a single diagram group (e.g. 2000 diagrams per group).
With respect to PR #1050 it also contains a full implementation of CUDA graphs to orchestrate diagram kernels. But this is actually not very useful as the main problem with small diagram kernels is access to GPU global memory and not kernel launch overhead. This is also described in the upcoming paper. So to ease maintenance I can remove CUDA graphs if you prefer.
The diagram per group generation is now configured in the generate script through a hack. But this should probably be in the runcard, I'd like to discuss the details.
Concerning BLAS, this is still optional as in present master. Note that in gg>ttggg now BLAS is faster than kernels for FPTYPE=d,f and just as fast for m. But it is still slower for simpler processes like gg_tt, so I think it should stay optional.
Let's discuss when the arxiv is out, it will be easier to discuss the details. I keep this in WIP for now.
Thanks!
Andrea