Kernel splitting ihel4-ihel6: Feynman diagram groups #1066

valassi · 2025-11-18T07:16:10Z

Hi @oliviermattelaer as we discussed, this is the PR that complete my work on kernel splitting.

It extends and replaces PR #1050 that I will now close. It is described in v2 of my paper that will be in arxiv in a couple of days.

Essentially this makes it possible to define Feynman diagram groups, in separate source code files, and launch them either as kernels (DCDIAG=0) or as device functions in a single kernel (DCDIAG=1). Its main interest is very complex processes. I was able to run gg->ttgggg (2->6) on CPU and GPU and gg->ttggggg (2->7) on CPU. For our standard candles like gg_ttggg (2->5) it has the same performance as the current master when generation is configured to have a single diagram group (e.g. 2000 diagrams per group).

With respect to PR #1050 it also contains a full implementation of CUDA graphs to orchestrate diagram kernels. But this is actually not very useful as the main problem with small diagram kernels is access to GPU global memory and not kernel launch overhead. This is also described in the upcoming paper. So to ease maintenance I can remove CUDA graphs if you prefer.

The diagram per group generation is now configured in the generate script through a hack. But this should probably be in the runcard, I'd like to discuss the details.

Concerning BLAS, this is still optional as in present master. Note that in gg>ttggg now BLAS is faster than kernels for FPTYPE=d,f and just as fast for m. But it is still slower for simpler processes like gg_tt, so I think it should stay optional.

Let's discuss when the arxiv is out, it will be easier to discuss the details. I keep this in WIP for now.

Thanks!
Andrea

…oups and storeWfs/retrieveWfs only for selected wfs

…gram groups - only missing diagrams.h

…ader.h

…eWfs to diagrams_header.h

… selection of wavefunctions to retrieve or store

…election of wavefunctions to retrieve/store

…t ggttgg fails runTest

…d warning (nevt declared but never referenced)

… - failures in ggttgg, ggttggg, smeftggtttt Note also that already with 5 diagrams per group this is now a factor 2 faster in cuda for ggttggg (and ~30% faster in C++) STARTED AT Sat Oct 18 11:58:05 AM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sat Oct 18 01:07:03 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling ENDED(1-scaling) AT Sat Oct 18 01:20:08 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ENDED(2) AT Sat Oct 18 01:23:31 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling ENDED(2-scaling) AT Sat Oct 18 01:39:28 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(3) AT Sat Oct 18 01:51:53 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(4) AT Sat Oct 18 01:59:57 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(5) AT Sat Oct 18 02:02:36 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(6) AT Sat Oct 18 02:05:07 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(7) AT Sat Oct 18 02:07:49 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(8) AT Sat Oct 18 02:19:40 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(9) AT Sat Oct 18 02:51:16 PM CEST 2025 [Status=2] ./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS ./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_blasOn.txt: 2 FAILED TESTS ./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 2 FAILED TESTS

…- failures in ggttgg, ggttggg, smeftggtttt STARTED AT Sat Oct 18 02:51:16 PM CEST 2025 (SM tests) ENDED(1) AT Sat Oct 18 03:46:22 PM CEST 2025 [Status=0] (BSM tests) ENDED(1) AT Sat Oct 18 03:49:46 PM CEST 2025 [Status=0] tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file: tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file: tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file:

…itscrd90 Revert "[hack_ihel4p2] rerun 144 tput tests on itscrd90 (with diagram groups) - failures in ggttgg, ggttggg, smeftggtttt" This reverts commit fd6d902. Revert "[hack_ihel4p2] rerun 30 tmad tests on itscrd90 (with diagram groups) - failures in ggttgg, ggttggg, smeftggtttt" This reverts commit 29f3b9b.

…put wavefunctions (4-particle vertices!)

…G FIX in CODEGEN (amplitudes with >=4 input wavefunctions) Checked that ggttg has no change in generated code Checked that ggttgg, ggttggg and smeftggtttt now pass runTest (Note that ggttggg builds seem faster than with a single kernel or with 1k kernels?)

…and CODEGEN bug fix (>=4 input wavefunctions for amplitudes)

…r kernel and CUDA Graphs) - all ok With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a): - CUDA (without blas) is a factor ~10 slower for small grids and ~2.5 slower for large grids (1 cycle) for ggttggg - C++ is 15-20% slower STARTED AT Sat Oct 18 09:09:50 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sat Oct 18 10:20:37 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling ENDED(1-scaling) AT Sat Oct 18 10:32:51 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ENDED(2) AT Sat Oct 18 10:37:13 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling ENDED(2-scaling) AT Sat Oct 18 10:52:17 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(3) AT Sat Oct 18 11:07:44 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(4) AT Sat Oct 18 11:17:54 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(5) AT Sat Oct 18 11:21:09 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(6) AT Sat Oct 18 11:24:22 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(7) AT Sat Oct 18 11:27:42 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(8) AT Sat Oct 18 11:37:29 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(9) AT Sat Oct 18 11:59:39 PM CEST 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs

… kernel and CUDA graphs) - all ok With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b): - CUDA is a factor 2 slower - C++ is 20% slower STARTED AT Sat Oct 18 11:59:39 PM CEST 2025 (SM tests) ENDED(1) AT Sun Oct 19 12:53:11 AM CEST 2025 [Status=0] (BSM tests) ENDED(1) AT Sun Oct 19 12:57:06 AM CEST 2025 [Status=0]

…PUGRAPHS is set and non empty

…f CUDACPP_RUNTIME_GPUGRAPHS is set and non empty Checked that gg_tt.md is regenerated correctly

… option -useGraphs (and run also x10 scaling tests in that case)

…phs (162 total: the old 144 now do not use cuda graphs)

…CPP_RUNTIME_GPUGRAPHS is non empty

for f in logs_ggtt*/*0.scaling; do cp $f ${f/.scaling/_graphs.scaling}; done

… new cuda graphs scaling) ./tput/allTees.sh -scalingonly STARTED AT Sun Oct 19 11:21:55 AM CEST 2025 SKIP './tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ' ENDED(1) AT Sun Oct 19 11:21:55 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling -makeclean ENDED(1-scaling) AT Sun Oct 19 11:37:14 AM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn ' ENDED(2) AT Sun Oct 19 11:37:14 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling ENDED(2-scaling) AT Sun Oct 19 11:52:33 AM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs ' ENDED(3) AT Sun Oct 19 11:52:33 AM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling ENDED(3-scaling) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly ' ENDED(4) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge ' ENDED(5) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ' ENDED(6) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ' ENDED(7) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ' ENDED(8) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ' ENDED(9) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0] SKIP './tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb ' ENDED(10) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]

… all 6 wavefunction components

…en dep/indep couplings (add depCoup bool flag for easier codegen after removing templates from helicity amplitude methods)

…ity amplitudes

…pr in color sum (e.g. needed for gg_ttggggg)

… from gg_ttggggg.dpg100dpf1000.sa This was created on the A100 node using the sse4 build - the test took 3h30m On [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttggggg.dpg100dpf1000.sa/... > date; CUDACPP_RUNTIME_GOODHELICITIES=ALL CUDACPP_RUNTEST_DUMPEVENTS=1 \ ./build.sse4_m_inl0_hrd0/runTest_cpp.exe ; date \cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/ Fri Nov 7 06:42:09 AM CET 2025 [==========] Running 3 tests from 3 test suites. [----------] Global test environment set-up. [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_XXX [ RUN ] SIGMA_SM_GG_TTXGGGGG_CPU_XXX.testxxx [ OK ] SIGMA_SM_GG_TTXGGGGG_CPU_XXX.testxxx (0 ms) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_XXX (0 ms total) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_MISC [ RUN ] SIGMA_SM_GG_TTXGGGGG_CPU_MISC.testmisc [ OK ] SIGMA_SM_GG_TTXGGGGG_CPU_MISC.testmisc (3 ms) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_MISC (3 ms total) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL [ RUN ] SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL.compareMomAndME INFO: Env variable CUDACPP_RUNTIME_GOODHELICITIES equals "ALL": keep all helicities Event dump written to ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttxggggg.txt [ OK ] SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL.compareMomAndME (12888302 ms) [----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL (12888302 ms total) [----------] Global test environment tear-down [==========] 3 tests from 3 test suites ran. (12888306 ms total) [ PASSED ] 3 tests. Fri Nov 7 10:16:58 AM CET 2025

…handling of constexpr in color sum (e.g. for gg_ttggggg)

…gg_tt5g)

With respect to the last LUMI logs for the 'hack_ihel6p1' codebase (commit 6495dbd): - What changed is the removal of templates from helicit yamplitude methods - Throughputs from HIP/dcd0/dcd1/noBlas and C++ are unchanged or slightly faster - Throughputs from HIP/blasOn are 10% slower STARTED AT Fri 07 Nov 2025 02:16:05 PM EET ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean -nocuda ENDED(1) AT Fri 07 Nov 2025 03:24:01 PM EET [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean -nocuda ENDED(1-scaling) AT Fri 07 Nov 2025 03:34:04 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean -nocuda ENDED(2) AT Fri 07 Nov 2025 03:38:37 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean -nocuda ENDED(2-scaling) AT Fri 07 Nov 2025 03:58:45 PM EET [Status=0] ./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean -nocuda ENDED(3) AT Fri 07 Nov 2025 04:02:46 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean -nocuda ENDED(3-scaling) AT Fri 07 Nov 2025 04:13:03 PM EET [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(4) AT Fri 07 Nov 2025 04:46:13 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean -nocuda ENDED(5) AT Fri 07 Nov 2025 04:52:36 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean -nocuda ENDED(6) AT Fri 07 Nov 2025 04:55:54 PM EET [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean -nocuda' ENDED(7) AT Fri 07 Nov 2025 04:55:54 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean -nocuda ENDED(8) AT Fri 07 Nov 2025 04:59:11 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean -nocuda ENDED(9) AT Fri 07 Nov 2025 05:04:15 PM EET [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(10) AT Fri 07 Nov 2025 05:40:05 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -ggttg5 -dcd -makeclean -nocuda ENDED(11) AT Fri 07 Nov 2025 05:42:32 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -ggttg5 -useGraphs -nocuda ENDED(12) AT Fri 07 Nov 2025 05:42:52 PM EET [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs

With respect to the last LUMI logs for the 'hack_ihel6p1' codebase (commit 4aa41e7): - What changed is the removal of templates from helicity amplitude methods - Throughputs from HIP/dcd0 and C++ are unchanged or slightly faster

Revert "[hack_ihel6p2] rerun 30 tmad tests on LUMI - all ok" This reverts commit 0eeea36. Revert "[hack_ihel6p2] rerun 159 tput tests on LUMI - all ok" This reverts commit ba8df93.

With respect to the last rd90 logs for the 'hack_ihel6p1' codebase (commit e96ecf3): - What changed is the removal of templates from helicity amplitude methods - Throughputs from CUDA/dcd0/dcd1 (with/without BLAS) are unchanged - Throughputs from C++ are 5% faster STARTED AT Fri Nov 7 01:13:35 PM CET 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean ENDED(1) AT Fri Nov 7 03:38:52 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean ENDED(1-scaling) AT Fri Nov 7 03:51:10 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean ENDED(2) AT Fri Nov 7 03:56:51 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean ENDED(2-scaling) AT Fri Nov 7 04:11:45 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean ENDED(3) AT Fri Nov 7 04:19:29 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean ENDED(3-scaling) AT Fri Nov 7 04:40:27 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(4) AT Fri Nov 7 05:21:54 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean ENDED(5) AT Fri Nov 7 05:31:08 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean ENDED(6) AT Fri Nov 7 05:35:41 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean ENDED(7) AT Fri Nov 7 05:40:00 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean ENDED(8) AT Fri Nov 7 05:44:28 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(9) AT Fri Nov 7 05:50:54 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(10) AT Fri Nov 7 06:09:23 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -makej -ggttg5 -dcd -makeclean ENDED(11) AT Fri Nov 7 06:10:50 PM CET 2025 [Status=0] ./tput/teeThroughputX.sh -makej -ggttg5 -useGraphs ENDED(12) AT Fri Nov 7 06:11:28 PM CET 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs

With respect to the last rd90 logs for the 'hack_ihel6p1' codebase (commit 628be94): - What changed is the removal of templates from helicity amplitude methods - Throughputs from CUDA and C++ are unchanged STARTED AT Fri Nov 7 06:11:28 PM CET 2025 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean (SM tests) ENDED(1) AT Fri Nov 7 07:03:51 PM CET 2025 [Status=0] /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean (BSM tests) ENDED(1) AT Fri Nov 7 07:08:44 PM CET 2025 [Status=0] tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ! No asserts found in logs No segmentation fault found in logs

…EN/generateAndCompare.sh

valassi · 2025-11-23T23:36:32Z

Full documentation in https://arxiv.org/abs/2510.05392v2 that should appear tomorrow

oliviermattelaer · 2025-11-24T05:43:11Z

good. Excellent! Thanks for the notification.

Olivier

valassi · 2025-12-11T09:57:02Z

Thanks Olivier!

Final notification: https://arxiv.org/abs/2510.05392v3 should appear tomorrow and was submitted to EPJC. This clarifies a few points in #1072 and PR #1073 but leaves this PR unchanged.

I mark this PR #1066 as ready for review. This completes my kernel splitting work on madgraph4gpu.

valassi · 2025-12-11T10:14:50Z

Hi Olivier, Daniele, I mark you as reviewers. Let me know if you want to discuss this. Thanks
Andrea

valassi added 30 commits October 17, 2025 20:31

[hack_ihel4p2] in ggttg.mad, first functional version with diagram gr…

8af8408

…oups and storeWfs/retrieveWfs only for selected wfs

[hack_ihel4p2] in ggttg.mad, add ndiagramgroups in parallel to ndiagrams

2a3116a

[hack_ihel4p2] in ggttg.mad, rename diagramsItoJ as diagramgroupK

4d27b32

[hack_ihel4p2] first part of CODEGEN backport (from ggttg.mad) of dia…

68eb2a5

…gram groups - only missing diagrams.h

[hack_ihel4p2] in ggttg.mad, move storeWfs/retrieveWfs to diagrams_he…

017ab30

…ader.h

[hack_ihel4p2] in CODEGEN (ggttg.mad backport), move storeWfs/retriev…

f815652

…eWfs to diagrams_header.h

[hack_ihel4p2] in ggttg.mad, formatting changes

2aed23f

[hack_ihel4p2] in CODEGEN, complete ggttg.mad backport except for the…

168d75b

… selection of wavefunctions to retrieve or store

[hack_ihel4p2] in CODEGEN, finally complete ggttg.mad backport with s…

7bd1cc8

…election of wavefunctions to retrieve/store

[hack_ihel4p2] regenerate gg_tt/gg_ttg/gg_ttgg.mad - all build ok, bu…

28da2e0

…t ggttgg fails runTest

[hack_ihel4p2] in CODEGEN diagram_boilerplate.h, suppress a CUDA buil…

400c30a

…d warning (nevt declared but never referenced)

[hack_ihel4p2] regenerate all processes - add diagrams_header.h

4533a88

[hack_ihel4p2] BUG FIX in CODEGEN: amplitudes may take more than 3 in…

97de37d

…put wavefunctions (4-particle vertices!)

[hack_ihel4p2] in CODEGEN, move to 100 diagrams per group

9c2538f

[hack_ihel4p2] regenerate all processes with 100 diagrams per groups …

49a172c

…and CODEGEN bug fix (>=4 input wavefunctions for amplitudes)

[hack_ihel4p2] in ggtt.mad, use CUDA graphs only if CUDACPP_RUNTIME_G…

59be4cd

…PUGRAPHS is set and non empty

[hack_ihel4p2] in CODEGEN (ggtt.mad backport), use CUDA graphs only i…

45deaaf

…f CUDACPP_RUNTIME_GPUGRAPHS is set and non empty Checked that gg_tt.md is regenerated correctly

[hack_ihel4p2] in tput/throughputX.sh and tput/teeThroughputX.sh, add…

77dc2b2

… option -useGraphs (and run also x10 scaling tests in that case)

[hack_ihel4p2] in tput/allTees.sh, add (6 + 12 scaling) logs with gra…

fb30d82

…phs (162 total: the old 144 now do not use cuda graphs)

[hack_ihel4p2] regenerate all processes: use CUDA graphs only if CUDA…

94ac3b2

…CPP_RUNTIME_GPUGRAPHS is non empty

[hack_ihel4p2] copy previous tput scaling logs to _graphs.scaling logs

ee74b4e

for f in logs_ggtt*/*0.scaling; do cp $f ${f/.scaling/_graphs.scaling}; done

[hack_ihel4p2] minor change in tput/throughputX.sh

fd08107

[hack_ihel4p2] in ggtt.mad, add a comment: need to store and retrieve…

597a573

… all 6 wavefunction components

valassi added 12 commits November 7, 2025 11:52

[hack_ihel6p2] CODEGEN (ee_mumu/gg_tt.mad backport): fix choice betwe…

68e29ad

…en dep/indep couplings (add depCoup bool flag for easier codegen after removing templates from helicity amplitude methods)

[hack_ihel6p2] regenerate all processes - remove templates from helic…

deaa311

…ity amplitudes

[hack_ihel6p2/ggtt5g] gg_tt.mad: improve optional handling of constex…

bc771b2

…pr in color sum (e.g. needed for gg_ttggggg)

[hack_ihel6p2/ggtt5g] CODEGEN (gg_tt.mad backport): improve optional …

20b436d

…handling of constexpr in color sum (e.g. for gg_ttggggg)

[hack_ihel6p2] regenerate all processes (including extra patches for …

8b5c387

…gg_tt5g)

[hack_ihel6p2] rerun 30 tmad tests on LUMI - all ok

0eeea36

With respect to the last LUMI logs for the 'hack_ihel6p1' codebase (commit 4aa41e7): - What changed is the removal of templates from helicity amplitude methods - Throughputs from HIP/dcd0 and C++ are unchanged or slightly faster

[hack_ihel6p2] go back from hack_ihel6p2/LUMI to hack_ihel6p1/rd90 logs

9397172

Revert "[hack_ihel6p2] rerun 30 tmad tests on LUMI - all ok" This reverts commit 0eeea36. Revert "[hack_ihel6p2] rerun 159 tput tests on LUMI - all ok" This reverts commit ba8df93.

[hack_ihel6p2] ** COMPLETE HACK_IHEL6P2 ** fix usage message in CODEG…

ccc2256

…EN/generateAndCompare.sh

valassi self-assigned this Nov 18, 2025

valassi marked this pull request as draft November 18, 2025 07:16

valassi mentioned this pull request Nov 18, 2025

(WIP, NOT FOR MERGING) Kernel splitting ihel4: Feynman diagram kernels #1050

Closed

[hack_ihel6p2/ggtt5g] CODEGEN/generateAndCompare.sh: add gg_ttggggg

900b59f

valassi force-pushed the hack_ihel6_pr branch from dab6ac4 to 900b59f Compare November 22, 2025 09:21

valassi linked an issue Nov 23, 2025 that may be closed by this pull request

Split the sigmakin kernel into smaller kernels #310

Closed

This was referenced Nov 23, 2025

Split the sigmakin kernel into smaller kernels #310

Closed

cuda graphs #12

Open

"xxx" function interface: further separation of data access and calculations? #175

Closed

valassi linked an issue Nov 23, 2025 that may be closed by this pull request

Reduce build times in CUDA and C++ for complex processes (split kernels and more) #348

Closed

valassi mentioned this pull request Nov 23, 2025

Reduce build times in CUDA and C++ for complex processes (split kernels and more) #348

Closed

valassi marked this pull request as ready for review December 11, 2025 09:57

valassi requested review from Qubitol and oliviermattelaer December 11, 2025 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel splitting ihel4-ihel6: Feynman diagram groups #1066

Kernel splitting ihel4-ihel6: Feynman diagram groups #1066

Uh oh!

valassi commented Nov 18, 2025

Uh oh!

valassi commented Nov 23, 2025

Uh oh!

oliviermattelaer commented Nov 24, 2025

Uh oh!

valassi commented Dec 11, 2025

Uh oh!

valassi commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kernel splitting ihel4-ihel6: Feynman diagram groups #1066

Are you sure you want to change the base?

Kernel splitting ihel4-ihel6: Feynman diagram groups #1066

Uh oh!

Conversation

valassi commented Nov 18, 2025

Uh oh!

valassi commented Nov 23, 2025

Uh oh!

oliviermattelaer commented Nov 24, 2025

Uh oh!

valassi commented Dec 11, 2025

Uh oh!

valassi commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants