Skip to content

Conversation

@valassi
Copy link
Member

@valassi valassi commented Nov 18, 2025

Hi @oliviermattelaer as we discussed, this is the PR that complete my work on kernel splitting.

It extends and replaces PR #1050 that I will now close. It is described in v2 of my paper that will be in arxiv in a couple of days.

Essentially this makes it possible to define Feynman diagram groups, in separate source code files, and launch them either as kernels (DCDIAG=0) or as device functions in a single kernel (DCDIAG=1). Its main interest is very complex processes. I was able to run gg->ttgggg (2->6) on CPU and GPU and gg->ttggggg (2->7) on CPU. For our standard candles like gg_ttggg (2->5) it has the same performance as the current master when generation is configured to have a single diagram group (e.g. 2000 diagrams per group).

With respect to PR #1050 it also contains a full implementation of CUDA graphs to orchestrate diagram kernels. But this is actually not very useful as the main problem with small diagram kernels is access to GPU global memory and not kernel launch overhead. This is also described in the upcoming paper. So to ease maintenance I can remove CUDA graphs if you prefer.

The diagram per group generation is now configured in the generate script through a hack. But this should probably be in the runcard, I'd like to discuss the details.

Concerning BLAS, this is still optional as in present master. Note that in gg>ttggg now BLAS is faster than kernels for FPTYPE=d,f and just as fast for m. But it is still slower for simpler processes like gg_tt, so I think it should stay optional.

Let's discuss when the arxiv is out, it will be easier to discuss the details. I keep this in WIP for now.

Thanks!
Andrea

…oups and storeWfs/retrieveWfs only for selected wfs
… selection of wavefunctions to retrieve or store
…d warning (nevt declared but never referenced)
… - failures in ggttgg, ggttggg, smeftggtttt

Note also that already with 5 diagrams per group this is now a factor 2 faster in cuda for ggttggg (and ~30% faster in C++)

STARTED  AT Sat Oct 18 11:58:05 AM CEST 2025
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Sat Oct 18 01:07:03 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling
ENDED(1-scaling) AT Sat Oct 18 01:20:08 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn
ENDED(2) AT Sat Oct 18 01:23:31 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling
ENDED(2-scaling) AT Sat Oct 18 01:39:28 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(3) AT Sat Oct 18 01:51:53 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean
ENDED(4) AT Sat Oct 18 01:59:57 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst
ENDED(5) AT Sat Oct 18 02:02:36 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst
ENDED(6) AT Sat Oct 18 02:05:07 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common
ENDED(7) AT Sat Oct 18 02:07:49 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean
ENDED(8) AT Sat Oct 18 02:19:40 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(9) AT Sat Oct 18 02:51:16 PM CEST 2025 [Status=2]

./tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0_bridge.txt: 2 FAILED TESTS
./tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0_blasOn.txt: 2 FAILED TESTS
./tput/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 2 FAILED TESTS
…- failures in ggttgg, ggttggg, smeftggtttt

STARTED  AT Sat Oct 18 02:51:16 PM CEST 2025
(SM tests)
ENDED(1) AT Sat Oct 18 03:46:22 PM CEST 2025 [Status=0]
(BSM tests)
ENDED(1) AT Sat Oct 18 03:49:46 PM CEST 2025 [Status=0]

tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file:
tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file:
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt: [XSECTION] ERROR! No cross section in log file:
…itscrd90

Revert "[hack_ihel4p2] rerun 144 tput tests on itscrd90 (with diagram groups) - failures in ggttgg, ggttggg, smeftggtttt"
This reverts commit fd6d902.

Revert "[hack_ihel4p2] rerun 30 tmad tests on itscrd90 (with diagram groups) - failures in ggttgg, ggttggg, smeftggtttt"
This reverts commit 29f3b9b.
…G FIX in CODEGEN (amplitudes with >=4 input wavefunctions)

Checked that ggttg has no change in generated code

Checked that ggttgg, ggttggg and smeftggtttt now pass runTest

(Note that ggttggg builds seem faster than with a single kernel or with 1k kernels?)
…and CODEGEN bug fix (>=4 input wavefunctions for amplitudes)
…r kernel and CUDA Graphs) - all ok

With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a):
- CUDA (without blas) is a factor ~10 slower for small grids and ~2.5 slower for large grids (1 cycle) for ggttggg
- C++ is 15-20% slower

STARTED  AT Sat Oct 18 09:09:50 PM CEST 2025
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Sat Oct 18 10:20:37 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling
ENDED(1-scaling) AT Sat Oct 18 10:32:51 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn
ENDED(2) AT Sat Oct 18 10:37:13 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling
ENDED(2-scaling) AT Sat Oct 18 10:52:17 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(3) AT Sat Oct 18 11:07:44 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean
ENDED(4) AT Sat Oct 18 11:17:54 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst
ENDED(5) AT Sat Oct 18 11:21:09 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst
ENDED(6) AT Sat Oct 18 11:24:22 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common
ENDED(7) AT Sat Oct 18 11:27:42 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean
ENDED(8) AT Sat Oct 18 11:37:29 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(9) AT Sat Oct 18 11:59:39 PM CEST 2025 [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

No aborts found in logs
… kernel and CUDA graphs) - all ok

With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b):
- CUDA is a factor 2 slower
- C++ is 20% slower

STARTED  AT Sat Oct 18 11:59:39 PM CEST 2025
(SM tests)
ENDED(1) AT Sun Oct 19 12:53:11 AM CEST 2025 [Status=0]
(BSM tests)
ENDED(1) AT Sun Oct 19 12:57:06 AM CEST 2025 [Status=0]
…f CUDACPP_RUNTIME_GPUGRAPHS is set and non empty

Checked that gg_tt.md is regenerated correctly
… option -useGraphs (and run also x10 scaling tests in that case)
…phs (162 total: the old 144 now do not use cuda graphs)
for f in logs_ggtt*/*0.scaling; do cp $f ${f/.scaling/_graphs.scaling}; done
… new cuda graphs scaling)

./tput/allTees.sh -scalingonly

STARTED  AT Sun Oct 19 11:21:55 AM CEST 2025
SKIP './tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean '
ENDED(1) AT Sun Oct 19 11:21:55 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling -makeclean
ENDED(1-scaling) AT Sun Oct 19 11:37:14 AM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn  '
ENDED(2) AT Sun Oct 19 11:37:14 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling
ENDED(2-scaling) AT Sun Oct 19 11:52:33 AM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs  '
ENDED(3) AT Sun Oct 19 11:52:33 AM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling
ENDED(3-scaling) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly  '
ENDED(4) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge  '
ENDED(5) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst '
ENDED(6) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst '
ENDED(7) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common '
ENDED(8) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean '
ENDED(9) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
SKIP './tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb  '
ENDED(10) AT Sun Oct 19 12:08:54 PM CEST 2025 [Status=0]
…en dep/indep couplings

(add depCoup bool flag for easier codegen after removing templates from helicity amplitude methods)
…pr in color sum (e.g. needed for gg_ttggggg)
… from gg_ttggggg.dpg100dpf1000.sa

This was created on the A100 node using the sse4 build - the test took 3h30m

On [avalassi@itscrd-a100 gcc11/usr]
/data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttggggg.dpg100dpf1000.sa/... >

  date; CUDACPP_RUNTIME_GOODHELICITIES=ALL CUDACPP_RUNTEST_DUMPEVENTS=1 \
  ./build.sse4_m_inl0_hrd0/runTest_cpp.exe ; date

  \cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/

Fri Nov  7 06:42:09 AM CET 2025
[==========] Running 3 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_XXX
[ RUN      ] SIGMA_SM_GG_TTXGGGGG_CPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTXGGGGG_CPU_XXX.testxxx (0 ms)
[----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_XXX (0 ms total)

[----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_MISC
[ RUN      ] SIGMA_SM_GG_TTXGGGGG_CPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTXGGGGG_CPU_MISC.testmisc (3 ms)
[----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_MISC (3 ms total)

[----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL
[ RUN      ] SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL.compareMomAndME
INFO: Env variable CUDACPP_RUNTIME_GOODHELICITIES equals "ALL": keep all helicities
Event dump written to ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttxggggg.txt
[       OK ] SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL.compareMomAndME (12888302 ms)
[----------] 1 test from SIGMA_SM_GG_TTXGGGGG_CPU_NOMULTICHANNEL (12888302 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 3 test suites ran. (12888306 ms total)
[  PASSED  ] 3 tests.
Fri Nov  7 10:16:58 AM CET 2025
…handling of constexpr in color sum (e.g. for gg_ttggggg)
With respect to the last LUMI logs for the 'hack_ihel6p1' codebase (commit 6495dbd):
- What changed is the removal of templates from helicit yamplitude methods
- Throughputs from HIP/dcd0/dcd1/noBlas and C++ are unchanged or slightly faster
- Throughputs from HIP/blasOn are 10% slower

STARTED  AT Fri 07 Nov 2025 02:16:05 PM EET
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean  -nocuda
ENDED(1) AT Fri 07 Nov 2025 03:24:01 PM EET [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean  -nocuda
ENDED(1-scaling) AT Fri 07 Nov 2025 03:34:04 PM EET [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean  -nocuda
ENDED(2) AT Fri 07 Nov 2025 03:38:37 PM EET [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean  -nocuda
ENDED(2-scaling) AT Fri 07 Nov 2025 03:58:45 PM EET [Status=0]
./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean  -nocuda
ENDED(3) AT Fri 07 Nov 2025 04:02:46 PM EET [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean  -nocuda
ENDED(3-scaling) AT Fri 07 Nov 2025 04:13:03 PM EET [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean  -nocuda
ENDED(4) AT Fri 07 Nov 2025 04:46:13 PM EET [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean  -nocuda
ENDED(5) AT Fri 07 Nov 2025 04:52:36 PM EET [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean  -nocuda
ENDED(6) AT Fri 07 Nov 2025 04:55:54 PM EET [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean  -nocuda'
ENDED(7) AT Fri 07 Nov 2025 04:55:54 PM EET [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean  -nocuda
ENDED(8) AT Fri 07 Nov 2025 04:59:11 PM EET [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean  -nocuda
ENDED(9) AT Fri 07 Nov 2025 05:04:15 PM EET [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean  -nocuda
ENDED(10) AT Fri 07 Nov 2025 05:40:05 PM EET [Status=0]
./tput/teeThroughputX.sh -makej -ggttg5 -dcd -makeclean  -nocuda
ENDED(11) AT Fri 07 Nov 2025 05:42:32 PM EET [Status=0]
./tput/teeThroughputX.sh -makej -ggttg5 -useGraphs  -nocuda
ENDED(12) AT Fri 07 Nov 2025 05:42:52 PM EET [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

No aborts found in logs
With respect to the last LUMI logs for the 'hack_ihel6p1' codebase (commit 4aa41e7):
- What changed is the removal of templates from helicity amplitude methods
- Throughputs from HIP/dcd0 and C++ are unchanged or slightly faster
Revert "[hack_ihel6p2] rerun 30 tmad tests on LUMI - all ok"
This reverts commit 0eeea36.

Revert "[hack_ihel6p2] rerun 159 tput tests on LUMI - all ok"
This reverts commit ba8df93.
With respect to the last rd90 logs for the 'hack_ihel6p1' codebase (commit e96ecf3):
- What changed is the removal of templates from helicity amplitude methods
- Throughputs from CUDA/dcd0/dcd1 (with/without BLAS) are unchanged
- Throughputs from C++ are 5% faster

STARTED  AT Fri Nov  7 01:13:35 PM CET 2025
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean
ENDED(1) AT Fri Nov  7 03:38:52 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean
ENDED(1-scaling) AT Fri Nov  7 03:51:10 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean
ENDED(2) AT Fri Nov  7 03:56:51 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean
ENDED(2-scaling) AT Fri Nov  7 04:11:45 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean
ENDED(3) AT Fri Nov  7 04:19:29 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean
ENDED(3-scaling) AT Fri Nov  7 04:40:27 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(4) AT Fri Nov  7 05:21:54 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean
ENDED(5) AT Fri Nov  7 05:31:08 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean
ENDED(6) AT Fri Nov  7 05:35:41 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean
ENDED(7) AT Fri Nov  7 05:40:00 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean
ENDED(8) AT Fri Nov  7 05:44:28 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean
ENDED(9) AT Fri Nov  7 05:50:54 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(10) AT Fri Nov  7 06:09:23 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -makej -ggttg5 -dcd -makeclean
ENDED(11) AT Fri Nov  7 06:10:50 PM CET 2025 [Status=0]
./tput/teeThroughputX.sh -makej -ggttg5 -useGraphs
ENDED(12) AT Fri Nov  7 06:11:28 PM CET 2025 [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

No aborts found in logs
With respect to the last rd90 logs for the 'hack_ihel6p1' codebase (commit 628be94):
- What changed is the removal of templates from helicity amplitude methods
- Throughputs from CUDA and C++ are unchanged

STARTED  AT Fri Nov  7 06:11:28 PM CET 2025
/data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean
(SM tests)
ENDED(1) AT Fri Nov  7 07:03:51 PM CET 2025 [Status=0]
/data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean
(BSM tests)
ENDED(1) AT Fri Nov  7 07:08:44 PM CET 2025 [Status=0]

tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt:ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ!

No asserts found in logs

No segmentation fault found in logs
@valassi
Copy link
Member Author

valassi commented Nov 23, 2025

Full documentation in https://arxiv.org/abs/2510.05392v2 that should appear tomorrow

@oliviermattelaer
Copy link
Member

good. Excellent! Thanks for the notification.

Olivier

@valassi
Copy link
Member Author

valassi commented Dec 11, 2025

Thanks Olivier!

Final notification: https://arxiv.org/abs/2510.05392v3 should appear tomorrow and was submitted to EPJC. This clarifies a few points in #1072 and PR #1073 but leaves this PR unchanged.

I mark this PR #1066 as ready for review. This completes my kernel splitting work on madgraph4gpu.

@valassi valassi marked this pull request as ready for review December 11, 2025 09:57
@valassi
Copy link
Member Author

valassi commented Dec 11, 2025

Hi Olivier, Daniele, I mark you as reviewers. Let me know if you want to discuss this. Thanks
Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce build times in CUDA and C++ for complex processes (split kernels and more) Split the sigmakin kernel into smaller kernels

2 participants