Consolidate GPU Kernel launches by spencerw · Pull Request #186 · N-BodyShop/changa

spencerw · 2024-11-20T22:57:17Z

Having GPU kernel launches tied to the TreePieces degrades performance and is probably causing some data race conditions. The GPU versions of the local tree walk and ewald calculation are now handled by the data manager. Any kernel launches involving interaction list calculations are handled by node groups.

Note that another open PR #183 has already been merged into this branch.

…n GPU tree walk enabled

…mmenceCalculateGraviyLocal is no longer called

… Ewald

spencerw · 2024-11-20T22:58:51Z

We still need to decide what to do about the nodeGravityComputation and particleGravityComputation kernel launches. I don't think the remote gravity performance will benefit much from consolidating these, but this PR as-is probably breaks the local GPU gravity calculation if we aren't using the gpu-local-tree-walk option.

trquinn · 2024-11-21T05:47:10Z

This doesn't even compile if "--enable-gpu-local-tree-walk" is not specified.

spencerw · 2024-11-21T22:27:39Z

I just tested this out using the verbs comm layer and CUDA memory errors are back. I'm guessing the poor performance from MPI was actually preventing the remote gravity kernels from stepping on each other.

We'll see if the CSA folks have any other suggestions when we talk to them next Monday, but I think we're going to need to move all of the nodeGravityComputation and particleGravityComputation kernel launches to the DataManager as well.

…Added stats ouput. See MEMORY_POOL_CHANGES.txt for more info

* Remove barrier preventing calculateGravityRemote from executing before prefetch data is transferred to GPU * Add barrier ensuring prefetch data is transferred before launching PEList kernels * Node-wide GPU data pointers no longer passed through TreePieces

DataManager.cpp

…ks in host memory pool

* CUDA stream for DataManager created later

cudaFreeAsync returns memory to CUDA's pool immediately, but GPU operations using that memory may still be in flight. When cudaMallocAsync returns the same memory, there's a race condition causing non-deterministic results. Fix: Synchronize the stream after cudaFreeAsync to ensure all operations complete before the memory becomes available for reuse. This fixes the teststep/testenergy benchmark which was failing due to ~0.04% energy calculation errors from the race condition.

Host memory pool fully featured

spencerw · 2025-12-12T16:45:15Z

I went ahead and merged Matt's memory pool PR into this. As of that latest commit (04724af), teststep, testcosmo and testcollapse all appear to pass and are coming back clean with valgrind, address-sanitizer and compute-sanitizer when run on multiple ranks.

We've also confirmed that the halo mass function matches between the CPU and GPU version of runs22, although this was run a couple months ago with an older commit (10ed833).

At the point, the main issue is that the star formation history (using metal cooling) for CosmoRun diverges between the CPU and GPU version somewhere around z=7. I'm also getting a heap corruption error around this time. This was all run with the verbs build on Vista. I also had to roll back to (10ed833), otherwise CosmoRun crashes and hangs fairly quickly. The MPI/UCX build works much more smoothly with the latest commit (04724af), but I run into RDMA errors 20 or so steps in.

[c637-132:2942276:0:2942276] ib_mlx5_log.c:179  Local length error on mlx5_0:1/IB (synd 0x1 vend 0xda hw_synd 0/0)                   [c637-132:2942276:0:2942276] ib_mlx5_log.c:179  RC QP 0x3db8 wqe[58529]: RDMA_READ s-- [rva 0x40034f29cd60 rkey 0x9b2a95c] [va 0x4f5d5e30 len 16384 lkey 0x15accae0] [rqpn 0x10c45 dlid=734 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:2942276) ====
 0  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_handle_error+0x288) [0x40002030eb98]
 1  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_fatal_error_message+0xd0) [0x40002030c1e0]
 2  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_log_default_handler+0xd80) [0x4000203105f0]
 3  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_log_dispatch+0xd0) [0x400020310950]
 4  /opt/apps/ucx/1.17.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x420) [0x4000225e6b50]
 5  /opt/apps/ucx/1.17.0/lib/ucx/libuct_ib.so.0(+0x3d694) [0x4000225fd694]
 6  /opt/apps/ucx/1.17.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x68) [0x4000225e7a9c]
 7  /opt/apps/ucx/1.17.0/lib/ucx/libuct_ib.so.0(+0x3f364) [0x4000225ff364]
 8  /opt/apps/ucx/1.17.0/lib/libucp.so.0(ucp_worker_progress+0x64) [0x4000201d8ab8]
 9  /opt/apps/nvidia24/openmpi/5.0.5_nvc249/lib/libmpi.so.40(mca_pml_ucx_recv+0x134) [0x40001f178034]
10  /opt/apps/nvidia24/openmpi/5.0.5_nvc249/lib/libmpi.so.40(PMPI_Recv+0x1bc) [0x40001f00053c]
11  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda() [0xa97660]
12  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(_Z24LrtsAdvanceCommunicationi+0x648) [0xa97fc8]
13  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(_Z25CommunicationServerThreadi+0x1c) [0xa94e1c]
14  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda() [0xa94cac]
15  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(ConverseInit+0x1694) [0xa94454]
16  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(charm_main+0x2c) [0xa0566c]
17  /lib64/libc.so.6(+0x2c79c) [0x40001f9ec79c]
18  /lib64/libc.so.6(__libc_start_main+0x98) [0x40001f9ec86c]
19  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(_start+0x30) [0x5cadb0]
=================================

Disabling the UCX transport layer with export OMPI_MCA_btl=^uct seems to help it get a bit further, but it still crashes with the same error eventually.

h896 seems much more stable, although I haven't run it for more than about 50 steps.

trquinn · 2026-01-16T23:25:48Z

The GPU code currently hangs if running on one node. I believe it is because there are no remote walks so DataManager::transferParticleVarsBack() is not called enough times.

trquinn · 2026-01-21T02:07:01Z

This doesn't even compile if "--enable-gpu-local-tree-walk" is not specified.

I've checked that it seems to run correctly with or without --enable-gpu-local-tree-walk after my recent patch (yet to be merged).

Previously this was done using the device pointers, which occasionally caused hangs if no data was transfered to the GPU. (E.g. a remote walk when running on a single node.) Also cleaned up some unused attributes in DataManager.

If there are no remote walks, these asserts will fail.

* fixed host pool init race condition * Address PR #7 review: MemoryPool.cpp cleanups - gpuPoolInit: Replace magic 16 with const maxCudaDevice; add cudaGetDeviceCount and CkAssert(nCudaDeviceCount <= maxCudaDevice) - gpuPoolInit: Check cudaMemPoolSetAttribute return value, CkAbort on failure - hostPoolReportStats: Replace snprintf with make_formatted_string per project pattern (formatted_string.h)

Replace plain bool with std::atomic<bool> for bLocalDataTransferred and bRemoteDataTransferred. These flags are set in async callbacks and read by multiple PEs in PEList::finishWalk(); without atomics, some PEs see stale values and never set bKernelDelayed, causing tryLaunchDelayedKernel to run before they check in → deadlock. - DataManager: .store() for all writes; .load() in PEList when reading - donePrefetch path also sets bRemoteDataTransferred via .store(true)

spencerw added 19 commits September 24, 2024 18:28

clearRegisteredPieces called before tree build

e2393c7

CUDA streams assigned to TreePieces after tree build

7dec4c5

Move clearRegisteredPieces inside buildTree

ba603d6

Remove unused code

8bbeccb

Move assignCUDAStreams() call inside of buildTree()

049e4fa

Remove DataTransfer calls from TreePieceCellListDataTransferLocal whe…

8ddb491

…n GPU tree walk enabled

Hack to launch gpuLocalTreeWalk from DataManager

a5e500b

Hack to run remote gravity without bookkeeping

83961cf

Restore bookkeeping around GPU local tree walk

f24d350

Use bare callback for local walk

bb0d5bb

First attempt at consolidated Ewald GPU kernel

5eb603a

Merge remote-tracking branch 'remotes/origin/dm_tp' into kernelfix

85191fe

More fixes to consolidated Ewald

37ba423

Ensure device memory pointers are passed to TreePieces even though co…

336b959

…mmenceCalculateGraviyLocal is no longer called

Call finishBucket after local tree walk, ensure smooth happens before…

ee3d927

… Ewald

Pass nReplicas and fPeriod to DataManagerLocalTreeWalk

aa919f8

Removed unused Ewald GPU code

40de4f5

Ewald GPU data to pinned host memory, remove markers

6b012e5

Comments and code cleanup

e4cb17f

Fix for bucket bookkeeping

2866fdb

Temporarily move initiatePrefetch to commenceCalculateGravityLocal

bce757c

spencerw changed the title ~~Consolidate gpuLocalTreeWalk and Ewald Kernel Launches~~ Consolidate GPU Kernel launches Apr 15, 2025

Interaction list kernels launch once per PE

e2a9686

spencerw force-pushed the kernelfix branch from 6b12b2d to e2a9686 Compare April 15, 2025 21:33

spencerw added 3 commits April 17, 2025 11:29

Bug fixes for Ewald and multistepping

87a10dd

GPU local walk and Ewald callbacks use treeProxy

0847e56

Remove small phase code

ae0c589

mrcawood and others added 5 commits October 2, 2025 15:08

added logging to CPU memory pool with +cpumemlog

433bd8e

removed large long lived mallocs from host mem pool

a77533e

Many improvements for host memory pool, include trimming and refill. …

2283f08

…Added stats ouput. See MEMORY_POOL_CHANGES.txt for more info

removed large mallocs from pool, further tuned warmup and trim params

f7f918d

spencerw commented Oct 23, 2025

View reviewed changes

DataManager.cpp Outdated Show resolved Hide resolved

spencerw and others added 15 commits October 24, 2025 12:11

resumeRemoteChunk iterates over PEList proxies correctly

e0f384e

fillGPUBuffer given priority over SPH

9d887ee

removed stale code

10ed833

Merge remote-tracking branch 'upstream/kernelfix' into kernelfix

4b4fc4c

Fix thread safety: synchronize shared data access and prevent deadloc…

ff764d5

…ks in host memory pool

* Fix hang when an entire PE has no particles

b688f14

* CUDA stream for DataManager created later

Merge remote-tracking branch 'upstream/master' into kernelfix

dc56105

Merge remote-tracking branch 'matt/kernelfix' into kernelfix

19758e5

added more null pointer guardrails

789cce3

Rework PEList callbacks

8426073

Remove pinned host allocator

6de2c13

Fix minor memory leaks

c169dba

Remove stream init from DataManager constructor

3c48a26

Merge pull request #5 from mrcawood/kernelfix

04724af

Host memory pool fully featured

trquinn and others added 6 commits January 21, 2026 07:25

TreePiece::finishWalk(): make HAPI_TRACE less verbose.

8314b2a

Synchronously call EwaldInit() to avoid use-before-init problem.

2b486c8

PEList::launch_kernel(): remove asserts that are not valid.

92b7598

If there are no remote walks, these asserts will fail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate GPU Kernel launches#186

Consolidate GPU Kernel launches#186
spencerw wants to merge 85 commits intoN-BodyShop:masterfrom
spencerw:kernelfix

spencerw commented Nov 20, 2024 •

edited

Loading

Uh oh!

spencerw commented Nov 20, 2024

Uh oh!

trquinn commented Nov 21, 2024

Uh oh!

spencerw commented Nov 21, 2024

Uh oh!

Uh oh!

spencerw commented Dec 12, 2025

Uh oh!

trquinn commented Jan 16, 2026

Uh oh!

trquinn commented Jan 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

spencerw commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spencerw commented Nov 20, 2024

Uh oh!

trquinn commented Nov 21, 2024

Uh oh!

spencerw commented Nov 21, 2024

Uh oh!

Uh oh!

spencerw commented Dec 12, 2025

Uh oh!

trquinn commented Jan 16, 2026

Uh oh!

trquinn commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spencerw commented Nov 20, 2024 •

edited

Loading

trquinn commented Jan 21, 2026 •

edited

Loading