Skip to content

Consolidate GPU Kernel launches#186

Open
spencerw wants to merge 85 commits intoN-BodyShop:masterfrom
spencerw:kernelfix
Open

Consolidate GPU Kernel launches#186
spencerw wants to merge 85 commits intoN-BodyShop:masterfrom
spencerw:kernelfix

Conversation

@spencerw
Copy link
Member

@spencerw spencerw commented Nov 20, 2024

Having GPU kernel launches tied to the TreePieces degrades performance and is probably causing some data race conditions. The GPU versions of the local tree walk and ewald calculation are now handled by the data manager. Any kernel launches involving interaction list calculations are handled by node groups.

Note that another open PR #183 has already been merged into this branch.

@spencerw
Copy link
Member Author

We still need to decide what to do about the nodeGravityComputation and particleGravityComputation kernel launches. I don't think the remote gravity performance will benefit much from consolidating these, but this PR as-is probably breaks the local GPU gravity calculation if we aren't using the gpu-local-tree-walk option.

@trquinn
Copy link
Member

trquinn commented Nov 21, 2024

This doesn't even compile if "--enable-gpu-local-tree-walk" is not specified.

@spencerw
Copy link
Member Author

I just tested this out using the verbs comm layer and CUDA memory errors are back. I'm guessing the poor performance from MPI was actually preventing the remote gravity kernels from stepping on each other.

We'll see if the CSA folks have any other suggestions when we talk to them next Monday, but I think we're going to need to move all of the nodeGravityComputation and particleGravityComputation kernel launches to the DataManager as well.

@spencerw spencerw changed the title Consolidate gpuLocalTreeWalk and Ewald Kernel Launches Consolidate GPU Kernel launches Apr 15, 2025
mrcawood and others added 5 commits October 2, 2025 15:08
…Added stats ouput. See MEMORY_POOL_CHANGES.txt for more info
* Remove barrier preventing calculateGravityRemote from executing before
  prefetch data is transferred to GPU
* Add barrier ensuring prefetch data is transferred before launching
  PEList kernels
* Node-wide GPU data pointers no longer passed through TreePieces
spencerw and others added 15 commits October 24, 2025 12:11
* CUDA stream for DataManager created later
cudaFreeAsync returns memory to CUDA's pool immediately, but GPU operations
using that memory may still be in flight. When cudaMallocAsync returns the
same memory, there's a race condition causing non-deterministic results.

Fix: Synchronize the stream after cudaFreeAsync to ensure all operations
complete before the memory becomes available for reuse.

This fixes the teststep/testenergy benchmark which was failing due to
~0.04% energy calculation errors from the race condition.
Host memory pool fully featured
@spencerw
Copy link
Member Author

I went ahead and merged Matt's memory pool PR into this. As of that latest commit (04724af), teststep, testcosmo and testcollapse all appear to pass and are coming back clean with valgrind, address-sanitizer and compute-sanitizer when run on multiple ranks.

We've also confirmed that the halo mass function matches between the CPU and GPU version of runs22, although this was run a couple months ago with an older commit (10ed833).

At the point, the main issue is that the star formation history (using metal cooling) for CosmoRun diverges between the CPU and GPU version somewhere around z=7. I'm also getting a heap corruption error around this time. This was all run with the verbs build on Vista. I also had to roll back to (10ed833), otherwise CosmoRun crashes and hangs fairly quickly. The MPI/UCX build works much more smoothly with the latest commit (04724af), but I run into RDMA errors 20 or so steps in.

[c637-132:2942276:0:2942276] ib_mlx5_log.c:179  Local length error on mlx5_0:1/IB (synd 0x1 vend 0xda hw_synd 0/0)                   [c637-132:2942276:0:2942276] ib_mlx5_log.c:179  RC QP 0x3db8 wqe[58529]: RDMA_READ s-- [rva 0x40034f29cd60 rkey 0x9b2a95c] [va 0x4f5d5e30 len 16384 lkey 0x15accae0] [rqpn 0x10c45 dlid=734 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:2942276) ====
 0  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_handle_error+0x288) [0x40002030eb98]
 1  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_fatal_error_message+0xd0) [0x40002030c1e0]
 2  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_log_default_handler+0xd80) [0x4000203105f0]
 3  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_log_dispatch+0xd0) [0x400020310950]
 4  /opt/apps/ucx/1.17.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x420) [0x4000225e6b50]
 5  /opt/apps/ucx/1.17.0/lib/ucx/libuct_ib.so.0(+0x3d694) [0x4000225fd694]
 6  /opt/apps/ucx/1.17.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x68) [0x4000225e7a9c]
 7  /opt/apps/ucx/1.17.0/lib/ucx/libuct_ib.so.0(+0x3f364) [0x4000225ff364]
 8  /opt/apps/ucx/1.17.0/lib/libucp.so.0(ucp_worker_progress+0x64) [0x4000201d8ab8]
 9  /opt/apps/nvidia24/openmpi/5.0.5_nvc249/lib/libmpi.so.40(mca_pml_ucx_recv+0x134) [0x40001f178034]
10  /opt/apps/nvidia24/openmpi/5.0.5_nvc249/lib/libmpi.so.40(PMPI_Recv+0x1bc) [0x40001f00053c]
11  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda() [0xa97660]
12  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(_Z24LrtsAdvanceCommunicationi+0x648) [0xa97fc8]
13  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(_Z25CommunicationServerThreadi+0x1c) [0xa94e1c]
14  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda() [0xa94cac]
15  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(ConverseInit+0x1694) [0xa94454]
16  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(charm_main+0x2c) [0xa0566c]
17  /lib64/libc.so.6(+0x2c79c) [0x40001f9ec79c]
18  /lib64/libc.so.6(__libc_start_main+0x98) [0x40001f9ec86c]
19  /work/03777/scw7/vista/changa_kernelfix_group_matt/ChaNGa.smp.cuda(_start+0x30) [0x5cadb0]
=================================

Disabling the UCX transport layer with export OMPI_MCA_btl=^uct seems to help it get a bit further, but it still crashes with the same error eventually.

h896 seems much more stable, although I haven't run it for more than about 50 steps.

@trquinn
Copy link
Member

trquinn commented Jan 16, 2026

The GPU code currently hangs if running on one node. I believe it is because there are no remote walks so DataManager::transferParticleVarsBack() is not called enough times.

@trquinn
Copy link
Member

trquinn commented Jan 21, 2026

This doesn't even compile if "--enable-gpu-local-tree-walk" is not specified.

I've checked that it seems to run correctly with or without --enable-gpu-local-tree-walk after my recent patch (yet to be merged).

trquinn and others added 6 commits January 21, 2026 07:25
Previously this was done using the device pointers, which occasionally
caused hangs if no data was transfered to the GPU.  (E.g. a remote walk
when running on a single node.)

Also cleaned up some unused attributes in DataManager.
If there are no remote walks, these asserts will fail.
* fixed host pool init race condition

* Address PR #7 review: MemoryPool.cpp cleanups

- gpuPoolInit: Replace magic 16 with const maxCudaDevice; add cudaGetDeviceCount
  and CkAssert(nCudaDeviceCount <= maxCudaDevice)
- gpuPoolInit: Check cudaMemPoolSetAttribute return value, CkAbort on failure
- hostPoolReportStats: Replace snprintf with make_formatted_string per project
  pattern (formatted_string.h)
Replace plain bool with std::atomic<bool> for bLocalDataTransferred and
bRemoteDataTransferred. These flags are set in async callbacks and read by
multiple PEs in PEList::finishWalk(); without atomics, some PEs see stale
values and never set bKernelDelayed, causing tryLaunchDelayedKernel to
run before they check in → deadlock.

- DataManager: .store() for all writes; .load() in PEList when reading
- donePrefetch path also sets bRemoteDataTransferred via .store(true)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants