Conversation
…n GPU tree walk enabled
…mmenceCalculateGraviyLocal is no longer called
|
We still need to decide what to do about the nodeGravityComputation and particleGravityComputation kernel launches. I don't think the remote gravity performance will benefit much from consolidating these, but this PR as-is probably breaks the local GPU gravity calculation if we aren't using the gpu-local-tree-walk option. |
|
This doesn't even compile if "--enable-gpu-local-tree-walk" is not specified. |
|
I just tested this out using the verbs comm layer and CUDA memory errors are back. I'm guessing the poor performance from MPI was actually preventing the remote gravity kernels from stepping on each other. We'll see if the CSA folks have any other suggestions when we talk to them next Monday, but I think we're going to need to move all of the nodeGravityComputation and particleGravityComputation kernel launches to the DataManager as well. |
…Added stats ouput. See MEMORY_POOL_CHANGES.txt for more info
* Remove barrier preventing calculateGravityRemote from executing before prefetch data is transferred to GPU * Add barrier ensuring prefetch data is transferred before launching PEList kernels * Node-wide GPU data pointers no longer passed through TreePieces
…ks in host memory pool
* CUDA stream for DataManager created later
cudaFreeAsync returns memory to CUDA's pool immediately, but GPU operations using that memory may still be in flight. When cudaMallocAsync returns the same memory, there's a race condition causing non-deterministic results. Fix: Synchronize the stream after cudaFreeAsync to ensure all operations complete before the memory becomes available for reuse. This fixes the teststep/testenergy benchmark which was failing due to ~0.04% energy calculation errors from the race condition.
Host memory pool fully featured
|
I went ahead and merged Matt's memory pool PR into this. As of that latest commit (04724af), teststep, testcosmo and testcollapse all appear to pass and are coming back clean with valgrind, address-sanitizer and compute-sanitizer when run on multiple ranks. We've also confirmed that the halo mass function matches between the CPU and GPU version of runs22, although this was run a couple months ago with an older commit (10ed833). At the point, the main issue is that the star formation history (using metal cooling) for CosmoRun diverges between the CPU and GPU version somewhere around z=7. I'm also getting a heap corruption error around this time. This was all run with the verbs build on Vista. I also had to roll back to (10ed833), otherwise CosmoRun crashes and hangs fairly quickly. The MPI/UCX build works much more smoothly with the latest commit (04724af), but I run into RDMA errors 20 or so steps in. Disabling the UCX transport layer with h896 seems much more stable, although I haven't run it for more than about 50 steps. |
|
The GPU code currently hangs if running on one node. I believe it is because there are no remote walks so DataManager::transferParticleVarsBack() is not called enough times. |
I've checked that it seems to run correctly with or without |
Previously this was done using the device pointers, which occasionally caused hangs if no data was transfered to the GPU. (E.g. a remote walk when running on a single node.) Also cleaned up some unused attributes in DataManager.
If there are no remote walks, these asserts will fail.
* fixed host pool init race condition * Address PR #7 review: MemoryPool.cpp cleanups - gpuPoolInit: Replace magic 16 with const maxCudaDevice; add cudaGetDeviceCount and CkAssert(nCudaDeviceCount <= maxCudaDevice) - gpuPoolInit: Check cudaMemPoolSetAttribute return value, CkAbort on failure - hostPoolReportStats: Replace snprintf with make_formatted_string per project pattern (formatted_string.h)
Replace plain bool with std::atomic<bool> for bLocalDataTransferred and bRemoteDataTransferred. These flags are set in async callbacks and read by multiple PEs in PEList::finishWalk(); without atomics, some PEs see stale values and never set bKernelDelayed, causing tryLaunchDelayedKernel to run before they check in → deadlock. - DataManager: .store() for all writes; .load() in PEList when reading - donePrefetch path also sets bRemoteDataTransferred via .store(true)
Having GPU kernel launches tied to the TreePieces degrades performance and is probably causing some data race conditions. The GPU versions of the local tree walk and ewald calculation are now handled by the data manager. Any kernel launches involving interaction list calculations are handled by node groups.
Note that another open PR #183 has already been merged into this branch.