-
-
Notifications
You must be signed in to change notification settings - Fork 123
Description
Starting mid December, sometimes, some of my MPI reductions produce wrong results when I am working with CUDA (and only when working with CUDA).
I came to this conclusion via printf-debugging:
println("PRE", mypid, " ", Array(remapper._interpolated_values)[end, end, end, 1]); flush(stdout)
MPI.Barrier(remapper.comms_ctx.mpicomm)
MPI.Reduce!(remapper._interpolated_values[:, :, :, 1], dest, +, mpicomm)
MPI.Barrier(remapper.comms_ctx.mpicomm)
println("POST", mypid, " ", Array(dest)[end, end, end]); flush(stdout)
that produces:
PRE1 0.0
PRE2 500.0
POST1 249.99999999999997
POST2 0.0
The value 249.99999999999997 is wrong and it should be 500 (0.0 + 500.0 = 500.0). I tried forcing CUDA.synchronize
and/or CUDA.sync
, and sometimes results change (for example, I can get 450.0 instead of 250). Note also that 250 could be the correct value for another point.
The code above is embedded in a more complex test case, and I can reproduce the problem reliably when running in a relatively isolated environment. When I work interactively or when I change the test, the test becomes very flaky and extremely hard to reproduce. To ensure consistent results, I have to clean the depot.
I initially thought it was a buggy MPI implementation, but it seems that this problem also appears in production on a different machine (with simulations that run for multiple days). On that machine, the same test never triggers the bug (as checked by running the test 100 times)
Considering the bug is hard to reproduce, this is what I think I concluded:
- With CUDA 5.6, MPI 0.20.21 is fine MPI 0.20.22 is not (build)
- With CUDA 5.6, commit aac9688 is not fine (build) but commit 9584ac8 is fine (build)
- Adding
local
in front of datatype as suggested on Slack does not fix the issue (build) - With CUDA 5.5, MPI 0.20.22 is fine (build)
- Both Julia 1.10 and Julia 1.11 have the problem