Reductions with CUDA are sometimes wrong

Starting mid December, sometimes, some of my MPI reductions produce wrong results when I am working with CUDA (and only when working with CUDA).

I came to this conclusion via printf-debugging:
```julia
println("PRE", mypid, " ", Array(remapper._interpolated_values)[end, end, end, 1]); flush(stdout)
MPI.Barrier(remapper.comms_ctx.mpicomm)
MPI.Reduce!(remapper._interpolated_values[:, :, :, 1], dest, +, mpicomm)
MPI.Barrier(remapper.comms_ctx.mpicomm)
println("POST", mypid, " ", Array(dest)[end, end, end]); flush(stdout)
```
that produces:
```
PRE1 0.0
PRE2 500.0
POST1 249.99999999999997 
POST2 0.0 
```
The value 249.99999999999997 is wrong and it should be 500 (0.0 + 500.0 = 500.0). I tried forcing `CUDA.synchronize` and/or `CUDA.sync`, and sometimes results change (for example, I can get 450.0 instead of 250). Note also that 250 could be the correct value for another point.

The code above is embedded in a more complex test case, and I can reproduce the problem reliably when running in a relatively isolated environment. When I work interactively or when I change the test, the test becomes very flaky and extremely hard to reproduce. To ensure consistent results, I have to clean the depot.

I initially thought it was a buggy MPI implementation, but it seems that this problem also appears in production on a different machine (with simulations that run for multiple days). On that machine, the same test never triggers the bug (as checked by running the test 100 times)

Considering the bug is hard to reproduce, this is what I think I concluded:
- With CUDA 5.6, MPI 0.20.21 is fine MPI 0.20.22 is not ([build](https://buildkite.com/clima/climacore-ci/builds/5170))
- With CUDA 5.6, commit aac9688e6961bc7e3aeeba7600f5e7d0b10596a3 is not fine ([build](https://buildkite.com/clima/climacore-ci/builds/5171)) but commit 9584ac8775bfd674eee74cb4b1c8dbe130a246ec is fine ([build](https://buildkite.com/clima/climacore-ci/builds/5172))
- [Adding `local` in front of datatype](https://github.com/JuliaParallel/MPI.jl/compare/master...Sbozzolo:MPI.jl:master) as suggested on Slack does not fix the issue ([build](https://buildkite.com/clima/climacore-ci/builds/5175))
- With CUDA 5.5, MPI 0.20.22 is fine ([build](https://buildkite.com/clima/climacore-ci/builds/5148))
- Both Julia 1.10 and Julia 1.11 have the problem


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reductions with CUDA are sometimes wrong #892

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Reductions with CUDA are sometimes wrong #892

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions