Skip to content

Reductions with CUDA are sometimes wrong #892

@Sbozzolo

Description

@Sbozzolo

Starting mid December, sometimes, some of my MPI reductions produce wrong results when I am working with CUDA (and only when working with CUDA).

I came to this conclusion via printf-debugging:

println("PRE", mypid, " ", Array(remapper._interpolated_values)[end, end, end, 1]); flush(stdout)
MPI.Barrier(remapper.comms_ctx.mpicomm)
MPI.Reduce!(remapper._interpolated_values[:, :, :, 1], dest, +, mpicomm)
MPI.Barrier(remapper.comms_ctx.mpicomm)
println("POST", mypid, " ", Array(dest)[end, end, end]); flush(stdout)

that produces:

PRE1 0.0
PRE2 500.0
POST1 249.99999999999997 
POST2 0.0 

The value 249.99999999999997 is wrong and it should be 500 (0.0 + 500.0 = 500.0). I tried forcing CUDA.synchronize and/or CUDA.sync, and sometimes results change (for example, I can get 450.0 instead of 250). Note also that 250 could be the correct value for another point.

The code above is embedded in a more complex test case, and I can reproduce the problem reliably when running in a relatively isolated environment. When I work interactively or when I change the test, the test becomes very flaky and extremely hard to reproduce. To ensure consistent results, I have to clean the depot.

I initially thought it was a buggy MPI implementation, but it seems that this problem also appears in production on a different machine (with simulations that run for multiple days). On that machine, the same test never triggers the bug (as checked by running the test 100 times)

Considering the bug is hard to reproduce, this is what I think I concluded:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions