Commit cd7517c
Fix a race condition bug caused by missing nullptr check
Summary:
This is a very tricky race condition bug that was root caused to the nullptr dereference of kernelFlag.
We are missing nullptr check in the KernelTestHostAbort() kernel, which caused illegal memory access during the distributed test of allgatherP on AMD:
`[1,8]<stdout>:Failed: Cuda error fbcode/comms/ctran/tests/CtranDistAllgatherPTests.cc:59 'an illegal memory access was encountered'
[1,8]<stdout>:W1119 21:46:34.875350 952795 CtranMapperRegMem.cc:197] rtptest042:952795:952795 [0][main] CTRAN WARN Total 2/2 remaining segments are still in RegCache at destroy time. `
full results see P2047923648
For Nvidia only tests, this issue didn't happen because the `abort` fileld in CtranComm [is initialized as `false`](https://fburl.com/code/0fybwtgz)
shmDevState.enableCTancellableWaits inherits this value, so ctran::utils::loadInt(flag) is short-circuited.
For NV/AMD compatible UT of allgatherP, we use the McclComm obj to initialize CtranComm. It by default enables cancellable waits, and thus exposed the dereference of flag.
The failure was very undeterministic because it relys on the race of "step" and "val" on this line https://fburl.com/code/l24vhay5.
The fix is simple, if flag is empty, we assume the user do not need cancellable waits, and we can simply return false in KernelTestHostAbort.
Reviewed By: cenzhaometa
Differential Revision: D87946459
fbshipit-source-id: ce4cc441768d5fbbdcff053fa9790fef0ace73b01 parent 32c23d5 commit cd7517c
1 file changed
+2
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
| 48 | + | |
| 49 | + | |
49 | 50 | | |
50 | 51 | | |
51 | 52 | | |
| |||
0 commit comments