Skip to content

Commit cd7517c

Browse files
dmwumeta-codesync[bot]
authored andcommitted
Fix a race condition bug caused by missing nullptr check
Summary: This is a very tricky race condition bug that was root caused to the nullptr dereference of kernelFlag. We are missing nullptr check in the KernelTestHostAbort() kernel, which caused illegal memory access during the distributed test of allgatherP on AMD: `[1,8]<stdout>:Failed: Cuda error fbcode/comms/ctran/tests/CtranDistAllgatherPTests.cc:59 'an illegal memory access was encountered' [1,8]<stdout>:W1119 21:46:34.875350 952795 CtranMapperRegMem.cc:197] rtptest042:952795:952795 [0][main] CTRAN WARN Total 2/2 remaining segments are still in RegCache at destroy time. ` full results see P2047923648 For Nvidia only tests, this issue didn't happen because the `abort` fileld in CtranComm [is initialized as `false`](https://fburl.com/code/0fybwtgz) shmDevState.enableCTancellableWaits inherits this value, so ctran::utils::loadInt(flag) is short-circuited. For NV/AMD compatible UT of allgatherP, we use the McclComm obj to initialize CtranComm. It by default enables cancellable waits, and thus exposed the dereference of flag. The failure was very undeterministic because it relys on the race of "step" and "val" on this line https://fburl.com/code/l24vhay5. The fix is simple, if flag is empty, we assume the user do not need cancellable waits, and we can simply return false in KernelTestHostAbort. Reviewed By: cenzhaometa Differential Revision: D87946459 fbshipit-source-id: ce4cc441768d5fbbdcff053fa9790fef0ace73b0
1 parent 32c23d5 commit cd7517c

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

comms/ctran/algos/common/GpeKernelDev.cuh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,8 @@ static inline __device__ bool KernelTestHostAbort(volatile int* flag) {
4545
// local), or user active aborts/timeouts will put CtranComm in such state, if
4646
// abort is enabled for CtranComm.
4747
return shmDevState.enableCancellableWaits &&
48-
(kernelDoAbort || ctran::utils::loadInt(flag) == KERNEL_HOST_ABORT);
48+
(kernelDoAbort ||
49+
(flag && ctran::utils::loadInt(flag) == KERNEL_HOST_ABORT));
4950
}
5051

5152
static inline __device__ bool KernelTestHostAbortBlock(volatile int* flag) {

0 commit comments

Comments
 (0)