-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Fix lookup for current Thread in async signal handler #122048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix lookup for current Thread in async signal handler #122048
Conversation
The current CheckActivationSafePoint uses thread local storage to get the current Thread instance. But this function is called from async signal handler (the activation signal handler) and it is not allowed to access TLS variables there because the access can allocate and if the interrupted code was running in an allocation code, it could crash. There was no problem with this since .NET 1.0, but a change in the recent glibc version has broken this. We've got reports of crashes in this code due to the reason mentioned above. This change introduces an async safe mechanism for accessing the current Thread instance from async signal handlers. It uses a segmented array that can grow, but never shrink. Entries for threads are added when runtime creates a thread / attaches to an external thread and removed when the thread dies. Closes dotnet#121581
|
Tagging subscribers to this area: @mangod9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes crashes occurring when async signal handlers access Thread Local Storage (TLS) in recent glibc versions. The fix introduces an async-safe, lock-free segmented array to map OS thread IDs to Thread instances, avoiding TLS access in signal handlers. The implementation is shared between CoreCLR and NativeAOT through the minipal library.
Key changes:
- New async-safe thread lookup mechanism using lock-free segmented arrays
- Added
minipal_get_current_thread_id_no_cache()to avoid TLS in signal handlers - Enhanced activation safe point checks to avoid taking ScanReaderLock
- Integrated async-safe thread lookup in both CoreCLR and NativeAOT runtimes
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/native/minipal/thread.h | Adds async-safe API declarations and no-cache thread ID function |
| src/native/minipal/thread.c | Implements lock-free segmented array for async-safe thread lookup |
| src/native/minipal/CMakeLists.txt | Adds thread.c to build |
| src/coreclr/vm/threadsuspend.cpp | Updates CheckActivationSafePoint to use async-safe thread lookup |
| src/coreclr/vm/threads.h | Declares GetThreadAsyncSafe for Unix |
| src/coreclr/vm/threads.cpp | Implements GetThreadAsyncSafe and integrates with async-safe map |
| src/coreclr/vm/codeman.h | Adds IsManagedCodeNoLock and GetScanFlags parameter |
| src/coreclr/vm/codeman.cpp | Implements IsManagedCodeNoLock for use without reader lock |
| src/coreclr/nativeaot/Runtime/unix/PalUnix.cpp | Updates activation handler to use async-safe thread lookup |
| src/coreclr/nativeaot/Runtime/threadstore.inl | Conditionalizes GetCurrentThread and adds async-safe variant |
| src/coreclr/nativeaot/Runtime/threadstore.h | Declares GetCurrentThreadIfAvailableAsyncSafe |
| src/coreclr/nativeaot/Runtime/threadstore.cpp | Implements async-safe thread lookup and map integration |
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| _ASSERTE(pThread != NULL); | ||
|
|
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion _ASSERTE(pThread != NULL) on line 5735 conflicts with the subsequent null check on line 5739. If the assertion is correct and pThread can never be NULL, then the null checks on lines 5739 and 5745 are redundant and should be removed. However, if there are cases where GetThreadAsyncSafe() can return NULL (e.g., if the thread hasn't been registered yet), then the assertion is incorrect and should be changed to a null check with early return.
Given that this is called from an async signal handler where the thread might not be fully initialized, the safer approach would be to remove the assertion and handle the NULL case gracefully:
Thread *pThread = GetThreadAsyncSafe();
if (pThread == NULL)
{
return FALSE;
}| _ASSERTE(pThread != NULL); | |
| if (pThread == NULL) | |
| { | |
| return FALSE; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like valid feedback. We can be seeing a signal sent by some other component (e.g. second NAOT runtime). We should forward the signal to the next component in that case.
Could you please do some ad-hoc stress testing to make sure that it works well? We should not be crashing, and we should not be eating signals that other components may expect.
| if (__atomic_compare_exchange_n( | ||
| &pSegment->entries[index].osThread, | ||
| &expected, osThread, | ||
| false /* weak */, | ||
| __ATOMIC_RELEASE /* success_memorder */, | ||
| __ATOMIC_RELAXED /* failure_memorder */)) | ||
| { | ||
| // Successfully inserted | ||
| pSegment->entries[index].pThread = pThread; |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write to pThread at line 78 is not atomic, which creates a race condition with the concurrent read in minipal_find_thread_in_async_safe_map at line 122. After the osThread field is atomically set with release semantics, a concurrent reader could load the osThread value with acquire semantics but then read a stale or partially-written pThread value.
To fix this, the write to pThread should happen before the atomic write to osThread, ensuring proper happens-before relationship:
pSegment->entries[index].pThread = pThread;
__atomic_store_n(&pSegment->entries[index].osThread, osThread, __ATOMIC_RELEASE);Then update line 70-75 to just check if the slot is empty before proceeding.
| if (__atomic_compare_exchange_n( | |
| &pSegment->entries[index].osThread, | |
| &expected, osThread, | |
| false /* weak */, | |
| __ATOMIC_RELEASE /* success_memorder */, | |
| __ATOMIC_RELAXED /* failure_memorder */)) | |
| { | |
| // Successfully inserted | |
| pSegment->entries[index].pThread = pThread; | |
| if (expected == __atomic_load_n(&pSegment->entries[index].osThread, __ATOMIC_ACQUIRE)) | |
| { | |
| // Write pThread first, then publish osThread atomically | |
| pSegment->entries[index].pThread = pThread; | |
| __atomic_store_n(&pSegment->entries[index].osThread, osThread, __ATOMIC_RELEASE); |
| for (size_t i = 0; i < MAX_THREADS_IN_SEGMENT; i++) | ||
| { | ||
| size_t index = (startIndex + i) % MAX_THREADS_IN_SEGMENT; | ||
| // Use acquire to synchronize with release in insert_thread_to_async_safe_map |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment references insert_thread_to_async_safe_map but the actual function name is minipal_insert_thread_into_async_safe_map. This should be corrected for clarity and consistency.
| // Use acquire to synchronize with release in insert_thread_to_async_safe_map | |
| // Use acquire to synchronize with release in minipal_insert_thread_into_async_safe_map |
| // Use acquire to synchronize with release in insert_thread_to_async_safe_map | ||
| if (__atomic_load_n(&pSegment->entries[index].osThread, __ATOMIC_ACQUIRE) == osThread) | ||
| { | ||
| return pSegment->entries[index].pThread; |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pThread field is read non-atomically at line 122, which can race with the non-atomic write at line 78 in minipal_insert_thread_into_async_safe_map. Even though the osThread field is read with acquire semantics, this doesn't protect the subsequent pThread read because it's a separate memory location.
The pThread read should use an atomic load with acquire semantics to properly synchronize with the insert operation:
void* result = __atomic_load_n(&pSegment->entries[index].pThread, __ATOMIC_ACQUIRE);
return result;This ensures that if we observe a matching osThread value, we'll also observe the correctly written pThread value.
| return pSegment->entries[index].pThread; | |
| return __atomic_load_n(&pSegment->entries[index].pThread, __ATOMIC_ACQUIRE); |
| pSegment = pSegment->pNext; | ||
| } | ||
| } | ||
|
|
||
| void minipal_remove_thread_from_async_safe_map(size_t osThread, void* pThread) | ||
| { | ||
| size_t startIndex = osThread % MAX_THREADS_IN_SEGMENT; | ||
|
|
||
| struct ThreadSegment* pSegment = s_pAsyncSafeThreadMapHead; | ||
| while (pSegment) | ||
| { | ||
| for (size_t i = 0; i < MAX_THREADS_IN_SEGMENT; i++) | ||
| { | ||
| size_t index = (startIndex + i) % MAX_THREADS_IN_SEGMENT; | ||
| if (pSegment->entries[index].pThread == pThread) | ||
| { | ||
| // Found the entry, remove it | ||
| pSegment->entries[index].pThread = NULL; | ||
| __atomic_exchange_n(&pSegment->entries[index].osThread, (size_t)0, __ATOMIC_RELEASE); | ||
| return; | ||
| } | ||
| } | ||
| pSegment = pSegment->pNext; | ||
| } | ||
| } | ||
|
|
||
| void *minipal_find_thread_in_async_safe_map(size_t osThread) | ||
| { | ||
| size_t startIndex = osThread % MAX_THREADS_IN_SEGMENT; | ||
| struct ThreadSegment* pSegment = s_pAsyncSafeThreadMapHead; | ||
| while (pSegment) | ||
| { | ||
| for (size_t i = 0; i < MAX_THREADS_IN_SEGMENT; i++) | ||
| { | ||
| size_t index = (startIndex + i) % MAX_THREADS_IN_SEGMENT; | ||
| // Use acquire to synchronize with release in insert_thread_to_async_safe_map | ||
| if (__atomic_load_n(&pSegment->entries[index].osThread, __ATOMIC_ACQUIRE) == osThread) | ||
| { | ||
| return pSegment->entries[index].pThread; | ||
| } | ||
| } | ||
| pSegment = pSegment->pNext; |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pNext field is read non-atomically on lines 84, 106, and 125, which can race with the atomic write to pNext at line 50-56 in the insert function. This could lead to reading a partially written pointer value, causing crashes or incorrect behavior.
All reads of pSegment->pNext should use atomic loads with acquire semantics to synchronize with the release store in the insert operation:
pSegment = __atomic_load_n(&pSegment->pNext, __ATOMIC_ACQUIRE);This ensures proper synchronization across all functions that traverse the segment list.
| struct ThreadSegment* pNext; | ||
| }; | ||
|
|
||
| static struct ThreadSegment *s_pAsyncSafeThreadMapHead = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether the implementation of this data structure belongs to minipal.
If you have put it into minipal just to facilitate sharing between NAOT and regular CoreCLR, we have better place for that: `src\coreclr\runtime .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, native/minipal was created to share code between src/coreclr and something outside of coreclr (libs, corehost or mono). Then we added cpuid/cpufeatures and bunch of other stuff which is not used by anything outside coreclr. Today, src/native/minipal is just being used to share C code regardless of whether it’s used outside of src/coreclr.
to facilitate sharing between NAOT and regular CoreCLR, we have better place for that: `src\coreclr\runtime .
I thought src\coreclr\minipal is the place for that facility and coreclr\runtime is to unify asm code between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think about src\minipal as general-purpose stuff that should be abstracting the platform. I know we have a few exceptions that are not really a platform abstraction - but they are still very general-purpose utilities.
This data structure looks very specific to how the thread suspension is implemented in JIT (regular CoreCLR) and NAOT runtimes. I think about it as an implementation detail of the thread suspension. Ideally, the whole implementation of thread suspension would be shared between JIT and NAOT - we are not there yet.
cpuid/cpufeatures
BTW: cpuid/cpufeatures is also used in crossgen2/ilc that is distinct from the runtimes.
src\coreclr\minipal is the place for that facility
coreclr\minipal has the messier and less general-purpose PAL abstractions that are specific to the JIT runtime. It is not used by anything else. We tend to be moving bits of src\coreclr\pal that are too difficult to cleanup completely, https://github.com/dotnet/runtime/blob/main/src/coreclr/minipal/dn-stdio.h#L15 is a great example.
coreclr\runtime is to unify asm code between them.
It is meant to be used for more than just the .asm files. It has CachedInterfaceDispatch.cpp/CachedInterfaceDispatch.h today. It would be nice to move the other .h/cpp files that are shared between JIT and NAOT runtime too. These files are cherry-picked from VM directory in a hacky way currently that does not look great, e.g.: https://github.com/dotnet/runtime/blob/main/src/coreclr/nativeaot/Runtime/gcinfodecoder.cpp.
Have you been able to trace down the change? I think we just got lucky that it has not showed up on the radar earlier. |
The current
CheckActivationSafePointuses thread local storage toget the current Thread instance. But this function is called from
async signal handler (the activation signal handler) and it is not
allowed to access TLS variables there because the access can allocate
and if the interrupted code was running in an allocation code, it
could crash.
There was no problem with this since .NET 1.0, but a change in the
recent glibc version has broken this. We've got reports of crashes
in this code due to the reason mentioned above.
This change introduces an async safe mechanism for accessing the
current Thread instance from async signal handlers. It uses a
segmented array that can grow, but never shrink. Entries for
threads are added when runtime creates a thread / attaches to an
external thread and removed when the thread dies.
The check for safety of the activation injection was further enhanced
to make sure that the ScanReaderLock is not taken. In cases it would
need to be taken, we just reject the location.
Since NativeAOT is subject to the same issue, the code to maintain the
thread id to thread instance map is placed to the minipal and shared
between coreclr and NativeAOT.
Closes #121581