fix(duplication): prevent potential crash when removing or pausing duplication#2368
Open
lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
Open
fix(duplication): prevent potential crash when removing or pausing duplication#2368lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
Conversation
Contributor
|
Hi @lengyuexuexuan Thank you for your contribution! Please modify the code according to the suggestions provided by Clang Tidy and IWYU. |
Collaborator
Author
done |
empiredan
reviewed
Mar 6, 2026
| return; | ||
| } | ||
|
|
||
| zauto_lock l(_lock); |
Contributor
There was a problem hiding this comment.
Does _primary_confirmed_decree below no longer need protection by _lock?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
#2211
What is changed and how does it work?
Background
During duplication of a table, if the commands
dup remove/pauseare executed ora balance operationis performed at the same time, there is a chance that a node may core dump with signalID 11. The core dump locations vary, but they all have one thing in common: they occur during memory allocation or deallocation.Analysis
Based on extensive testing, the following conclusions can be drawn:
a. The issue only reproduces when there is write traffic. The difference between having and not having write traffic is: It adds the ship and load_private_log tasks.
b. The core dump occurs during the execution of
cancel_all().c. The issue occurs with low probability (approximately 1 in 100).
Through analysis using ASAN (AddressSanitizer):
dup_remove_asan.txt
Based on ASAN analysis, the following conclusions can be drawn:
a. The memory corruption occurs during the ship process. The mutations obtained from replaying the plog are passed to ship, leading to the issue.
b.
_load_mutationsis captured by a lambda expression and then passed to astd::function. Sincestd::moveis used, the lifetime of_load_mutationsis tied to that of thestd::function.c. The
cancel_all()function is executed in the default thread pool. At this point, the following function is called. When thestd::functionis set to nullptr, it will release the memory it manages.incubator-pegasus/src/task/task.h
Line 341 in e64faa7
d. However, each task executes
exec_internal()in its own thread pool, and eventually callsrelease_ref(), which results in delete this.incubator-pegasus/src/task/task.cpp
Line 224 in e64faa7
Conclusion
1. Both
task.cancel()andtask.exec_internal()destruct the std::function member inside the task object. These two operations are executed in different threads, and there is no mechanism in place to prevent race conditions between them. As a result, it is possible for both threads to attempt to destruct the same std::function, which can lead to a double deletion of the memory associated with _load_mutations. This ultimately causes memory corruption.2.
_duplicationsis accessed without proper synchronization in certain functions under multi-threaded scenarios, potentially causing race conditions.Solution
_cbcallback to ensure that only one thread executes its destructor._duplicationswithout synchronization to prevent concurrent access conflicts.Tests
The changes have been production-validated at Xiaomi, running stably on more than 30 clusters for over six months, confirming that they resolve the concurrency issues described above.