Skip to content

fix(duplication): prevent potential crash when removing or pausing duplication#2368

Open
lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
lengyuexuexuan:fix_duplication
Open

fix(duplication): prevent potential crash when removing or pausing duplication#2368
lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
lengyuexuexuan:fix_duplication

Conversation

@lengyuexuexuan
Copy link
Collaborator

What problem does this PR solve?

#2211

What is changed and how does it work?

  1. Background
    During duplication of a table, if the commands dup remove/pause are executed or a balance operation is performed at the same time, there is a chance that a node may core dump with signal ID 11. The core dump locations vary, but they all have one thing in common: they occur during memory allocation or deallocation.

  2. Analysis
    Based on extensive testing, the following conclusions can be drawn:
    a. The issue only reproduces when there is write traffic. The difference between having and not having write traffic is: It adds the ship and load_private_log tasks.
    b. The core dump occurs during the execution of cancel_all().
    c. The issue occurs with low probability (approximately 1 in 100).

    Through analysis using ASAN (AddressSanitizer):
    dup_remove_asan.txt
    Based on ASAN analysis, the following conclusions can be drawn:
    a. The memory corruption occurs during the ship process. The mutations obtained from replaying the plog are passed to ship, leading to the issue.
    b. _load_mutations is captured by a lambda expression and then passed to a std::function. Since std::move is used, the lifetime of _load_mutations is tied to that of the std::function.
    c. The cancel_all() function is executed in the default thread pool. At this point, the following function is called. When the std::function is set to nullptr, it will release the memory it manages.

    void clear_non_trivial_on_task_end() override { _cb = nullptr; }

    d. However, each task executes exec_internal() in its own thread pool, and eventually calls release_ref(), which results in delete this.
    this->release_ref(); // added in enqueue(pool)

  3. Conclusion
    1. Both task.cancel() and task.exec_internal() destruct the std::function member inside the task object. These two operations are executed in different threads, and there is no mechanism in place to prevent race conditions between them. As a result, it is possible for both threads to attempt to destruct the same std::function, which can lead to a double deletion of the memory associated with _load_mutations. This ultimately causes memory corruption.
    2. _duplications is accessed without proper synchronization in certain functions under multi-threaded scenarios, potentially causing race conditions.

  4. Solution

    1. Lock the _cb callback to ensure that only one thread executes its destructor.
    2. Add locking to functions that access _duplications without synchronization to prevent concurrent access conflicts.
Tests
  • Manual test
    The changes have been production-validated at Xiaomi, running stably on more than 30 clusters for over six months, confirming that they resolve the concurrency issues described above.

@github-actions github-actions bot added the cpp label Feb 12, 2026
@empiredan
Copy link
Contributor

Hi @lengyuexuexuan Thank you for your contribution!

Please modify the code according to the suggestions provided by Clang Tidy and IWYU.

@lengyuexuexuan
Copy link
Collaborator Author

Hi @lengyuexuexuan Thank you for your contribution!

Please modify the code according to the suggestions provided by Clang Tidy and IWYU.

done

return;
}

zauto_lock l(_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does _primary_confirmed_decree below no longer need protection by _lock?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants