fix(scheduler): release paged-cache snapshots in ~HybridPrefixCache to avoid teardown use-after-free#455
Open
Sunt-ing wants to merge 1 commit into
Conversation
…o avoid teardown use-after-free PagedCacheSnapshots live on TreeNodes owned by KVPrefixCache, which Scheduler declares before (and therefore destroys after) HybridPrefixCache. Each snapshot OwnedPages borrows from a PagedCacheGroupAllocator owned by HybridPrefixCache, so when HybridPrefixCache is destroyed first the later ~KVPrefixCache deallocates page ids into a freed allocator. This is a heap-use-after-free on the normal teardown path whenever paged-cache prefix state is still attached to the radix tree. Add a ~HybridPrefixCache destructor that detaches every still-attached snapshot before its allocators are destroyed. request_paged_cache_tables_ is declared after paged_cache_allocators_ and so is already destroyed first; only the tree-node snapshot path needed handling. Validated with AddressSanitizer over the scheduler C++ test suite: the heap-use-after-free reported during PagedCacheTestFixtureT teardown disappears and the full suite is ASan/UBSan-clean. Adds a deterministic regression test. Signed-off-by: Ting Sun <suntcrick@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PagedCacheSnapshots attached toTreeNodes inKVPrefixCacheholdOwnedPagesthat borrow fromPagedCacheGroupAllocators owned byHybridPrefixCache, butSchedulerdestroysHybridPrefixCachefirst, so~KVPrefixCachelater deallocates page ids into a freed allocator.~HybridPrefixCachedestructor that detaches every still-attached snapshot (viapaged_cache_snapshot_nodes_) before its allocators are destroyed, returning the pages to still-live allocators.Root Cause
Schedulerdeclareskv_prefix_cache_beforehybrid_prefix_cache_, so member destruction order tears downHybridPrefixCache(and itspaged_cache_allocators_) first, thenKVPrefixCache:Whenever a paged-cache deployment still has prefix snapshots attached to the radix tree at exit, the snapshot
OwnedPagesdeallocate into an allocator that was already freed.request_paged_cache_tables_does not have this problem: it is declared afterpaged_cache_allocators_and so is destroyed first. The mamba slot path is also unaffected, since the mamba allocators areSchedulermembers declared beforekv_prefix_cache_.#357 fixed the analogous dangling-pointer problem on the prune path (its destroy callback drops dying nodes from these adjunct sets, including
paged_cache_snapshot_nodes_). The teardown path addressed here is a separate gap: at shutdown the snapshots are released by~KVPrefixCacherather thanPruneEmptyByNode, afterHybridPrefixCacheand its allocators are already gone.Test Plan
Built and run on a single host under AddressSanitizer over the full scheduler C++ test suite (
tokenspeed_scheduler_tests).heap-use-after-freeinPageAllocator::DeallocateduringPagedCacheTestFixtureTteardown.The new test
PagedCacheFamilySplitTest.DestructorReleasesAttachedSnapshotsattaches a snapshot, destroys the hybrid cache first, and asserts the node no longer carries a snapshot. With the fix reverted it reproduces the use-after-free under ASan and fails in a normal build; the scheduler CI runs Release without ASan, so this deterministic guard is needed.ASan report (before fix)