Skip to content

fix(scheduler): release paged-cache snapshots in ~HybridPrefixCache to avoid teardown use-after-free#455

Open
Sunt-ing wants to merge 1 commit into
lightseekorg:mainfrom
Sunt-ing:fix/hybrid-prefix-cache-teardown-uaf
Open

fix(scheduler): release paged-cache snapshots in ~HybridPrefixCache to avoid teardown use-after-free#455
Sunt-ing wants to merge 1 commit into
lightseekorg:mainfrom
Sunt-ing:fix/hybrid-prefix-cache-teardown-uaf

Conversation

@Sunt-ing

Copy link
Copy Markdown

Summary

  • Fixes a heap-use-after-free at scheduler teardown: PagedCacheSnapshots attached to TreeNodes in KVPrefixCache hold OwnedPages that borrow from PagedCacheGroupAllocators owned by HybridPrefixCache, but Scheduler destroys HybridPrefixCache first, so ~KVPrefixCache later deallocates page ids into a freed allocator.
  • Adds a ~HybridPrefixCache destructor that detaches every still-attached snapshot (via paged_cache_snapshot_nodes_) before its allocators are destroyed, returning the pages to still-live allocators.
  • Adds a deterministic regression test.

Root Cause

Scheduler declares kv_prefix_cache_ before hybrid_prefix_cache_, so member destruction order tears down HybridPrefixCache (and its paged_cache_allocators_) first, then KVPrefixCache:

freed by  ~HybridPrefixCache -> ~map<string, unique_ptr<PagedCacheGroupAllocator>>
used  by  ~KVPrefixCache -> ~RadixTree -> ~TreeNode -> ~PagedCacheSnapshot
                         -> ~PagedCacheGroupSnapshot -> ~OwnedPages
                         -> PageAllocator::Deallocate()   // heap-use-after-free

Whenever a paged-cache deployment still has prefix snapshots attached to the radix tree at exit, the snapshot OwnedPages deallocate into an allocator that was already freed. request_paged_cache_tables_ does not have this problem: it is declared after paged_cache_allocators_ and so is destroyed first. The mamba slot path is also unaffected, since the mamba allocators are Scheduler members declared before kv_prefix_cache_.

#357 fixed the analogous dangling-pointer problem on the prune path (its destroy callback drops dying nodes from these adjunct sets, including paged_cache_snapshot_nodes_). The teardown path addressed here is a separate gap: at shutdown the snapshots are released by ~KVPrefixCache rather than PruneEmptyByNode, after HybridPrefixCache and its allocators are already gone.

Test Plan

Built and run on a single host under AddressSanitizer over the full scheduler C++ test suite (tokenspeed_scheduler_tests).

  • Before (fix reverted): ASan reports heap-use-after-free in PageAllocator::Deallocate during PagedCacheTestFixtureT teardown.
  • After: full suite is ASan-clean, 185 tests pass (184 existing + 1 new).

The new test PagedCacheFamilySplitTest.DestructorReleasesAttachedSnapshots attaches a snapshot, destroys the hybrid cache first, and asserts the node no longer carries a snapshot. With the fix reverted it reproduces the use-after-free under ASan and fails in a normal build; the scheduler CI runs Release without ASan, so this deterministic guard is needed.

ASan report (before fix)
==ERROR: AddressSanitizer: heap-use-after-free ... READ of size 8
  #2 PageAllocator::Deallocate(...) page_allocator.cpp:66
  #3 OwnedPages::~OwnedPages() owned_pages.cpp:32
  #4 PagedCacheGroupSnapshot::~PagedCacheGroupSnapshot() paged_cache_snapshot.h:30
  ...
  #14 PagedCacheSnapshot::~PagedCacheSnapshot()
  #17 TreeNode::~TreeNode()
  #31 RadixTree::~RadixTree()
  #32 KVPrefixCache::~KVPrefixCache()
freed by thread T0 here:
  #1 ~unique_ptr<PagedCacheGroupAllocator>()
  #12 HybridPrefixCache::~HybridPrefixCache() hybrid_prefix_cache.h:48

…o avoid teardown use-after-free

PagedCacheSnapshots live on TreeNodes owned by KVPrefixCache, which Scheduler
declares before (and therefore destroys after) HybridPrefixCache. Each
snapshot OwnedPages borrows from a PagedCacheGroupAllocator owned by
HybridPrefixCache, so when HybridPrefixCache is destroyed first the later
~KVPrefixCache deallocates page ids into a freed allocator. This is a
heap-use-after-free on the normal teardown path whenever paged-cache prefix
state is still attached to the radix tree.

Add a ~HybridPrefixCache destructor that detaches every still-attached snapshot
before its allocators are destroyed. request_paged_cache_tables_ is declared
after paged_cache_allocators_ and so is already destroyed first; only the
tree-node snapshot path needed handling.

Validated with AddressSanitizer over the scheduler C++ test suite: the
heap-use-after-free reported during PagedCacheTestFixtureT teardown disappears
and the full suite is ASan/UBSan-clean. Adds a deterministic regression test.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@Sunt-ing Sunt-ing requested a review from a team as a code owner June 15, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant