Skip to content

feat (L3 KVStore): prefetch and backup support#293

Open
ehuohz wants to merge 4 commits into
lightseekorg:mainfrom
ehuohz:main
Open

feat (L3 KVStore): prefetch and backup support#293
ehuohz wants to merge 4 commits into
lightseekorg:mainfrom
ehuohz:main

Conversation

@ehuohz

@ehuohz ehuohz commented May 28, 2026

Copy link
Copy Markdown

Summary

Prefetch (L3 → Host)

On request submission, query Mooncake for existing KV pages. If hits exceed prefetch_threshold, take an async prefetch path (Submitted → Prefetching → PrefetchDone → Prefilling) instead of direct prefill. Completed pages are inserted into the radix tree's host layer with proper OwnedPages ownership transfer.

Backup (Host → L3)

On WriteBackDone, emit a fire-and-forget BackUpOperation to persist host pages to Mooncake. Backup metadata is captured at WriteBackOperation creation time while the Draining state's host node-ref is still alive.

Key changes

  • FSM: Add SchedulePrefillFirstChunkEvent::operator()(PrefetchDone&&) via templated applyFirstChunk() so prefetch-completed requests can enter prefill.
  • forward.cpp: Attempt schedulePrefetch for Submitted requests before falling through to prefill. Treat PrefetchDone with same scheduling priority.
  • outside_event_handler.cpp: Transfer host page ownership via OwnedPages into Insert(); RAII-free uncompleted pages. Add WriteBackDone hook to emit BackUpOperation.
  • scheduler.cpp: Capture L3 backup metadata (rolling hashes, host page IDs) in CacheOpSpec during newWriteBackOperation. Drain pending_prefetch_ops_ and pending_backup_ops_ in NextExecutionPlan().

Test Plan

  • Served kimi-k2.5 with Mooncake KVStore enabled in dev container
  • Sent a long-generation request (long prompt, max_tokens=262144), then sent multiple different requests to fill device/host cache and force eviction of the first request's KV pages to L3
  • Re-sent the same first request; verified L3 prefetch path activated (KV pages fetched back from Mooncake → host → device) and output matched the original response
  • Confirmed L3 fetch via Mooncake batch_get_into logs (439 tokens prompt, 439/64 = 6, 6 * TP4 = 24):
Mooncake log:
 | Requests (Success/Total): PutStart=4/4, PutEnd=4/4, PutRevoke=0/0, Get=4/4, Exist=4/4, Del=0/0, DelAll=0/0, Ping=2228/2228, CopyStart=0/0, CopyEnd=0/0, CopyRevoke=0/0, MoveStart=0/0, MoveEnd=0/0, MoveRevoke=0/0, EvictDiskReplica=0/0 | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=14/0/51, Item=1369/5238), PutEnd:(Req=14/0/14, Item=1369/1369), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=4/0/4, Item=24/24), ExistKey:(Req=64/0/64, Item=5524/5524), QueryIp:(Req=0/0/0, Item=0/0), Clear:(Req=0/0/0, Item=0/0), CreateMoveTask:(Req=0/0), CreateCopyTask:(Req=0/0), QueryTask=(Req=0/0), FetchTasks=(Req=2228/2228), MarkTaskToComplete= (Req=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B | Discard: Released/Total=0/0, StagingSize=0 B | Snapshots: Success=0, Fail=0}, ha={HA Metrics Summary: last_seq=0, applied_seq=0, lag=0, pending=0, mutation_queue=0, batch_commits=0, sync_commits=0, skipped=0, checksum_fail=0, etcd_fail=0, watch_disconn=0, state=0}

ts log:
[ts] I0528 04:36:33.135007 2732203 real_client.cpp:3556] Time taken for batch_get_into: 9814us, read store: 0us, with memory key count: 6, offload key count: 0
[ts] I0528 04:36:33.135859 2732205 real_client.cpp:3556] Time taken for batch_get_into: 10103us, read store: 0us, with memory key count: 6, offload key count: 0
[ts] I0528 04:36:33.136945 2732206 real_client.cpp:3556] Time taken for batch_get_into: 11312us, read store: 0us, with memory key count: 6, offload key count: 0
[ts] I0528 04:36:33.147481 2732204 real_client.cpp:3556] Time taken for batch_get_into: 22004us, read store: 0us, with memory key count: 6, offload key count: 0

@ehuohz ehuohz requested a review from a team as a code owner May 28, 2026 07:03
@XucSh XucSh self-assigned this May 28, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 487eba9340

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tokenspeed-scheduler/csrc/scheduler/operations/forward.cpp
Comment thread tokenspeed-scheduler/csrc/scheduler/scheduler.cpp Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c707d29204

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tokenspeed-scheduler/csrc/scheduler/outside_event_handler.cpp
@ehuohz ehuohz force-pushed the main branch 2 times, most recently from b6586ec to e732341 Compare May 28, 2026 08:04

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e732341112

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/event_loop.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 074c709457

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tokenspeed-scheduler/csrc/scheduler/operations/forward.cpp

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires Mooncake L3 KV-store integration into the scheduler with two flows: (1) an asynchronous prefetch from L3→Host on request submission with proper page-ownership transfer into the radix tree, and (2) a fire-and-forget backup from Host→L3 triggered on WriteBackDone, with backup metadata captured at WriteBackOperation creation time while the Draining state's host node-ref is still alive. FSM support is extended so PrefetchDone can transition into prefill via a templated applyFirstChunk.

Changes:

  • Add async L3 prefetch path in newForwardOperation (Submitted → Prefetching → PrefetchDone → Prefilling), with host node locking before eviction and ownership transfer of host pages into Insert<Host>().
  • Add BackUpDone event + BackUpOperation emission on WriteBackDone, with CacheOpSpec carrying captured host page IDs and rolling hashes.
  • Drain pending_prefetch_ops_ / pending_backup_ops_ in NextExecutionPlan(); extend FSM to allow SchedulePrefillFirstChunkEvent from both Submitted and PrefetchDone.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tokenspeed-scheduler/csrc/scheduler/scheduler.h Declares BackUpDone handler and pending L3 op queues.
tokenspeed-scheduler/csrc/scheduler/scheduler.cpp Captures backup metadata in WriteBackOp; drains pending L3 ops into ExecutionPlan.
tokenspeed-scheduler/csrc/scheduler/request.h / request.cpp Adds TakeHostPages() and GetHostNode<Draining>() accessors.
tokenspeed-scheduler/csrc/scheduler/outside_events/cache.h Adds BackUpDone event and includes it in CacheEvent variant.
tokenspeed-scheduler/csrc/scheduler/outside_event_handler.cpp Refactors PrefetchDone to transfer OwnedPages; emits backup op on WriteBackDone; adds (reserved) BackUpDone handler.
tokenspeed-scheduler/csrc/scheduler/operations/forward.cpp Adds prefetch attempt for Submitted; treats PrefetchDone same as Submitted for prefill priority.
tokenspeed-scheduler/csrc/scheduler/operations/cache.cpp Locks matched host node before eviction in schedulePrefetch.
tokenspeed-scheduler/csrc/resource/types.h Adds backup fields to CacheOpSpec.
tokenspeed-scheduler/csrc/fsm/forward_states.h Exposes HostNode() accessor on Draining.
tokenspeed-scheduler/csrc/fsm/forward_events.h / .cpp Templated applyFirstChunk enables transition from PrefetchDone.
tokenspeed-scheduler/csrc/fsm/cache_events.h / .cpp SchedulePrefetchEvent now owns a HostNodeRef (RAII lock).
tokenspeed-scheduler/bindings/python_module.cpp Binds BackUpDoneEvent.
python/tokenspeed/runtime/engine/scheduler_utils.py Registers BackUpDoneEvent and round-trips completed_pages in payloads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ehuohz ehuohz changed the title (feat) L3 KVStore: prefetch and backup support feat (L3 KVStore): prefetch and backup support Jun 2, 2026
@XucSh

XucSh commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Partial L2+L3 hits insert prefetched suffix pages under the wrong prefix. calc_l3_query_hashes() uses apply_match=True, so the C++ scheduler skips pages already matched in host cache and only returns hashes for token_pages[host_matched:]. However, the PrefetchDone handler still builds insert_token_pages from token_pages.begin(), without carrying the host-match offset. If host has the first h pages and L3 has the next n, the pages fetched from Mooncake are suffix pages but get inserted as the first n prompt pages, so later loadback can use corrupt KV. Please carry the host matched page offset through Prefetching/PrefetchDone and insert starting at that offset.

@@ -40,6 +40,7 @@
_CACHE_EVENT_TYPES = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backup source host pages are not pinned until backup completes. WriteBackDone queues a BackUpOp with only raw host page ids, then immediately applies WriteBackDoneEvent, which releases the WritingBack node refs. The Python backup reads those host pages asynchronously later, so host eviction can reuse them before _run_backup() calls batch_set_v1(), storing unrelated KV under the captured hashes.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8e5fe234e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 62 to 63
std::vector<std::string> hashes(storage.rolling_hashes.begin(),
storage.rolling_hashes.begin() + num_pages_to_fetch);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Align L3 hashes with the match used for prefetch

When a Submitted request waits while the L2 host match changes after admission (for example another request writes back or evicts part of the prefix), the stored rolling_hashes still start at the offset used by calc_l3_query_hashes(..., apply_match=True) at admission, not necessarily at the current match.host.DepthInPage() saved above. This slice always starts at begin(), and PrefetchDone inserts those pages at the current prefetch_start_page, so a changed host prefix downloads/stores the wrong page contents under the wrong token pages. Recompute the hashes against the current match or carry the admission-time start offset and adjust the slice before scheduling the prefetch.

Useful? React with 👍 / 👎.

ehuohz and others added 4 commits June 12, 2026 16:01
Signed-off-by: He Zhou <zhouhe2025@gmail.com>
Signed-off-by: zhouhe2025 <670085873@qq.com>
…ransition

Signed-off-by: He Zhou <zhouhe2025@gmail.com>
Signed-off-by: zhouhe2025 <670085873@qq.com>
Signed-off-by: He Zhou <zhouhe2025@gmail.com>
Signed-off-by: zhouhe2025 <670085873@qq.com>
Signed-off-by: zhouhe2025 <670085873@qq.com>
@github-actions

Copy link
Copy Markdown

This PR has been inactive for 14 days and is marked as stale. It will be closed in 3 days if there is no further activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants