Replace the `Sync` component with a new initial block download engine that extends the network protocol by arya2 · Pull Request #10725 · ZcashFoundation/zebra

arya2 · 2026-06-16T15:59:31Z

Note: This PR is not complete or ready for review.

Motivation

We want to make Zebra's initial sync faster and more reliable.

Design

Add known hashes chunks with all checkpointed block:
- header hashes
- approximate block sizes
- hashes of note commitment trees at every block height
- a set of unspent output locations by the max checkpoint height (to avoid writing/deleting spent outputs)
Add a tower-fair-buffer crate to help Zebra's inbound service fairly respond to syncing peers with block data
Replace the Sync component with a new initial block download engine that drives more parallel block downloads and skips obtain/extend tip cycles that are no irrelevant thanks to the known hashes
Write blocks to memory first and have another trailing task write blocks to the database
Keep created UTXOs around in memory for 500-1000 blocks to avoid excessive disk reads during the spam attack

TODO

Extend the p2p network protocol to allow downloads of:
- known block hashes chunks,
- note commitment trees, and
- unspent output locations (at the max checkpoint height)
More review and polish iterations, especially improvements to the design spec doc
Final testing

Results So Far

Syncing from genesis to 1.7M blocks on Mainnet in ~90 minutes (8d31e52)

…fault off)

…merkle checks

…fe guard

…ow the known-hash floor

Split the finalized (checkpoint) block write loop into a two-thread pipeline so block commits are not serialized behind disk I/O: - Thread 1 commits checkpoint-verified blocks to the non-finalized state via the new NonFinalizedState::commit_checkpoint_block (a fast in-memory commit that skips the block commitment and anchor checks the pinned hash chain already guarantees), responds to the state service immediately, and prunes blocks the disk writer has finished. The single pipeline chain is taken out of the chain set for each commit, so it is mutated in place rather than deep-cloned, and stale prefix chains cannot accumulate. - Thread 2 commits each block to the finalized state through commit_finalized_direct, preserving the tip-linkage assertions, the debug_stop_at_height exit, and elasticsearch indexing. Spent UTXOs and their output locations are derived from the data the non-finalized commit already resolved, instead of re-reading the database per input; the same derivation now also serves the post-checkpoint finalize path. Chain gains UpdateWith arms for V1/V2/V3 transactions (including Bctv14 joinsplits), which can now enter the non-finalized state during the checkpoint range, plus a pub(crate) treestate accessor used by the new NonFinalizedState::peek_finalize_tip. The pipeline does not publish the non-finalized state to the watch channel: this keeps the backup task idle during the checkpoint phase and the chain Arc uniquely owned. The end-of-phase drain only removes blocks at or below the finalized tip, preserving a restored non-finalized-state backup when the pipeline never ran.

…peline - Fix the peer set init() destructuring (4-tuple) in zebra-network tests. - Repoint MAX_CHECKPOINT_HEIGHT_GAP / MAX_CHECKPOINT_BYTE_COUNT uses to zebra_node_services::constants, their original home, after the zebra-consensus re-exports were removed with the checkpoint verifier. - Convert the IBD disk-cache tests to the single-owner sync cache API. - Delete the engine fetch tests written against the pre-rework engine (reservation tiers, run_fetch_only): the machinery they test was deliberately removed; engine tests need a rewrite against the new API. - Replace the checkpoint-era router tests (which committed genesis through the router) with a gate test: commits at or below the known-hash floor are rejected and never reach the state. - populated_state and value_pool_is_updated now wait for the checkpoint pipeline's disk writes to catch up before querying on-disk state, matching the pipeline's earlier commit responses. - Narrow the WAL/compaction test accessors to cfg(test) and drop test helpers orphaned by the deleted engine tests.

Each known-hash chunk file can now carry one quantized size-hint byte per block after its hash section (33 bytes per block instead of 32); the pinned SHA-256 constants distinguish the formats, so hints ship per-chunk as size data becomes available. The separate size-hints file and its KnownHashListSpec::size_hint_hash field are removed. Mainnet chunks 00-14 (heights 0..=2,249,999) are re-emitted with hints from a checkpoint-anchored sweep of a synced local node (11,240 anchors verified; hash sections byte-identical to the previously pinned chunks). The remaining chunks stay hash-only until their size data is swept near the chain tip. KnownHashList gains a size_hint accessor, and the IBD engine now uses embedded hints for fetch batch packing, falling back to the conservative maximum for hash-only chunks.

…rifier

…atch concurrency Three issues found in review of the known-hash IBD engine: - commit_checkpoint_block skipped block_commitment_is_valid_for_chain_history, and the disk writer's Contextual arm skips it too, so nothing validated the ZIP-244 hashBlockCommitments (authorizing-data) root on the pipeline path. The merkle root pins only transaction IDs, which for NU5-onward blocks exclude authorizing data, so a peer could substitute signatures/proofs/ ciphertexts on a pinned block and have it permanently committed. Restore the commitment check in commit_checkpoint_block, before the chain push. The sprout anchor check stays skipped (pre-NU5 txids pin joinsplit anchors via the merkle root; no sprout joinsplits exist after NU5). - The refill Commit arm gated the frontier block's stage-2 commit on the commit-pipeline caps with no frontier bypass, unlike the Issue and Promote arms. Above-frontier commits could fill the 1024-block cap while their in-order state drain waited on the frontier, and the frontier's own commit was then refused — a silent deadlock (track_stall only watches the fetch frontier). commit_caps_allow now always admits the frontier block; the overshoot is bounded to one block. - Engine::new sized the batch layer's max_concurrent_batches from a snapshot of the ready-peer count taken before any handshake completed (typically zero -> one batch for the entire run), throttling all of IBD to a single 16-block request in flight. Size the layer at the IBD_MAX_CONCURRENT_BATCHES ceiling instead; the live per-peer limit is already enforced at issuance by fetch_slots_available.

…ling Self-contained correctness fixes from the branch review: - Hedge re-issue spin: a hedge failing while another fetch is still in flight left the slot's hedge clock at the original issue time, so hedge_frontier re-hedged the height on the very next loop pass. Restart the clock when promoting or dropping a hedge. - Corrupt-copy refetch spin: discard_slot_copy reset the slot with no backoff, so a peer repeatedly serving a bad body for one height drove a tight fetch -> verify-fail -> refetch loop. Back off before refetching. - Tip-stall peer eviction now fires only when a peer reports a height above ours, i.e. there is newer chain we are failing to obtain. A fully-synced node during a long block gap, a quiet network, and regtest/unmined testnet (where the tip never grows) no longer evict a healthy peer and cancel its in-flight work every window. - The startup checkpoint spot-check no longer panics the consensus task (exiting the node) when the best chain shrinks between its two non-atomic state reads, e.g. via an invalidateblock RPC or a reorg. - DiskDb stores its column family names at construction instead of re-listing them from disk on every flush/compaction/level-0 call. A transient list failure can no longer silently turn the WAL-skip flush into a no-op (which would report bulk writes durable when nothing was flushed). - The known-hashes tool now prints a KnownHashListSpec constant that compiles as-is (all fields, block::Height path, string-literal hashes). - NotFound peer attribution uses the current fetch round's source rather than accumulating the earliest, so it no longer blames a peer that delivered blocks in a later round. - Doc fixes: the IBD engine is on by default for Mainnet; the known_hash_list_download and known_hash_cache_write_ahead config fields are documented as not-yet-implemented.

Quality cleanups from the branch review, no behavior change: - Remove ten public Engine accessors (base, window_len, budget_used, commit_inflight_*, fetched_blocks, hedge_count, peer_stats, slot, ...) that only existed for the engine tests deleted earlier in the branch, and the now-write-only hedge_count field (the ibd.gap.hedge.count metric already records it). - Define SIZE_HINT_UNIT once in zebra-chain instead of separately in the engine (as u64) and the asset emitter (as u32). A single source of truth means the quantization that produces shipped hints and the dequantization the engine budgets against can't silently disagree. - Alias IBD_BATCH_MAX_BLOCKS to the inbound GETDATA_MAX_BLOCK_COUNT serving limit it must match, rather than repeating the literal 16. - Drop the router spot-check's dead checkpoint_sync capture (the rewritten O(1) spot-check always runs).

…he known-hash engine is off The commit gate floor was raised to the known-hash list max unconditionally, even when sync.known_hash_sync was disabled. The list-max floor exists only to stop a semantic commit from flipping the state to its non-finalized mode while the engine owns the pinned range; with the engine off there is no engine to protect, so the gate was needlessly blocking a node from committing the post-Canopy range through the legacy path. A fresh flag-off Mainnet node could not sync, and an existing-state node below the list max was stuck. router::init now takes known_hash_sync and lowers the floor to the mandatory checkpoint height when the engine is disabled (fresh flag-off sync below Canopy stays unsupported — there is no checkpoint verifier to validate those blocks, which is a separate, documented limitation). When a BelowKnownHashRange rejection does reach the legacy syncer (engine disabled or degraded on a node still below the list max), it is now surfaced as an actionable error explaining that the range is engine-only, instead of being swallowed by the generic retry path as a silent loop.

…and remove the router With the checkpoint verifier gone, BlockVerifierRouter only added a constant known-hash floor check in front of SemanticBlockVerifier, then forwarded everything else. Fold the floor check into the semantic verifier and delete the router: - SemanticBlockVerifier gains a known_hash_floor and a VerifyBlockError::BelowKnownHashRange variant (not a duplicate, misbehavior score 0, with is_below_known_hash_range()); its call() rejects commits at or below the floor before verification. - router::init builds the semantic verifier with the floor (still honoring known_hash_sync) and buffers it directly. BlockVerifierRouter and RouterError are removed; the buffered service's error type is now VerifyBlockError, which ripples through the inbound, sync-download, and RPC submit_block downcasts. No behavior change: the gate is identical, just one layer shallower. The rpc_submitblock_errors test is updated to reflect that below-floor RPC submissions are rejected by the gate before duplicate detection (a known-hash design consequence noted for a zcashd-compat follow-up); this completes the test adaptation the checkpoint-verifier removal began.

The known-hash commit gate ran before the already-in-chain check, so an RPC submit_block of a block the engine had already committed (any height below the floor) was rejected with BelowKnownHashRange instead of the zcashd-compatible duplicate response. Move the gate to after the already-in-chain check in SemanticBlockVerifier: a duplicate of an already-committed block reports as a duplicate (no new commit happens), while a genuinely new below-floor block is still rejected (it can only be committed by the engine).

The disk overflow tier is always present: the supervisor creates it on every (re)start and there is no config to disable it (design doc §4.5). The Option only carried a stale 'TODO: wire the production cache' and forced dual-path handling at eight call sites (a never-taken None arm in arrival placement, an unreachable expect in promotion, and several if-let wrappers). Store a plain BlockCache and drop the dead branches.

convert() computed block.zcash_serialized_size() — a full traversal of every transaction — and returned it in a tuple, but the only production caller discarded it, and the engine already records each block's exact size from the network at arrival. Return the converted block alone, removing one whole-block serialization pass per block from the rayon verify closure that runs for all ~3.36M blocks of initial sync.

…eer counts The refill walk is O(window length) per loop event. The window was bounded only by the byte budget, so in small-block eras (where the 256 MiB budget holds hundreds of thousands of ~1 KB blocks) an aggressively-fetching engine grew the window to 90k+ blocks. The per-event rescan then dominated the engine thread and *lowered* throughput as the peer set grew: measured ~400 blk/s at ~20 peers but only ~277 blk/s at ~80 peers, with a 93k-block window. Cap the fetch-ahead at IBD_WINDOW_MAX_BLOCKS (16,384) — far more than the in-flight fetch capacity plus the commit pipeline, so every peer stays fed, but small enough to keep the rescan cheap. In large-block eras the byte budgets bind well below this, so the cap only takes effect when blocks are small (exactly where the unbounded window hurt).

…low-ups

…ffer A byte-faithful copy of tower::buffer from tower 0.4.13, as the base for the prioritized fair buffer built on top of it in the next commit, so the fork's changes review as a plain diff against upstream. Mechanical adaptations, listed exhaustively: - buffer/mod.rs becomes src/lib.rs, and gains tower's root BoxError alias (error.rs imports it from the crate root) - tower_service::Service / tower_layer::Layer imports go through the tower facade re-exports - a minimal vendor Cargo.toml (workspace member, tower's dependencies), and tower 0.4.13's MIT license No behavior changes.

Replaces tower::buffer's FIFO mpsc channel and semaphore-reserved slots with a priority queue, per the tower-fair-buffer design (#7306): - requests are tagged with an optional caller key; each key's recent request count (decaying on a two-generation rotation) is its priority, and untagged internal requests are always priority 0 - a crossbeam-skiplist SkipMap orders queued requests by (priority, FIFO sequence); the worker dispatches the lowest key first - the buffer is always ready: a full queue sheds the highest-key queued request with a Shed error instead of exerting backpressure, and internal requests are never shed (the Layer is dropped: the buffer is constructed directly with its capacity and rotation interval) - queue mutations are serialized under one mutex (never held across an await), closing the push/teardown races that lock-free designs leave open - deterministic fairness, shedding, rotation, and teardown tests, plus ports of tower::buffer's worker tests The diff against the previous commit is the full set of changes relative to tower 0.4.13's buffer.

Changes the inbound service contract from Service<Request> to Service<Tagged<PeerSocketAddr, Request>>: the handshake wraps each new connection's inbound service with map_request, tagging every call with the connection's transient peer address. Isolated connections have no transient address, so their requests are tagged as internal (priority 0). The tagging happens per connection rather than per call, so the Connection type, its service bound, and all its tests are unchanged. Connections treat the fair buffer's Shed error like load_shed's Overloaded error, routing it to the existing probabilistic-disconnect overload handling. Part of #7306.

Replaces the load_shed + buffer layers on the inbound service with a FairBuffer keyed by peer address, with the same capacity bound (MAX_INBOUND_CONCURRENCY). Overload now sheds the loudest peer's queued request instead of failing the newest caller at random, so quiet peers keep getting served under load. Recent request counts rotate every INBOUND_FAIRNESS_ROTATION_INTERVAL (53s, mirroring the inventory rotation interval's reasoning). The MAX_INBOUND_RESPONSE_TIME timeout moves outside the buffer, so it now bounds queue wait plus processing: requests starved by the priority queue fail with Elapsed and feed the existing per-connection overload handling, instead of stalling their connection until the heartbeat timeout. Closes #7306.

…n with its handles

…time A caller's recent cost now counts one point per request plus one point per 10ms of inner-service response time, recorded by the response future when each response completes — so priority reflects how expensive a caller's requests are to serve, not just how many it sends. Caller costs move under their own lock so recording never touches the queue state; the inbound fairness rotation interval becomes 7 minutes, long enough that flooding or expensive requests stay deprioritized across many blocks.

`Config::initial_peers` resolved the DNS seeders before loading the disk cache, and `resolve_peers` retries DNS indefinitely until it returns at least one address. So when the seeders were slow or unreachable, a node with a populated peer cache would stall in the DNS retry loop and never load the cached peers it could have bootstrapped from. Load the disk cache first and pass the cached peers to `resolve_peers` as a fallback. When fallback peers are available, `resolve_peers` stops after a single resolution round instead of looping on DNS, so Zebra reaches a usable initial peer set straight from cache when the seeders are down. The DNS-only path (no cache) keeps retrying as before, and the per-seeder concurrency, timeouts, and outbound connection limits are unchanged.

Rework commit_checkpoint_block to locate the parent chain by tip hash and source its chain context (spends, value balance, history tree) from that chain, instead of asserting a single best chain. During pure checkpoint sync there is still exactly one chain and the parent is always its tip, so the hot path is unchanged; a checkpoint block forking off a non-best chain now commits correctly against its own parent. Run every fallible, peer-influenceable check (transparent spend, value balance, NU5+ hashBlockCommitments) before any mutation, so an Err return leaves the non-finalized state untouched and the write worker can treat a commit error as recoverable. Return the new tip's ChainTipBlock together with a FinalizableBlock carrying the freshly pushed block and its treestate, ready to hand to the disk writer; delete peek_finalize_tip and switch the write.rs call site. Generalize finalize() into finalize_root(Option<Hash>): None selects the best chain (overflow case), Some(hash) finalizes the chain with that exact root hash and drops mismatched-root siblings whole, so the checkpoint prune path can retire a durable block by hash instead of by transient work. finalize() becomes a thin None wrapper. Single-chain behavior is bit-identical. Adds unit tests for the tip-fork commit, the Err-leaves-state-untouched contract, finalize_root dropping a mismatched sibling, and finalize_root(Some(best)) == finalize().

Promote the inline Thread 2 disk-write closure into a DiskWriter struct in a new write/disk_writer module, the sole caller of commit_finalized_direct. DiskRequest::{Write, EndBulk} replaces the bare FinalizableBlock channel: Write carries a bulk flag (whether to write under the FinalizedWritePhase guard) and an optional ack (Some for senders that own the recovery policy and block on durability; None for the fire-and-forget checkpoint stream whose post-ack disk errors are documented fatals). EndBulk drops the bulk guard mid-stream, FIFO-ordered behind the writes it covered. The guard becomes a reversible Option<FinalizedWritePhase> created on the first bulk write and dropped on EndBulk, a non-bulk write, or channel close. The disk writer now lives for the worker's whole lifetime inside one thread scope spanning both phases, so genesis and the reorg-overflow loop route through it with a blocking ack instead of calling commit_finalized_direct on Thread 1. Delete FinalizedState::commit_finalized (its sole caller is gone). Introduce the disk frontier bookkeeping the any-order worker will need: next_disk_height, disk_frontier_hash, and an inflight_disk queue of handed-off checkpoint blocks. Switch the prune loop to be hash-pinned: a block is finalized out of memory only when its own height is observed durable, by its exact hash via finalize_root(Some(hash)), closing the work-based-prune hole where a transient adversarial fork could orphan the pipeline chain. The worker keeps its two sequential phases; it sends EndBulk where it previously relied on the pipeline scope ending. All stores to the disk-tip atomic now happen on the single disk-writer thread, so its # Correctness note covers uniform monotonicity with no per-phase re-init. Adds disk_writer unit tests for the guard on/off/on lifecycle (including EndBulk and channel-close), and that acked writes return the committed hash and advance the published disk tip in order.

Move the worker into a new write/worker module as WriteBlockWorker, a single persistent loop that reads one WriteMessage channel and dispatches to four handlers (handle_checkpoint_block, handle_semantic_block, handle_invalidate, handle_reconsider). The worker-local commit and disk-frontier state (next_disk_height, disk_frontier_hash, inflight_disk, parent_error_map, the bulk-active flag) lives in a WorkerLoopState threaded through the handlers, with no synchronization. The hash-pinned prune moves here as prune_durable_blocks. Merge the two service-to-worker channels (finalized + non-finalized) into one UnboundedSender<WriteMessage>; WriteMessage gains a Checkpoint variant alongside Semantic/Invalidate/Reconsider, and NonFinalizedWriteMessage is gone. BlockWriteSender collapses to a single Option<sender> (Option only so Drop can close it), and spawn loses the should_use_finalized parameter. Keep the flip's external behavior with a temporary service-side accepting_checkpoint_blocks boolean replacing the dropped finalized sender: checkpoint blocks queue and send while true, repeated checkpoints get a duplicate error and semantic blocks flow once false. The flip flips the boolean instead of dropping a channel. The first semantic block drains every still-in-flight checkpoint block out of the non-finalized state (reproducing the old sequential-phase boundary so the semantic block commits as a fresh chain), sends EndBulk, and disables the recently-finalized cache. Observably today's behavior: all 151 zebra-state and 138 zebrad lib tests pass, including the commit-ordering and value-pool end-to-end tests.

Delete the service-side flip: checkpoint-verified and semantically-verified blocks now commit in any order, interleaved. The accepting_checkpoint_blocks boolean, the flip block, the post-flip duplicate-error arm, and the SentHashes.can_fork_chain_at_hashes phase gate are gone. Checkpoint blocks are always queued and drained; can_fork_chain_at(hash) is contains(hash) || hash == finalized tip, valid in every phase because every sent checkpoint hash is now recorded via SentHashes::add_sent_hash. After each checkpoint send the service records the hash, releases any queued semantic child, and every 1024 sends prunes the sent-hash and finalized-queue structures by height (replacing the flip's clears). The write worker gates checkpoint commits for any-order arrival: - genesis commits directly to disk (blocking ack); - a block whose parent is no chain tip and not the finalized tip is adopted if an identical twin already entered memory via a semantic commit (Ok), else rejected with the new CommitBlockError::OutOfOrder { height, next_height }; - a block above the disk frontier flushes its semantically-committed ancestors to the disk writer first, failing safe (OutOfOrder, no mutation) on a fork below the frontier; - the recently-finalized cache is enabled lazily on the first checkpoint commit; - a commit error (the NU5+ auth-data hashBlockCommitments check, or the spend/value-balance build) responds Err and resets the service to the parent instead of panicking — the non-finalized state is untouched (validate-before-mutate), so the honest copy recommits cleanly. This replaces the previously remote-triggerable expect. OutOfOrder is engine-compatible: the known-hash IBD engine resubmits its retained copy on any error above its frontier without inspecting it, and is_write_task_exited still matches only real channel drops. The first semantic commit drops the disk-writer bulk guard (EndBulk) and the cache; a later checkpoint block re-enables both. New worker integration tests drive the OutOfOrder gate and unwedge-after-gap recovery through the WriteMessage channel. All 154 zebra-state and 138 zebrad lib tests pass, including the commit-ordering, value-pool, and chain-tip end-to-end tests with the flip removed.

…ne precisely Move the invalidate/reconsider gating into the write worker, with a precise frontier-height guard: a block whose disk write is enqueued, in flight, or complete (its height is at or below the disk frontier) is rejected with the typed error; above the frontier, invalidation and reconsideration now work during checkpoint sync instead of being unconditionally refused. The service-side phase gate is gone (the worker decides). Make the post-op publication empty-non-finalized-state-tolerant: invalidating the only chain empties the state, so publish the empty snapshot and fall the chain tip back to the finalized tip instead of panicking in update_latest_chain_channels. Publish only on success. Fix a latent panic this path reaches: NonFinalizedState::invalidate_block removed the chain root via BTreeSet::remove(&chain), which hits Chain::cmp's 'tip hashes are always unique' unreachable when the target equals an existing chain; remove by tip hash via retain instead (semantically identical, no panic). Update the InvalidateError/ReconsiderError doc comments to the frontier wording. Adds an NonFinalizedState::height_by_hash helper, and worker tests: admin requests below the frontier are rejected (invalidate, reconsider, and unknown hash), and invalidating an above-frontier root empties the state and publishes without panicking. All 156 zebra-state, 138 zebrad, and 69 zebra-rpc lib tests pass.

Rewrite known-hash-ibd.md §7.3 to the as-built any-order design: one write worker loop over one WriteMessage channel, one persistent disk-writer thread (the sole commit_finalized_direct caller), disk-frontier bookkeeping, the adopt-twin / ancestor-flush / OutOfOrder gate, the reversible EndBulk bulk guard lifecycle, hash-pinned prune, the error table, the publication policy, the ack contract, and the precise invalidate/reconsider-during-IBD semantics. Update the one-page summary and §3.1 module layout for the worker/disk_writer split; replace §4.7's flip-trigger sentence and §7.2's residual-responsibility paragraph with the any-order safety-by-construction note; fix the stale capacity-100 mentions to the configurable min-500 channel; update the §11 gate list and §12 risk table flip references. Add a CHANGELOG entry (any-order commits, invalidate/reconsider during IBD) and the invalidate-only-chain panic fix; touch up the checkpoint_sync_retained_blocks / checkpoint_sync_pipeline_capacity config doc comments for the reversible-phase and worker-to-disk-writer wording.

feat(state): commit checkpoint and semantic blocks in any order

…ll size hints Re-sweep the Mainnet every-block known-hash list from a synced node: - Extend the list from 3,358,431 to 3,373,206 (heights 0..=3,373,206). - Embed per-block size hints in all 23 chunks. Previously only chunks 00-14 carried hints; chunks 15-22 were hash-only pending the size sweep, so each grows from 32 to 33 bytes per block. Update MAINNET_KNOWN_HASHES (max_height + the eight regenerated chunk SHA-256 constants) and the loader tests for the new tip and the now-uniform hint coverage.

Emit the Testnet every-block known-hash list as 28 chunk assets with embedded per-block size hints, swept from a synced Testnet node and verified against every spaced checkpoint anchor in test-checkpoints.txt (10,144 anchors). Add the TESTNET_KNOWN_HASHES spec constant and wire for_network() to return it for the default public Testnet (custom testnets fall through). The list max height matches TESTNET_MAX_CHECKPOINT_HEIGHT.

…fact fetch Document the next-step fast-sync direction in known-hash-ibd.md §16: ship the checkpoint-range result at H_max (note commitment trees + anchors + nullifiers, and the survivor transparent outputs keyed by OutputLocation + final value pool/balances) as SHA-256-pinned assets, loaded once before the engine starts so 0..=H_max is skipped rather than replayed. Covers the trust/verification model (chunk hashes, tree roots vs header commitments, UTXO-set hash), a staged plan (shielded-only first as the minimal measurable step), optional P2P content-addressed distribution, and the sync-time measurement plan.

…etwork fmt Extract assert_bundled_list_loads(network) so the mainnet and testnet asset-load tests share the open/boundary/genesis/size-hint contract instead of duplicating ~35 lines (the duplication would drift per network). Collapse the for_network Testnet match arm to one line so cargo fmt --check passes.

…rrent asset sizes

…ic over it Introduce a CommitStage trait (the tower Service<IbdBlock> contract pinned to the engine's stage-2 response and error types) and make Engine generic over it instead of over the state service ZS. VerifyAndCommit becomes the known-hash implementation behind a blanket impl; Engine::new keeps its network+state signature via a constructor impl specialized to Engine<ZN, VerifyAndCommit<ZS>, L>. This is phase 1 of the generic engine unification: the window, weighted-fetch, gap-hedge, and commit-pipeline machinery is now decoupled from the verification strategy, so a later full-validation CommitStage drops in unchanged.

Rename the engine's pinned-hash trait HashList -> HashSource and add two defaulted seam methods that the fixed known-hash list ignores but a discovery source (full-validation sync) will use: extend() to append newly discovered hashes above max_height, and invalidate_above() to drop hashes on a reorg. The run loop now re-reads max_height every pass, so a grown range extends the fetch window automatically. This is phase 2 of the generic engine unification: it defines the generalization point. The engine-side window reorg op and the discovery HashSource implementation land in phase 3 alongside their consumer and an engine test harness that can validate the slot/byte/commit-pipeline accounting.

…mmitStage seam SemanticCommit is the second CommitStage implementation: instead of the known-hash merkle pin plus CommitCheckpointVerifiedBlock, it hands each fetched block to the zebra-consensus verifier via Request::Commit (semantic + contextual validation + non-finalized commit) — the same call the legacy syncer's download_and_verify makes per block. Corrupt cached bytes still fail as a verify-stage error before the verifier is reached, so the engine's discard-and-refetch path is reused unchanged. Purely additive and not yet wired into ChainSync: it proves the CommitStage seam supports full validation with the engine's window/fetch/hedge/commit machinery unchanged. Covered by isolated MockService tests mirroring the known-hash convert_vectors. Verifier error classification (peer-attributable invalid block vs. transient state error) is left as a documented TODO for the ChainSync wiring phase, which defines the discovery error semantics.

- Rename VerifyAndCommitError::StateUnready -> StageUnready and generalize its doc/message: it is the stage-2 commit service (the Buffer'd state for VerifyAndCommit, or the block verifier for SemanticCommit), not always the state, so the surfaced error names the right subsystem. - Restore the committed-hash parity assert in SemanticCommit::call, matching the known-hash path: the engine keys the committed slot by assigned height, so a verifier returning a different hash must fail loudly rather than mark the wrong block committed. - Fix the HashSource docs to stop referencing an engine-side invalidate_above window op that does not exist yet: the extend/invalidate_above methods define the seam shape; the engine-side handling (window reorg, restore-cache rescan, growing-range completion) lands with the discovery source in a later phase.

github-actions · 2026-06-16T15:59:49Z

PR titles must follow Conventional Commits format. Your title needs a small adjustment.

No release type found in pull request title "Replace the `Sync` component with a new initial block download engine that extends the network protocol". Add a prefix to indicate what kind of release this pull request corresponds to. For reference, see https://www.conventionalcommits.org/

Available types:
 - feat
 - fix
 - perf
 - refactor
 - build
 - chore
 - docs
 - test
 - ci
 - style
 - revert
 - release

See the contribution guide for details.

arya2 added 30 commits June 11, 2026 02:26

docs(design): add known-hash IBD engine design

c6a1986

feat(chain): add KnownHashList with verified windowed chunk loader

783e451

feat(network): live peer height tracking and height-aware block routing

8960122

feat(network): rate-limited peer eviction on stalled chain tip

d9e98d3

feat(network): per-peer block delivery counters and peer set diagnostics

33176fc

feat(zebrad): add IBD engine skeleton behind sync.known_hash_sync (de…

3b67825

…fault off)

feat(ibd): engine task with ring window and weighted batched fetch

91f28d7

feat(ibd): verify-and-commit tower service over the state with rayon …

2ef90d4

…merkle checks

feat(ibd): disk overflow tier so each block is downloaded at most once

416b3e8

feat(ibd): commit pipeline with gap-priority hedging and reset recovery

e9a204c

feat(zebrad): run IBD engine before legacy syncer when enabled

407ca00

perf(state): tune RocksDB for write-heavy initial sync

3531799

perf(state): pause auto-compaction during known-hash IBD with exit-sa…

3031210

…fe guard

feat(utils): add known-hashes list assembly tool

eeaf7d4

feat(zebrad)!: enable known-hash sync by default on mainnet

4f8961e

refactor(consensus)!: remove checkpoint verifier and gate commits bel…

231d3d8

…ow the known-hash floor

docs(design): plan folding the commit gate into the semantic block ve…

7f8d9a2

…rifier

docs(design): mark the block verifier router removal as done

486f2e3

arya2 added 29 commits June 12, 2026 19:49

docs(design): record the paused subtree workflow and engine-crate fol…

de39a87

…low-ups

docs(design): name the engine crate zebra-sync

27f43f0

docs(design): record WAL auto-mode and flush-tuning follow-ups

3b28539

fix(utils): keep fair-buffer rotation live under backlog and shut dow…

ea1ccf1

…n with its handles

test(state): allow stderr in the crash-test writer role

cb4ab8f

fix(ibd): always delete forgotten cache entry files

6d757ce

Merge pull request #4 from arya2/state-write-pipeline-redesign

cf396b7

feat(state): commit checkpoint and semantic blocks in any order

docs(ibd): refresh the design doc for the shipped Testnet list and cu…

77adb3c

…rrent asset sizes

docs(ibd): document the generic engine unification (design §17)

cec897d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace the `Sync` component with a new initial block download engine that extends the network protocol#10725

Replace the `Sync` component with a new initial block download engine that extends the network protocol#10725
arya2 wants to merge 92 commits into
mainfrom
ibd-engine

arya2 commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

arya2 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Design

TODO

Results So Far

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arya2 commented Jun 16, 2026 •

edited

Loading