Skip to content

Replace the Sync component with a new initial block download engine that extends the network protocol#10725

Draft
arya2 wants to merge 92 commits into
mainfrom
ibd-engine
Draft

Replace the Sync component with a new initial block download engine that extends the network protocol#10725
arya2 wants to merge 92 commits into
mainfrom
ibd-engine

Conversation

@arya2

@arya2 arya2 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Note: This PR is not complete or ready for review.

Motivation

We want to make Zebra's initial sync faster and more reliable.

Design

  • Add known hashes chunks with all checkpointed block:
    • header hashes
    • approximate block sizes
    • hashes of note commitment trees at every block height
    • a set of unspent output locations by the max checkpoint height (to avoid writing/deleting spent outputs)
  • Add a tower-fair-buffer crate to help Zebra's inbound service fairly respond to syncing peers with block data
  • Replace the Sync component with a new initial block download engine that drives more parallel block downloads and skips obtain/extend tip cycles that are no irrelevant thanks to the known hashes
  • Write blocks to memory first and have another trailing task write blocks to the database
  • Keep created UTXOs around in memory for 500-1000 blocks to avoid excessive disk reads during the spam attack

TODO

  • Extend the p2p network protocol to allow downloads of:
    • known block hashes chunks,
    • note commitment trees, and
    • unspent output locations (at the max checkpoint height)
  • More review and polish iterations, especially improvements to the design spec doc
  • Final testing

Results So Far

Syncing from genesis to 1.7M blocks on Mainnet in ~90 minutes (8d31e52)

arya2 added 30 commits June 11, 2026 02:26
Split the finalized (checkpoint) block write loop into a two-thread
pipeline so block commits are not serialized behind disk I/O:

- Thread 1 commits checkpoint-verified blocks to the non-finalized state
  via the new NonFinalizedState::commit_checkpoint_block (a fast
  in-memory commit that skips the block commitment and anchor checks the
  pinned hash chain already guarantees), responds to the state service
  immediately, and prunes blocks the disk writer has finished. The
  single pipeline chain is taken out of the chain set for each commit,
  so it is mutated in place rather than deep-cloned, and stale prefix
  chains cannot accumulate.
- Thread 2 commits each block to the finalized state through
  commit_finalized_direct, preserving the tip-linkage assertions, the
  debug_stop_at_height exit, and elasticsearch indexing. Spent UTXOs and
  their output locations are derived from the data the non-finalized
  commit already resolved, instead of re-reading the database per input;
  the same derivation now also serves the post-checkpoint finalize path.

Chain gains UpdateWith arms for V1/V2/V3 transactions (including Bctv14
joinsplits), which can now enter the non-finalized state during the
checkpoint range, plus a pub(crate) treestate accessor used by the new
NonFinalizedState::peek_finalize_tip.

The pipeline does not publish the non-finalized state to the watch
channel: this keeps the backup task idle during the checkpoint phase and
the chain Arc uniquely owned. The end-of-phase drain only removes blocks
at or below the finalized tip, preserving a restored non-finalized-state
backup when the pipeline never ran.
…peline

- Fix the peer set init() destructuring (4-tuple) in zebra-network tests.
- Repoint MAX_CHECKPOINT_HEIGHT_GAP / MAX_CHECKPOINT_BYTE_COUNT uses to
  zebra_node_services::constants, their original home, after the
  zebra-consensus re-exports were removed with the checkpoint verifier.
- Convert the IBD disk-cache tests to the single-owner sync cache API.
- Delete the engine fetch tests written against the pre-rework engine
  (reservation tiers, run_fetch_only): the machinery they test was
  deliberately removed; engine tests need a rewrite against the new API.
- Replace the checkpoint-era router tests (which committed genesis through
  the router) with a gate test: commits at or below the known-hash floor
  are rejected and never reach the state.
- populated_state and value_pool_is_updated now wait for the checkpoint
  pipeline's disk writes to catch up before querying on-disk state,
  matching the pipeline's earlier commit responses.
- Narrow the WAL/compaction test accessors to cfg(test) and drop test
  helpers orphaned by the deleted engine tests.
Each known-hash chunk file can now carry one quantized size-hint byte
per block after its hash section (33 bytes per block instead of 32);
the pinned SHA-256 constants distinguish the formats, so hints ship
per-chunk as size data becomes available. The separate size-hints file
and its KnownHashListSpec::size_hint_hash field are removed.

Mainnet chunks 00-14 (heights 0..=2,249,999) are re-emitted with hints
from a checkpoint-anchored sweep of a synced local node (11,240 anchors
verified; hash sections byte-identical to the previously pinned
chunks). The remaining chunks stay hash-only until their size data is
swept near the chain tip.

KnownHashList gains a size_hint accessor, and the IBD engine now uses
embedded hints for fetch batch packing, falling back to the
conservative maximum for hash-only chunks.
…atch concurrency

Three issues found in review of the known-hash IBD engine:

- commit_checkpoint_block skipped block_commitment_is_valid_for_chain_history,
  and the disk writer's Contextual arm skips it too, so nothing validated the
  ZIP-244 hashBlockCommitments (authorizing-data) root on the pipeline path.
  The merkle root pins only transaction IDs, which for NU5-onward blocks
  exclude authorizing data, so a peer could substitute signatures/proofs/
  ciphertexts on a pinned block and have it permanently committed. Restore the
  commitment check in commit_checkpoint_block, before the chain push. The
  sprout anchor check stays skipped (pre-NU5 txids pin joinsplit anchors via
  the merkle root; no sprout joinsplits exist after NU5).

- The refill Commit arm gated the frontier block's stage-2 commit on the
  commit-pipeline caps with no frontier bypass, unlike the Issue and Promote
  arms. Above-frontier commits could fill the 1024-block cap while their
  in-order state drain waited on the frontier, and the frontier's own commit
  was then refused — a silent deadlock (track_stall only watches the fetch
  frontier). commit_caps_allow now always admits the frontier block; the
  overshoot is bounded to one block.

- Engine::new sized the batch layer's max_concurrent_batches from a snapshot
  of the ready-peer count taken before any handshake completed (typically
  zero -> one batch for the entire run), throttling all of IBD to a single
  16-block request in flight. Size the layer at the IBD_MAX_CONCURRENT_BATCHES
  ceiling instead; the live per-peer limit is already enforced at issuance by
  fetch_slots_available.
…ling

Self-contained correctness fixes from the branch review:

- Hedge re-issue spin: a hedge failing while another fetch is still in
  flight left the slot's hedge clock at the original issue time, so
  hedge_frontier re-hedged the height on the very next loop pass. Restart
  the clock when promoting or dropping a hedge.

- Corrupt-copy refetch spin: discard_slot_copy reset the slot with no
  backoff, so a peer repeatedly serving a bad body for one height drove a
  tight fetch -> verify-fail -> refetch loop. Back off before refetching.

- Tip-stall peer eviction now fires only when a peer reports a height
  above ours, i.e. there is newer chain we are failing to obtain. A
  fully-synced node during a long block gap, a quiet network, and
  regtest/unmined testnet (where the tip never grows) no longer evict a
  healthy peer and cancel its in-flight work every window.

- The startup checkpoint spot-check no longer panics the consensus task
  (exiting the node) when the best chain shrinks between its two
  non-atomic state reads, e.g. via an invalidateblock RPC or a reorg.

- DiskDb stores its column family names at construction instead of
  re-listing them from disk on every flush/compaction/level-0 call. A
  transient list failure can no longer silently turn the WAL-skip flush
  into a no-op (which would report bulk writes durable when nothing was
  flushed).

- The known-hashes tool now prints a KnownHashListSpec constant that
  compiles as-is (all fields, block::Height path, string-literal hashes).

- NotFound peer attribution uses the current fetch round's source rather
  than accumulating the earliest, so it no longer blames a peer that
  delivered blocks in a later round.

- Doc fixes: the IBD engine is on by default for Mainnet; the
  known_hash_list_download and known_hash_cache_write_ahead config fields
  are documented as not-yet-implemented.
Quality cleanups from the branch review, no behavior change:

- Remove ten public Engine accessors (base, window_len, budget_used,
  commit_inflight_*, fetched_blocks, hedge_count, peer_stats, slot, ...)
  that only existed for the engine tests deleted earlier in the branch,
  and the now-write-only hedge_count field (the ibd.gap.hedge.count
  metric already records it).

- Define SIZE_HINT_UNIT once in zebra-chain instead of separately in the
  engine (as u64) and the asset emitter (as u32). A single source of
  truth means the quantization that produces shipped hints and the
  dequantization the engine budgets against can't silently disagree.

- Alias IBD_BATCH_MAX_BLOCKS to the inbound GETDATA_MAX_BLOCK_COUNT
  serving limit it must match, rather than repeating the literal 16.

- Drop the router spot-check's dead checkpoint_sync capture (the
  rewritten O(1) spot-check always runs).
…he known-hash engine is off

The commit gate floor was raised to the known-hash list max
unconditionally, even when sync.known_hash_sync was disabled. The
list-max floor exists only to stop a semantic commit from flipping the
state to its non-finalized mode while the engine owns the pinned range;
with the engine off there is no engine to protect, so the gate was
needlessly blocking a node from committing the post-Canopy range through
the legacy path. A fresh flag-off Mainnet node could not sync, and an
existing-state node below the list max was stuck.

router::init now takes known_hash_sync and lowers the floor to the
mandatory checkpoint height when the engine is disabled (fresh flag-off
sync below Canopy stays unsupported — there is no checkpoint verifier to
validate those blocks, which is a separate, documented limitation).

When a BelowKnownHashRange rejection does reach the legacy syncer (engine
disabled or degraded on a node still below the list max), it is now
surfaced as an actionable error explaining that the range is
engine-only, instead of being swallowed by the generic retry path as a
silent loop.
…and remove the router

With the checkpoint verifier gone, BlockVerifierRouter only added a
constant known-hash floor check in front of SemanticBlockVerifier, then
forwarded everything else. Fold the floor check into the semantic
verifier and delete the router:

- SemanticBlockVerifier gains a known_hash_floor and a
  VerifyBlockError::BelowKnownHashRange variant (not a duplicate,
  misbehavior score 0, with is_below_known_hash_range()); its call()
  rejects commits at or below the floor before verification.
- router::init builds the semantic verifier with the floor (still honoring
  known_hash_sync) and buffers it directly. BlockVerifierRouter and
  RouterError are removed; the buffered service's error type is now
  VerifyBlockError, which ripples through the inbound, sync-download, and
  RPC submit_block downcasts.

No behavior change: the gate is identical, just one layer shallower. The
rpc_submitblock_errors test is updated to reflect that below-floor RPC
submissions are rejected by the gate before duplicate detection (a
known-hash design consequence noted for a zcashd-compat follow-up); this
completes the test adaptation the checkpoint-verifier removal began.
The known-hash commit gate ran before the already-in-chain check, so an
RPC submit_block of a block the engine had already committed (any height
below the floor) was rejected with BelowKnownHashRange instead of the
zcashd-compatible duplicate response.

Move the gate to after the already-in-chain check in
SemanticBlockVerifier: a duplicate of an already-committed block reports
as a duplicate (no new commit happens), while a genuinely new below-floor
block is still rejected (it can only be committed by the engine).
The disk overflow tier is always present: the supervisor creates it on
every (re)start and there is no config to disable it (design doc §4.5).
The Option only carried a stale 'TODO: wire the production cache' and
forced dual-path handling at eight call sites (a never-taken None arm in
arrival placement, an unreachable expect in promotion, and several
if-let wrappers). Store a plain BlockCache and drop the dead branches.
convert() computed block.zcash_serialized_size() — a full traversal of
every transaction — and returned it in a tuple, but the only production
caller discarded it, and the engine already records each block's exact
size from the network at arrival. Return the converted block alone,
removing one whole-block serialization pass per block from the rayon
verify closure that runs for all ~3.36M blocks of initial sync.
…eer counts

The refill walk is O(window length) per loop event. The window was
bounded only by the byte budget, so in small-block eras (where the
256 MiB budget holds hundreds of thousands of ~1 KB blocks) an
aggressively-fetching engine grew the window to 90k+ blocks. The
per-event rescan then dominated the engine thread and *lowered*
throughput as the peer set grew: measured ~400 blk/s at ~20 peers but
only ~277 blk/s at ~80 peers, with a 93k-block window.

Cap the fetch-ahead at IBD_WINDOW_MAX_BLOCKS (16,384) — far more than the
in-flight fetch capacity plus the commit pipeline, so every peer stays
fed, but small enough to keep the rescan cheap. In large-block eras the
byte budgets bind well below this, so the cap only takes effect when
blocks are small (exactly where the unbounded window hurt).
arya2 added 29 commits June 12, 2026 19:49
…ffer

A byte-faithful copy of tower::buffer from tower 0.4.13, as the base for
the prioritized fair buffer built on top of it in the next commit, so the
fork's changes review as a plain diff against upstream.

Mechanical adaptations, listed exhaustively:
- buffer/mod.rs becomes src/lib.rs, and gains tower's root BoxError alias
  (error.rs imports it from the crate root)
- tower_service::Service / tower_layer::Layer imports go through the tower
  facade re-exports
- a minimal vendor Cargo.toml (workspace member, tower's dependencies), and
  tower 0.4.13's MIT license

No behavior changes.
Replaces tower::buffer's FIFO mpsc channel and semaphore-reserved slots
with a priority queue, per the tower-fair-buffer design (#7306):

- requests are tagged with an optional caller key; each key's recent
  request count (decaying on a two-generation rotation) is its priority,
  and untagged internal requests are always priority 0
- a crossbeam-skiplist SkipMap orders queued requests by
  (priority, FIFO sequence); the worker dispatches the lowest key first
- the buffer is always ready: a full queue sheds the highest-key queued
  request with a Shed error instead of exerting backpressure, and internal
  requests are never shed (the Layer is dropped: the buffer is constructed
  directly with its capacity and rotation interval)
- queue mutations are serialized under one mutex (never held across an
  await), closing the push/teardown races that lock-free designs leave open
- deterministic fairness, shedding, rotation, and teardown tests, plus
  ports of tower::buffer's worker tests

The diff against the previous commit is the full set of changes relative
to tower 0.4.13's buffer.
Changes the inbound service contract from Service<Request> to
Service<Tagged<PeerSocketAddr, Request>>: the handshake wraps each new
connection's inbound service with map_request, tagging every call with the
connection's transient peer address. Isolated connections have no transient
address, so their requests are tagged as internal (priority 0).

The tagging happens per connection rather than per call, so the Connection
type, its service bound, and all its tests are unchanged. Connections treat
the fair buffer's Shed error like load_shed's Overloaded error, routing it
to the existing probabilistic-disconnect overload handling.

Part of #7306.
Replaces the load_shed + buffer layers on the inbound service with a
FairBuffer keyed by peer address, with the same capacity bound
(MAX_INBOUND_CONCURRENCY). Overload now sheds the loudest peer's queued
request instead of failing the newest caller at random, so quiet peers keep
getting served under load.

Recent request counts rotate every INBOUND_FAIRNESS_ROTATION_INTERVAL (53s,
mirroring the inventory rotation interval's reasoning).

The MAX_INBOUND_RESPONSE_TIME timeout moves outside the buffer, so it now
bounds queue wait plus processing: requests starved by the priority queue
fail with Elapsed and feed the existing per-connection overload handling,
instead of stalling their connection until the heartbeat timeout.

Closes #7306.
…time

A caller's recent cost now counts one point per request plus one point per
10ms of inner-service response time, recorded by the response future when
each response completes — so priority reflects how expensive a caller's
requests are to serve, not just how many it sends. Caller costs move under
their own lock so recording never touches the queue state; the inbound
fairness rotation interval becomes 7 minutes, long enough that flooding or
expensive requests stay deprioritized across many blocks.
`Config::initial_peers` resolved the DNS seeders before loading the disk
cache, and `resolve_peers` retries DNS indefinitely until it returns at
least one address. So when the seeders were slow or unreachable, a node
with a populated peer cache would stall in the DNS retry loop and never
load the cached peers it could have bootstrapped from.

Load the disk cache first and pass the cached peers to `resolve_peers` as
a fallback. When fallback peers are available, `resolve_peers` stops after
a single resolution round instead of looping on DNS, so Zebra reaches a
usable initial peer set straight from cache when the seeders are down.

The DNS-only path (no cache) keeps retrying as before, and the per-seeder
concurrency, timeouts, and outbound connection limits are unchanged.
Rework commit_checkpoint_block to locate the parent chain by tip hash and
source its chain context (spends, value balance, history tree) from that
chain, instead of asserting a single best chain. During pure checkpoint sync
there is still exactly one chain and the parent is always its tip, so the hot
path is unchanged; a checkpoint block forking off a non-best chain now commits
correctly against its own parent.

Run every fallible, peer-influenceable check (transparent spend, value
balance, NU5+ hashBlockCommitments) before any mutation, so an Err return
leaves the non-finalized state untouched and the write worker can treat a
commit error as recoverable. Return the new tip's ChainTipBlock together with
a FinalizableBlock carrying the freshly pushed block and its treestate, ready
to hand to the disk writer; delete peek_finalize_tip and switch the write.rs
call site.

Generalize finalize() into finalize_root(Option<Hash>): None selects the best
chain (overflow case), Some(hash) finalizes the chain with that exact root
hash and drops mismatched-root siblings whole, so the checkpoint prune path
can retire a durable block by hash instead of by transient work. finalize()
becomes a thin None wrapper.

Single-chain behavior is bit-identical. Adds unit tests for the tip-fork
commit, the Err-leaves-state-untouched contract, finalize_root dropping a
mismatched sibling, and finalize_root(Some(best)) == finalize().
Promote the inline Thread 2 disk-write closure into a DiskWriter struct in a
new write/disk_writer module, the sole caller of commit_finalized_direct.
DiskRequest::{Write, EndBulk} replaces the bare FinalizableBlock channel:
Write carries a bulk flag (whether to write under the FinalizedWritePhase
guard) and an optional ack (Some for senders that own the recovery policy and
block on durability; None for the fire-and-forget checkpoint stream whose
post-ack disk errors are documented fatals). EndBulk drops the bulk guard
mid-stream, FIFO-ordered behind the writes it covered. The guard becomes a
reversible Option<FinalizedWritePhase> created on the first bulk write and
dropped on EndBulk, a non-bulk write, or channel close.

The disk writer now lives for the worker's whole lifetime inside one thread
scope spanning both phases, so genesis and the reorg-overflow loop route
through it with a blocking ack instead of calling commit_finalized_direct on
Thread 1. Delete FinalizedState::commit_finalized (its sole caller is gone).

Introduce the disk frontier bookkeeping the any-order worker will need:
next_disk_height, disk_frontier_hash, and an inflight_disk queue of handed-off
checkpoint blocks. Switch the prune loop to be hash-pinned: a block is
finalized out of memory only when its own height is observed durable, by its
exact hash via finalize_root(Some(hash)), closing the work-based-prune hole
where a transient adversarial fork could orphan the pipeline chain.

The worker keeps its two sequential phases; it sends EndBulk where it
previously relied on the pipeline scope ending. All stores to the disk-tip
atomic now happen on the single disk-writer thread, so its # Correctness note
covers uniform monotonicity with no per-phase re-init.

Adds disk_writer unit tests for the guard on/off/on lifecycle (including
EndBulk and channel-close), and that acked writes return the committed hash
and advance the published disk tip in order.
Move the worker into a new write/worker module as WriteBlockWorker, a single
persistent loop that reads one WriteMessage channel and dispatches to four
handlers (handle_checkpoint_block, handle_semantic_block, handle_invalidate,
handle_reconsider). The worker-local commit and disk-frontier state
(next_disk_height, disk_frontier_hash, inflight_disk, parent_error_map, the
bulk-active flag) lives in a WorkerLoopState threaded through the handlers,
with no synchronization. The hash-pinned prune moves here as
prune_durable_blocks.

Merge the two service-to-worker channels (finalized + non-finalized) into one
UnboundedSender<WriteMessage>; WriteMessage gains a Checkpoint variant
alongside Semantic/Invalidate/Reconsider, and NonFinalizedWriteMessage is
gone. BlockWriteSender collapses to a single Option<sender> (Option only so
Drop can close it), and spawn loses the should_use_finalized parameter.

Keep the flip's external behavior with a temporary service-side
accepting_checkpoint_blocks boolean replacing the dropped finalized sender:
checkpoint blocks queue and send while true, repeated checkpoints get a
duplicate error and semantic blocks flow once false. The flip flips the
boolean instead of dropping a channel. The first semantic block drains every
still-in-flight checkpoint block out of the non-finalized state (reproducing
the old sequential-phase boundary so the semantic block commits as a fresh
chain), sends EndBulk, and disables the recently-finalized cache.

Observably today's behavior: all 151 zebra-state and 138 zebrad lib tests
pass, including the commit-ordering and value-pool end-to-end tests.
Delete the service-side flip: checkpoint-verified and semantically-verified
blocks now commit in any order, interleaved. The accepting_checkpoint_blocks
boolean, the flip block, the post-flip duplicate-error arm, and the
SentHashes.can_fork_chain_at_hashes phase gate are gone. Checkpoint blocks are
always queued and drained; can_fork_chain_at(hash) is contains(hash) || hash
== finalized tip, valid in every phase because every sent checkpoint hash is
now recorded via SentHashes::add_sent_hash. After each checkpoint send the
service records the hash, releases any queued semantic child, and every 1024
sends prunes the sent-hash and finalized-queue structures by height (replacing
the flip's clears).

The write worker gates checkpoint commits for any-order arrival:
- genesis commits directly to disk (blocking ack);
- a block whose parent is no chain tip and not the finalized tip is adopted if
  an identical twin already entered memory via a semantic commit (Ok), else
  rejected with the new CommitBlockError::OutOfOrder { height, next_height };
- a block above the disk frontier flushes its semantically-committed ancestors
  to the disk writer first, failing safe (OutOfOrder, no mutation) on a fork
  below the frontier;
- the recently-finalized cache is enabled lazily on the first checkpoint
  commit;
- a commit error (the NU5+ auth-data hashBlockCommitments check, or the
  spend/value-balance build) responds Err and resets the service to the parent
  instead of panicking — the non-finalized state is untouched
  (validate-before-mutate), so the honest copy recommits cleanly. This
  replaces the previously remote-triggerable expect.

OutOfOrder is engine-compatible: the known-hash IBD engine resubmits its
retained copy on any error above its frontier without inspecting it, and
is_write_task_exited still matches only real channel drops. The first semantic
commit drops the disk-writer bulk guard (EndBulk) and the cache; a later
checkpoint block re-enables both.

New worker integration tests drive the OutOfOrder gate and unwedge-after-gap
recovery through the WriteMessage channel. All 154 zebra-state and 138 zebrad
lib tests pass, including the commit-ordering, value-pool, and chain-tip
end-to-end tests with the flip removed.
…ne precisely

Move the invalidate/reconsider gating into the write worker, with a precise
frontier-height guard: a block whose disk write is enqueued, in flight, or
complete (its height is at or below the disk frontier) is rejected with the
typed error; above the frontier, invalidation and reconsideration now work
during checkpoint sync instead of being unconditionally refused. The
service-side phase gate is gone (the worker decides).

Make the post-op publication empty-non-finalized-state-tolerant: invalidating
the only chain empties the state, so publish the empty snapshot and fall the
chain tip back to the finalized tip instead of panicking in
update_latest_chain_channels. Publish only on success.

Fix a latent panic this path reaches: NonFinalizedState::invalidate_block
removed the chain root via BTreeSet::remove(&chain), which hits Chain::cmp's
'tip hashes are always unique' unreachable when the target equals an existing
chain; remove by tip hash via retain instead (semantically identical, no
panic). Update the InvalidateError/ReconsiderError doc comments to the
frontier wording.

Adds an NonFinalizedState::height_by_hash helper, and worker tests: admin
requests below the frontier are rejected (invalidate, reconsider, and unknown
hash), and invalidating an above-frontier root empties the state and publishes
without panicking. All 156 zebra-state, 138 zebrad, and 69 zebra-rpc lib tests
pass.
Rewrite known-hash-ibd.md §7.3 to the as-built any-order design: one write
worker loop over one WriteMessage channel, one persistent disk-writer thread
(the sole commit_finalized_direct caller), disk-frontier bookkeeping, the
adopt-twin / ancestor-flush / OutOfOrder gate, the reversible EndBulk bulk
guard lifecycle, hash-pinned prune, the error table, the publication policy,
the ack contract, and the precise invalidate/reconsider-during-IBD semantics.

Update the one-page summary and §3.1 module layout for the worker/disk_writer
split; replace §4.7's flip-trigger sentence and §7.2's residual-responsibility
paragraph with the any-order safety-by-construction note; fix the stale
capacity-100 mentions to the configurable min-500 channel; update the §11 gate
list and §12 risk table flip references.

Add a CHANGELOG entry (any-order commits, invalidate/reconsider during IBD)
and the invalidate-only-chain panic fix; touch up the
checkpoint_sync_retained_blocks / checkpoint_sync_pipeline_capacity config doc
comments for the reversible-phase and worker-to-disk-writer wording.
feat(state): commit checkpoint and semantic blocks in any order
…ll size hints

Re-sweep the Mainnet every-block known-hash list from a synced node:

- Extend the list from 3,358,431 to 3,373,206 (heights 0..=3,373,206).
- Embed per-block size hints in all 23 chunks. Previously only chunks
  00-14 carried hints; chunks 15-22 were hash-only pending the size
  sweep, so each grows from 32 to 33 bytes per block.

Update MAINNET_KNOWN_HASHES (max_height + the eight regenerated chunk
SHA-256 constants) and the loader tests for the new tip and the
now-uniform hint coverage.
Emit the Testnet every-block known-hash list as 28 chunk assets with embedded
per-block size hints, swept from a synced Testnet node and verified against
every spaced checkpoint anchor in test-checkpoints.txt (10,144 anchors). Add the
TESTNET_KNOWN_HASHES spec constant and wire for_network() to return it for the
default public Testnet (custom testnets fall through). The list max height
matches TESTNET_MAX_CHECKPOINT_HEIGHT.
…fact fetch

Document the next-step fast-sync direction in known-hash-ibd.md §16: ship the
checkpoint-range result at H_max (note commitment trees + anchors + nullifiers,
and the survivor transparent outputs keyed by OutputLocation + final value
pool/balances) as SHA-256-pinned assets, loaded once before the engine starts so
0..=H_max is skipped rather than replayed. Covers the trust/verification model
(chunk hashes, tree roots vs header commitments, UTXO-set hash), a staged plan
(shielded-only first as the minimal measurable step), optional P2P
content-addressed distribution, and the sync-time measurement plan.
…etwork fmt

Extract assert_bundled_list_loads(network) so the mainnet and testnet
asset-load tests share the open/boundary/genesis/size-hint contract instead of
duplicating ~35 lines (the duplication would drift per network). Collapse the
for_network Testnet match arm to one line so cargo fmt --check passes.
…ic over it

Introduce a CommitStage trait (the tower Service<IbdBlock> contract pinned to
the engine's stage-2 response and error types) and make Engine generic over it
instead of over the state service ZS. VerifyAndCommit becomes the known-hash
implementation behind a blanket impl; Engine::new keeps its network+state
signature via a constructor impl specialized to Engine<ZN, VerifyAndCommit<ZS>, L>.

This is phase 1 of the generic engine unification: the window, weighted-fetch,
gap-hedge, and commit-pipeline machinery is now decoupled from the verification
strategy, so a later full-validation CommitStage drops in unchanged.
Rename the engine's pinned-hash trait HashList -> HashSource and add two
defaulted seam methods that the fixed known-hash list ignores but a discovery
source (full-validation sync) will use: extend() to append newly discovered
hashes above max_height, and invalidate_above() to drop hashes on a reorg. The
run loop now re-reads max_height every pass, so a grown range extends the fetch
window automatically.

This is phase 2 of the generic engine unification: it defines the
generalization point. The engine-side window reorg op and the discovery
HashSource implementation land in phase 3 alongside their consumer and an engine
test harness that can validate the slot/byte/commit-pipeline accounting.
…mmitStage seam

SemanticCommit is the second CommitStage implementation: instead of the
known-hash merkle pin plus CommitCheckpointVerifiedBlock, it hands each fetched
block to the zebra-consensus verifier via Request::Commit (semantic +
contextual validation + non-finalized commit) — the same call the legacy
syncer's download_and_verify makes per block. Corrupt cached bytes still fail as
a verify-stage error before the verifier is reached, so the engine's
discard-and-refetch path is reused unchanged.

Purely additive and not yet wired into ChainSync: it proves the CommitStage
seam supports full validation with the engine's window/fetch/hedge/commit
machinery unchanged. Covered by isolated MockService tests mirroring the
known-hash convert_vectors.

Verifier error classification (peer-attributable invalid block vs. transient
state error) is left as a documented TODO for the ChainSync wiring phase, which
defines the discovery error semantics.
- Rename VerifyAndCommitError::StateUnready -> StageUnready and generalize its
  doc/message: it is the stage-2 commit service (the Buffer'd state for
  VerifyAndCommit, or the block verifier for SemanticCommit), not always the
  state, so the surfaced error names the right subsystem.
- Restore the committed-hash parity assert in SemanticCommit::call, matching the
  known-hash path: the engine keys the committed slot by assigned height, so a
  verifier returning a different hash must fail loudly rather than mark the wrong
  block committed.
- Fix the HashSource docs to stop referencing an engine-side invalidate_above
  window op that does not exist yet: the extend/invalidate_above methods define
  the seam shape; the engine-side handling (window reorg, restore-cache rescan,
  growing-range completion) lands with the discovery source in a later phase.
@github-actions

Copy link
Copy Markdown

PR titles must follow Conventional Commits format. Your title needs a small adjustment.

No release type found in pull request title "Replace the `Sync` component with a new initial block download engine that extends the network protocol". Add a prefix to indicate what kind of release this pull request corresponds to. For reference, see https://www.conventionalcommits.org/

Available types:
 - feat
 - fix
 - perf
 - refactor
 - build
 - chore
 - docs
 - test
 - ci
 - style
 - revert
 - release

See the contribution guide for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants