Skip to content

refactor(network): complete the Zakura peer-routine migration#180

Draft
evan-forbes wants to merge 10 commits into
ironwood-mainfrom
evan/peer-routines-everywhere
Draft

refactor(network): complete the Zakura peer-routine migration#180
evan-forbes wants to merge 10 commits into
ironwood-mainfrom
evan/peer-routines-everywhere

Conversation

@evan-forbes

@evan-forbes evan-forbes commented Jun 21, 2026

Copy link
Copy Markdown

Motivation

Block-sync already runs as per-peer routines; header-sync and discovery did not. This completes the Zakura peer-routine migration so that every negotiated ordered stream is driven by one supervised peer routine that owns its per-peer protocol state, while reactors keep only genuinely shared coordination (global work production, frontier watches, broadcasts, admission, commit ordering). It also retires the now-unused generic pipe/DAG scaffolding and hardens shared state against panic poisoning.

Implements the plan in art/peer_routine_centric_refactor/. The block-sync peer-routine work this builds on has now landed in ironwood-main via #164, so this PR is rebased directly onto ironwood-main and the diff is exactly the 10 discovery, header-sync, poison-hardening, and scaffold-retirement commits summarized below — nothing from #164 is restated here.

Solution

Landed as independently-reviewed chunks:

  • Discovery → one routine. DiscoveryPeerRoutine owns admission, startup sends, inbound frame handling, progress, settle timeout, and cleanup in a single supervised task; discovery/pipe.rs is removed. Admitted peer state is removed on every exit path (normal, cancel, stream-close, protocol/local reject, panic); discovery-only-disconnect and protocol-vs-local semantics preserved.
  • Header-sync → routine data plane. HeaderSyncPeerRoutine owns the recv loop, decode, and peer-local Status/Headers/GetHeaders/NewBlock validation. HeaderSyncEvent::WireMessage is replaced by narrow shared-effect events (PeerStatusUpdated, PeerHeadersReceived, InboundGetHeadersRequested, NewBlockCandidate, PeerMisbehavior). The routine pulls GetHeaders work from a shared range queue and records expectations locally before yielding; a per-session generation scheme prevents reconnect/reset stranding. Outbound writes are routine-owned via a bounded/coalescing command channel (never-drop commands park the peer rather than lose correctness state). Misbehavior is record-only (no disconnect) for decoded-but-semantically-invalid messages; only hard decode/guard/no-expectation failures Protocol-disconnect.
  • Poison hardening. Production shared state reachable from peer routines uses non-poisoning parking_lot locks, so a routine panic cannot poison structures used by other peers.
  • Scaffold retirement. The generic Pipe/PipeShape/Flow/Node/Edge/NodeKind DAG vocabulary is deleted; the supervised launcher is split into transport/peer_task.rs. ServiceRegistry, capability negotiation, lazy escalation, duplicate-collision handling, and request/response dispatch are unchanged.

Tests

  • zebra-network and zebrad build clean; cargo clippy -p zebra-network --all-targets -- -D warnings clean.
  • Zakura suites green: zakura::header_sync 141/0, zakura::block_sync 124/0, zakura::discovery 99/0, zakura::transport 36/0, broad zakura 599/0 (--test-threads=4; the two remaining full-parallelism flakes pass in isolation).
  • New focused tests cover: teardown/cancellation and protocol-vs-local reject semantics; panic isolation with work/byte-budget return; header-sync correlation ordering; range-queue strand-prevention across timeout/send-fail/disconnect/cancel/panic/reconnect/reset; covered-range over-fetch/strand prevention; command backpressure and full-queue policy per variant; and negotiation/lazy-escalation/dispatch regressions.

Follow-up Work

  • The optional static service manifest and legacy-gossip-routine migration (chunk 08) were evaluated and deferred: the registry is already small and clear, and legacy gossip is already non-poisoning and independent of the deleted scaffold.
  • Pre-existing testkit/mock_blocksync.rs rustfmt drift and a couple of too-many-arguments clippy lints exist on the base branch and are out of scope for this PR.

AI Disclosure

  • AI tools were used: Claude Code (Opus) implemented the migration chunk-by-chunk, ran an independent review agent to gate each chunk, rebased the work onto ironwood-main after the block-sync peer-routine PR (refactor!: use peer routines for high throughput #164) landed, and drafted this description. The contributor is the responsible author and has reviewed the changes.

PR Checklist

  • The PR title follows conventional commits format.
  • The solution is tested.
  • This change was discussed in an issue or with the team beforehand.
  • The documentation and changelogs are up to date.

…er routine

Fold discovery admission, startup sends, inbound frame handling, progress
tracking, settle timeout, and cleanup into one supervised DiscoveryPeerRoutine
per stream. Remove DiscoveryExchangeStart, spawn_discovery_exchange,
DiscoverySink, DiscoverySource, run_discovery_pipe, and DiscoveryExchangeProgress;
delete discovery/pipe.rs. Reuse the existing shared discovery handle and
transport registry unchanged. Admitted peer state is removed on every exit path
(normal, cancel, stream-close, protocol/local reject, panic) with no double
remove; protocol-vs-local and discovery-only-disconnect semantics preserved.

(cherry picked from commit ab0a819)
Lift the free run_peer recv loop into a concrete HeaderSyncPeerRoutine that
owns HsLocal (expected-Headers FIFO/correlation), command draining, inbound
stream-guard admission, and frame decode for admitted header-sync peers. Spawn
it through the supervised pipe launcher with unchanged teardown semantics
(normal exit and panic send PeerDisconnected; panic and protocol reject cancel
the connection; local reject does not blame the peer). This is a compatibility
step: the reactor still owns global scheduling and direct outbound writes, and
HeaderSyncEvent::WireMessage plus HeaderSyncPeerCommand::RecordExpectedHeaders
are retained for chunks 03-05. Command-drain ordering is byte-faithful to the
old run_peer so expectations are recorded before a matching Headers can decode.

(cherry picked from commit 39746be)
Remove HeaderSyncEvent::WireMessage (and the WireDecodeFailed/WireProtocolFailure
demux) so the reactor no longer matches decoded stream-5 wire variants. Relocate
peer-local protocol validation into HeaderSyncPeerRoutine: Status validation +
status-spam metering, advertised tip/cap clamping and received-status gate,
Headers correlation + shape checks, inbound GetHeaders status/count gates, and
the pre-decode NewBlock rate gate. The routine now forwards only typed
shared-effect events (PeerStatusUpdated, PeerHeadersReceived,
InboundGetHeadersRequested, NewBlockCandidate, PeerMisbehavior); the reactor
keeps admission/snapshots, commit ordering, global NewBlock acceptance sets,
backend dispatch, and the connection-level disconnect decision.

Disconnect semantics are preserved exactly: hard protocol failures (frame/control
decode failure, oversize guard, no-expectation Headers, correlated-Headers decode
failure) still reject the connection, while decoded-but-semantically-invalid
messages (InvalidStatus, StatusSpam, GetHeadersSpam/TooLong, decoded-Headers
body-size) emit PeerMisbehavior and continue, matching the record-only
"stop peer scoring" baseline. Inbound serving-slot inflight accounting stays
reactor-side (its release couples to backend completion, chunk 04/05).

(cherry picked from commit f575f81)
Invert header-sync scheduling from reactor-push to routine-pull. Add a
SharedHeaderRangeQueue (ensure_forward/ensure_backward/take_for_peer/
return_work/mark_covered/reset_above) that reproduces the reactor's range math
(narrow/clamp/fanout/finalized-shrink/forward-before-backward) behind a short
synchronous critical section (no lock held across .await). HeaderSyncPeerRoutine
now pulls eligible work, sends GetHeaders on its own stream, and records the
expected Headers locally before yielding, emitting PeerWorkAssigned; the reactor
records the OutstandingRange and keeps commit/covered ordering.

Add a process-monotone per-session generation carried on session/PeerHeaderState/
OutstandingRange/assigned-map/DropExpectation so an old session's teardown or
timeout cannot erase a reconnected session's live assignment, and old work is
returned rather than orphaned. Outstanding ranges are returned on every exit
path: timeout and send-failure via the queue, stream-close/cancel/protocol-
reject/panic via the routine Drop. Remove HeaderSyncPeerCommand::RecordExpectedHeaders
and the try_send_get_headers expectation side effect. Misbehavior stays
record-only; no return/reset/timeout path disconnects a peer.

(cherry picked from commit 3458b3f)
…tines

Make every header-sync stream write routine-owned. The reactor still selects
destinations (eligible_tip_destinations, status-refresh broadcast) but enqueues
typed commands instead of writing frames: replace direct try_send_status /
try_send_headers_with_sizes / try_send_new_block with a coalescing single-slot
watch for Status plus a bounded (depth 256) HeaderSyncPeerCommand mpsc carrying
SendHeaders, ForwardNewBlock, Wake, and DropExpectation. Each HeaderSyncPeerRoutine
writes the frame on its own stream from its select loop.

Full-queue policy is explicit and tested per variant: Status coalesces (complete
snapshot), Wake coalesces (idempotent re-pull), ForwardNewBlock drops on full
(advisory, matches prior best-effort), while SendHeaders and DropExpectation are
never-drop -> a full/closed queue parks the peer. park_peer_for_undeliverable_command
returns the peer's outstanding work under its generation (no strand) and cancels
only the service token, leaving the connection and co-resident services alive. It
is wired only to undeliverable never-drop commands, never to misbehavior, so the
record-only misbehavior posture is unchanged. A full/slow one-peer queue cannot
wedge unrelated routines or the reactor; no shared lock is held across a send.

(cherry picked from commit ef0938d)
Add parking_lot and convert every production peer-routine-reachable
std::sync::Mutex in the Zakura layer to parking_lot::Mutex so a peer-routine
panic cannot poison shared state used by other peers: block-sync work queue,
peer registry, throughput meter, and per-service peer map; the header-sync
SharedHeaderRangeQueue; handler connection-freshness, message-rate buckets, and
upgrade-dial registry; and legacy-gossip outbound/session/latest-block and
supervisor registries. Remove the poison-specific unwrap/expect/if-let-Ok
fallbacks (all were poison-only branches; control flow is unchanged). ByteBudget
stays lock-free (AtomicU64). No lock is held across .await (parking_lot guards
are !Send, enforced by the build). Pure lock-type swap: no work-queue, registry,
throughput, or misbehavior semantics change. Add panic-while-locked tests proving
each converted structure stays usable and returns outstanding work / byte budget
after a panic; retarget the legacy-gossip replay-panic test off mutex poisoning
onto a test-only panic hook.

(cherry picked from commit 13e4070)
Now that discovery, header-sync, and block-sync own concrete peer routines,
delete the generic data-plane vocabulary Pipe, PipeCx, PipeEntry, PipeShape,
PipeSink, Flow, Node, Edge, and NodeKind. Split the supervised launcher from the
pipe vocabulary: transport/pipe.rs becomes transport/peer_task.rs exposing
spawn_supervised_routine/spawn_supervised_peer_task/handle_routine_exit and the
teardown guards, keeping SinkReject, the session guard, the panic=abort
compile_error guard, and the panic metric unchanged.

header_sync/pipe.rs becomes header_sync/routine.rs: HeaderSyncPeerRoutine owns
peer_id/HsLocal/HsEnv/SessionGuard directly and uses a routine-local
Ingest{Continue,Done,Reject} enum in place of Flow, with byte-identical select-arm
ordering, Local->Done mapping, restore-on-Local correlation, record-only-vs-protocol
misbehavior split, and Drop strand-return all preserved. block_sync/pipe.rs is
deleted (block_sync_guard moves into peer_routine.rs, which already owned its
stream-6 loop). ServiceRegistry, capability negotiation, lazy escalation,
duplicate-collision handling, request/response dispatch, and all deliver_frame
paths are unchanged; the obsolete PipeShape/PipeSink drift tests are removed while
the teardown and negotiation guardrail tests are preserved.

(cherry picked from commit 983c686)
…nly contract

Three header-sync hostile-peer tests asserted disconnect-on-misbehavior behavior
that was deliberately removed by 9b9d641 (stop peer scoring -> misbehavior is
record-only) and 325dd66 (tolerate unsolicited headers -> no misbehavior action,
but the routine still Protocol-rejects). Production behavior is correct; the tests
were stale. Realign them to the current contract without weakening coverage:

- header_sync_misbehavior_action_disconnects_peer ->
  header_sync_misbehavior_action_is_record_only_and_keeps_connection: assert the
  Misbehavior action does NOT cancel the connection and the driver keeps draining.
- native_stream5_hostile_bytes_...: keep the malformed/oversize/truncated
  hard-protocol disconnect assertions; prove unsolicited-headers teardown via the
  routine's SinkReject::Protocol path (peer-set removal) instead of the retired
  misbehavior action.
- header_sync_e2e_hostile_peer_disconnect_matrix_...: keep record-only matrix
  cases; assert the unsolicited-headers violation trace (synthetic harness has no
  connection to observe teardown, which the native test covers); expect
  MalformedMessage for an over-count Headers (decode is now bound by the
  correlated request count, so ResponseTooLong is unreachable from ingest); use an
  encodable GetHeaders count over the serving limit for GetHeadersTooLong.

Adds an assert_header_violation testkit helper. Test/testkit-only; no production
code changed.

(cherry picked from commit 016e37a)
@evan-forbes evan-forbes force-pushed the evan/peer-routines-everywhere branch from 335c20e to 3eb8a7c Compare June 22, 2026 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant