refactor(network): complete the Zakura peer-routine migration#180
Draft
evan-forbes wants to merge 10 commits into
Draft
refactor(network): complete the Zakura peer-routine migration#180evan-forbes wants to merge 10 commits into
evan-forbes wants to merge 10 commits into
Conversation
(cherry picked from commit a15ff7f)
(cherry picked from commit be55733)
…er routine Fold discovery admission, startup sends, inbound frame handling, progress tracking, settle timeout, and cleanup into one supervised DiscoveryPeerRoutine per stream. Remove DiscoveryExchangeStart, spawn_discovery_exchange, DiscoverySink, DiscoverySource, run_discovery_pipe, and DiscoveryExchangeProgress; delete discovery/pipe.rs. Reuse the existing shared discovery handle and transport registry unchanged. Admitted peer state is removed on every exit path (normal, cancel, stream-close, protocol/local reject, panic) with no double remove; protocol-vs-local and discovery-only-disconnect semantics preserved. (cherry picked from commit ab0a819)
Lift the free run_peer recv loop into a concrete HeaderSyncPeerRoutine that owns HsLocal (expected-Headers FIFO/correlation), command draining, inbound stream-guard admission, and frame decode for admitted header-sync peers. Spawn it through the supervised pipe launcher with unchanged teardown semantics (normal exit and panic send PeerDisconnected; panic and protocol reject cancel the connection; local reject does not blame the peer). This is a compatibility step: the reactor still owns global scheduling and direct outbound writes, and HeaderSyncEvent::WireMessage plus HeaderSyncPeerCommand::RecordExpectedHeaders are retained for chunks 03-05. Command-drain ordering is byte-faithful to the old run_peer so expectations are recorded before a matching Headers can decode. (cherry picked from commit 39746be)
Remove HeaderSyncEvent::WireMessage (and the WireDecodeFailed/WireProtocolFailure demux) so the reactor no longer matches decoded stream-5 wire variants. Relocate peer-local protocol validation into HeaderSyncPeerRoutine: Status validation + status-spam metering, advertised tip/cap clamping and received-status gate, Headers correlation + shape checks, inbound GetHeaders status/count gates, and the pre-decode NewBlock rate gate. The routine now forwards only typed shared-effect events (PeerStatusUpdated, PeerHeadersReceived, InboundGetHeadersRequested, NewBlockCandidate, PeerMisbehavior); the reactor keeps admission/snapshots, commit ordering, global NewBlock acceptance sets, backend dispatch, and the connection-level disconnect decision. Disconnect semantics are preserved exactly: hard protocol failures (frame/control decode failure, oversize guard, no-expectation Headers, correlated-Headers decode failure) still reject the connection, while decoded-but-semantically-invalid messages (InvalidStatus, StatusSpam, GetHeadersSpam/TooLong, decoded-Headers body-size) emit PeerMisbehavior and continue, matching the record-only "stop peer scoring" baseline. Inbound serving-slot inflight accounting stays reactor-side (its release couples to backend completion, chunk 04/05). (cherry picked from commit f575f81)
Invert header-sync scheduling from reactor-push to routine-pull. Add a SharedHeaderRangeQueue (ensure_forward/ensure_backward/take_for_peer/ return_work/mark_covered/reset_above) that reproduces the reactor's range math (narrow/clamp/fanout/finalized-shrink/forward-before-backward) behind a short synchronous critical section (no lock held across .await). HeaderSyncPeerRoutine now pulls eligible work, sends GetHeaders on its own stream, and records the expected Headers locally before yielding, emitting PeerWorkAssigned; the reactor records the OutstandingRange and keeps commit/covered ordering. Add a process-monotone per-session generation carried on session/PeerHeaderState/ OutstandingRange/assigned-map/DropExpectation so an old session's teardown or timeout cannot erase a reconnected session's live assignment, and old work is returned rather than orphaned. Outstanding ranges are returned on every exit path: timeout and send-failure via the queue, stream-close/cancel/protocol- reject/panic via the routine Drop. Remove HeaderSyncPeerCommand::RecordExpectedHeaders and the try_send_get_headers expectation side effect. Misbehavior stays record-only; no return/reset/timeout path disconnects a peer. (cherry picked from commit 3458b3f)
…tines Make every header-sync stream write routine-owned. The reactor still selects destinations (eligible_tip_destinations, status-refresh broadcast) but enqueues typed commands instead of writing frames: replace direct try_send_status / try_send_headers_with_sizes / try_send_new_block with a coalescing single-slot watch for Status plus a bounded (depth 256) HeaderSyncPeerCommand mpsc carrying SendHeaders, ForwardNewBlock, Wake, and DropExpectation. Each HeaderSyncPeerRoutine writes the frame on its own stream from its select loop. Full-queue policy is explicit and tested per variant: Status coalesces (complete snapshot), Wake coalesces (idempotent re-pull), ForwardNewBlock drops on full (advisory, matches prior best-effort), while SendHeaders and DropExpectation are never-drop -> a full/closed queue parks the peer. park_peer_for_undeliverable_command returns the peer's outstanding work under its generation (no strand) and cancels only the service token, leaving the connection and co-resident services alive. It is wired only to undeliverable never-drop commands, never to misbehavior, so the record-only misbehavior posture is unchanged. A full/slow one-peer queue cannot wedge unrelated routines or the reactor; no shared lock is held across a send. (cherry picked from commit ef0938d)
Add parking_lot and convert every production peer-routine-reachable std::sync::Mutex in the Zakura layer to parking_lot::Mutex so a peer-routine panic cannot poison shared state used by other peers: block-sync work queue, peer registry, throughput meter, and per-service peer map; the header-sync SharedHeaderRangeQueue; handler connection-freshness, message-rate buckets, and upgrade-dial registry; and legacy-gossip outbound/session/latest-block and supervisor registries. Remove the poison-specific unwrap/expect/if-let-Ok fallbacks (all were poison-only branches; control flow is unchanged). ByteBudget stays lock-free (AtomicU64). No lock is held across .await (parking_lot guards are !Send, enforced by the build). Pure lock-type swap: no work-queue, registry, throughput, or misbehavior semantics change. Add panic-while-locked tests proving each converted structure stays usable and returns outstanding work / byte budget after a panic; retarget the legacy-gossip replay-panic test off mutex poisoning onto a test-only panic hook. (cherry picked from commit 13e4070)
Now that discovery, header-sync, and block-sync own concrete peer routines,
delete the generic data-plane vocabulary Pipe, PipeCx, PipeEntry, PipeShape,
PipeSink, Flow, Node, Edge, and NodeKind. Split the supervised launcher from the
pipe vocabulary: transport/pipe.rs becomes transport/peer_task.rs exposing
spawn_supervised_routine/spawn_supervised_peer_task/handle_routine_exit and the
teardown guards, keeping SinkReject, the session guard, the panic=abort
compile_error guard, and the panic metric unchanged.
header_sync/pipe.rs becomes header_sync/routine.rs: HeaderSyncPeerRoutine owns
peer_id/HsLocal/HsEnv/SessionGuard directly and uses a routine-local
Ingest{Continue,Done,Reject} enum in place of Flow, with byte-identical select-arm
ordering, Local->Done mapping, restore-on-Local correlation, record-only-vs-protocol
misbehavior split, and Drop strand-return all preserved. block_sync/pipe.rs is
deleted (block_sync_guard moves into peer_routine.rs, which already owned its
stream-6 loop). ServiceRegistry, capability negotiation, lazy escalation,
duplicate-collision handling, request/response dispatch, and all deliver_frame
paths are unchanged; the obsolete PipeShape/PipeSink drift tests are removed while
the teardown and negotiation guardrail tests are preserved.
(cherry picked from commit 983c686)
…nly contract Three header-sync hostile-peer tests asserted disconnect-on-misbehavior behavior that was deliberately removed by 9b9d641 (stop peer scoring -> misbehavior is record-only) and 325dd66 (tolerate unsolicited headers -> no misbehavior action, but the routine still Protocol-rejects). Production behavior is correct; the tests were stale. Realign them to the current contract without weakening coverage: - header_sync_misbehavior_action_disconnects_peer -> header_sync_misbehavior_action_is_record_only_and_keeps_connection: assert the Misbehavior action does NOT cancel the connection and the driver keeps draining. - native_stream5_hostile_bytes_...: keep the malformed/oversize/truncated hard-protocol disconnect assertions; prove unsolicited-headers teardown via the routine's SinkReject::Protocol path (peer-set removal) instead of the retired misbehavior action. - header_sync_e2e_hostile_peer_disconnect_matrix_...: keep record-only matrix cases; assert the unsolicited-headers violation trace (synthetic harness has no connection to observe teardown, which the native test covers); expect MalformedMessage for an over-count Headers (decode is now bound by the correlated request count, so ResponseTooLong is unreachable from ingest); use an encodable GetHeaders count over the serving limit for GetHeadersTooLong. Adds an assert_header_violation testkit helper. Test/testkit-only; no production code changed. (cherry picked from commit 016e37a)
335c20e to
3eb8a7c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Block-sync already runs as per-peer routines; header-sync and discovery did not. This completes the Zakura peer-routine migration so that every negotiated ordered stream is driven by one supervised peer routine that owns its per-peer protocol state, while reactors keep only genuinely shared coordination (global work production, frontier watches, broadcasts, admission, commit ordering). It also retires the now-unused generic pipe/DAG scaffolding and hardens shared state against panic poisoning.
Implements the plan in
art/peer_routine_centric_refactor/. The block-sync peer-routine work this builds on has now landed inironwood-mainvia #164, so this PR is rebased directly ontoironwood-mainand the diff is exactly the 10 discovery, header-sync, poison-hardening, and scaffold-retirement commits summarized below — nothing from #164 is restated here.Solution
Landed as independently-reviewed chunks:
DiscoveryPeerRoutineowns admission, startup sends, inbound frame handling, progress, settle timeout, and cleanup in a single supervised task;discovery/pipe.rsis removed. Admitted peer state is removed on every exit path (normal, cancel, stream-close, protocol/local reject, panic); discovery-only-disconnect and protocol-vs-local semantics preserved.HeaderSyncPeerRoutineowns the recv loop, decode, and peer-localStatus/Headers/GetHeaders/NewBlockvalidation.HeaderSyncEvent::WireMessageis replaced by narrow shared-effect events (PeerStatusUpdated,PeerHeadersReceived,InboundGetHeadersRequested,NewBlockCandidate,PeerMisbehavior). The routine pullsGetHeaderswork from a shared range queue and records expectations locally before yielding; a per-session generation scheme prevents reconnect/reset stranding. Outbound writes are routine-owned via a bounded/coalescing command channel (never-drop commands park the peer rather than lose correctness state). Misbehavior is record-only (no disconnect) for decoded-but-semantically-invalid messages; only hard decode/guard/no-expectation failures Protocol-disconnect.parking_lotlocks, so a routine panic cannot poison structures used by other peers.Pipe/PipeShape/Flow/Node/Edge/NodeKindDAG vocabulary is deleted; the supervised launcher is split intotransport/peer_task.rs.ServiceRegistry, capability negotiation, lazy escalation, duplicate-collision handling, and request/response dispatch are unchanged.Tests
zebra-networkandzebradbuild clean;cargo clippy -p zebra-network --all-targets -- -D warningsclean.zakura::header_sync141/0,zakura::block_sync124/0,zakura::discovery99/0,zakura::transport36/0, broadzakura599/0 (--test-threads=4; the two remaining full-parallelism flakes pass in isolation).Follow-up Work
testkit/mock_blocksync.rsrustfmt drift and a couple oftoo-many-argumentsclippy lints exist on the base branch and are out of scope for this PR.AI Disclosure
ironwood-mainafter the block-sync peer-routine PR (refactor!: use peer routines for high throughput #164) landed, and drafted this description. The contributor is the responsible author and has reviewed the changes.PR Checklist