Skip to content

fix(network): block-sync stack review follow-ups#295

Open
evan-forbes wants to merge 1 commit into
feat/pre-release-mainfrom
evan/blocksync-stack-flups
Open

fix(network): block-sync stack review follow-ups#295
evan-forbes wants to merge 1 commit into
feat/pre-release-mainfrom
evan/blocksync-stack-flups

Conversation

@evan-forbes

Copy link
Copy Markdown

Motivation

Review follow-ups ("flups") collected from the merged block-sync throughput stack
(#284, #286, #287, #288, #289). These were filed as deferred during stack review and
are bundled here as a single follow-up PR against feat/pre-release-main.

The two largest themes:

  • Stop relying on fixed-cadence timers/poll loops where a real event already
    exists — two backstops remained after the stack landed (the fix(network): add block sync congestion control #288 merge kept both
    the demand-driven floor-funding path and the old timer backstop).
  • Remove a config double-negative before release (disable_vct_fast_sync).

Solution

  • sequencer: drop the redundant timer-driven floor-starvation shed (F1, fix(network): prioritize block sync floor requests #287).
    The 500ms FLOOR_STARVATION_SHED_INTERVAL / shed_tick arm is deleted. Floor
    rescue is already fully demand-driven: inline after each accepted body, plus the
    synchronous SequencerControlInput::FundFloorReservation path a floor requester
    drives when it cannot reserve budget. The periodic backstop was a pure
    fixed-cadence poll on top of those.

  • reactor: make the floor watchdog event-driven (F2). Instead of polling on a
    fixed floor_watchdog_tick, the reactor arms a sleep to the earliest
    outstanding floor-claim deadline
    (a new non-allocating
    PeerRegistry::earliest_outstanding_deadline_at), mirroring the per-peer
    routine's existing own-timeout arm. The cancellation logic (run_floor_watchdog)
    is unchanged; only when it fires moves from a 1s poll to the exact deadline.
    Removes the floor_watchdog_tick config knob (+ accessor, default, const, and
    both stored test configs).

  • consensus/state: rename consensus.disable_vct_fast_sync
    consensus.vct_fast_sync (F3, fix(network): tune block sync throughput defaults #289).
    Default flips from false to true
    (same behavior, no double negative). Inverted at every consumer: the state-config
    mirror, select_source_mode, finalized_state resume guard, the zebrad
    startup wiring, unit/prop tests, the VCT design doc, and CHANGELOG. No
    serde(alias) — this is a clean pre-release rename (the old name never shipped in
    a stable release).

  • work_queue: assert the take_in_range_budgeted precondition (F4, perf(network): pack block sync ranges by size hint #284).
    Adds debug_assert!(max_count > 0 && low <= high, ...) so an invalid call is a
    loud caller bug in debug/test builds, while release still degrades to an empty
    Vec rather than panicking a live node.

  • block_sync: reword refactor-historical comments (F5, perf(network): retain raw block bodies in reorder backlog #286). Comments framed
    as "ported from / verbatim from reactor.rs / matches the previous
    run_peer / used to …" required knowing the pre-refactor code and carried
    rot-prone line numbers. Reworded across the module to describe current behavior;
    genuine "why it's shaped this way" comments were kept.

Dropped after discussion (left as-is): bumping MAX_BS_INFLIGHT_REQUESTS and the
32k DEFAULT_BS_MAX_INFLIGHT default.

Tests

  • cargo fmt --all -- --check — clean.
  • cargo clippy -p zebra-network -p zebra-consensus -p zebra-state -p zebrad --all-targets -- -D warnings — clean.
  • cargo nextest run -p zebra-network -p zebra-consensus -p zebra-state — 1149/1150
    pass. The one failure (peer::handshake::tests::mutual_p2p_v2_legacy_upgrade_forms_zakura_connection)
    is a pre-existing concurrency flake unrelated to this change — it passes 3/3 in
    isolation and touches handshake code, not block-sync/config/VCT.
  • F-specific surfaces all pass: vct::tests::source_mode_precedence (rewritten for
    the inverted flag), the consensus/state default tests, the stored-config TOML
    parse/round-trip, and the block-sync shed_top_* / admission tests.
  • Being validated IRL on a running node.

Specifications & References

Originating review threads: #284, #286, #287, #288, #289.

Follow-up Work

  • peer_routine's OUTBOUND_FULL_POLL_INTERVAL (10ms re-check while a peer's
    outbound stream is full) is the one remaining poll left in place: converting it
    needs a session-level "outbound capacity freed" notification that may not exist
    yet, and the recv gating is subtle. Left untouched for safety and noted here.

AI Disclosure

  • AI tools were used: Claude Code (Opus 4.8) — implemented the changes (timer
    removal, event-driven watchdog, config rename, comment sweep) and ran the local
    fmt/clippy/test validation.

PR Checklist

  • The PR title follows conventional commits format: type(scope): description
  • The PR follows the contribution guidelines.
  • This change was discussed in an issue or with the team beforehand.
  • The solution is tested.
  • The documentation and changelogs are up to date.

@evan-forbes evan-forbes marked this pull request as ready for review June 28, 2026 02:22
@evan-forbes

Copy link
Copy Markdown
Author

happy to revert this as well fwiw

Remove a config double-negative before release (disable_vct_fast_sync).

might cause issues with existing deployments

@evan-forbes

Copy link
Copy Markdown
Author

fixing conflicts now

Follow-ups from review of the merged block-sync throughput stack
(#284, #286, #287, #288, #289):

- sequencer: drop the redundant 500ms timer-driven floor-starvation shed.
  Floor rescue is already demand-driven (inline after each accepted body,
  plus the synchronous FundFloorReservation path), so the periodic backstop
  was a pure fixed-cadence poll.
- reactor: make the floor watchdog event-driven. Arm a sleep to the earliest
  outstanding floor-claim deadline instead of polling on a fixed tick,
  mirroring the per-peer routine's own-timeout arm. Removes the
  floor_watchdog_tick config knob.
- consensus/state: rename consensus.disable_vct_fast_sync ->
  consensus.vct_fast_sync (default true) to remove the double negative.
  Mirror updated in state config, docs, and CHANGELOG.
- work_queue: debug_assert the take_in_range_budgeted precondition (positive
  count, low <= high) instead of silently returning empty.
- block_sync: reword refactor-historical comments ("ported from",
  "verbatim", "matches the previous", "used to") to describe current
  behavior.

AI-assisted: implemented with Claude Code (Opus 4.8).
@evan-forbes evan-forbes force-pushed the evan/blocksync-stack-flups branch from c8db64e to bfdfe76 Compare June 28, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant