fix(network): add block sync congestion control by evan-forbes · Pull Request #288 · valargroup/zebra

evan-forbes · 2026-06-27T01:24:58Z

Motivation

This is the fifth block-sync PR in the stack. It adds the admission, queueing, and transport budget controls needed to keep block-sync scheduling bounded under pressure.

Solution

Add block-sync admission policy for floor and speculative work.
Add/adjust congestion controls in the peer registry, work queue, peer routine, reactor, and transport guard.
Add adaptive request-window slow start/backoff behavior.
Add central floor watchdog and bounded reorder lookahead configuration.
Wire block-sync config validation into top-level network config deserialization.
Keep state/read support needed by congestion decisions.
Add tests for admission, byte budget conservation, watchdog recovery, peer timeout behavior, queue bounds, config validation, and reactor scheduling edge cases.

Scope boundary:

This PR intentionally excludes packaging/tooling changes and the generated config fixture/final byte-budget tuning PR.
It also excludes the later sequencer-owned apply/committer refactor.

Tests

Passed locally on the rebuilt split stack:

git diff --check origin/review/headersync-roots..review/blocksync-throughput-defaults
cargo fmt --all -- --check
cargo test -p zebra-network zakura::block_sync --lib (146 passed on the final stack tip)
cargo test -p zebra-network p2p_v2_block_sync_config_validation_rejects_degenerate_values --lib
Focused review tests for watchdog/admission/window behavior passed.

Attempted but blocked by the local toolchain before test execution:

cargo test -p zebrad --test acceptance latest_config_is_stored -- --nocapture
Blocker: bundled librocksdb-sys C++ compilation fails on RocksDB headers using uint64_t without <cstdint>.

Specifications & References

Stack context:

Base: fix(network): prioritize block sync floor requests #287 / review/blocksync-floor-priority

Follow-up Work

Throughput-default tuning is stacked above this PR. Sequencer-owned apply remains for the later refactor PR.

AI Disclosure

No AI tools were used in this PR
AI tools were used: Codex was used to reconstruct the stacked branch, run sequential review agents, fix review findings, draft/update PR text, and run validation commands.

PR Checklist

The PR title follows conventional commits format: type(scope): description
The PR follows the contribution guidelines.
This change was discussed in an issue or with the team beforehand.
The solution is tested.
The documentation and changelogs are up to date.

evan-forbes · 2026-06-27T18:49:20Z

fixing conflicts now

p0mvn · 2026-06-27T18:56:18Z

@evan-forbes sorry, saw your message later and frontran you a bit

evan-forbes · 2026-06-27T19:02:40Z

all good! thanks

p0mvn

LGTM. Focused on grasping the concepts and dependencies in this review as opposed to bug-finding

p0mvn · 2026-06-27T19:10:10Z

+    download_floor: block::Height,
+    start_height: block::Height,
+) -> RequestPriority {
+    if start_height <= floor_rescue_high(download_floor) {


Why is this <= and not ==?

Can't we be rescuing bodies that have already been downloaded?

ahh no inherently a rescue means we haven't downloaded the block. we got a timeout and need to re-request from someone else

floor_rescue_high(download_floor) is download_floor + 1 and download_floor means we've already downloaded

p0mvn · 2026-06-27T19:25:27Z

+/// [`MAX_BS_INFLIGHT_REQUESTS`]) only after sustained error-free responses (see
+/// the streak-gated cubic ramp on `DownloadWindow`). In a homogeneous fleet this
+/// is the per-peer concurrency ceiling every peer offers.
+pub const DEFAULT_BS_MAX_INFLIGHT: u32 = 32000;


Why is this value so high? Is it even realistic?

yeah totally, so the reasoning was our timeout here is 8s. with congestion control we're ideally maximizing a peer's throughput to the point where they have 8s of congestion at any moment. so our buffer is ~4k blocks. if we want to keep the buffer filled over 8s then that's 32k. that's at least the reasoning.

I've seen it hit 22-26k before

p0mvn · 2026-06-27T19:28:02Z

+///
+/// This is the hard ceiling the default advertisement ([`DEFAULT_BS_MAX_INFLIGHT`]
+/// = 32,000) is clamped to, and also the per-peer outstanding-request safety bound
+/// (`EFFECTIVE_BS_OUTBOUND_INFLIGHT_PER_PEER`). It bounds how many concurrent
+/// requests a remote peer can make us hold against it, so it doubles as a DoS bound.
+pub const MAX_BS_INFLIGHT_REQUESTS: u32 = 32_768;


Why not have a single DEFAULT_BS_MAX_INFLIGHT? Is 768 value difference between them meaningful?

having a bit of overflow here is good just in case we're completely maxed out and need to rescue. its a good point though perhaps it should be even bit higher

p0mvn · 2026-06-27T19:29:55Z

+pub const DEFAULT_BS_REQUEST_TIMEOUT: Duration = Duration::from_secs(8);
+/// Default central floor-watchdog cadence.
+pub const DEFAULT_BS_FLOOR_WATCHDOG_TICK: Duration = Duration::from_secs(1);


Big fan of reducing dependency on timeouts in the upcoming work

Follow-ups from review of the merged block-sync throughput stack (#284, #286, #287, #288, #289): - sequencer: drop the redundant 500ms timer-driven floor-starvation shed. Floor rescue is already demand-driven (inline after each accepted body, plus the synchronous FundFloorReservation path), so the periodic backstop was a pure fixed-cadence poll. - reactor: make the floor watchdog event-driven. Arm a sleep to the earliest outstanding floor-claim deadline instead of polling on a fixed tick, mirroring the per-peer routine's own-timeout arm. Removes the floor_watchdog_tick config knob. - consensus/state: rename consensus.disable_vct_fast_sync -> consensus.vct_fast_sync (default true) to remove the double negative. Mirror updated in state config, docs, and CHANGELOG. - work_queue: debug_assert the take_in_range_budgeted precondition (positive count, low <= high) instead of silently returning empty. - block_sync: reword refactor-historical comments ("ported from", "verbatim", "matches the previous", "used to") to describe current behavior. AI-assisted: implemented with Claude Code (Opus 4.8).

evan-forbes mentioned this pull request Jun 27, 2026

fix(network): tune block sync throughput defaults #289

Merged

7 tasks

evan-forbes force-pushed the review/blocksync-congestion-control branch from 4119bd9 to b06c085 Compare June 27, 2026 04:56

evan-forbes force-pushed the review/blocksync-floor-priority branch from 6195c51 to 224c828 Compare June 27, 2026 04:56

evan-forbes force-pushed the review/blocksync-congestion-control branch from b06c085 to 18eac76 Compare June 27, 2026 05:01

evan-forbes force-pushed the review/blocksync-floor-priority branch from 224c828 to 96789ea Compare June 27, 2026 05:01

p0mvn mentioned this pull request Jun 27, 2026

feat: pre-release main (sync perf, vct, block sync fixes) #290

Draft

p0mvn force-pushed the review/blocksync-floor-priority branch from 96789ea to 6a23ba8 Compare June 27, 2026 06:33

evan-forbes added 3 commits June 27, 2026 03:14

perf(network): retain raw block bodies in reorder backlog

cb8b994

fix(network): prioritize block sync floor requests

00fe3fd

fix(network): add block sync congestion control

a7e79f3

evan-forbes force-pushed the review/blocksync-floor-priority branch from 6a23ba8 to 00fe3fd Compare June 27, 2026 08:18

evan-forbes force-pushed the review/blocksync-congestion-control branch from 18eac76 to a7e79f3 Compare June 27, 2026 08:18

evan-forbes marked this pull request as ready for review June 27, 2026 08:27

Merge feat/pre-release-main into block sync congestion control

e4d136e

p0mvn changed the base branch from review/blocksync-floor-priority to feat/pre-release-main June 27, 2026 18:55

p0mvn added 6 commits June 27, 2026 13:16

fix test

8683e1f

nits

f53dc41

lint

6444c2a

more comments

43222ce

renames

9ec8a10

nits

55ae293

p0mvn approved these changes Jun 27, 2026

View reviewed changes

p0mvn merged commit 04191c0 into feat/pre-release-main Jun 27, 2026
43 checks passed

evan-forbes mentioned this pull request Jun 28, 2026

fix(network): block-sync stack review follow-ups #295

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(network): add block sync congestion control#288

fix(network): add block sync congestion control#288
p0mvn merged 10 commits into
feat/pre-release-mainfrom
review/blocksync-congestion-control

evan-forbes commented Jun 27, 2026 •

edited

Loading

Uh oh!

evan-forbes commented Jun 27, 2026

Uh oh!

p0mvn commented Jun 27, 2026

Uh oh!

evan-forbes commented Jun 27, 2026

Uh oh!

p0mvn left a comment

Uh oh!

p0mvn Jun 27, 2026

Uh oh!

evan-forbes Jun 27, 2026

Uh oh!

p0mvn Jun 27, 2026

Uh oh!

evan-forbes Jun 27, 2026

Uh oh!

p0mvn Jun 27, 2026

Uh oh!

evan-forbes Jun 27, 2026

Uh oh!

p0mvn Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

evan-forbes commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Solution

Tests

Specifications & References

Follow-up Work

AI Disclosure

PR Checklist

Uh oh!

evan-forbes commented Jun 27, 2026

Uh oh!

p0mvn commented Jun 27, 2026

Uh oh!

evan-forbes commented Jun 27, 2026

Uh oh!

p0mvn left a comment

Choose a reason for hiding this comment

Uh oh!

p0mvn Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

evan-forbes Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

p0mvn Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

evan-forbes Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

p0mvn Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

evan-forbes Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

p0mvn Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evan-forbes commented Jun 27, 2026 •

edited

Loading