fix(network): avoid disconnecting peers that return empty find responses at the chain tip#10732
fix(network): avoid disconnecting peers that return empty find responses at the chain tip#10732jvff wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adjusts Zebra’s peer stall-tracking so that peers are no longer disconnected for empty FindBlocks/FindHeaders responses when Zebra is already at (or near) the network chain tip, while preserving the existing stall-disconnect behavior during catch-up syncing.
Changes:
zebra-chain: adds an “at/near tip” threshold constant and a providedChainTip::is_at_or_near_network_tip()helper.zebra-network: exposesMinimumPeerVersion::chain_tip()and uses it to gate stall tracking inPeerSet::route_p2c; updates handshakestart_heightderivation accordingly.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| zebra-network/src/peer/minimum_peer_version.rs | Replaces chain_tip_height() with chain_tip() accessor for callers needing richer tip context. |
| zebra-network/src/peer/handshake.rs | Computes start_height from chain_tip().best_tip_height() (defaulting to height 0). |
| zebra-network/src/peer_set/set.rs | Disables find-response stall event emission when at/near tip; keeps it enabled while syncing. |
| zebra-chain/src/chain_tip.rs | Introduces AT_OR_NEAR_TIP_THRESHOLD and is_at_or_near_network_tip() to centralize the tip proximity decision. |
ad77fdf to
33914be
Compare
33914be to
779e11f
Compare
…king on chain-tip state Carry of ZcashFoundation/zebra#10732 (open, closes their #10715) to fix the stall-tracker false-positive that disconnects a single-upstream internal node at the tip (Luxor). Drop on the next subtree pull after #10732 merges upstream. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
In some places the `MinimumPeerVersion` becomes the way to access the `ChainTip`, so we should make that explicit instead of adding unrelated methods to `MinimumPeerVersion`.
Refactor to keep `MinimumPeerVersion` more focused.
A helper method to check if the node can be considered to be synced.
779e11f to
e48127e
Compare
e48127e to
649ff3f
Compare
Empty find block or find headers responses are expected once nodes are synced.
Ensure that the stall tracker only disconnects peers while the node is syncing.
Empty find blocks or find headers responses can slow down synchronization, so the peers should be penalized with a disconnection.
Ensure that the initial behavior is to disconnect from peers that return empty find blocks or headers responses.
Ensure that if the node falls behind, it will still track stalls from peers.
Simulate the scenario described in the issue, where an internal Zebra node disconnects from its upstream (and sole peer) because it has finished syncing.
List the fix to not disconnect from peers missing blocks when synced.
649ff3f to
f18537a
Compare
| /// The maximum estimated distance to the network chain tip that is considered "at or near tip". | ||
| /// | ||
| /// Allows for normal block-time variance and propagation delay. | ||
| /// Most chain forks are 1–7 blocks long; 1 is used in sync progress tracking. | ||
| pub const AT_OR_NEAR_TIP_THRESHOLD: block::HeightDiff = 2; |
| fn is_at_or_near_network_tip(&self, network: &Network) -> bool { | ||
| match self.estimate_distance_to_network_chain_tip(network) { | ||
| None => false, | ||
| Some((distance, _height)) => distance <= AT_OR_NEAR_TIP_THRESHOLD, |
There was a problem hiding this comment.
Perhaps we could use a value that would work for the getblocktemplate RPC too and use a single constant for both, MAX_ESTIMATED_DISTANCE_TO_NETWORK_CHAIN_TIP is currently 100, I think that's more reasonable.
There was a problem hiding this comment.
Should I reuse MAX_ESTIMATED_DISTANCE_TO_NETWORK_CHAIN_TIP or should just I just set the AT_OR_NEAR_TIP_THRESHOLD to have the same value?
There was a problem hiding this comment.
Do we want to update FIND_RESPONSE_STALL_THRESHOLD to be higher in this PR?
There was a problem hiding this comment.
I don't think we need to change it 🤔
|
We could also add a config flag for disabling the stall tracker altogether (we'll get rid of it anyway if we replace our checkpoints with the hashes of chunks of all of the checkpointed block hashes, done in the experimental sync PR). |
Sure, I opened an issue for that (#10842). We can prioritize it if needed or close it if we end up not needing it anymore. |
Motivation
Closes #10715.
At the shared chain tip,
FindBlocks/FindHeaderscorrectly return empty responses — thereare no new blocks to report. The stall tracker was treating these as misbehaviour,
disconnecting peers that were behaving correctly. This was introduced as a side effect of
the fix for GHSA-h9hm-m2xj-4rq9, which must remain active during catch-up.
Solution
Gate stall tracking on chain-tip state in
PeerSet::route_p2c. When the node is at ornear the network tip,
track_stallsis set tofalseand no stall event is emitted —the peer's counter is neither incremented nor cleared. During catch-up the node is
meaningfully behind, so stall detection fires exactly as before.
Changes:
zebra-chain: addAT_OR_NEAR_TIP_THRESHOLDconstant andis_at_or_near_network_tipprovided method to the
ChainTiptrait.zebra-network: addchain_tip()getter toMinimumPeerVersion, replacing theremoved
chain_tip_height()method; gatetrack_stallsonis_syncinginroute_p2c.Tests
Four unit tests were added:
find_blocks_stall_not_tracked_when_at_tip: Verifies that a peer sending emptyFindBlocksresponses is not disconnected when the node is at the chain tip, confirming the stall tracker is suppressed by the fix.find_blocks_stall_tracked_when_syncing: Verifies that a peer sending only emptyFindBlocksresponses is still disconnected after exceeding the stall threshold during initial sync, confirming the security property from GHSA-h9hm-m2xj-4rq9 is preserved.find_blocks_stall_tracked_when_tip_unknown: Verifies that stall tracking remains active when the chain tip is unknown (empty state), so the node is protected even before it has synced its first block.find_blocks_stall_count_preserved_across_tip_transition: Verifies that stall counts accumulated during sync are not reset when the node transitions to at-tip and back, so a peer cannot avoid detection by sending one useful response when the node nears the tip.A regression test was added that spawns a regtest network with only two nodes. One mines 100 blocks, then the other syncs. When the tip is reached, the test ensures that the nodes are still connected.
Specifications & References
Follow-up Work
We might want a
AI Disclosure
PR Checklist
type(scope): description