Skip to content

fix(network): avoid disconnecting peers that return empty find responses at the chain tip#10732

Open
jvff wants to merge 10 commits into
mainfrom
jvff/ignore-stall-events-if-synced
Open

fix(network): avoid disconnecting peers that return empty find responses at the chain tip#10732
jvff wants to merge 10 commits into
mainfrom
jvff/ignore-stall-events-if-synced

Conversation

@jvff

@jvff jvff commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Motivation

Closes #10715.

At the shared chain tip, FindBlocks/FindHeaders correctly return empty responses — there
are no new blocks to report. The stall tracker was treating these as misbehaviour,
disconnecting peers that were behaving correctly. This was introduced as a side effect of
the fix for GHSA-h9hm-m2xj-4rq9, which must remain active during catch-up.

Solution

Gate stall tracking on chain-tip state in PeerSet::route_p2c. When the node is at or
near the network tip, track_stalls is set to false and no stall event is emitted —
the peer's counter is neither incremented nor cleared. During catch-up the node is
meaningfully behind, so stall detection fires exactly as before.

Changes:

  • zebra-chain: add AT_OR_NEAR_TIP_THRESHOLD constant and is_at_or_near_network_tip
    provided method to the ChainTip trait.
  • zebra-network: add chain_tip() getter to MinimumPeerVersion, replacing the
    removed chain_tip_height() method; gate track_stalls on is_syncing in
    route_p2c.

Tests

Four unit tests were added:

  • find_blocks_stall_not_tracked_when_at_tip: Verifies that a peer sending empty FindBlocks responses is not disconnected when the node is at the chain tip, confirming the stall tracker is suppressed by the fix.
  • find_blocks_stall_tracked_when_syncing: Verifies that a peer sending only empty FindBlocks responses is still disconnected after exceeding the stall threshold during initial sync, confirming the security property from GHSA-h9hm-m2xj-4rq9 is preserved.
  • find_blocks_stall_tracked_when_tip_unknown: Verifies that stall tracking remains active when the chain tip is unknown (empty state), so the node is protected even before it has synced its first block.
  • find_blocks_stall_count_preserved_across_tip_transition: Verifies that stall counts accumulated during sync are not reset when the node transitions to at-tip and back, so a peer cannot avoid detection by sending one useful response when the node nears the tip.

A regression test was added that spawns a regtest network with only two nodes. One mines 100 blocks, then the other syncs. When the tip is reached, the test ensures that the nodes are still connected.

Specifications & References

Follow-up Work

We might want a

AI Disclosure

  • No AI tools were used in this PR
  • AI tools were used: Claude Sonnet for discussing the solution and coding it

PR Checklist

  • The PR title follows conventional commits format: type(scope): description
  • The PR follows the contribution guidelines.
  • This change was discussed in an issue or with the team beforehand.
  • The solution is tested.
  • The documentation and changelogs are up to date.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Zebra’s peer stall-tracking so that peers are no longer disconnected for empty FindBlocks/FindHeaders responses when Zebra is already at (or near) the network chain tip, while preserving the existing stall-disconnect behavior during catch-up syncing.

Changes:

  • zebra-chain: adds an “at/near tip” threshold constant and a provided ChainTip::is_at_or_near_network_tip() helper.
  • zebra-network: exposes MinimumPeerVersion::chain_tip() and uses it to gate stall tracking in PeerSet::route_p2c; updates handshake start_height derivation accordingly.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
zebra-network/src/peer/minimum_peer_version.rs Replaces chain_tip_height() with chain_tip() accessor for callers needing richer tip context.
zebra-network/src/peer/handshake.rs Computes start_height from chain_tip().best_tip_height() (defaulting to height 0).
zebra-network/src/peer_set/set.rs Disables find-response stall event emission when at/near tip; keeps it enabled while syncing.
zebra-chain/src/chain_tip.rs Introduces AT_OR_NEAR_TIP_THRESHOLD and is_at_or_near_network_tip() to centralize the tip proximity decision.

Comment thread zebra-chain/src/chain_tip.rs
Comment thread zebra-network/src/peer_set/set.rs Outdated
@jvff jvff force-pushed the jvff/ignore-stall-events-if-synced branch from ad77fdf to 33914be Compare June 19, 2026 19:14
@jvff jvff self-assigned this Jun 19, 2026
@jvff jvff added C-enhancement Category: This is an improvement A-network Area: Network protocol updates or fixes labels Jun 19, 2026
@jvff jvff force-pushed the jvff/ignore-stall-events-if-synced branch from 33914be to 779e11f Compare June 19, 2026 19:41
@jvff jvff marked this pull request as ready for review June 22, 2026 14:13
aphelionz added a commit to ShieldedLabs/zero that referenced this pull request Jun 25, 2026
…king on chain-tip state

Carry of ZcashFoundation/zebra#10732 (open, closes their #10715) to fix the
stall-tracker false-positive that disconnects a single-upstream internal node
at the tip (Luxor). Drop on the next subtree pull after #10732 merges upstream.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jvff jvff requested a review from arya2 June 25, 2026 20:18
jvff added 3 commits June 25, 2026 21:15
In some places the `MinimumPeerVersion` becomes the way to access the
`ChainTip`, so we should make that explicit instead of adding unrelated
methods to `MinimumPeerVersion`.
Refactor to keep `MinimumPeerVersion` more focused.
A helper method to check if the node can be considered to be synced.
Copilot AI review requested due to automatic review settings June 25, 2026 21:16
@jvff jvff force-pushed the jvff/ignore-stall-events-if-synced branch from 779e11f to e48127e Compare June 25, 2026 21:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread zebra-network/src/peer_set/set.rs Outdated
@jvff jvff force-pushed the jvff/ignore-stall-events-if-synced branch from e48127e to 649ff3f Compare June 25, 2026 21:35
jvff added 7 commits June 25, 2026 21:39
Empty find block or find headers responses are expected once nodes are
synced.
Ensure that the stall tracker only disconnects peers while the node is
syncing.
Empty find blocks or find headers responses can slow down
synchronization, so the peers should be penalized with a disconnection.
Ensure that the initial behavior is to disconnect from peers that return
empty find blocks or headers responses.
Ensure that if the node falls behind, it will still track stalls from
peers.
Simulate the scenario described in the issue, where an internal Zebra
node disconnects from its upstream (and sole peer) because it has
finished syncing.
List the fix to not disconnect from peers missing blocks when synced.
Copilot AI review requested due to automatic review settings June 25, 2026 21:39
@jvff jvff force-pushed the jvff/ignore-stall-events-if-synced branch from 649ff3f to f18537a Compare June 25, 2026 21:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment on lines +18 to +22
/// The maximum estimated distance to the network chain tip that is considered "at or near tip".
///
/// Allows for normal block-time variance and propagation delay.
/// Most chain forks are 1–7 blocks long; 1 is used in sync progress tracking.
pub const AT_OR_NEAR_TIP_THRESHOLD: block::HeightDiff = 2;

@arya2 arya2 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you.

Comment on lines +133 to +136
fn is_at_or_near_network_tip(&self, network: &Network) -> bool {
match self.estimate_distance_to_network_chain_tip(network) {
None => false,
Some((distance, _height)) => distance <= AT_OR_NEAR_TIP_THRESHOLD,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could use a value that would work for the getblocktemplate RPC too and use a single constant for both, MAX_ESTIMATED_DISTANCE_TO_NETWORK_CHAIN_TIP is currently 100, I think that's more reasonable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I reuse MAX_ESTIMATED_DISTANCE_TO_NETWORK_CHAIN_TIP or should just I just set the AT_OR_NEAR_TIP_THRESHOLD to have the same value?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to update FIND_RESPONSE_STALL_THRESHOLD to be higher in this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to change it 🤔

@arya2

arya2 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

We could also add a config flag for disabling the stall tracker altogether (we'll get rid of it anyway if we replace our checkpoints with the hashes of chunks of all of the checkpointed block hashes, done in the experimental sync PR).

@jvff

jvff commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

We could also add a config flag for disabling the stall tracker altogether (we'll get rid of it anyway if we replace our checkpoints with the hashes of chunks of all of the checkpointed block hashes, done in the experimental sync PR).

Sure, I opened an issue for that (#10842). We can prioritize it if needed or close it if we end up not needing it anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-network Area: Network protocol updates or fixes C-enhancement Category: This is an improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stall tracker false-positive: an internal Zebra (or zcashd) relaying through a single upstream Zebra disconnects it and self-isolates

4 participants