fix(rpc): stop read-state syncer re-subscribing ~1/sec under finalized lag by nuttycom · Pull Request #10805 · ZcashFoundation/zebra

nuttycom · 2026-06-25T01:11:28Z

Motivation

A co-located read-state follower (zebra_rpc::sync::TrustedChainSync, used by Zaino's ReadStateService backend and other init_read_state_with_syncer consumers) re-subscribes to the non_finalized_state_change indexer stream roughly once per second for the entire time its finalized (secondary) state lags the primary. Each teardown makes the node log one INFO line:

INFO zebra_rpc::indexer::methods: client disconnected, dropping non_finalized_state_change task

so a multi-minute catch-up window produces thousands of them.

The cadence is the consumer's commit-retry backoff, not many clients: sync() subscribes → receives one block → try_commit fails (the secondary's finalized state hasn't caught up, so the streamed block has no parent — ValidateContextError::NotReadyToBeCommitted) → drops the subscription and sleeps COMMIT_RETRY_DELAY (1s) → re-subscribes. Re-subscribing buys nothing: it replays the same backlog from the consumer's unchanged chain tips.

Closes #10803.

Solution

zebra-rpc/src/sync.rs: on a commit failure, retry the same block in place (keeping the subscription open) up to MAX_IN_PLACE_COMMIT_RETRIES (30) × COMMIT_RETRY_DELAY (1s). try_commit drives try_catch_up_with_primary + fill_finalized_gap on every attempt, so the block becomes committable as the finalized gap closes — without churning the connection. Re-subscribe only as a bounded backstop, kept under the server's 60s non-finalized send timeout so the syncer resets before the server would drop it as a slow consumer. This is not the reorg path: a healthy syncer commits reorg blocks as they arrive on the open subscription, so the retry loop isn't entered.
zebra-rpc/src/indexer/methods.rs: demote the three indexer stream-teardown logs (client disconnected, dropping … task in chain_tip_change, non_finalized_state_change, mempool_change) from info! to debug! — a consumer disconnect is a normal lifecycle event.

Net effect: during a finalized-state catch-up window, re-subscriptions drop from ~1/sec to at most ~1/30s (and only when a block is genuinely stuck), and the residual teardown lines no longer appear at default INFO.

Tests

cargo fmt -p zebra-rpc -- --check — clean.
cargo clippy -p zebra-rpc --lib -- -D warnings — clean.
cargo test -p zebra-rpc --lib indexer — passes (indexer decode tests + server spawn).
The happy path is unchanged (first try_commit Ok → break immediately), so the existing zebrad/tests/e2e/trusted_chain.rs tip-change assertions are unaffected.

The failure/retry path is not yet covered by an automated test — there's no mock-indexer harness for TrustedChainSync, and the existing e2e test only exercises the happy path. See Follow-up Work.

Specifications & References

Root-cause issue: read-state syncer re-subscribes ~1/sec under finalized-state lag, flooding INFO logs with 'client disconnected, dropping non_finalized_state_change task' #10803
Prior frequency-reduction work: fix(state): make read-only state DB open safe and back off the read-state syncer #10741 (1s backoff), feat(rpc): resumable non-finalized streaming, finalized-gap bridging, and syncer hardening #10776 (resume-from-tips + backpressure)
Related but distinct: bug: non_finalized_state_change indexer stream livelocks near the tip - initial burst exceeds the 64-slot response buffer #10728 (slow consumer buffer-overflow livelock)

Follow-up Work

A focused integration test for the retry-in-place + backstop path (a mock read-state that fails N commits then succeeds) would be valuable but needs a new harness.

AI Disclosure

AI tools were used: Claude Code (Opus 4.8) for investigation, implementation, and drafting this description. The contributor reviewed and is responsible for all changes.

PR Checklist

The PR title follows conventional commits format: type(scope): description
The PR follows the contribution guidelines.
This change was discussed in an issue or with the team beforehand.
The solution is tested.
The documentation and changelogs are up to date.

…d lag `TrustedChainSync` tore down its non-finalized block subscription and re-subscribed on every commit failure, backing off 1s (`COMMIT_RETRY_DELAY`). While the secondary's finalized state lags the primary the streamed block can't attach, so this repeated once per second for the whole catch-up window, and each teardown made the server log `client disconnected, dropping non_finalized_state_change task` at INFO — thousands of lines. Re-subscribing buys nothing here: it replays the same backlog from our unchanged chain tips. Retry the same block in place instead, keeping the subscription open; `try_commit` advances the secondary's finalized state on each attempt, so the block becomes committable as the gap closes. Re-subscribe only as a bounded backstop (`MAX_IN_PLACE_COMMIT_RETRIES`) so a genuinely stuck block (e.g. a primary reorg our forward-only stream can't observe) still resets the stream. Also demote the three indexer stream-teardown logs (`client disconnected, dropping … task`) from info to debug: a consumer going away is a normal lifecycle event. Refs ZcashFoundation#10803. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Stands up a fully-custom mock indexer gRPC server and a genesis-only follower db, then drives the "finalized state behind the primary" scenario: the streamed block's parent (the gap block) is fetched via `get_block`, which fails twice before succeeding. Asserts the syncer subscribes exactly once — retrying the streamed block in place rather than re-subscribing per failure — and commits it once the gap becomes fillable. The single-subscription assertion is what distinguishes the new in-place retry from the old re-subscribe-per-failure behavior. The early mainnet block vectors carry V1/V2 transactions that the non-finalized state rejects, so the test re-emits each block's coinbase as V4 and re-links the chain, mirroring zebra-state's own continuous-block test helpers. Also reword the two `fill_finalized_gap` log messages ("will retry" rather than "on the next subscription") to match the in-place retry behavior. Refs ZcashFoundation#10803. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

arya2 · 2026-06-26T03:34:29Z

superseded by #10818

nuttycom and others added 2 commits June 24, 2026 19:06

arya2 reviewed Jun 26, 2026

View reviewed changes

Comment thread zebra-rpc/src/sync.rs

arya2 closed this Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rpc): stop read-state syncer re-subscribing ~1/sec under finalized lag#10805

fix(rpc): stop read-state syncer re-subscribing ~1/sec under finalized lag#10805
nuttycom wants to merge 2 commits into
ZcashFoundation:mainfrom
nuttycom:fix/readstate-syncer-resubscribe-churn

nuttycom commented Jun 25, 2026

Uh oh!

Uh oh!

arya2 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nuttycom commented Jun 25, 2026

Motivation

Solution

Tests

Specifications & References

Follow-up Work

AI Disclosure

PR Checklist

Uh oh!

Uh oh!

arya2 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants