fix(rpc): stop read-state syncer re-subscribing ~1/sec under finalized lag#10805
Closed
nuttycom wants to merge 2 commits into
Closed
fix(rpc): stop read-state syncer re-subscribing ~1/sec under finalized lag#10805nuttycom wants to merge 2 commits into
nuttycom wants to merge 2 commits into
Conversation
…d lag `TrustedChainSync` tore down its non-finalized block subscription and re-subscribed on every commit failure, backing off 1s (`COMMIT_RETRY_DELAY`). While the secondary's finalized state lags the primary the streamed block can't attach, so this repeated once per second for the whole catch-up window, and each teardown made the server log `client disconnected, dropping non_finalized_state_change task` at INFO — thousands of lines. Re-subscribing buys nothing here: it replays the same backlog from our unchanged chain tips. Retry the same block in place instead, keeping the subscription open; `try_commit` advances the secondary's finalized state on each attempt, so the block becomes committable as the gap closes. Re-subscribe only as a bounded backstop (`MAX_IN_PLACE_COMMIT_RETRIES`) so a genuinely stuck block (e.g. a primary reorg our forward-only stream can't observe) still resets the stream. Also demote the three indexer stream-teardown logs (`client disconnected, dropping … task`) from info to debug: a consumer going away is a normal lifecycle event. Refs ZcashFoundation#10803. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stands up a fully-custom mock indexer gRPC server and a genesis-only follower
db, then drives the "finalized state behind the primary" scenario: the streamed
block's parent (the gap block) is fetched via `get_block`, which fails twice
before succeeding.
Asserts the syncer subscribes exactly once — retrying the streamed block in
place rather than re-subscribing per failure — and commits it once the gap
becomes fillable. The single-subscription assertion is what distinguishes the
new in-place retry from the old re-subscribe-per-failure behavior.
The early mainnet block vectors carry V1/V2 transactions that the non-finalized
state rejects, so the test re-emits each block's coinbase as V4 and re-links the
chain, mirroring zebra-state's own continuous-block test helpers.
Also reword the two `fill_finalized_gap` log messages ("will retry" rather than
"on the next subscription") to match the in-place retry behavior.
Refs ZcashFoundation#10803.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
arya2
reviewed
Jun 26, 2026
Contributor
|
superseded by #10818 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
A co-located read-state follower (
zebra_rpc::sync::TrustedChainSync, used by Zaino'sReadStateServicebackend and otherinit_read_state_with_syncerconsumers) re-subscribes to thenon_finalized_state_changeindexer stream roughly once per second for the entire time its finalized (secondary) state lags the primary. Each teardown makes the node log oneINFOline:so a multi-minute catch-up window produces thousands of them.
The cadence is the consumer's commit-retry backoff, not many clients:
sync()subscribes → receives one block →try_commitfails (the secondary's finalized state hasn't caught up, so the streamed block has no parent —ValidateContextError::NotReadyToBeCommitted) → drops the subscription and sleepsCOMMIT_RETRY_DELAY(1s) → re-subscribes. Re-subscribing buys nothing: it replays the same backlog from the consumer's unchanged chain tips.Closes #10803.
Solution
zebra-rpc/src/sync.rs: on a commit failure, retry the same block in place (keeping the subscription open) up toMAX_IN_PLACE_COMMIT_RETRIES(30) ×COMMIT_RETRY_DELAY(1s).try_commitdrivestry_catch_up_with_primary+fill_finalized_gapon every attempt, so the block becomes committable as the finalized gap closes — without churning the connection. Re-subscribe only as a bounded backstop, kept under the server's 60s non-finalized send timeout so the syncer resets before the server would drop it as a slow consumer. This is not the reorg path: a healthy syncer commits reorg blocks as they arrive on the open subscription, so the retry loop isn't entered.zebra-rpc/src/indexer/methods.rs: demote the three indexer stream-teardown logs (client disconnected, dropping … taskinchain_tip_change,non_finalized_state_change,mempool_change) frominfo!todebug!— a consumer disconnect is a normal lifecycle event.Net effect: during a finalized-state catch-up window, re-subscriptions drop from ~1/sec to at most ~1/30s (and only when a block is genuinely stuck), and the residual teardown lines no longer appear at default
INFO.Tests
cargo fmt -p zebra-rpc -- --check— clean.cargo clippy -p zebra-rpc --lib -- -D warnings— clean.cargo test -p zebra-rpc --lib indexer— passes (indexer decode tests + server spawn).try_commitOk→breakimmediately), so the existingzebrad/tests/e2e/trusted_chain.rstip-change assertions are unaffected.The failure/retry path is not yet covered by an automated test — there's no mock-indexer harness for
TrustedChainSync, and the existing e2e test only exercises the happy path. See Follow-up Work.Specifications & References
slow consumerbuffer-overflow livelock)Follow-up Work
AI Disclosure
PR Checklist
type(scope): description