fix(rpc): stop read-state syncer commit races under finalized-state lag#10818
fix(rpc): stop read-state syncer commit races under finalized-state lag#10818arya2 wants to merge 1 commit into
Conversation
The co-located read-state syncer (`TrustedChainSync`, used by Zaino's
`ReadStateService` backend) shares its secondary RocksDB instance with the
`update_finalized_chain_tip` task. While the secondary lags the primary, that
task calls `try_catch_up_with_primary` ~1/sec, advancing the secondary's view
concurrently with `sync()`'s commits. A block can become finalized between
`sync()`'s pre-commit check and the commit itself, so the commit fails as a
duplicate-effects error (`Duplicate*Nullifier { in_finalized_state: true }`),
drops the subscription, and re-subscribes — the per-second churn in #10803.
Stop `update_finalized_chain_tip` as soon as `sync()` receives its first
parseable block from the `non_finalized_state_change` stream, via a dedicated
watch signal. From then on `sync()` is the sole caller of catch-up on the
secondary, so its view can't advance between a check and a commit. The task
also yields the published chain tip to `sync()`'s (higher) non-finalized tip
at that point, so the handover no longer drags the reported tip backwards.
Also skip re-inserting an emptied side chain into the chain set during
finalization, which could otherwise leave a zero-length chain behind.
There was a problem hiding this comment.
Pull request overview
This PR fixes a subscription-churn/log-flood problem (#10803) in the co-located read-state syncer (zebra_rpc::sync::TrustedChainSync). The root cause is a race on the shared secondary RocksDB handle: while the secondary finalized state lags the primary, both the sync() loop and the update_finalized_chain_tip task call try_catch_up_with_primary, so the secondary's view can advance between sync()'s pre-commit check and its commit, finalizing the very block being committed and failing it with a duplicate-effects error. The syncer then drops and re-subscribes ~1/sec. The fix makes sync() the sole catch-up caller from its first parseable block onward by stopping the finalized-tip task via a dedicated watch::Sender<bool> signal, moving the chain-tip handover earlier, and skipping re-insertion of an emptied side chain during finalization.
Changes:
- Add a
started_syncwatch channel;sync()flips it on its first parseable streamed block, andupdate_finalized_chain_tipstops promptly (top-of-loop check,select!onchanged()while parked onmessage(), and a re-check before catch-up/publish). - Replace the prior "non-finalized best chain is populated" handover signal with this earlier "first parseable block" signal, so the finalized-tip task can no longer advance the secondary concurrently with
sync()'s commits or drag the reported tip backwards. - In
NonFinalizedState::finalize, guard side-chain re-insertion with!side_chain.is_empty()(mirroring the existingbest_chainguard) to avoid leaving a zero-length chain; infill_finalized_gap, use.max(finalized_tip_height())instead of.or_else(...)for the highest known height.
Risk scan: The change is concurrency-sensitive (task handover on a shared secondary DB) and touches consensus-adjacent finalization logic in zebra-state. A narrow one-shot window remains between send_replace(true) and the task actually returning (the task could be mid-spawn_try_catch_up_with_primary at line 167), but this is at most a single residual occurrence on startup and is explicitly scoped as follow-up in the PR description, rather than the prior persistent ~1/sec churn. No broken references, serialization changes, or protocol/RPC shape changes were found; the .max change is semantically equivalent-or-more-defensive than the prior .or_else in reachable states.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
zebra-rpc/src/sync.rs |
Adds started_sync watch channel and uses it to stop update_finalized_chain_tip on the first parseable streamed block, making sync() the sole secondary catch-up caller; switches fill_finalized_gap's highest-height computation to .max(finalized_tip_height()). |
zebra-state/src/service/non_finalized_state.rs |
Skips re-inserting an emptied side chain during finalize, consistent with the existing best_chain guard. |
Motivation
The co-located read-state syncer (
zebra_rpc::sync::TrustedChainSync, used by Zaino'sReadStateServicebackend and otherinit_read_state_with_syncerconsumers) tears down and re-subscribes to thenon_finalized_state_changestream roughly once per second for as long as its secondary finalized state lags the primary, flooding logs and churning subscriptions (#10803).The root cause is a race, not just backoff cadence.
TrustedChainSyncshares its secondary RocksDB instance with theupdate_finalized_chain_tiptask. A RocksDB secondary's view only advances whentry_catch_up_with_primaryis called on it — but both the sync loop and that task call it on the same shared handle. While the secondary lags,update_finalized_chain_tipcatches up ~1/sec, so the secondary's view can advance betweensync()'s pre-commit check and the commit itself. When that advance finalizes the very block being committed, the non-finalized commit fails with a duplicate-effects error (ValidateContextError::Duplicate{Sprout,Sapling,Orchard}Nullifier { in_finalized_state: true }), the syncer drops the subscription, sleepsCOMMIT_RETRY_DELAY, and re-subscribes.Solution
update_finalized_chain_tipas soon assync()decodes its first parseable block from thenon_finalized_state_changestream, via a dedicatedwatch::Sender<bool>signal. The task checks the flag at the top of its loop,select!s on it while parked onchain_tip_change.message()(so it stops promptly rather than at the next tip change), and re-checks before catching up / publishing.sync()is the sole caller oftry_catch_up_with_primaryon the secondary, so its view can no longer advance between a check and a commit. This is also the earlier, cleaner handover point for the published chain tip:sync()owns the (higher) non-finalized tip from then on, so the finalized-tip task yielding earlier can't drag the reported tip backwards.NonFinalizedState::finalize), which could otherwise leave a zero-length chain behind.This removes the concurrent-task source of the race. The in-call(AI got this wrong, it's fixed in this PR)fill_finalized_gapcatch-up during bootstrap is a separate, narrower path; with the sync task now the single writer of the secondary view, a finalized-check placed immediately beforecommit()is race-free and can close it in a follow-up if needed.Testing
cargo fmt --all -- --checkcargo clippy -p zebra-state -p zebra-rpc --lib -- -D warningscargo test -p zebra-state --lib non_finalized(32 passed)Related: #10803 (and the prior frequency-reduction work in #10741, #10776).
AI disclosure
Used Claude Code to analyze the race, draft the implementation, and write this PR description. The maintainer reviewed and edited the change (including the
non_finalized_state.rsfix). The contributor is the responsible author.