Skip to content

fix(rpc): stop read-state syncer commit races under finalized-state lag#10818

Open
arya2 wants to merge 1 commit into
mainfrom
fix-trusted-chain-sync-finalized-churn
Open

fix(rpc): stop read-state syncer commit races under finalized-state lag#10818
arya2 wants to merge 1 commit into
mainfrom
fix-trusted-chain-sync-finalized-churn

Conversation

@arya2

@arya2 arya2 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Motivation

The co-located read-state syncer (zebra_rpc::sync::TrustedChainSync, used by Zaino's ReadStateService backend and other init_read_state_with_syncer consumers) tears down and re-subscribes to the non_finalized_state_change stream roughly once per second for as long as its secondary finalized state lags the primary, flooding logs and churning subscriptions (#10803).

The root cause is a race, not just backoff cadence. TrustedChainSync shares its secondary RocksDB instance with the update_finalized_chain_tip task. A RocksDB secondary's view only advances when try_catch_up_with_primary is called on it — but both the sync loop and that task call it on the same shared handle. While the secondary lags, update_finalized_chain_tip catches up ~1/sec, so the secondary's view can advance between sync()'s pre-commit check and the commit itself. When that advance finalizes the very block being committed, the non-finalized commit fails with a duplicate-effects error (ValidateContextError::Duplicate{Sprout,Sapling,Orchard}Nullifier { in_finalized_state: true }), the syncer drops the subscription, sleeps COMMIT_RETRY_DELAY, and re-subscribes.

Solution

  • Stop update_finalized_chain_tip as soon as sync() decodes its first parseable block from the non_finalized_state_change stream, via a dedicated watch::Sender<bool> signal. The task checks the flag at the top of its loop, select!s on it while parked on chain_tip_change.message() (so it stops promptly rather than at the next tip change), and re-checks before catching up / publishing.
  • From the first streamed block onward, sync() is the sole caller of try_catch_up_with_primary on the secondary, so its view can no longer advance between a check and a commit. This is also the earlier, cleaner handover point for the published chain tip: sync() owns the (higher) non-finalized tip from then on, so the finalized-tip task yielding earlier can't drag the reported tip backwards.
  • Skip re-inserting an emptied side chain into the chain set during finalization (NonFinalizedState::finalize), which could otherwise leave a zero-length chain behind.
  • use greater of finalized or non-finalized tip height in fill_finalized_gap() to decide the next block height to get

This removes the concurrent-task source of the race. The in-call fill_finalized_gap catch-up during bootstrap is a separate, narrower path; with the sync task now the single writer of the secondary view, a finalized-check placed immediately before commit() is race-free and can close it in a follow-up if needed. (AI got this wrong, it's fixed in this PR)

Testing

  • cargo fmt --all -- --check
  • cargo clippy -p zebra-state -p zebra-rpc --lib -- -D warnings
  • cargo test -p zebra-state --lib non_finalized (32 passed)
  • Syncer behavior is exercised end-to-end by the cached-state read-state integration tests.

Related: #10803 (and the prior frequency-reduction work in #10741, #10776).

AI disclosure

Used Claude Code to analyze the race, draft the implementation, and write this PR description. The maintainer reviewed and edited the change (including the non_finalized_state.rs fix). The contributor is the responsible author.

The co-located read-state syncer (`TrustedChainSync`, used by Zaino's
`ReadStateService` backend) shares its secondary RocksDB instance with the
`update_finalized_chain_tip` task. While the secondary lags the primary, that
task calls `try_catch_up_with_primary` ~1/sec, advancing the secondary's view
concurrently with `sync()`'s commits. A block can become finalized between
`sync()`'s pre-commit check and the commit itself, so the commit fails as a
duplicate-effects error (`Duplicate*Nullifier { in_finalized_state: true }`),
drops the subscription, and re-subscribes — the per-second churn in #10803.

Stop `update_finalized_chain_tip` as soon as `sync()` receives its first
parseable block from the `non_finalized_state_change` stream, via a dedicated
watch signal. From then on `sync()` is the sole caller of catch-up on the
secondary, so its view can't advance between a check and a commit. The task
also yields the published chain tip to `sync()`'s (higher) non-finalized tip
at that point, so the handover no longer drags the reported tip backwards.

Also skip re-inserting an emptied side chain into the chain set during
finalization, which could otherwise leave a zero-length chain behind.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a subscription-churn/log-flood problem (#10803) in the co-located read-state syncer (zebra_rpc::sync::TrustedChainSync). The root cause is a race on the shared secondary RocksDB handle: while the secondary finalized state lags the primary, both the sync() loop and the update_finalized_chain_tip task call try_catch_up_with_primary, so the secondary's view can advance between sync()'s pre-commit check and its commit, finalizing the very block being committed and failing it with a duplicate-effects error. The syncer then drops and re-subscribes ~1/sec. The fix makes sync() the sole catch-up caller from its first parseable block onward by stopping the finalized-tip task via a dedicated watch::Sender<bool> signal, moving the chain-tip handover earlier, and skipping re-insertion of an emptied side chain during finalization.

Changes:

  • Add a started_sync watch channel; sync() flips it on its first parseable streamed block, and update_finalized_chain_tip stops promptly (top-of-loop check, select! on changed() while parked on message(), and a re-check before catch-up/publish).
  • Replace the prior "non-finalized best chain is populated" handover signal with this earlier "first parseable block" signal, so the finalized-tip task can no longer advance the secondary concurrently with sync()'s commits or drag the reported tip backwards.
  • In NonFinalizedState::finalize, guard side-chain re-insertion with !side_chain.is_empty() (mirroring the existing best_chain guard) to avoid leaving a zero-length chain; in fill_finalized_gap, use .max(finalized_tip_height()) instead of .or_else(...) for the highest known height.

Risk scan: The change is concurrency-sensitive (task handover on a shared secondary DB) and touches consensus-adjacent finalization logic in zebra-state. A narrow one-shot window remains between send_replace(true) and the task actually returning (the task could be mid-spawn_try_catch_up_with_primary at line 167), but this is at most a single residual occurrence on startup and is explicitly scoped as follow-up in the PR description, rather than the prior persistent ~1/sec churn. No broken references, serialization changes, or protocol/RPC shape changes were found; the .max change is semantically equivalent-or-more-defensive than the prior .or_else in reachable states.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
zebra-rpc/src/sync.rs Adds started_sync watch channel and uses it to stop update_finalized_chain_tip on the first parseable streamed block, making sync() the sole secondary catch-up caller; switches fill_finalized_gap's highest-height computation to .max(finalized_tip_height()).
zebra-state/src/service/non_finalized_state.rs Skips re-inserting an emptied side chain during finalize, consistent with the existing best_chain guard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants