fix: stabilize flaky Zakura network and VCT fast-sync tests#293
Merged
Conversation
…fast sync
`cache_genesis_roots::quick_check` runs on the background format-validity
thread (`DbFormatChange::spawn_format_change`) concurrently with block
commits. It reads its `is_vct_synced()` guard once, then reads the genesis
note-commitment trees via readers (`sapling_tree_by_height` etc.) that return
`None` for any height in a fast-synced database's `[U, H)` absent band by
re-reading the same `vct_synced_below` marker. A fast-sync commit can set that
marker in the window between the two reads, so the function passes the
`is_vct_synced()` guard (marker unset) and then a genesis read returns `None`
(marker now set) -- and the old `.expect("just checked for genesis block")`
panicked. Under the test suite's parallelism this surfaced as a flaky
`vct_mode_switches_continue_from_safe_boundaries` failure (reproduced 2/10).
Treat an absent genesis tree as a (mid-flight) fast-synced database, where the
genesis-root-caching invariant does not apply, and return `Ok` instead of
panicking. A `None` here can only mean the absent-band guard fired; a genuinely
missing genesis tree on a non-fast database makes the readers panic with their
own "must exist" message rather than return `None`, so no real corruption is
masked.
The Zakura networking tests (`zakura::legacy_gossip::tests::*`, `zakura::testkit::cluster::tests::*`, `peer::handshake::tests::mutual_p2p_*`, the testkit gossip mesh) dial real iroh/QUIC endpoints and register peers within a bounded deadline. Run at the suite's `num-cpus` parallelism they oversubscribe the CPU and starve those handshakes, so the hardcoded 5s connect and `await_until` convergence deadlines are exceeded and the upgrade handshake takes a different path. In the PR lane (`--profile all-tests`, `fail-fast` and no retries) a single such timeout sinks the whole run. Reproduced at 2x thread oversubscription: 10/10 iterations had >= 1 failure. Two complementary changes: - Introduce a single generous `TEST_NET_TIMEOUT = 30s` constant in the zebra-network testkit and replace the scattered 5s/15s `connect_native`, `connect_full_mesh`, `await_all_connected`, `await_until`, and convergence `timeout` deadlines with it. A slow connect under load is not a correctness signal, and these helpers return as soon as the awaited condition holds, so a generous deadline never masks a genuine hang. - Add a nextest `zakura-network` test group (`max-threads = 4`) and a default-profile override binding the real-connect modules to it, so they do not starve each other. The override only assigns a group; it excludes no tests, so coverage is unchanged, and living on the `default` profile it applies to both `all-tests` and `full-tests`. With both changes the same 2x-oversubscription reproduction is 0/5, and a full `zebra-network`/`zebra-state` run under `--profile all-tests` passes (1007/1007).
ad4d412 to
42c75fe
Compare
p0mvn
approved these changes
Jun 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The PR lane (
cargo nextest run --profile all-tests) runs withfail-fast = trueand no retries, so a single flaky failure sinks the whole run and forces a manual re-trigger. Two distinct root causes were producing intermittent failures under the suite'snum-cpusparallelism:zakura::legacy_gossip::tests::*,zakura::testkit::cluster::tests::*,peer::handshake::tests::mutual_p2p_*, the testkit gossip mesh) dial real iroh/QUIC endpoints with hardcoded 5s/15s connect andawait_untilconvergence deadlines. Under CPU contention those deadlines are exceeded (and the legacy→Zakura upgrade handshake takes a different path), so the tests time out or assert the wrong error.vct_mode_switches_continue_from_safe_boundariespanicked withjust checked for genesis blockdue to a TOCTOU in thecache_genesis_rootsformat-validity check racing an in-progress verified-commitment-trees fast sync.Solution
test(network)): introduce a single generousTEST_NET_TIMEOUT = 30sconstant in the zebra-network testkit and replace the scattered 5s/15sconnect_native/connect_full_mesh/await_all_connected/await_until/ convergencetimeoutdeadlines with it. A slow connect under load is not a correctness signal, and these helpers return as soon as the awaited condition holds, so a generous deadline never masks a genuine hang.test(network)): add[test-groups] zakura-network = { max-threads = 4 }and adefault-profile override binding the real-connect modules to it, bounding their concurrency so they don't starve each other. The override only assigns a group — it excludes no tests, so coverage is unchanged — and living on thedefaultprofile it applies to bothall-testsandfull-tests.fix(state)):cache_genesis_roots::quick_checknow treats an absent genesis tree as a (mid-flight) fast-synced database — where the genesis-root-caching invariant does not apply — and returnsOkinstead of.expect()-panicking. ANonegenesis-tree read can only mean the absent-band guard fired; a genuinely missing genesis tree on a non-fast database makes the readers panic with their own "must exist" message rather than returnNone, so no real corruption is masked. CHANGELOG updated.Test evidence
Each flake was reproduced, then each fix was A/B-validated on this machine (24 cores):
mutual_p2p+ 1 hostile still fail (5/5)vct_proptests @ 24 threadsjust checked for genesis block)Also green locally:
cargo fmt --all -- --check(changed files; a pre-existingzakura/block_sync/wire.rsformatting nit on the base branch is left untouched)cargo clippy -p zebra-network -p zebra-state --all-targets --features proptest-impl -- -D warningszebra-network+zebra-staterun under--profile all-tests: 1007 passed, 4 skippedNote on the VCT fix: the race is an inherent TOCTOU — the torn state (
is_vct_synced()readingfalsewhile a genesis read returnsNone) only exists transiently across two reads — so it cannot be frozen into a static unit test. It is guarded by the now-reliablevct_mode_switches_continue_from_safe_boundariesproptest under the suite's parallelism, plus a detailed code comment.AI disclosure
Used Claude Code (Opus 4.8) to reproduce the flakes, root-cause them, implement the fixes, and run the validation above.