Skip to content

fix: stabilize flaky Zakura network and VCT fast-sync tests#293

Merged
p0mvn merged 2 commits into
feat/pre-release-mainfrom
evan/fix-flaky-tests
Jun 27, 2026
Merged

fix: stabilize flaky Zakura network and VCT fast-sync tests#293
p0mvn merged 2 commits into
feat/pre-release-mainfrom
evan/fix-flaky-tests

Conversation

@evan-forbes

Copy link
Copy Markdown

Motivation

The PR lane (cargo nextest run --profile all-tests) runs with fail-fast = true and no retries, so a single flaky failure sinks the whole run and forces a manual re-trigger. Two distinct root causes were producing intermittent failures under the suite's num-cpus parallelism:

  1. Zakura networking tests (zakura::legacy_gossip::tests::*, zakura::testkit::cluster::tests::*, peer::handshake::tests::mutual_p2p_*, the testkit gossip mesh) dial real iroh/QUIC endpoints with hardcoded 5s/15s connect and await_until convergence deadlines. Under CPU contention those deadlines are exceeded (and the legacy→Zakura upgrade handshake takes a different path), so the tests time out or assert the wrong error.
  2. vct_mode_switches_continue_from_safe_boundaries panicked with just checked for genesis block due to a TOCTOU in the cache_genesis_roots format-validity check racing an in-progress verified-commitment-trees fast sync.

Solution

  • Harness timeouts (test(network)): introduce a single generous TEST_NET_TIMEOUT = 30s constant in the zebra-network testkit and replace the scattered 5s/15s connect_native / connect_full_mesh / await_all_connected / await_until / convergence timeout deadlines with it. A slow connect under load is not a correctness signal, and these helpers return as soon as the awaited condition holds, so a generous deadline never masks a genuine hang.
  • nextest test group (test(network)): add [test-groups] zakura-network = { max-threads = 4 } and a default-profile override binding the real-connect modules to it, bounding their concurrency so they don't starve each other. The override only assigns a group — it excludes no tests, so coverage is unchanged — and living on the default profile it applies to both all-tests and full-tests.
  • VCT format-check fix (fix(state)): cache_genesis_roots::quick_check now treats an absent genesis tree as a (mid-flight) fast-synced database — where the genesis-root-caching invariant does not apply — and returns Ok instead of .expect()-panicking. A None genesis-tree read can only mean the absent-band guard fired; a genuinely missing genesis tree on a non-fast database makes the readers panic with their own "must exist" message rather than return None, so no real corruption is masked. CHANGELOG updated.

Test evidence

Each flake was reproduced, then each fix was A/B-validated on this machine (24 cores):

Scenario Before After
Network family @ 48 threads, no retries (10 iters) 10/10 iterations fail
+ timeout fix only 13 tests fixed; mutual_p2p + 1 hostile still fail (5/5)
+ timeout fix and test group (5 iters) 0/5 fail
vct_ proptests @ 24 threads 2/10 fail (just checked for genesis block) 0/15 fail

Also green locally:

  • cargo fmt --all -- --check (changed files; a pre-existing zakura/block_sync/wire.rs formatting nit on the base branch is left untouched)
  • cargo clippy -p zebra-network -p zebra-state --all-targets --features proptest-impl -- -D warnings
  • full zebra-network + zebra-state run under --profile all-tests: 1007 passed, 4 skipped

Note on the VCT fix: the race is an inherent TOCTOU — the torn state (is_vct_synced() reading false while a genesis read returns None) only exists transiently across two reads — so it cannot be frozen into a static unit test. It is guarded by the now-reliable vct_mode_switches_continue_from_safe_boundaries proptest under the suite's parallelism, plus a detailed code comment.

AI disclosure

Used Claude Code (Opus 4.8) to reproduce the flakes, root-cause them, implement the fixes, and run the validation above.

…fast sync

`cache_genesis_roots::quick_check` runs on the background format-validity
thread (`DbFormatChange::spawn_format_change`) concurrently with block
commits. It reads its `is_vct_synced()` guard once, then reads the genesis
note-commitment trees via readers (`sapling_tree_by_height` etc.) that return
`None` for any height in a fast-synced database's `[U, H)` absent band by
re-reading the same `vct_synced_below` marker. A fast-sync commit can set that
marker in the window between the two reads, so the function passes the
`is_vct_synced()` guard (marker unset) and then a genesis read returns `None`
(marker now set) -- and the old `.expect("just checked for genesis block")`
panicked. Under the test suite's parallelism this surfaced as a flaky
`vct_mode_switches_continue_from_safe_boundaries` failure (reproduced 2/10).

Treat an absent genesis tree as a (mid-flight) fast-synced database, where the
genesis-root-caching invariant does not apply, and return `Ok` instead of
panicking. A `None` here can only mean the absent-band guard fired; a genuinely
missing genesis tree on a non-fast database makes the readers panic with their
own "must exist" message rather than return `None`, so no real corruption is
masked.
The Zakura networking tests (`zakura::legacy_gossip::tests::*`,
`zakura::testkit::cluster::tests::*`, `peer::handshake::tests::mutual_p2p_*`,
the testkit gossip mesh) dial real iroh/QUIC endpoints and register peers
within a bounded deadline. Run at the suite's `num-cpus` parallelism they
oversubscribe the CPU and starve those handshakes, so the hardcoded 5s connect
and `await_until` convergence deadlines are exceeded and the upgrade handshake
takes a different path. In the PR lane (`--profile all-tests`, `fail-fast` and
no retries) a single such timeout sinks the whole run. Reproduced at 2x thread
oversubscription: 10/10 iterations had >= 1 failure.

Two complementary changes:

- Introduce a single generous `TEST_NET_TIMEOUT = 30s` constant in the
  zebra-network testkit and replace the scattered 5s/15s `connect_native`,
  `connect_full_mesh`, `await_all_connected`, `await_until`, and convergence
  `timeout` deadlines with it. A slow connect under load is not a correctness
  signal, and these helpers return as soon as the awaited condition holds, so a
  generous deadline never masks a genuine hang.

- Add a nextest `zakura-network` test group (`max-threads = 4`) and a
  default-profile override binding the real-connect modules to it, so they do
  not starve each other. The override only assigns a group; it excludes no
  tests, so coverage is unchanged, and living on the `default` profile it
  applies to both `all-tests` and `full-tests`.

With both changes the same 2x-oversubscription reproduction is 0/5, and a full
`zebra-network`/`zebra-state` run under `--profile all-tests` passes (1007/1007).
@evan-forbes evan-forbes force-pushed the evan/fix-flaky-tests branch from ad4d412 to 42c75fe Compare June 27, 2026 19:34
@p0mvn p0mvn merged commit cd838a9 into feat/pre-release-main Jun 27, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants