Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
188ba8d
perf(consensus): precompute auth data root concurrently in the checkp…
p0mvn Jun 19, 2026
acb269a
perf: de-duplicate the librustzcash conversion for txid and auth digest
p0mvn Jun 18, 2026
d83ae46
perf: de-duplicate the librustzcash conversion for txid and auth digest
p0mvn Jun 19, 2026
6bbd343
perf(state): parallelize per-block serialization in the finalized blo…
p0mvn Jun 19, 2026
003c703
perf(state): gate parallel block batch-prep on a transaction-count th…
p0mvn Jun 19, 2026
b37ad32
perf(chain): compute ZIP-244 txid and auth digest natively (#131)
p0mvn Jun 19, 2026
c2c24b0
perf(chain): drop the discarded librustzcash reparse on v5 deserializ…
p0mvn Jun 19, 2026
e632fdf
perf(chain): defer Sapling cv/epk decompression, enforce on the seman…
p0mvn Jun 19, 2026
c8f1196
perf(state): parallelize and de-duplicate the committer's UTXO/addres…
p0mvn Jun 19, 2026
20eeea2
perf(state): optimize checkpoint prepare digest fanout (#148)
p0mvn Jun 19, 2026
7fb4ffe
perf(state): precompute note-commitment tree hashing off the committe…
p0mvn Jun 20, 2026
763c580
[REVERT] Roman's AI workspace
p0mvn Jun 21, 2026
84576dd
perf(sync): hedge head-of-line block download on registry-miss [proto…
p0mvn Jun 21, 2026
111ad33
perf(state): overlap raw-transaction serialization with the committer…
p0mvn Jun 23, 2026
414bc33
perf(state): run write_block on the committer thread instead of the c…
p0mvn Jun 23, 2026
b7cbe2e
perf: verified commitment trees (#189)
p0mvn Jun 27, 2026
646e19c
perf(network): pack block sync ranges by size hint (#284)
evan-forbes Jun 27, 2026
06f4c32
perf(network): reduce block sync request bookkeeping (#285)
evan-forbes Jun 27, 2026
95a39c9
fix(chain): gate V6 sapling-point check on nu6.3/nu7 cfg, not removed…
evan-forbes Jun 27, 2026
a86368f
fix(rpc): drop redundant clone on Copy ValueCommitment
evan-forbes Jun 27, 2026
d9d51b0
docs: document header body-size hints
evan-forbes Jun 27, 2026
69c3318
test(state): let VCT reopen proptest bypass the interrupted-fast-sync…
evan-forbes Jun 27, 2026
f70eec2
test(state): expect advertised body-size hints in header_only boundar…
evan-forbes Jun 27, 2026
ea17fdd
test(zebrad): accept CommitCheckpointPrecomputed in legacy sync vectors
evan-forbes Jun 27, 2026
180b426
perf(network): retain raw block bodies in reorder backlog (#286)
evan-forbes Jun 27, 2026
3c29c94
fix(network): prioritize block sync floor requests (#287)
evan-forbes Jun 27, 2026
04191c0
fix(network): add block sync congestion control (#288)
evan-forbes Jun 27, 2026
16705ff
fix(network): tune block sync throughput defaults (#289)
evan-forbes Jun 27, 2026
cd838a9
fix: stabilize flaky Zakura network and VCT fast-sync tests (#293)
evan-forbes Jun 27, 2026
f6668dc
Merge branch 'ironwood-main' into feat/pre-release-main
p0mvn Jun 27, 2026
f0ddf79
Merge remote-tracking branch 'origin/ironwood-main' into feat/pre-rel…
p0mvn Jun 28, 2026
07f3fd1
fix(network): enforce checkpoint-safe block apply budget (#292)
evan-forbes Jun 28, 2026
4243388
Merge branch 'ironwood-main' into feat/pre-release-main
p0mvn Jun 28, 2026
290bf55
Merge branch 'ironwood-main' into feat/pre-release-main
p0mvn Jun 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .config/nextest.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,19 @@
fail-fast = true
status-level = "pass"

# --- Test groups ---

# Zakura networking tests dial real iroh/QUIC endpoints and register peers within
# a bounded deadline. Running many of them at once oversubscribes the CPU and
# starves those handshakes/upgrades, which surfaced as flaky timeouts and upgrade
# races in the PR lane (which has no retries). Cap their concurrency so they do
# not starve each other. This complements the generous `TEST_NET_TIMEOUT` deadline
# in the zebra-network testkit: the deadline absorbs slow connects, the group keeps
# the contention that causes them from piling up. It does not exclude any tests, so
# coverage is unchanged.
[test-groups]
zakura-network = { max-threads = 4 }

# --- Platform-specific overrides ---

# Skip Windows-incompatible tests
Expand All @@ -13,6 +26,14 @@ platform = 'cfg(target_os = "windows")'

filter = "not test(=trusted_chain_sync_handles_forks_correctly) and not test(=delete_old_databases)"

# Bound the Zakura networking tests (real iroh/QUIC connects) to the serial test
# group defined above. This override lives on the `default` profile so it applies
# to every profile (all-tests, full-tests, ...). It only assigns a test group; it
# does not filter any tests out, so coverage is unchanged.
[[profile.default.overrides]]
filter = "package(zebra-network) and (test(~zakura::legacy_gossip::tests) or test(~zakura::testkit::cluster::tests) or test(~zakura::testkit::gossip::tests) or test(~zakura::testkit::mock_blocksync) or test(~mutual_p2p))"
test-group = "zakura-network"

# --- All Tests profile ---
# CI-friendly test selection.
#
Expand Down
53 changes: 53 additions & 0 deletions .github/workflows/checkpoint-update.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ jobs:
env:
MAINNET_CHECKPOINTS: zebra-chain/src/parameters/checkpoint/main-checkpoints.txt
TESTNET_CHECKPOINTS: zebra-chain/src/parameters/checkpoint/test-checkpoints.txt
MAINNET_FRONTIER: zebra-state/src/service/finalized_state/vct/mainnet-frontier.bin
EOS_FILE: zebrad/src/components/sync/end_of_support.rs
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd #v6.0.2
Expand Down Expand Up @@ -87,6 +88,15 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
continue-on-error: true

- name: Download mainnet frontier artifact
id: mainnet-frontier-artifact
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c #v8.0.1
with:
name: generate-checkpoints-mainnet-frontier
run-id: ${{ steps.resolve-run.outputs.run_id }}
github-token: ${{ secrets.GITHUB_TOKEN }}
continue-on-error: true

- name: Download testnet checkpoint artifact
id: testnet-artifact
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c #v8.0.1
Expand All @@ -112,6 +122,11 @@ jobs:
HAS_MAINNET="true"
fi

if [ -f "mainnet-frontier.bin" ]; then
BYTES=$(wc -c < mainnet-frontier.bin | tr -d ' ')
echo "Mainnet frontier artifact: ${BYTES} bytes"
fi

if [ -f "test-checkpoints.txt" ]; then
LINES=$(wc -l < test-checkpoints.txt | tr -d ' ')
echo "Testnet artifact: ${LINES} checkpoint lines"
Expand All @@ -130,6 +145,7 @@ jobs:

# Append new mainnet checkpoints (entries with heights higher than current last)
- name: Append new mainnet checkpoints
id: append-mainnet
if: steps.check-artifacts.outputs.has_mainnet == 'true'
run: |
CURRENT_LAST=$(tail -1 "${MAINNET_CHECKPOINTS}" | awk '{print $1}')
Expand All @@ -138,13 +154,48 @@ jobs:
# Extract only new entries (height > current last)
NEW_COUNT=$(awk -v last="$CURRENT_LAST" '$1 > last' main-checkpoints.txt | wc -l | tr -d ' ')
echo "New mainnet checkpoints to append: ${NEW_COUNT}"
echo "new_count=${NEW_COUNT}" >> "$GITHUB_OUTPUT"

if [ "$NEW_COUNT" -gt 0 ]; then
awk -v last="$CURRENT_LAST" '$1 > last' main-checkpoints.txt >> "${MAINNET_CHECKPOINTS}"
NEW_LAST=$(tail -1 "${MAINNET_CHECKPOINTS}" | awk '{print $1}')
echo "Updated last mainnet checkpoint: ${NEW_LAST}"
echo "new_last=${NEW_LAST}" >> "$GITHUB_OUTPUT"
else
echo "new_last=${CURRENT_LAST}" >> "$GITHUB_OUTPUT"
fi

- name: Update Mainnet VCT frontier
if: >-
steps.check-artifacts.outputs.has_mainnet == 'true' &&
steps.append-mainnet.outputs.new_count != '0'
env:
EXPECTED_HEIGHT: ${{ steps.append-mainnet.outputs.new_last }}
run: |
if [ ! -s "mainnet-frontier.bin" ]; then
echo "Mainnet checkpoints advanced, but mainnet-frontier.bin is missing or empty"
exit 1
fi

FRONTIER_HEIGHT=$(python3 - <<'PY'
import struct

with open("mainnet-frontier.bin", "rb") as frontier:
height_bytes = frontier.read(4)
if len(height_bytes) != 4:
raise SystemExit("frontier artifact is shorter than its height prefix")
print(struct.unpack("<I", height_bytes)[0])
PY
)

if [ "${FRONTIER_HEIGHT}" != "${EXPECTED_HEIGHT}" ]; then
echo "Frontier height ${FRONTIER_HEIGHT} does not match updated Mainnet checkpoint ${EXPECTED_HEIGHT}"
exit 1
fi

cp mainnet-frontier.bin "${MAINNET_FRONTIER}"
echo "Updated ${MAINNET_FRONTIER} for checkpoint height ${EXPECTED_HEIGHT}"

# Append new testnet checkpoints
- name: Append new testnet checkpoints
if: steps.check-artifacts.outputs.has_testnet == 'true'
Expand Down Expand Up @@ -214,6 +265,7 @@ jobs:
### Changes

- Updated mainnet and/or testnet checkpoint files with new entries
- Updated `mainnet-frontier.bin` when Mainnet checkpoints advanced
- Updated `ESTIMATED_RELEASE_HEIGHT` in `end_of_support.rs` to match the latest mainnet checkpoint

### Validation
Expand All @@ -223,6 +275,7 @@ jobs:
- Heights are monotonically increasing
- No gaps exceed 400 blocks
- No duplicate heights or hashes
- Mainnet frontier height matches the updated Mainnet checkpoint height, when present

### Review

Expand Down
29 changes: 29 additions & 0 deletions .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,7 @@ jobs:
CONTAINER_ID: ${{ steps.find-container.outputs.CONTAINER_ID }}
INSTANCE_NAME: ${{ inputs.test_id }}-${{ env.GITHUB_REF_SLUG_URL }}-${{ env.GITHUB_SHA_SHORT }}
GCP_ZONE: ${{ vars.GCP_ZONE }}
CAPTURE_MAINNET_FRONTIER: ${{ contains(inputs.test_id, 'mainnet') }}
run: |
gcloud compute ssh "${INSTANCE_NAME}" \
--zone "${GCP_ZONE}" \
Expand All @@ -403,6 +404,11 @@ jobs:
--command="
sudo docker logs ${CONTAINER_ID} 2>&1 | grep -oE '[0-9]+ [0-9a-f]{64}' > /tmp/checkpoints.txt;
echo \"Captured \$(wc -l < /tmp/checkpoints.txt) checkpoint lines\";
if [ \"${CAPTURE_MAINNET_FRONTIER}\" = 'true' ]; then
sudo docker cp ${CONTAINER_ID}:/tmp/mainnet-frontier.bin /tmp/mainnet-frontier.bin;
test -s /tmp/mainnet-frontier.bin;
echo \"Captured Mainnet VCT frontier artifact\";
fi
"

# Upload the checkpoint file captured in the test-result job as a workflow
Expand Down Expand Up @@ -468,13 +474,36 @@ jobs:
exit 1
fi

- name: Pull Mainnet frontier artifact from instance
if: ${{ contains(inputs.test_id, 'mainnet') }}
run: |
INSTANCE_NAME="${TEST_ID}-${GITHUB_REF_SLUG_URL}-${GITHUB_SHA_SHORT}"

gcloud compute scp \
--zone "${GCP_ZONE}" \
"${INSTANCE_NAME}:/tmp/mainnet-frontier.bin" \
"mainnet-frontier.bin"

if [ ! -s "mainnet-frontier.bin" ]; then
echo "ERROR: Mainnet frontier artifact is empty"
exit 1
fi

- name: Upload checkpoint artifact
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a #v7.0.1
with:
name: ${{ inputs.test_id }}-checkpoints
path: "*-checkpoints.txt"
retention-days: 30

- name: Upload Mainnet frontier artifact
if: ${{ contains(inputs.test_id, 'mainnet') }}
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a #v7.0.1
with:
name: ${{ inputs.test_id }}-frontier
path: mainnet-frontier.bin
retention-days: 30

# create a state image from the instance's state disk, if requested by the caller
create-state-image:
name: Create ${{ inputs.test_id }} cached state image
Expand Down
17 changes: 17 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
// Rust build artifacts (~140G in target/, ~5G in unity-node/target/) saturate
// the file watcher and index, which hangs the extension host (agents + terminal).
// These dirs are gitignored, so hiding them from the editor is safe.
"files.watcherExclude": {
"**/target/**": true,
"**/.git/objects/**": true
},
"search.exclude": {
"**/target": true
},
"files.exclude": {
"**/target": true
},
// Let rust-analyzer manage the workspace without a redundant cargo check storm.
"rust-analyzer.files.excludeDirs": ["target"]
}
36 changes: 36 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,17 @@ and this project adheres to [Semantic Versioning](https://semver.org).
hosts (~20 → ~42 blk/s on an 8-core machine at 1.7M height). A new
default-off `commit-metrics` feature emits per-block timing histograms
(`zebra.state.write.*`) for future profiling.
- Precompute note-commitment tree hashing ahead of the finalized committer. The
per-leaf Merkle hashing for a block (the dominant committer cost on shielded
blocks) depends only on the starting note count, not the frontier's hashes, so
the finalized write loop now does a one-block look-ahead and runs the next
block's Sapling/Orchard hashing on idle cores while the current block commits;
the committer then only applies the precomputed subtree roots onto the frontier
(`update_trees_parallel_with` in `zebra-chain`). The precompute is applied only
if its starting tree size still matches at commit time and otherwise falls back
to inline hashing, so it affects only speed, never the resulting tree. This cuts
the committer's tree-update cost by ~54% (12.5 → 5.7 ms/block) where the
committer is the bottleneck.

### Changed

Expand Down Expand Up @@ -89,6 +100,23 @@ and this project adheres to [Semantic Versioning](https://semver.org).
duplicate-peer handling scaffolding.
- Added bounded Zakura header-sync stream-5 wire messages, stateless header
validation, and the default `network.zakura.header_sync` config surface.
- Verified-commitment-trees fast checkpoint sync. Below the last checkpoint Zebra
now fetches per-block Sapling/Orchard commitment roots from peers over a new
header-sync-aligned `tree_aux` stream, verifies each root against the node's own
checkpoint-committed block headers (the ZIP-221 ChainHistory MMR plus direct
below-Heartwood/below-NU5 checks), and folds the verified roots into the anchor
set and history tree — skipping the per-block note-commitment frontier recompute
that dominates checkpoint-sync CPU cost. At the checkpoint handoff an embedded
final frontier, verified against that block's proven root, is written as the tip
treestate and normal per-block recompute resumes. The resulting consensus state
is byte-identical to the legacy recompute; a root that cannot be obtained or
verified is rejected rather than recomputed against the stale frozen frontier, so
no untrusted data can influence consensus state. This is the default whenever
`consensus.checkpoint_sync = true` on a network with an embedded handoff frontier
(Mainnet), for both Archive and Pruned storage modes. The new
`consensus.disable_vct_fast_sync` flag (default `false`) keeps checkpoint sync
enabled while forcing the legacy per-block recompute. Bumps the state database
format to 27.3.0 (new column families only; no data migration).
- Include the `zebra-rollback-state` and `zebra-prune-state` utilities alongside
`zebrad` in release Docker images and Docker CI builds.
- Use the `5.0.0-rc.3` release identity for this fork's v5 rollback build.
Expand Down Expand Up @@ -145,6 +173,14 @@ and this project adheres to [Semantic Versioning](https://semver.org).

### Fixed

- Stop the database format-validity check from panicking with "just checked for
genesis block" while a verified-commitment-trees fast sync is in progress. The
check runs on a background thread, concurrently with block commits, and could
read its `is_vct_synced()` guard as `false` and then read an absent genesis
note-commitment tree once a concurrent fast-sync commit set the marker in
between. It now treats an absent genesis tree as a (mid-flight) fast-synced
database — where the genesis-root-caching invariant does not apply — instead of
panicking.
- Treat missing transaction inventory responses during mempool download as a
recoverable download failure, avoiding a panic when public peers no longer
have a gossiped transaction available.
Expand Down
1 change: 1 addition & 0 deletions Cargo.lock
Original file line number Diff line number Diff line change
Expand Up @@ -9492,6 +9492,7 @@ dependencies = [
"zebra-chain",
"zebra-node-services",
"zebra-rpc",
"zebra-state",
]

[[package]]
Expand Down
53 changes: 53 additions & 0 deletions HOL_HEDGE_RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Hedged head-of-line download — benchmark results

**Branch:** `proto-hedged-hol-download` (binary `/root/wal-bench/zebrad-hedge`, built `--features commit-metrics`).
**PR:** #151 (`hedge-hol-rebased` → `proto-note-tree-precompute`).
**Method:** single binary, env-toggled `SYNC_HOL_HEDGE_FANOUT=0` (baseline) vs `=4` (hedged), **random DNS peers** (the stall only manifests with diverse/churning peers — a pinned peer never reproduces it). Interleaved off/on/off/on/off/on so temporal peer drift hits both arms equally. 7.5-min fork windows from the 1,707,210 snapshot. `checkpoint_verify=1500`, `download=150`. Harness: `hedge_ab.sh`.

## Per-run data (N=3 per arm)

| run | Δblocks (7.5 min) | stall intervals (blk/s<2 & in_flight>1000) | reg_miss | all_missing | route_hedge win | steady blk/s |
|---|---|---|---|---|---|---|
| OFF-1 | 10,899 | 18/84 | 97,676 | 380,894 | — | 27.9 |
| OFF-2 | 10,539 | 21/83 | 93,517 | 364,439 | — | 25.4 |
| OFF-3 | 22,438 | 9/84 | 50,060 | 195,060 | — | 68.9 |
| **ON-1** | 18,316 | 7/81 | 43,328 | 62,469 | 17,990 | 45.4 |
| **ON-2** | 19,434 | 12/84 | 44,729 | 57,630 | 18,295 | 50.8 |
| **ON-3** | 28,213 | 3/84 | 0 | 7 | 0 (inert) | 64.7 |

## Medians (OFF → ON)

| metric | OFF | ON | Δ |
|---|---|---|---|
| stall intervals | 18 | 7 | **−61%** |
| reg_miss | 93,517 | 43,328 | **−54%** |
| **all_missing** (stale-marker fails) | 364,439 | 57,630 | **−84%** |
| Δblocks per 7.5-min window | 10,899 | 19,434 | **+78%** |
| steady-state blk/s | 27.9 | 50.8 | +82% |

## Verdict — the hedge works, and is well-behaved

**It does exactly what it was designed to do, confirmed across N=3:**

1. **Active when peers thrash.** On the two bad draws (ON-1, ON-2), the baseline equivalent would have accumulated ~360k `all_missing` synthetic failures; the hedge fired (`dispatch` ~140k per-peer, **~18k wins**), bypassing the stale "missing" inventory markers and delivering the head block from a real ready peer. Result: `all_missing` −84%, `reg_miss` −54%, stalls cut, ~+78% more blocks committed in the window.

2. **Inert when peers are clean.** ON-3 drew a healthy peer set with **0 registry-misses** — the hedge stayed at 0 dispatches and matched the best baseline draw (OFF-3: 68.9 vs ON-3: 64.7 blk/s). No overhead, no regression when there's nothing to fix.

**This contradicts the handoff's "honest risk"** that #105 might already absorb the stall: on bad draws the baseline still thrashed hard (364k `all_missing`, 18–21 stall intervals), and the hedge sharply reduced it. #105 (let markers age out during the 2s backoff) and the hedge (bypass the markers entirely on retry) are complementary — the hedge attacks the residual cases #105 doesn't resolve within budget.

## Mechanism evidence (`route_hedge` counters, bad-draw arms)

- `dispatch` ~136k–147k per-peer requests, `win` ~18k, `exhausted` ~117k–127k. So ~12–13% of per-peer hedge requests delivered the block; the rest exhausted and fell back to the unchanged #105 backoff. Even at that win rate, `all_missing` collapsed −84% and throughput rose — because each win resolves a head-of-line block that would otherwise have stalled the strictly-ordered commit for a full 2s backoff cycle.

## Honest caveats

- **Throughput is peer-draw-dependent.** The +78% Δblocks / +82% steady-state are real within these runs but confounded by which peers each window drew (the ON arm happened to also escape cold-start faster on average). The robust, mechanism-level claims are the **`all_missing` −84%** and the **18k hedge wins** — these directly measure the stale-marker bypass and are not throughput-noise.
- N=3 per arm. More runs would tighten the medians, but the direction is consistent across every pair (each ON arm has far lower `all_missing` than every OFF arm except the clean ON-3, which had none to begin with).

## DoS posture (unchanged from the design)

Scoped to the single head-of-line hash in `registry_miss_retry`; small fanout (4) clamped to ready peers; `select_random_ready_peers` (random, load-ignoring, broadcast stance); losers cancelled on first win; no new retry budget; counts as one request against `download_concurrency_limit`.

## Recommendation

Ship-worthy as a prototype. The lever is validated: it converts stale-marker `all_missing` failures into deliveries and reduces head-of-line stalls, with zero overhead on clean draws. Next tuning (per handoff §7): cut the 2s backoff for hedged retries (the fanout already addresses the root cause, so the wait is mostly wasted), and/or latency-aware peer selection to raise the floor.
Loading
Loading