Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
188ba8d
perf(consensus): precompute auth data root concurrently in the checkp…
p0mvn Jun 19, 2026
acb269a
perf: de-duplicate the librustzcash conversion for txid and auth digest
p0mvn Jun 18, 2026
d83ae46
perf: de-duplicate the librustzcash conversion for txid and auth digest
p0mvn Jun 19, 2026
6bbd343
perf(state): parallelize per-block serialization in the finalized blo…
p0mvn Jun 19, 2026
003c703
perf(state): gate parallel block batch-prep on a transaction-count th…
p0mvn Jun 19, 2026
b37ad32
perf(chain): compute ZIP-244 txid and auth digest natively (#131)
p0mvn Jun 19, 2026
c2c24b0
perf(chain): drop the discarded librustzcash reparse on v5 deserializ…
p0mvn Jun 19, 2026
e632fdf
perf(chain): defer Sapling cv/epk decompression, enforce on the seman…
p0mvn Jun 19, 2026
c8f1196
perf(state): parallelize and de-duplicate the committer's UTXO/addres…
p0mvn Jun 19, 2026
20eeea2
perf(state): optimize checkpoint prepare digest fanout (#148)
p0mvn Jun 19, 2026
7fb4ffe
perf(state): precompute note-commitment tree hashing off the committe…
p0mvn Jun 20, 2026
763c580
[REVERT] Roman's AI workspace
p0mvn Jun 21, 2026
84576dd
perf(sync): hedge head-of-line block download on registry-miss [proto…
p0mvn Jun 21, 2026
111ad33
perf(state): overlap raw-transaction serialization with the committer…
p0mvn Jun 23, 2026
414bc33
perf(state): run write_block on the committer thread instead of the c…
p0mvn Jun 23, 2026
b7cbe2e
perf: verified commitment trees (#189)
p0mvn Jun 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/checkpoint-update.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ jobs:
env:
MAINNET_CHECKPOINTS: zebra-chain/src/parameters/checkpoint/main-checkpoints.txt
TESTNET_CHECKPOINTS: zebra-chain/src/parameters/checkpoint/test-checkpoints.txt
MAINNET_FRONTIER: zebra-state/src/service/finalized_state/vct/mainnet-frontier.bin
EOS_FILE: zebrad/src/components/sync/end_of_support.rs
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd #v6.0.2
Expand Down Expand Up @@ -87,6 +88,15 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
continue-on-error: true

- name: Download mainnet frontier artifact
id: mainnet-frontier-artifact
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c #v8.0.1
with:
name: generate-checkpoints-mainnet-frontier
run-id: ${{ steps.resolve-run.outputs.run_id }}
github-token: ${{ secrets.GITHUB_TOKEN }}
continue-on-error: true

- name: Download testnet checkpoint artifact
id: testnet-artifact
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c #v8.0.1
Expand All @@ -112,6 +122,11 @@ jobs:
HAS_MAINNET="true"
fi

if [ -f "mainnet-frontier.bin" ]; then
BYTES=$(wc -c < mainnet-frontier.bin | tr -d ' ')
echo "Mainnet frontier artifact: ${BYTES} bytes"
fi

if [ -f "test-checkpoints.txt" ]; then
LINES=$(wc -l < test-checkpoints.txt | tr -d ' ')
echo "Testnet artifact: ${LINES} checkpoint lines"
Expand All @@ -130,6 +145,7 @@ jobs:

# Append new mainnet checkpoints (entries with heights higher than current last)
- name: Append new mainnet checkpoints
id: append-mainnet
if: steps.check-artifacts.outputs.has_mainnet == 'true'
run: |
CURRENT_LAST=$(tail -1 "${MAINNET_CHECKPOINTS}" | awk '{print $1}')
Expand All @@ -138,13 +154,48 @@ jobs:
# Extract only new entries (height > current last)
NEW_COUNT=$(awk -v last="$CURRENT_LAST" '$1 > last' main-checkpoints.txt | wc -l | tr -d ' ')
echo "New mainnet checkpoints to append: ${NEW_COUNT}"
echo "new_count=${NEW_COUNT}" >> "$GITHUB_OUTPUT"

if [ "$NEW_COUNT" -gt 0 ]; then
awk -v last="$CURRENT_LAST" '$1 > last' main-checkpoints.txt >> "${MAINNET_CHECKPOINTS}"
NEW_LAST=$(tail -1 "${MAINNET_CHECKPOINTS}" | awk '{print $1}')
echo "Updated last mainnet checkpoint: ${NEW_LAST}"
echo "new_last=${NEW_LAST}" >> "$GITHUB_OUTPUT"
else
echo "new_last=${CURRENT_LAST}" >> "$GITHUB_OUTPUT"
fi

- name: Update Mainnet VCT frontier
if: >-
steps.check-artifacts.outputs.has_mainnet == 'true' &&
steps.append-mainnet.outputs.new_count != '0'
env:
EXPECTED_HEIGHT: ${{ steps.append-mainnet.outputs.new_last }}
run: |
if [ ! -s "mainnet-frontier.bin" ]; then
echo "Mainnet checkpoints advanced, but mainnet-frontier.bin is missing or empty"
exit 1
fi

FRONTIER_HEIGHT=$(python3 - <<'PY'
import struct

with open("mainnet-frontier.bin", "rb") as frontier:
height_bytes = frontier.read(4)
if len(height_bytes) != 4:
raise SystemExit("frontier artifact is shorter than its height prefix")
print(struct.unpack("<I", height_bytes)[0])
PY
)

if [ "${FRONTIER_HEIGHT}" != "${EXPECTED_HEIGHT}" ]; then
echo "Frontier height ${FRONTIER_HEIGHT} does not match updated Mainnet checkpoint ${EXPECTED_HEIGHT}"
exit 1
fi

cp mainnet-frontier.bin "${MAINNET_FRONTIER}"
echo "Updated ${MAINNET_FRONTIER} for checkpoint height ${EXPECTED_HEIGHT}"

# Append new testnet checkpoints
- name: Append new testnet checkpoints
if: steps.check-artifacts.outputs.has_testnet == 'true'
Expand Down Expand Up @@ -214,6 +265,7 @@ jobs:
### Changes

- Updated mainnet and/or testnet checkpoint files with new entries
- Updated `mainnet-frontier.bin` when Mainnet checkpoints advanced
- Updated `ESTIMATED_RELEASE_HEIGHT` in `end_of_support.rs` to match the latest mainnet checkpoint

### Validation
Expand All @@ -223,6 +275,7 @@ jobs:
- Heights are monotonically increasing
- No gaps exceed 400 blocks
- No duplicate heights or hashes
- Mainnet frontier height matches the updated Mainnet checkpoint height, when present

### Review

Expand Down
29 changes: 29 additions & 0 deletions .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,7 @@ jobs:
CONTAINER_ID: ${{ steps.find-container.outputs.CONTAINER_ID }}
INSTANCE_NAME: ${{ inputs.test_id }}-${{ env.GITHUB_REF_SLUG_URL }}-${{ env.GITHUB_SHA_SHORT }}
GCP_ZONE: ${{ vars.GCP_ZONE }}
CAPTURE_MAINNET_FRONTIER: ${{ contains(inputs.test_id, 'mainnet') }}
run: |
gcloud compute ssh "${INSTANCE_NAME}" \
--zone "${GCP_ZONE}" \
Expand All @@ -403,6 +404,11 @@ jobs:
--command="
sudo docker logs ${CONTAINER_ID} 2>&1 | grep -oE '[0-9]+ [0-9a-f]{64}' > /tmp/checkpoints.txt;
echo \"Captured \$(wc -l < /tmp/checkpoints.txt) checkpoint lines\";
if [ \"${CAPTURE_MAINNET_FRONTIER}\" = 'true' ]; then
sudo docker cp ${CONTAINER_ID}:/tmp/mainnet-frontier.bin /tmp/mainnet-frontier.bin;
test -s /tmp/mainnet-frontier.bin;
echo \"Captured Mainnet VCT frontier artifact\";
fi
"

# Upload the checkpoint file captured in the test-result job as a workflow
Expand Down Expand Up @@ -468,13 +474,36 @@ jobs:
exit 1
fi

- name: Pull Mainnet frontier artifact from instance
if: ${{ contains(inputs.test_id, 'mainnet') }}
run: |
INSTANCE_NAME="${TEST_ID}-${GITHUB_REF_SLUG_URL}-${GITHUB_SHA_SHORT}"

gcloud compute scp \
--zone "${GCP_ZONE}" \
"${INSTANCE_NAME}:/tmp/mainnet-frontier.bin" \
"mainnet-frontier.bin"

if [ ! -s "mainnet-frontier.bin" ]; then
echo "ERROR: Mainnet frontier artifact is empty"
exit 1
fi

- name: Upload checkpoint artifact
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a #v7.0.1
with:
name: ${{ inputs.test_id }}-checkpoints
path: "*-checkpoints.txt"
retention-days: 30

- name: Upload Mainnet frontier artifact
if: ${{ contains(inputs.test_id, 'mainnet') }}
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a #v7.0.1
with:
name: ${{ inputs.test_id }}-frontier
path: mainnet-frontier.bin
retention-days: 30

# create a state image from the instance's state disk, if requested by the caller
create-state-image:
name: Create ${{ inputs.test_id }} cached state image
Expand Down
17 changes: 17 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
// Rust build artifacts (~140G in target/, ~5G in unity-node/target/) saturate
// the file watcher and index, which hangs the extension host (agents + terminal).
// These dirs are gitignored, so hiding them from the editor is safe.
"files.watcherExclude": {
"**/target/**": true,
"**/.git/objects/**": true
},
"search.exclude": {
"**/target": true
},
"files.exclude": {
"**/target": true
},
// Let rust-analyzer manage the workspace without a redundant cargo check storm.
"rust-analyzer.files.excludeDirs": ["target"]
}
28 changes: 28 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,17 @@ and this project adheres to [Semantic Versioning](https://semver.org).
hosts (~20 → ~42 blk/s on an 8-core machine at 1.7M height). A new
default-off `commit-metrics` feature emits per-block timing histograms
(`zebra.state.write.*`) for future profiling.
- Precompute note-commitment tree hashing ahead of the finalized committer. The
per-leaf Merkle hashing for a block (the dominant committer cost on shielded
blocks) depends only on the starting note count, not the frontier's hashes, so
the finalized write loop now does a one-block look-ahead and runs the next
block's Sapling/Orchard hashing on idle cores while the current block commits;
the committer then only applies the precomputed subtree roots onto the frontier
(`update_trees_parallel_with` in `zebra-chain`). The precompute is applied only
if its starting tree size still matches at commit time and otherwise falls back
to inline hashing, so it affects only speed, never the resulting tree. This cuts
the committer's tree-update cost by ~54% (12.5 → 5.7 ms/block) where the
committer is the bottleneck.

### Changed

Expand Down Expand Up @@ -87,6 +98,23 @@ and this project adheres to [Semantic Versioning](https://semver.org).
duplicate-peer handling scaffolding.
- Added bounded Zakura header-sync stream-5 wire messages, stateless header
validation, and the default `network.zakura.header_sync` config surface.
- Verified-commitment-trees fast checkpoint sync. Below the last checkpoint Zebra
now fetches per-block Sapling/Orchard commitment roots from peers over a new
header-sync-aligned `tree_aux` stream, verifies each root against the node's own
checkpoint-committed block headers (the ZIP-221 ChainHistory MMR plus direct
below-Heartwood/below-NU5 checks), and folds the verified roots into the anchor
set and history tree — skipping the per-block note-commitment frontier recompute
that dominates checkpoint-sync CPU cost. At the checkpoint handoff an embedded
final frontier, verified against that block's proven root, is written as the tip
treestate and normal per-block recompute resumes. The resulting consensus state
is byte-identical to the legacy recompute; a root that cannot be obtained or
verified is rejected rather than recomputed against the stale frozen frontier, so
no untrusted data can influence consensus state. This is the default whenever
`consensus.checkpoint_sync = true` on a network with an embedded handoff frontier
(Mainnet), for both Archive and Pruned storage modes. The new
`consensus.disable_vct_fast_sync` flag (default `false`) keeps checkpoint sync
enabled while forcing the legacy per-block recompute. Bumps the state database
format to 27.3.0 (new column families only; no data migration).
- Include the `zebra-rollback-state` and `zebra-prune-state` utilities alongside
`zebrad` in release Docker images and Docker CI builds.
- Use the `5.0.0-rc.3` release identity for this fork's v5 rollback build.
Expand Down
1 change: 1 addition & 0 deletions Cargo.lock
Original file line number Diff line number Diff line change
Expand Up @@ -9492,6 +9492,7 @@ dependencies = [
"zebra-chain",
"zebra-node-services",
"zebra-rpc",
"zebra-state",
]

[[package]]
Expand Down
53 changes: 53 additions & 0 deletions HOL_HEDGE_RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Hedged head-of-line download — benchmark results

**Branch:** `proto-hedged-hol-download` (binary `/root/wal-bench/zebrad-hedge`, built `--features commit-metrics`).
**PR:** #151 (`hedge-hol-rebased` → `proto-note-tree-precompute`).
**Method:** single binary, env-toggled `SYNC_HOL_HEDGE_FANOUT=0` (baseline) vs `=4` (hedged), **random DNS peers** (the stall only manifests with diverse/churning peers — a pinned peer never reproduces it). Interleaved off/on/off/on/off/on so temporal peer drift hits both arms equally. 7.5-min fork windows from the 1,707,210 snapshot. `checkpoint_verify=1500`, `download=150`. Harness: `hedge_ab.sh`.

## Per-run data (N=3 per arm)

| run | Δblocks (7.5 min) | stall intervals (blk/s<2 & in_flight>1000) | reg_miss | all_missing | route_hedge win | steady blk/s |
|---|---|---|---|---|---|---|
| OFF-1 | 10,899 | 18/84 | 97,676 | 380,894 | — | 27.9 |
| OFF-2 | 10,539 | 21/83 | 93,517 | 364,439 | — | 25.4 |
| OFF-3 | 22,438 | 9/84 | 50,060 | 195,060 | — | 68.9 |
| **ON-1** | 18,316 | 7/81 | 43,328 | 62,469 | 17,990 | 45.4 |
| **ON-2** | 19,434 | 12/84 | 44,729 | 57,630 | 18,295 | 50.8 |
| **ON-3** | 28,213 | 3/84 | 0 | 7 | 0 (inert) | 64.7 |

## Medians (OFF → ON)

| metric | OFF | ON | Δ |
|---|---|---|---|
| stall intervals | 18 | 7 | **−61%** |
| reg_miss | 93,517 | 43,328 | **−54%** |
| **all_missing** (stale-marker fails) | 364,439 | 57,630 | **−84%** |
| Δblocks per 7.5-min window | 10,899 | 19,434 | **+78%** |
| steady-state blk/s | 27.9 | 50.8 | +82% |

## Verdict — the hedge works, and is well-behaved

**It does exactly what it was designed to do, confirmed across N=3:**

1. **Active when peers thrash.** On the two bad draws (ON-1, ON-2), the baseline equivalent would have accumulated ~360k `all_missing` synthetic failures; the hedge fired (`dispatch` ~140k per-peer, **~18k wins**), bypassing the stale "missing" inventory markers and delivering the head block from a real ready peer. Result: `all_missing` −84%, `reg_miss` −54%, stalls cut, ~+78% more blocks committed in the window.

2. **Inert when peers are clean.** ON-3 drew a healthy peer set with **0 registry-misses** — the hedge stayed at 0 dispatches and matched the best baseline draw (OFF-3: 68.9 vs ON-3: 64.7 blk/s). No overhead, no regression when there's nothing to fix.

**This contradicts the handoff's "honest risk"** that #105 might already absorb the stall: on bad draws the baseline still thrashed hard (364k `all_missing`, 18–21 stall intervals), and the hedge sharply reduced it. #105 (let markers age out during the 2s backoff) and the hedge (bypass the markers entirely on retry) are complementary — the hedge attacks the residual cases #105 doesn't resolve within budget.

## Mechanism evidence (`route_hedge` counters, bad-draw arms)

- `dispatch` ~136k–147k per-peer requests, `win` ~18k, `exhausted` ~117k–127k. So ~12–13% of per-peer hedge requests delivered the block; the rest exhausted and fell back to the unchanged #105 backoff. Even at that win rate, `all_missing` collapsed −84% and throughput rose — because each win resolves a head-of-line block that would otherwise have stalled the strictly-ordered commit for a full 2s backoff cycle.

## Honest caveats

- **Throughput is peer-draw-dependent.** The +78% Δblocks / +82% steady-state are real within these runs but confounded by which peers each window drew (the ON arm happened to also escape cold-start faster on average). The robust, mechanism-level claims are the **`all_missing` −84%** and the **18k hedge wins** — these directly measure the stale-marker bypass and are not throughput-noise.
- N=3 per arm. More runs would tighten the medians, but the direction is consistent across every pair (each ON arm has far lower `all_missing` than every OFF arm except the clean ON-3, which had none to begin with).

## DoS posture (unchanged from the design)

Scoped to the single head-of-line hash in `registry_miss_retry`; small fanout (4) clamped to ready peers; `select_random_ready_peers` (random, load-ignoring, broadcast stance); losers cancelled on first win; no new retry budget; counts as one request against `download_concurrency_limit`.

## Recommendation

Ship-worthy as a prototype. The lever is validated: it converts stale-marker `all_missing` failures into deliveries and reduces head-of-line stalls, with zero overhead on clean draws. Next tuning (per handoff §7): cut the 2s backoff for hedged retries (the fanout already addresses the root cause, so the wait is mostly wasted), and/or latency-aware peer selection to raise the floor.
Loading
Loading