prototype(state): any-order finalized commit pipeline (no gain in CPU-bound regime) [DO NOT MERGE]#129
Draft
p0mvn wants to merge 16 commits into
Draft
Conversation
Break the single-core checkpoint-sync commit ceiling (~17-22 blk/s) by parallelizing the note-commitment-tree work on the finalized writer. - Part 1: overlap update_trees_parallel with the chain-history commitment check via rayon::in_place_scope_fifo (the check reads only the parent history tree, so they are independent). - Part 3: add a generic parallel batch frontier append (zebra-chain/src/parallel/batch_frontier.rs) and wire it into the Sapling and Orchard note-commitment trees via NoteCommitmentTree::append_batch. Byte-identical to sequential append (differential proptests + known-answer vectors); subtree completion tracking preserved. - Gate per-block commit-phase timing histograms (zebra.state.write.*) behind a new 'commit-metrics' feature, off by default. Measured: update_trees ~30 -> ~16.5 ms/block; throughput ~2x; CPU ~1.7 -> ~3.3/8.
A peer returning zero (or multiple) blocks for a single-hash request hit a hard assert_eq! in the download path, panicking and killing the node. Treat a non-single/unavailable response as a retryable DownloadFailed error so the block is retried from another peer instead — a remote peer must not be able to crash us.
In the checkpoint-commit path, the ZIP-244 chain-history commitment check (`block_commitment_is_valid_for_chain_history`) runs on the single finalized writer thread. Its cost is dominated by `Block::auth_data_root`, which computes each transaction's authorizing-data digest (re-serializing the transaction and BLAKE2b-hashing its authorizing data, scaling with shielded I/O). This was a plain serial `iter().map()`, so on heavy shielded-spam blocks (~1.72M+ on mainnet) it grew to the largest single component of the serial commit (measured ~90-123 ms/block, exceeding the already-parallel note- commitment tree update). The per-transaction digests are independent, so compute them across the rayon pool. `collect` into a Vec preserves transaction order, so the resulting Merkle root is byte-identical to the sequential version. Measured (checkpoint sync from a 1.7M snapshot, Zakura-off, pinned peer, height-aligned bins): commitment_check -27% to -47% in the heavy region, heavy-region throughput ~7.3 -> 9.3 blk/s (~1.27x). Consensus-neutral: same digests, same order, same root. Verified by a differential proptest (parallel vs sequential, byte-identical root).
Adds a `zebra.state.write.block_tx_count` histogram (gated behind the existing `commit-metrics` feature, zero overhead in production) recording each committed block's transaction count. Used to attribute checkpoint-commit phase times to block composition during perf investigation.
The checkpoint-commit treestate computation (note-commitment tree update and the ZIP-244 auth-data-root commitment check) runs on the single finalized writer thread. Both are internally parallel via rayon, but on the global rayon pool they contend with each other and with the block download/verification pipeline (equihash, batch signature/proof verification). In-node this left the tree-update burst at only ~1.6 effective cores despite scaling ~7x in isolation. Run both operations inside a dedicated rayon thread pool so the commit-stage crypto has its own workers, isolated from the verifier's global-pool work. The commit burst and the download/verify feed alternate, so the burst can claim the otherwise-idle cores instead of fighting the verifier for global-pool workers. No behavior change: the computation is identical, only the thread pool it runs on differs.
…oint verifier The ZIP-244 authorizing-data commitment (Block::auth_data_root, a per-transaction auth digest) is one of the two dominant serial costs of the finalized committer on heavy shielded blocks. Unlike the note-commitment tree update it depends only on the block's own transactions, not chain state, so it can be computed ahead of the committer. Computing it inline in `check_block` does NOT help: the checkpoint verifier is wrapped in a tower `Buffer` (single worker), and `check_block` runs on that serialized path, so the work just moves to another single-threaded stage. Instead, compute it in the per-block task the verifier already `tokio::spawn`s to commit each verified block. That task runs off the buffer worker, one per block, so many blocks' auth digests are computed concurrently (via `spawn_blocking`), overlapping with and ahead of the single-threaded committer. The committer uses the precomputed value (carried on `SemanticallyVerifiedBlock::auth_data_root`), falling back to computing it when absent. Only Nu5-onward blocks bind the auth data in their block commitment. Consensus-neutral: the value is byte-identical to recomputing it at commit time; an end-to-end differential mainnet sync is the proof, since a wrong auth data root fails the commitment check and rejects the block.
…mit-metrics) Adds commit-metrics-gated histograms for the synchronous, single-threaded work the checkpoint verifier does on the tower Buffer worker per block (queue_block / check_block: equihash, precompute, merkle; and process_checkpoint_range) to localize the feed bottleneck once the committer is no longer the limiter.
Computing a v5+ transaction's txid (`Transaction::hash`) and its ZIP-244 authorizing-data digest (`auth_digest`) each independently convert the whole transaction to its librustzcash representation (re-serialize + re-parse), which dominates the per-transaction cost on heavy shielded blocks. The checkpoint commit path paid this twice: once building the transaction hashes in `CheckpointVerifiedBlock::new`, and again computing the auth data root. Add `Transaction::txid_and_auth_digest`, which performs one conversion and returns both. `SemanticallyVerifiedBlock::with_hash` now computes the transaction hashes and the auth data root together from that single shared conversion (the auth digest is nearly free once the txid is computed), so the auth data root is carried on the block and the separate per-block conversion in the checkpoint verifier's commit task is removed. Byte-identical to the separate computations (differential proptest `txid_and_auth_digest_matches_separate`); an end-to-end mainnet sync is the consensus proof.
…nload tasks The checkpoint verifier runs inside a tower Buffer (single worker), so the per-block precomputation it does in CheckpointVerifiedBlock construction (the per-transaction txids and auth data root, which re-serialize each transaction via librustzcash and dominate the cost on heavy shielded blocks) is serialized across all blocks. With the committer no longer the bottleneck, this serial "feed" stage became the throughput limit. Move that precomputation into the syncer's per-block download tasks, which run concurrently (one tokio task per in-flight block). For checkpoint-height blocks the download task builds the CheckpointVerifiedBlock (via with_hash, in spawn_blocking) and sends the new Request::CommitCheckpointPrecomputed; the checkpoint verifier's new call_precomputed path runs all the same validity checks (height, proof of work, Merkle root) on the prebuilt block but skips the now-already-done precomputation. The verifier's Service<Arc<Block>> path is unchanged and remains the fallback for inbound/gossip blocks, so behavior is identical for non-syncer callers. Validation is unchanged; an end-to-end mainnet sync is the consensus proof.
…rep in dedicated pool
…gain, CPU-bound) Split the finalized writer into a two-thread pipeline: - Stage A (new compute thread): the ordered, chained treestate computation (compute_finalized — note-tree update, ZIP-244 commitment check, history-tree push), threading the note + history trees in memory from each iteration to the next so the compute never reads a not-yet-written tip. - Stage B (existing writer thread): finish_pipelined — batch build + RocksDB write + set_finalized_tip, in height order. ChainTipSender stays in Stage B (no Clone needed); the finalized receiver is moved into Stage A via mem::replace. A write error signals Stage A (AtomicBool) to drop its threaded trees and re-seed from the reset tip; Stage B re-checks block height against the real db tip before writing. Verified correct: 46/46 finalized_state unit tests; clean checkpoint sync 1.707M->1.737M with no commitment errors or resets (every block's history root validated against the threaded trees), shutdown DB-format integrity check passed. MEASURED: no throughput gain, ~10% slower in the heavy region (29.5->26.4 blk/s). The region is already CPU-saturated (~7.75/8 cores, downloads buffered), so splitting the commit adds no cores: Stage A (the un-parallelizable tree chain) becomes the gating stage and Stage B idles waiting on it (writer cycle 33.7->37.8 ms/block), with extra COMMIT_COMPUTE_POOL contention. The bottleneck is now total CPU work, not the serial commit stage. Kept as a prototype for reference; not proposed for merge. See ANY_ORDER_COMMIT_DESIGN.md section 7d.
|
Note Complete: Audit complete. No review-worthy issues remain after automatic triage. Three findings were auto-invalidated. Open the full results here. Analyzed 11 files, diff |
|
PR titles must follow Conventional Commits format. Your title needs a small adjustment. See the contribution guide for details. |
6cbb939 to
092f440
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
⛔ DO NOT MERGE — draft / parked experiment (kept for reference)
This is a negative-but-useful result, opened as a draft on top of #128 so the work and its
measurements aren't lost. It is not intended for merge in the current regime. See
ANY_ORDER_COMMIT_DESIGN.md§7d (added here) for the full write-up.Motivation
The "any-order commit" idea: let the per-block finalized commit's independent work (serialization +
disk write) run while the next block's chained treestate is computed, so the writer pipelines instead of
doing everything serially per block. The hypothesis was ~1.5–1.8× on the writer.
What this does
Splits the finalized (checkpoint) writer loop into two threads:
compute_finalized— the ordered, chained treestate computation(note-commitment tree update, ZIP-244 commitment check, history-tree push). Threads the note + history
trees in memory from one iteration to the next, so it never reads a not-yet-written DB tip.
finish_pipelined— batch build + RocksDB write +set_finalized_tip, strictly in height order.ChainTipSenderstays in Stage B (noCloneneeded); the finalized receiver is moved into Stage A viamem::replace. A write error signals Stage A (AtomicBool) to drop its threaded trees and re-seed fromthe reset tip; Stage B re-checks each block's height against the real DB tip before writing. Also includes
the feature-gated
commit-metricsinstrumentation (writer wait/busy + a new compute-stage histogram)used to measure it — off by default, zero production overhead.
Result — no throughput gain (CPU-bound), ~10% slower
Matched-height benchmark, mainnet heavy region 1.72M→1.73M, pinned peer, Zakura off, 8-core box, vs the
#128 binary:
Why: after the cyc1–7 + #128 wins, the heavy region is already CPU-saturated (~7.75/8 cores) with
downloads fully buffered. Splitting commit across two threads adds no cores — it redistributes the same
work. Stage A (the un-parallelizable tree chain) becomes the gating stage, so the writer idles waiting on
it (7.8→19.6 ms), and the cross-thread handoff + shared commit pool contention leave cores more idle.
Work-conservation: at ~N/N cores, wall ≥ total_work / N regardless of how the commit is partitioned.
Why it may still be useful (why this is parked, not closed)
The design is correct and validated — it's just dominated by the CPU-saturation constraint here. If
total CPU work later drops enough to un-saturate the box (e.g. native ZIP-244 digests removing the per-tx
to_librustzcashreparse), or the node runs on production-class hardware with more cores, the commitchain could gate again with spare cores to overlap onto — exactly the regime where this pays off.
Reviving it from this branch is then cheap. Treat it as a measured "shelf it for now" rather than a dead
end.
Test evidence
cargo test -p zebra-state --lib finalized_state::tests→ 46/46 pass (the split isbehaviour-preserving;
commit_finalizednow delegates throughcompute_finalized+finish_pipelined).and no state resets — each block's history root validates against the in-memory threaded trees, so the
treestate threading is proven correct on the happy path — and the on-shutdown DB-format integrity check
passed.
cargo check -p zebra-state -p zebra-consensus -p zebrad --features commit-metricsclean.Caveats / review surface (if ever un-parked)
The error/reset path, the contextual (non-finalized) finalization handoff, and shutdown draining are
not exercised by a clean checkpoint-sync differential test — that's the review surface for any future
merge attempt.
AI disclosure
Implemented and benchmarked with Claude Code (Claude Opus). The contributor is the responsible author.