prototype(state): any-order finalized commit pipeline (no gain in CPU-bound regime) [DO NOT MERGE] by p0mvn · Pull Request #129 · valargroup/zebra

p0mvn · 2026-06-18T14:12:35Z

⛔ DO NOT MERGE — draft / parked experiment (kept for reference)

This is a negative-but-useful result, opened as a draft on top of #128 so the work and its
measurements aren't lost. It is not intended for merge in the current regime. See
ANY_ORDER_COMMIT_DESIGN.md §7d (added here) for the full write-up.

Motivation

The "any-order commit" idea: let the per-block finalized commit's independent work (serialization +
disk write) run while the next block's chained treestate is computed, so the writer pipelines instead of
doing everything serially per block. The hypothesis was ~1.5–1.8× on the writer.

What this does

Splits the finalized (checkpoint) writer loop into two threads:

Stage A (new compute thread): compute_finalized — the ordered, chained treestate computation
(note-commitment tree update, ZIP-244 commitment check, history-tree push). Threads the note + history
trees in memory from one iteration to the next, so it never reads a not-yet-written DB tip.
Stage B (existing writer thread): finish_pipelined — batch build + RocksDB write +
set_finalized_tip, strictly in height order.

ChainTipSender stays in Stage B (no Clone needed); the finalized receiver is moved into Stage A via
mem::replace. A write error signals Stage A (AtomicBool) to drop its threaded trees and re-seed from
the reset tip; Stage B re-checks each block's height against the real DB tip before writing. Also includes
the feature-gated commit-metrics instrumentation (writer wait/busy + a new compute-stage histogram)
used to measure it — off by default, zero production overhead.

Result — no throughput gain (CPU-bound), ~10% slower

Matched-height benchmark, mainnet heavy region 1.72M→1.73M, pinned peer, Zakura off, 8-core box, vs the
#128 binary:

metric	#128 (serial)	this PR (pipeline)
throughput	29.5 blk/s	26.4 blk/s
writer busy	25.9 ms (compute+write)	18.2 ms (write only)
Stage A compute	— (inline)	17.7 ms (concurrent)
writer wait (idle)	7.8 ms	19.6 ms
writer cycle (busy+wait)	33.7 ms/blk	37.8 ms/blk
CPU	7.75 / 8 cores	7.14 / 8 cores
downloads in-flight	1550	1357 (buffer full → not peer-starved)

Why: after the cyc1–7 + #128 wins, the heavy region is already CPU-saturated (~7.75/8 cores) with
downloads fully buffered. Splitting commit across two threads adds no cores — it redistributes the same
work. Stage A (the un-parallelizable tree chain) becomes the gating stage, so the writer idles waiting on
it (7.8→19.6 ms), and the cross-thread handoff + shared commit pool contention leave cores more idle.
Work-conservation: at ~N/N cores, wall ≥ total_work / N regardless of how the commit is partitioned.

Why it may still be useful (why this is parked, not closed)

The design is correct and validated — it's just dominated by the CPU-saturation constraint here. If
total CPU work later drops enough to un-saturate the box (e.g. native ZIP-244 digests removing the per-tx
to_librustzcash reparse), or the node runs on production-class hardware with more cores, the commit
chain could gate again with spare cores to overlap onto — exactly the regime where this pays off.
Reviving it from this branch is then cheap. Treat it as a measured "shelf it for now" rather than a dead
end.

Test evidence

cargo test -p zebra-state --lib finalized_state::tests → 46/46 pass (the split is
behaviour-preserving; commit_finalized now delegates through compute_finalized + finish_pipelined).
Clean differential checkpoint sync 1.707M → 1.737M: every block committed with no commitment errors
and no state resets — each block's history root validates against the in-memory threaded trees, so the
treestate threading is proven correct on the happy path — and the on-shutdown DB-format integrity check
passed.
cargo check -p zebra-state -p zebra-consensus -p zebrad --features commit-metrics clean.

Caveats / review surface (if ever un-parked)

The error/reset path, the contextual (non-finalized) finalization handoff, and shutdown draining are
not exercised by a clean checkpoint-sync differential test — that's the review surface for any future
merge attempt.

AI disclosure

Implemented and benchmarked with Claude Code (Claude Opus). The contributor is the responsible author.

Break the single-core checkpoint-sync commit ceiling (~17-22 blk/s) by parallelizing the note-commitment-tree work on the finalized writer. - Part 1: overlap update_trees_parallel with the chain-history commitment check via rayon::in_place_scope_fifo (the check reads only the parent history tree, so they are independent). - Part 3: add a generic parallel batch frontier append (zebra-chain/src/parallel/batch_frontier.rs) and wire it into the Sapling and Orchard note-commitment trees via NoteCommitmentTree::append_batch. Byte-identical to sequential append (differential proptests + known-answer vectors); subtree completion tracking preserved. - Gate per-block commit-phase timing histograms (zebra.state.write.*) behind a new 'commit-metrics' feature, off by default. Measured: update_trees ~30 -> ~16.5 ms/block; throughput ~2x; CPU ~1.7 -> ~3.3/8.

A peer returning zero (or multiple) blocks for a single-hash request hit a hard assert_eq! in the download path, panicking and killing the node. Treat a non-single/unavailable response as a retryable DownloadFailed error so the block is retried from another peer instead — a remote peer must not be able to crash us.

In the checkpoint-commit path, the ZIP-244 chain-history commitment check (`block_commitment_is_valid_for_chain_history`) runs on the single finalized writer thread. Its cost is dominated by `Block::auth_data_root`, which computes each transaction's authorizing-data digest (re-serializing the transaction and BLAKE2b-hashing its authorizing data, scaling with shielded I/O). This was a plain serial `iter().map()`, so on heavy shielded-spam blocks (~1.72M+ on mainnet) it grew to the largest single component of the serial commit (measured ~90-123 ms/block, exceeding the already-parallel note- commitment tree update). The per-transaction digests are independent, so compute them across the rayon pool. `collect` into a Vec preserves transaction order, so the resulting Merkle root is byte-identical to the sequential version. Measured (checkpoint sync from a 1.7M snapshot, Zakura-off, pinned peer, height-aligned bins): commitment_check -27% to -47% in the heavy region, heavy-region throughput ~7.3 -> 9.3 blk/s (~1.27x). Consensus-neutral: same digests, same order, same root. Verified by a differential proptest (parallel vs sequential, byte-identical root).

Adds a `zebra.state.write.block_tx_count` histogram (gated behind the existing `commit-metrics` feature, zero overhead in production) recording each committed block's transaction count. Used to attribute checkpoint-commit phase times to block composition during perf investigation.

The checkpoint-commit treestate computation (note-commitment tree update and the ZIP-244 auth-data-root commitment check) runs on the single finalized writer thread. Both are internally parallel via rayon, but on the global rayon pool they contend with each other and with the block download/verification pipeline (equihash, batch signature/proof verification). In-node this left the tree-update burst at only ~1.6 effective cores despite scaling ~7x in isolation. Run both operations inside a dedicated rayon thread pool so the commit-stage crypto has its own workers, isolated from the verifier's global-pool work. The commit burst and the download/verify feed alternate, so the burst can claim the otherwise-idle cores instead of fighting the verifier for global-pool workers. No behavior change: the computation is identical, only the thread pool it runs on differs.

…oint verifier The ZIP-244 authorizing-data commitment (Block::auth_data_root, a per-transaction auth digest) is one of the two dominant serial costs of the finalized committer on heavy shielded blocks. Unlike the note-commitment tree update it depends only on the block's own transactions, not chain state, so it can be computed ahead of the committer. Computing it inline in `check_block` does NOT help: the checkpoint verifier is wrapped in a tower `Buffer` (single worker), and `check_block` runs on that serialized path, so the work just moves to another single-threaded stage. Instead, compute it in the per-block task the verifier already `tokio::spawn`s to commit each verified block. That task runs off the buffer worker, one per block, so many blocks' auth digests are computed concurrently (via `spawn_blocking`), overlapping with and ahead of the single-threaded committer. The committer uses the precomputed value (carried on `SemanticallyVerifiedBlock::auth_data_root`), falling back to computing it when absent. Only Nu5-onward blocks bind the auth data in their block commitment. Consensus-neutral: the value is byte-identical to recomputing it at commit time; an end-to-end differential mainnet sync is the proof, since a wrong auth data root fails the commitment check and rejects the block.

…mit-metrics) Adds commit-metrics-gated histograms for the synchronous, single-threaded work the checkpoint verifier does on the tower Buffer worker per block (queue_block / check_block: equihash, precompute, merkle; and process_checkpoint_range) to localize the feed bottleneck once the committer is no longer the limiter.

Computing a v5+ transaction's txid (`Transaction::hash`) and its ZIP-244 authorizing-data digest (`auth_digest`) each independently convert the whole transaction to its librustzcash representation (re-serialize + re-parse), which dominates the per-transaction cost on heavy shielded blocks. The checkpoint commit path paid this twice: once building the transaction hashes in `CheckpointVerifiedBlock::new`, and again computing the auth data root. Add `Transaction::txid_and_auth_digest`, which performs one conversion and returns both. `SemanticallyVerifiedBlock::with_hash` now computes the transaction hashes and the auth data root together from that single shared conversion (the auth digest is nearly free once the txid is computed), so the auth data root is carried on the block and the separate per-block conversion in the checkpoint verifier's commit task is removed. Byte-identical to the separate computations (differential proptest `txid_and_auth_digest_matches_separate`); an end-to-end mainnet sync is the consensus proof.

…nload tasks The checkpoint verifier runs inside a tower Buffer (single worker), so the per-block precomputation it does in CheckpointVerifiedBlock construction (the per-transaction txids and auth data root, which re-serialize each transaction via librustzcash and dominate the cost on heavy shielded blocks) is serialized across all blocks. With the committer no longer the bottleneck, this serial "feed" stage became the throughput limit. Move that precomputation into the syncer's per-block download tasks, which run concurrently (one tokio task per in-flight block). For checkpoint-height blocks the download task builds the CheckpointVerifiedBlock (via with_hash, in spawn_blocking) and sends the new Request::CommitCheckpointPrecomputed; the checkpoint verifier's new call_precomputed path runs all the same validity checks (height, proof of work, Merkle root) on the prebuilt block but skips the now-already-done precomputation. The verifier's Service<Arc<Block>> path is unchanged and remains the fallback for inbound/gossip blocks, so behavior is identical for non-syncer callers. Validation is unchanged; an end-to-end mainnet sync is the consensus proof.

…mmit-metrics)

…rep in dedicated pool

…gain, CPU-bound) Split the finalized writer into a two-thread pipeline: - Stage A (new compute thread): the ordered, chained treestate computation (compute_finalized — note-tree update, ZIP-244 commitment check, history-tree push), threading the note + history trees in memory from each iteration to the next so the compute never reads a not-yet-written tip. - Stage B (existing writer thread): finish_pipelined — batch build + RocksDB write + set_finalized_tip, in height order. ChainTipSender stays in Stage B (no Clone needed); the finalized receiver is moved into Stage A via mem::replace. A write error signals Stage A (AtomicBool) to drop its threaded trees and re-seed from the reset tip; Stage B re-checks block height against the real db tip before writing. Verified correct: 46/46 finalized_state unit tests; clean checkpoint sync 1.707M->1.737M with no commitment errors or resets (every block's history root validated against the threaded trees), shutdown DB-format integrity check passed. MEASURED: no throughput gain, ~10% slower in the heavy region (29.5->26.4 blk/s). The region is already CPU-saturated (~7.75/8 cores, downloads buffered), so splitting the commit adds no cores: Stage A (the un-parallelizable tree chain) becomes the gating stage and Stage B idles waiting on it (writer cycle 33.7->37.8 ms/block), with extra COMMIT_COMPUTE_POOL contention. The bottleneck is now total CPU work, not the serial commit stage. Kept as a prototype for reference; not proposed for merge. See ANY_ORDER_COMMIT_DESIGN.md section 7d.

v12-auditor · 2026-06-18T14:12:43Z

Note

Complete: Audit complete. No review-worthy issues remain after automatic triage. Three findings were auto-invalidated.

Open the full results here.

Analyzed 11 files, diff 6cbb939...65cec65.

github-actions · 2026-06-18T14:12:51Z

PR titles must follow Conventional Commits format. Your title needs a small adjustment.

Unknown release type "prototype" found in pull request title "prototype(state): any-order finalized commit pipeline (no gain in CPU-bound regime) [DO NOT MERGE]".

Available types:
 - feat
 - fix
 - perf
 - refactor
 - build
 - chore
 - docs
 - test
 - ci
 - style
 - revert
 - release

See the contribution guide for details.

p0mvn added 16 commits June 17, 2026 18:57

address v12 bug

0432d8c

instrument verify->commit handoff: writer wait/busy/channel-depth (co…

a879b45

…mmit-metrics)

instrument write_block: batch_reads/prepare_batch + per-sub-batch (co…

3336b86

…mmit-metrics)

instrument value_pools + precommit_metrics sub-batches (commit-metrics)

168e4ae

perf(state): serialize raw transactions in parallel when writing blocks

db66c70

perf(state): compute block size in parallel + run block-write batch p…

30976c7

…rep in dedicated pool

p0mvn force-pushed the perf-parallel-writer-serialization branch from 6cbb939 to 092f440 Compare June 19, 2026 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prototype(state): any-order finalized commit pipeline (no gain in CPU-bound regime) [DO NOT MERGE]#129

prototype(state): any-order finalized commit pipeline (no gain in CPU-bound regime) [DO NOT MERGE]#129
p0mvn wants to merge 16 commits into
perf-parallel-writer-serializationfrom
proto-any-order-pipeline

p0mvn commented Jun 18, 2026

Uh oh!

v12-auditor Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

p0mvn commented Jun 18, 2026

⛔ DO NOT MERGE — draft / parked experiment (kept for reference)

Motivation

What this does

Result — no throughput gain (CPU-bound), ~10% slower

Why it may still be useful (why this is parked, not closed)

Test evidence

Caveats / review surface (if ever un-parked)

AI disclosure

Uh oh!

v12-auditor Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

v12-auditor Bot commented Jun 18, 2026 •

edited

Loading