Skip to content

prototype(state): any-order finalized commit pipeline (no gain in CPU-bound regime) [DO NOT MERGE]#129

Draft
p0mvn wants to merge 16 commits into
perf-parallel-writer-serializationfrom
proto-any-order-pipeline
Draft

prototype(state): any-order finalized commit pipeline (no gain in CPU-bound regime) [DO NOT MERGE]#129
p0mvn wants to merge 16 commits into
perf-parallel-writer-serializationfrom
proto-any-order-pipeline

Conversation

@p0mvn

@p0mvn p0mvn commented Jun 18, 2026

Copy link
Copy Markdown

⛔ DO NOT MERGE — draft / parked experiment (kept for reference)

This is a negative-but-useful result, opened as a draft on top of #128 so the work and its
measurements aren't lost. It is not intended for merge in the current regime. See
ANY_ORDER_COMMIT_DESIGN.md §7d (added here) for the full write-up.

Motivation

The "any-order commit" idea: let the per-block finalized commit's independent work (serialization +
disk write) run while the next block's chained treestate is computed, so the writer pipelines instead of
doing everything serially per block. The hypothesis was ~1.5–1.8× on the writer.

What this does

Splits the finalized (checkpoint) writer loop into two threads:

  • Stage A (new compute thread): compute_finalized — the ordered, chained treestate computation
    (note-commitment tree update, ZIP-244 commitment check, history-tree push). Threads the note + history
    trees in memory from one iteration to the next, so it never reads a not-yet-written DB tip.
  • Stage B (existing writer thread): finish_pipelined — batch build + RocksDB write +
    set_finalized_tip, strictly in height order.

ChainTipSender stays in Stage B (no Clone needed); the finalized receiver is moved into Stage A via
mem::replace. A write error signals Stage A (AtomicBool) to drop its threaded trees and re-seed from
the reset tip; Stage B re-checks each block's height against the real DB tip before writing. Also includes
the feature-gated commit-metrics instrumentation (writer wait/busy + a new compute-stage histogram)
used to measure it — off by default, zero production overhead.

Result — no throughput gain (CPU-bound), ~10% slower

Matched-height benchmark, mainnet heavy region 1.72M→1.73M, pinned peer, Zakura off, 8-core box, vs the
#128 binary:

metric #128 (serial) this PR (pipeline)
throughput 29.5 blk/s 26.4 blk/s
writer busy 25.9 ms (compute+write) 18.2 ms (write only)
Stage A compute — (inline) 17.7 ms (concurrent)
writer wait (idle) 7.8 ms 19.6 ms
writer cycle (busy+wait) 33.7 ms/blk 37.8 ms/blk
CPU 7.75 / 8 cores 7.14 / 8 cores
downloads in-flight 1550 1357 (buffer full → not peer-starved)

Why: after the cyc1–7 + #128 wins, the heavy region is already CPU-saturated (~7.75/8 cores) with
downloads fully buffered. Splitting commit across two threads adds no cores — it redistributes the same
work. Stage A (the un-parallelizable tree chain) becomes the gating stage, so the writer idles waiting on
it (7.8→19.6 ms), and the cross-thread handoff + shared commit pool contention leave cores more idle.
Work-conservation: at ~N/N cores, wall ≥ total_work / N regardless of how the commit is partitioned.

Why it may still be useful (why this is parked, not closed)

The design is correct and validated — it's just dominated by the CPU-saturation constraint here. If
total CPU work later drops enough to un-saturate the box (e.g. native ZIP-244 digests removing the per-tx
to_librustzcash reparse), or the node runs on production-class hardware with more cores, the commit
chain could gate again with spare cores to overlap onto — exactly the regime where this pays off.
Reviving it from this branch is then cheap. Treat it as a measured "shelf it for now" rather than a dead
end.

Test evidence

  • cargo test -p zebra-state --lib finalized_state::tests46/46 pass (the split is
    behaviour-preserving; commit_finalized now delegates through compute_finalized + finish_pipelined).
  • Clean differential checkpoint sync 1.707M → 1.737M: every block committed with no commitment errors
    and no state resets — each block's history root validates against the in-memory threaded trees, so the
    treestate threading is proven correct on the happy path — and the on-shutdown DB-format integrity check
    passed.
  • cargo check -p zebra-state -p zebra-consensus -p zebrad --features commit-metrics clean.

Caveats / review surface (if ever un-parked)

The error/reset path, the contextual (non-finalized) finalization handoff, and shutdown draining are
not exercised by a clean checkpoint-sync differential test — that's the review surface for any future
merge attempt.

AI disclosure

Implemented and benchmarked with Claude Code (Claude Opus). The contributor is the responsible author.

p0mvn added 16 commits June 17, 2026 18:57
Break the single-core checkpoint-sync commit ceiling (~17-22 blk/s) by
parallelizing the note-commitment-tree work on the finalized writer.

- Part 1: overlap update_trees_parallel with the chain-history commitment
  check via rayon::in_place_scope_fifo (the check reads only the parent
  history tree, so they are independent).
- Part 3: add a generic parallel batch frontier append
  (zebra-chain/src/parallel/batch_frontier.rs) and wire it into the Sapling
  and Orchard note-commitment trees via NoteCommitmentTree::append_batch.
  Byte-identical to sequential append (differential proptests + known-answer
  vectors); subtree completion tracking preserved.
- Gate per-block commit-phase timing histograms (zebra.state.write.*) behind
  a new 'commit-metrics' feature, off by default.

Measured: update_trees ~30 -> ~16.5 ms/block; throughput ~2x; CPU ~1.7 -> ~3.3/8.
A peer returning zero (or multiple) blocks for a single-hash request hit a
hard assert_eq! in the download path, panicking and killing the node. Treat
a non-single/unavailable response as a retryable DownloadFailed error so the
block is retried from another peer instead — a remote peer must not be able
to crash us.
In the checkpoint-commit path, the ZIP-244 chain-history commitment check
(`block_commitment_is_valid_for_chain_history`) runs on the single finalized
writer thread. Its cost is dominated by `Block::auth_data_root`, which computes
each transaction's authorizing-data digest (re-serializing the transaction and
BLAKE2b-hashing its authorizing data, scaling with shielded I/O).

This was a plain serial `iter().map()`, so on heavy shielded-spam blocks
(~1.72M+ on mainnet) it grew to the largest single component of the serial
commit (measured ~90-123 ms/block, exceeding the already-parallel note-
commitment tree update). The per-transaction digests are independent, so
compute them across the rayon pool. `collect` into a Vec preserves transaction
order, so the resulting Merkle root is byte-identical to the sequential version.

Measured (checkpoint sync from a 1.7M snapshot, Zakura-off, pinned peer,
height-aligned bins): commitment_check -27% to -47% in the heavy region,
heavy-region throughput ~7.3 -> 9.3 blk/s (~1.27x).

Consensus-neutral: same digests, same order, same root. Verified by a
differential proptest (parallel vs sequential, byte-identical root).
Adds a `zebra.state.write.block_tx_count` histogram (gated behind the existing
`commit-metrics` feature, zero overhead in production) recording each committed
block's transaction count. Used to attribute checkpoint-commit phase times to
block composition during perf investigation.
The checkpoint-commit treestate computation (note-commitment tree update and
the ZIP-244 auth-data-root commitment check) runs on the single finalized
writer thread. Both are internally parallel via rayon, but on the global rayon
pool they contend with each other and with the block download/verification
pipeline (equihash, batch signature/proof verification). In-node this left the
tree-update burst at only ~1.6 effective cores despite scaling ~7x in isolation.

Run both operations inside a dedicated rayon thread pool so the commit-stage
crypto has its own workers, isolated from the verifier's global-pool work. The
commit burst and the download/verify feed alternate, so the burst can claim the
otherwise-idle cores instead of fighting the verifier for global-pool workers.

No behavior change: the computation is identical, only the thread pool it runs
on differs.
…oint verifier

The ZIP-244 authorizing-data commitment (Block::auth_data_root, a per-transaction
auth digest) is one of the two dominant serial costs of the finalized committer
on heavy shielded blocks. Unlike the note-commitment tree update it depends only
on the block's own transactions, not chain state, so it can be computed ahead of
the committer.

Computing it inline in `check_block` does NOT help: the checkpoint verifier is
wrapped in a tower `Buffer` (single worker), and `check_block` runs on that
serialized path, so the work just moves to another single-threaded stage.

Instead, compute it in the per-block task the verifier already `tokio::spawn`s to
commit each verified block. That task runs off the buffer worker, one per block,
so many blocks' auth digests are computed concurrently (via `spawn_blocking`),
overlapping with and ahead of the single-threaded committer. The committer uses
the precomputed value (carried on `SemanticallyVerifiedBlock::auth_data_root`),
falling back to computing it when absent. Only Nu5-onward blocks bind the auth
data in their block commitment.

Consensus-neutral: the value is byte-identical to recomputing it at commit time;
an end-to-end differential mainnet sync is the proof, since a wrong auth data root
fails the commitment check and rejects the block.
…mit-metrics)

Adds commit-metrics-gated histograms for the synchronous, single-threaded work
the checkpoint verifier does on the tower Buffer worker per block (queue_block /
check_block: equihash, precompute, merkle; and process_checkpoint_range) to
localize the feed bottleneck once the committer is no longer the limiter.
Computing a v5+ transaction's txid (`Transaction::hash`) and its ZIP-244
authorizing-data digest (`auth_digest`) each independently convert the whole
transaction to its librustzcash representation (re-serialize + re-parse), which
dominates the per-transaction cost on heavy shielded blocks. The checkpoint
commit path paid this twice: once building the transaction hashes in
`CheckpointVerifiedBlock::new`, and again computing the auth data root.

Add `Transaction::txid_and_auth_digest`, which performs one conversion and
returns both. `SemanticallyVerifiedBlock::with_hash` now computes the transaction
hashes and the auth data root together from that single shared conversion (the
auth digest is nearly free once the txid is computed), so the auth data root is
carried on the block and the separate per-block conversion in the checkpoint
verifier's commit task is removed.

Byte-identical to the separate computations (differential proptest
`txid_and_auth_digest_matches_separate`); an end-to-end mainnet sync is the
consensus proof.
…nload tasks

The checkpoint verifier runs inside a tower Buffer (single worker), so the
per-block precomputation it does in CheckpointVerifiedBlock construction (the
per-transaction txids and auth data root, which re-serialize each transaction
via librustzcash and dominate the cost on heavy shielded blocks) is serialized
across all blocks. With the committer no longer the bottleneck, this serial
"feed" stage became the throughput limit.

Move that precomputation into the syncer's per-block download tasks, which run
concurrently (one tokio task per in-flight block). For checkpoint-height blocks
the download task builds the CheckpointVerifiedBlock (via with_hash, in
spawn_blocking) and sends the new Request::CommitCheckpointPrecomputed; the
checkpoint verifier's new call_precomputed path runs all the same validity
checks (height, proof of work, Merkle root) on the prebuilt block but skips the
now-already-done precomputation.

The verifier's Service<Arc<Block>> path is unchanged and remains the fallback for
inbound/gossip blocks, so behavior is identical for non-syncer callers.
Validation is unchanged; an end-to-end mainnet sync is the consensus proof.
…gain, CPU-bound)

Split the finalized writer into a two-thread pipeline:
- Stage A (new compute thread): the ordered, chained treestate computation
  (compute_finalized — note-tree update, ZIP-244 commitment check, history-tree
  push), threading the note + history trees in memory from each iteration to the
  next so the compute never reads a not-yet-written tip.
- Stage B (existing writer thread): finish_pipelined — batch build + RocksDB
  write + set_finalized_tip, in height order.

ChainTipSender stays in Stage B (no Clone needed); the finalized receiver is moved
into Stage A via mem::replace. A write error signals Stage A (AtomicBool) to drop
its threaded trees and re-seed from the reset tip; Stage B re-checks block height
against the real db tip before writing.

Verified correct: 46/46 finalized_state unit tests; clean checkpoint sync
1.707M->1.737M with no commitment errors or resets (every block's history root
validated against the threaded trees), shutdown DB-format integrity check passed.

MEASURED: no throughput gain, ~10% slower in the heavy region (29.5->26.4 blk/s).
The region is already CPU-saturated (~7.75/8 cores, downloads buffered), so
splitting the commit adds no cores: Stage A (the un-parallelizable tree chain)
becomes the gating stage and Stage B idles waiting on it (writer cycle 33.7->37.8
ms/block), with extra COMMIT_COMPUTE_POOL contention. The bottleneck is now total
CPU work, not the serial commit stage. Kept as a prototype for reference; not
proposed for merge. See ANY_ORDER_COMMIT_DESIGN.md section 7d.
@v12-auditor

v12-auditor Bot commented Jun 18, 2026

Copy link
Copy Markdown

Note

Complete: Audit complete. No review-worthy issues remain after automatic triage. Three findings were auto-invalidated.

Open the full results here.

Analyzed 11 files, diff 6cbb939...65cec65.

@github-actions

Copy link
Copy Markdown

PR titles must follow Conventional Commits format. Your title needs a small adjustment.

Unknown release type "prototype" found in pull request title "prototype(state): any-order finalized commit pipeline (no gain in CPU-bound regime) [DO NOT MERGE]".

Available types:
 - feat
 - fix
 - perf
 - refactor
 - build
 - chore
 - docs
 - test
 - ci
 - style
 - revert
 - release

See the contribution guide for details.

@p0mvn p0mvn force-pushed the perf-parallel-writer-serialization branch from 6cbb939 to 092f440 Compare June 19, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant