perf(state): skip the WAL for checkpoint-verified block commits by p0mvn · Pull Request #90 · valargroup/zebra

p0mvn · 2026-06-15T05:13:39Z

Checkpoint-verified blocks are reproducible from the hard-coded checkpoint hashes, so their durability does not need to survive a crash: if a crash discards memtable contents that have not yet been flushed to SST files, sync simply resumes from the recovered finalized tip. Writing them through RocksDB's write-ahead log therefore only adds write amplification (every block's bytes are written once to the WAL and again on memtable flush).

Write checkpoint-verified blocks with WriteOptions::disable_wal(true), keyed off the FinalizableBlock::Checkpoint variant. Semantically-verified (contextual) blocks are not reproducible and keep full WAL durability.

To keep crash recovery consistent:

Enable RocksDB atomic_flush. With the WAL off and ~31 column families, non-atomic flush could persist families to different heights and recover into a torn state; atomic flush guarantees the recovered state is always a consistent prefix across every column family.
Before the first WAL-backed write that follows any WAL-less writes, flush the memtables. This makes earlier checkpoint blocks durable before a non-reproducible block (and its WAL entry) depends on them, so a later crash cannot replay the WAL on top of a state missing those blocks. The flush flag is restored if the flush itself fails, so a retry re-attempts it.

Graceful shutdown already flushes memtables, so WAL-less writes survive a clean exit; only a hard crash discards the unflushed tail.

Adds a low-level unit test for the WAL-less write path and the boundary flush, and a criterion benchmark (checkpoint_commit) that A/Bs the commit path with the WAL enabled vs disabled. An in-memory, benchmark-only override (set_checkpoint_skip_wal, gated behind proptest-impl) drives the A/B; the WAL-skip is always on in production.

Benchmark note: because Zebra runs the WAL asynchronously (RocksDB WriteOptions::sync defaults to false and is never set), there is no per-block fsync to remove. The micro-benchmark over a short chain of small blocks shows a modest (~4%) improvement; the saving is write-amplication bound, so it scales with block size and is expected to be larger over real mainnet blocks and longer runs.

Checkpoint-verified blocks are reproducible from the hard-coded checkpoint hashes, so their durability does not need to survive a crash: if a crash discards memtable contents that have not yet been flushed to SST files, sync simply resumes from the recovered finalized tip. Writing them through RocksDB's write-ahead log therefore only adds write amplification (every block's bytes are written once to the WAL and again on memtable flush). Write checkpoint-verified blocks with `WriteOptions::disable_wal(true)`, keyed off the `FinalizableBlock::Checkpoint` variant. Semantically-verified (contextual) blocks are not reproducible and keep full WAL durability. To keep crash recovery consistent: - Enable RocksDB `atomic_flush`. With the WAL off and ~31 column families, non-atomic flush could persist families to different heights and recover into a torn state; atomic flush guarantees the recovered state is always a consistent prefix across every column family. - Before the first WAL-backed write that follows any WAL-less writes, flush the memtables. This makes earlier checkpoint blocks durable before a non-reproducible block (and its WAL entry) depends on them, so a later crash cannot replay the WAL on top of a state missing those blocks. The flush flag is restored if the flush itself fails, so a retry re-attempts it. Graceful shutdown already flushes memtables, so WAL-less writes survive a clean exit; only a hard crash discards the unflushed tail. Adds a low-level unit test for the WAL-less write path and the boundary flush, and a criterion benchmark (`checkpoint_commit`) that A/Bs the commit path with the WAL enabled vs disabled. An in-memory, benchmark-only override (`set_checkpoint_skip_wal`, gated behind `proptest-impl`) drives the A/B; the WAL-skip is always on in production. Benchmark note: because Zebra runs the WAL asynchronously (RocksDB `WriteOptions::sync` defaults to false and is never set), there is no per-block fsync to remove. The micro-benchmark over a short chain of small blocks shows a modest (~4%) improvement; the saving is write-amplication bound, so it scales with block size and is expected to be larger over real mainnet blocks and longer runs.

p0mvn marked this pull request as ready for review June 15, 2026 14:33

p0mvn marked this pull request as draft June 16, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(state): skip the WAL for checkpoint-verified block commits#90

perf(state): skip the WAL for checkpoint-verified block commits#90
p0mvn wants to merge 1 commit into
ironwood-mainfrom
roman/checkpoint-skip-wal

p0mvn commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

p0mvn commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant