Skip to content

perf(state): skip the WAL for checkpoint-verified block commits#90

Draft
p0mvn wants to merge 1 commit into
ironwood-mainfrom
roman/checkpoint-skip-wal
Draft

perf(state): skip the WAL for checkpoint-verified block commits#90
p0mvn wants to merge 1 commit into
ironwood-mainfrom
roman/checkpoint-skip-wal

Conversation

@p0mvn

@p0mvn p0mvn commented Jun 15, 2026

Copy link
Copy Markdown

Checkpoint-verified blocks are reproducible from the hard-coded checkpoint hashes, so their durability does not need to survive a crash: if a crash discards memtable contents that have not yet been flushed to SST files, sync simply resumes from the recovered finalized tip. Writing them through RocksDB's write-ahead log therefore only adds write amplification (every block's bytes are written once to the WAL and again on memtable flush).

Write checkpoint-verified blocks with WriteOptions::disable_wal(true), keyed off the FinalizableBlock::Checkpoint variant. Semantically-verified (contextual) blocks are not reproducible and keep full WAL durability.

To keep crash recovery consistent:

  • Enable RocksDB atomic_flush. With the WAL off and ~31 column families, non-atomic flush could persist families to different heights and recover into a torn state; atomic flush guarantees the recovered state is always a consistent prefix across every column family.

  • Before the first WAL-backed write that follows any WAL-less writes, flush the memtables. This makes earlier checkpoint blocks durable before a non-reproducible block (and its WAL entry) depends on them, so a later crash cannot replay the WAL on top of a state missing those blocks. The flush flag is restored if the flush itself fails, so a retry re-attempts it.

Graceful shutdown already flushes memtables, so WAL-less writes survive a clean exit; only a hard crash discards the unflushed tail.

Adds a low-level unit test for the WAL-less write path and the boundary flush, and a criterion benchmark (checkpoint_commit) that A/Bs the commit path with the WAL enabled vs disabled. An in-memory, benchmark-only override (set_checkpoint_skip_wal, gated behind proptest-impl) drives the A/B; the WAL-skip is always on in production.

Benchmark note: because Zebra runs the WAL asynchronously (RocksDB WriteOptions::sync defaults to false and is never set), there is no per-block fsync to remove. The micro-benchmark over a short chain of small blocks shows a modest (~4%) improvement; the saving is write-amplication bound, so it scales with block size and is expected to be larger over real mainnet blocks and longer runs.

Checkpoint-verified blocks are reproducible from the hard-coded checkpoint
hashes, so their durability does not need to survive a crash: if a crash
discards memtable contents that have not yet been flushed to SST files,
sync simply resumes from the recovered finalized tip. Writing them through
RocksDB's write-ahead log therefore only adds write amplification (every
block's bytes are written once to the WAL and again on memtable flush).

Write checkpoint-verified blocks with `WriteOptions::disable_wal(true)`,
keyed off the `FinalizableBlock::Checkpoint` variant. Semantically-verified
(contextual) blocks are not reproducible and keep full WAL durability.

To keep crash recovery consistent:

- Enable RocksDB `atomic_flush`. With the WAL off and ~31 column families,
  non-atomic flush could persist families to different heights and recover
  into a torn state; atomic flush guarantees the recovered state is always a
  consistent prefix across every column family.

- Before the first WAL-backed write that follows any WAL-less writes, flush
  the memtables. This makes earlier checkpoint blocks durable before a
  non-reproducible block (and its WAL entry) depends on them, so a later
  crash cannot replay the WAL on top of a state missing those blocks. The
  flush flag is restored if the flush itself fails, so a retry re-attempts it.

Graceful shutdown already flushes memtables, so WAL-less writes survive a
clean exit; only a hard crash discards the unflushed tail.

Adds a low-level unit test for the WAL-less write path and the boundary
flush, and a criterion benchmark (`checkpoint_commit`) that A/Bs the commit
path with the WAL enabled vs disabled. An in-memory, benchmark-only override
(`set_checkpoint_skip_wal`, gated behind `proptest-impl`) drives the A/B; the
WAL-skip is always on in production.

Benchmark note: because Zebra runs the WAL asynchronously (RocksDB
`WriteOptions::sync` defaults to false and is never set), there is no
per-block fsync to remove. The micro-benchmark over a short chain of small
blocks shows a modest (~4%) improvement; the saving is write-amplication
bound, so it scales with block size and is expected to be larger over real
mainnet blocks and longer runs.
@p0mvn p0mvn marked this pull request as ready for review June 15, 2026 14:33
@p0mvn p0mvn marked this pull request as draft June 16, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant