perf(state): skip the WAL for checkpoint-verified block commits#90
Draft
p0mvn wants to merge 1 commit into
Draft
perf(state): skip the WAL for checkpoint-verified block commits#90p0mvn wants to merge 1 commit into
p0mvn wants to merge 1 commit into
Conversation
Checkpoint-verified blocks are reproducible from the hard-coded checkpoint hashes, so their durability does not need to survive a crash: if a crash discards memtable contents that have not yet been flushed to SST files, sync simply resumes from the recovered finalized tip. Writing them through RocksDB's write-ahead log therefore only adds write amplification (every block's bytes are written once to the WAL and again on memtable flush). Write checkpoint-verified blocks with `WriteOptions::disable_wal(true)`, keyed off the `FinalizableBlock::Checkpoint` variant. Semantically-verified (contextual) blocks are not reproducible and keep full WAL durability. To keep crash recovery consistent: - Enable RocksDB `atomic_flush`. With the WAL off and ~31 column families, non-atomic flush could persist families to different heights and recover into a torn state; atomic flush guarantees the recovered state is always a consistent prefix across every column family. - Before the first WAL-backed write that follows any WAL-less writes, flush the memtables. This makes earlier checkpoint blocks durable before a non-reproducible block (and its WAL entry) depends on them, so a later crash cannot replay the WAL on top of a state missing those blocks. The flush flag is restored if the flush itself fails, so a retry re-attempts it. Graceful shutdown already flushes memtables, so WAL-less writes survive a clean exit; only a hard crash discards the unflushed tail. Adds a low-level unit test for the WAL-less write path and the boundary flush, and a criterion benchmark (`checkpoint_commit`) that A/Bs the commit path with the WAL enabled vs disabled. An in-memory, benchmark-only override (`set_checkpoint_skip_wal`, gated behind `proptest-impl`) drives the A/B; the WAL-skip is always on in production. Benchmark note: because Zebra runs the WAL asynchronously (RocksDB `WriteOptions::sync` defaults to false and is never set), there is no per-block fsync to remove. The micro-benchmark over a short chain of small blocks shows a modest (~4%) improvement; the saving is write-amplication bound, so it scales with block size and is expected to be larger over real mainnet blocks and longer runs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Checkpoint-verified blocks are reproducible from the hard-coded checkpoint hashes, so their durability does not need to survive a crash: if a crash discards memtable contents that have not yet been flushed to SST files, sync simply resumes from the recovered finalized tip. Writing them through RocksDB's write-ahead log therefore only adds write amplification (every block's bytes are written once to the WAL and again on memtable flush).
Write checkpoint-verified blocks with
WriteOptions::disable_wal(true), keyed off theFinalizableBlock::Checkpointvariant. Semantically-verified (contextual) blocks are not reproducible and keep full WAL durability.To keep crash recovery consistent:
Enable RocksDB
atomic_flush. With the WAL off and ~31 column families, non-atomic flush could persist families to different heights and recover into a torn state; atomic flush guarantees the recovered state is always a consistent prefix across every column family.Before the first WAL-backed write that follows any WAL-less writes, flush the memtables. This makes earlier checkpoint blocks durable before a non-reproducible block (and its WAL entry) depends on them, so a later crash cannot replay the WAL on top of a state missing those blocks. The flush flag is restored if the flush itself fails, so a retry re-attempts it.
Graceful shutdown already flushes memtables, so WAL-less writes survive a clean exit; only a hard crash discards the unflushed tail.
Adds a low-level unit test for the WAL-less write path and the boundary flush, and a criterion benchmark (
checkpoint_commit) that A/Bs the commit path with the WAL enabled vs disabled. An in-memory, benchmark-only override (set_checkpoint_skip_wal, gated behindproptest-impl) drives the A/B; the WAL-skip is always on in production.Benchmark note: because Zebra runs the WAL asynchronously (RocksDB
WriteOptions::syncdefaults to false and is never set), there is no per-block fsync to remove. The micro-benchmark over a short chain of small blocks shows a modest (~4%) improvement; the saving is write-amplication bound, so it scales with block size and is expected to be larger over real mainnet blocks and longer runs.