feat(perf): add private Zakura cohort deploy and bench tooling#291
Conversation
* feat(network): add private Zakura dev-network cohorts Zakura (v2) dev nodes bootstrap from a few peers, but discovery and gossip then pull in the rest of the network, so concurrent experiments by different team members collide. This adds an opt-in way to run an isolated v2 overlay on top of unchanged mainnet consensus. Add an optional `[network.zakura] dev_network` cohort tag. When set, a node only forms Zakura connections with peers sharing the same tag: its `ZakuraHandshakeConfig` advertises `ZakuraNetworkId::Configured` and a cohort-derived `chain_id` (`derive_dev_chain_id` = domain-separated blake2b over the real genesis hash and the tag). Both fields are already validated in the Zakura control handshake, the legacy->Zakura upgrade prelude, and signed discovery records (records copy `handshake.chain_id`), so isolation propagates with no wire-format change and no new reject code: - a public mainnet node (`network_id = Mainnet`) and a dev node reject each other with `WrongNetwork` and stay on legacy; - different cohorts (both `Configured`, different `chain_id`) reject with `WrongChain`, and cross-cohort discovery records fail import; - same-tag peers match and form the private overlay. The tag only scopes the Zakura v2 overlay. Genesis, network magic, and activation heights are unchanged, so a cohort node validates the real chain. `chain_id` here is a Zakura peer-matching id only; block validation uses the unchanged network parameters. Has no effect unless `v2_p2p` is enabled. The legacy->Zakura upgrade path rebuilds the handshake config from scratch, so it now also threads the cohort tag; otherwise a tagged node would advertise the cohort id on its native endpoint but the plain id during upgrades and could not upgrade with its own cohort. * docs(network): add Zakura dev-network README Add a developer-facing README for the `[network.zakura] dev_network` cohort feature next to the code (`zebra-network/src/zakura/README.md`): what it does, the network_id/chain_id mechanism, the code map, and how to test. Complements the operator-facing book guide.
Add deploy/deployer/, a dependency-free Python CLI for deploying zebrad to a fleet of nodes and collecting their logs. It reuses the build -> scp -> install-with-.bak -> systemctl restart -> rollback pattern from deploy-zcashd-compat.yml, generalized to a dynamic multi-node TOML config (per-node name / ssh_string / commit). - build: resolve each node's commit to a SHA (origin/<ref> fallback) and build zebrad once per unique SHA into a cache, using a throwaway detached git worktree so the caller's working tree is untouched. Honors CARGO_TARGET_DIR. - deploy: distribute the binary + a rendered zebrad.toml (deterministic [tracing] log_file) + a systemd unit, install with a .bak backup, restart, and roll back on failure. Nodes run in parallel. - status: per-node service state + version. - logs fetch / logs follow: copy or live-tail the deterministic log file by name. Remote scripts are fed on stdin (bash -s) rather than as ssh args, since ssh flattens argv and would collapse `bash -c '<multi-word>'` to its first word.
…267) Tooling for deterministic, isolated sync-perf benchmarking against a private Zakura cohort of operator-controlled serving nodes. - deploy/runner/: perf.sh is the single entry point over the seed -> peers -> freeze -> run -> analyze lifecycle; feed_run.sh forks a snapshot and samples five-category bottleneck metrics into a CSV; feed_analyze.py does steady-state bottleneck attribution; a tokenized bench config and serving-fleet template; the live metrics dashboard; and a runbook. Host-specific paths and the cohort identity live in an untracked cohort.env. - deploy/deployer/deploy.py: render a fleet-wide [network.zakura] cohort block, storage_mode / v2_p2p / legacy_p2p, and an optional metrics endpoint plus tracing filter; `status` now reports the running git commit and configured ref. - make/perf.mk and Makefile: `make perf-*` wrappers (build-local, run, analyze, dashboard, verify-isolation, seed/peers/freeze/status).
…refinements (#271) Adds per-phase commit-pipeline instrumentation (behind the `commit-metrics` feature) and the perf-harness refinements that consume it, on top of the private-cohort bench harness (#267). State instrumentation (commit-metrics gated; default builds unaffected): - finalized_state.rs: time the history-tree MMR push (history_push phase). - block.rs: time spent-UTXO reads, address-balance reads, and batch build, and record committed batch size in bytes (write-throughput / per-block size). - disk_db.rs: DiskWriteBatch::size_in_bytes() accessor for the above; gated to its sole caller so a default build does not flag it dead. Perf harness (deploy/runner + make/perf.mk): - feed_run.sh: stop a prior same-label run before re-forking, so a rerun can't collide on the fork/ports (RocksDB "multiple active instances" -> instant exit). - perf.sh: add a `logs` subcommand (drift-spam filtered; RAW=1 for all). - zebra-metrics-dashboard.py: smooth throughput over a 20s trailing window so batched checkpoint commits don't alias into false spike/zero; add --smooth. - feed_analyze.py: commit-pipeline attribution over the new metric set. - make/perf.mk: perf-logs target; default stop height -> 1.9M; wider analyze window. Verified: `cargo check -p zebra-state` passes with and without commit-metrics; rustfmt clean. Instrumentation exercised live by the bench binary emitting these metrics during cohort sync. AI disclosure: drafted with Claude Code (Claude Opus) — instrumentation, harness changes, and this description.
And one more auto-invalidated finding. Analyzed six files, diff |
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.
Reviewed by Cursor Bugbot for commit 151bd15. Configure here.

Motivation
Sync-perf debugging from the 1.8M snapshot was unreliable when bench nodes peered with the shared public Zakura fleet: serving peers could change mid-run as other engineers redeployed nodes, making results hard to compare. This PR brings the previously merged private Zakura cohort, deployer, and benchmark harness work onto
ironwood-mainso an isolated, operator-controlled benchmark can be run against a deterministic serving cohort.This also restores a local, ad-hoc multi-node
zebraddeploy path. The existing automated deploy path targets a single hard-coded compatibility host in CI; the new deployer can build a chosen commit, push it to a fleet, run it as a service, and fetch or follow node logs.Solution
deploy/deployer/, a dependency-free Python 3.11+ CLI that buildszebradonce per unique commit SHA, deploys binaries/config/systemd units to multiple SSH targets, reports status, and fetches or follows deterministic log files.deploy/runner/benchmark tooling for the private-cohort lifecycle: seed serving nodes, render peer configs, freeze serving nodes, run local sync benchmarks, analyze CSV bottleneck metrics, show a live dashboard, verify isolation, and collect logs.make perf-*wrappers and extend the deployer renderer for[network.zakura], storage mode, V2/legacy P2P toggles, metrics endpoints, tracing filters, and running commit reporting.commit-metrics-gated state instrumentation for commit pipeline phases and batch bytes, consumed by the benchmark analyzer/dashboard while leaving default builds unaffected.Tests
Source PR validation, carried over from #264 and #267:
zakura.p2p.conn.*metrics with zerowrong_network/wrong_chainrejects.legacy peers = 0, VCT fast path active), producing CSV output andanalyzeresults.zebrad 5.0.0-rc.3, a second build reused the cache, deploy succeeded to both nodes,statusreported active services,logs fetchcopied deterministic log files, andlogs followstreamed live output.make -nconfirmed each perf target maps to the intendedperf.shcommand.