Skip to content

feat(perf): add private Zakura cohort deploy and bench tooling#291

Merged
p0mvn merged 7 commits into
ironwood-mainfrom
roman/ironwood-zakura-perf-deploy
Jun 27, 2026
Merged

feat(perf): add private Zakura cohort deploy and bench tooling#291
p0mvn merged 7 commits into
ironwood-mainfrom
roman/ironwood-zakura-perf-deploy

Conversation

@p0mvn

@p0mvn p0mvn commented Jun 27, 2026

Copy link
Copy Markdown

Motivation

Sync-perf debugging from the 1.8M snapshot was unreliable when bench nodes peered with the shared public Zakura fleet: serving peers could change mid-run as other engineers redeployed nodes, making results hard to compare. This PR brings the previously merged private Zakura cohort, deployer, and benchmark harness work onto ironwood-main so an isolated, operator-controlled benchmark can be run against a deterministic serving cohort.

This also restores a local, ad-hoc multi-node zebrad deploy path. The existing automated deploy path targets a single hard-coded compatibility host in CI; the new deployer can build a chosen commit, push it to a fleet, run it as a service, and fetch or follow node logs.

Solution

  • Add private Zakura dev-network cohort support and documentation, including cohort identity checks in the Zakura handshake so isolated test cohorts do not mix with public/default peers.
  • Add deploy/deployer/, a dependency-free Python 3.11+ CLI that builds zebrad once per unique commit SHA, deploys binaries/config/systemd units to multiple SSH targets, reports status, and fetches or follows deterministic log files.
  • Add deploy/runner/ benchmark tooling for the private-cohort lifecycle: seed serving nodes, render peer configs, freeze serving nodes, run local sync benchmarks, analyze CSV bottleneck metrics, show a live dashboard, verify isolation, and collect logs.
  • Add make perf-* wrappers and extend the deployer renderer for [network.zakura], storage mode, V2/legacy P2P toggles, metrics endpoints, tracing filters, and running commit reporting.
  • Add commit-metrics-gated state instrumentation for commit pipeline phases and batch bytes, consumed by the benchmark analyzer/dashboard while leaving default builds unaffected.

Tests

Source PR validation, carried over from #264 and #267:

  • Deployed two live archive serving nodes, formed a private Zakura cohort, and verified zakura.p2p.conn.* metrics with zero wrong_network/wrong_chain rejects.
  • Ran an isolated bench that synced from the cohort only (legacy peers = 0, VCT fast path active), producing CSV output and analyze results.
  • Validated deployer behavior end-to-end: cold build produced zebrad 5.0.0-rc.3, a second build reused the cache, deploy succeeded to both nodes, status reported active services, logs fetch copied deterministic log files, and logs follow streamed live output.
  • Unit-checked deployer render paths by parsing rendered TOML for seed/freeze phases and confirming existing fleets without new fields render unchanged.
  • make -n confirmed each perf target maps to the intended perf.sh command.

p0mvn added 4 commits June 27, 2026 01:08
* feat(network): add private Zakura dev-network cohorts

Zakura (v2) dev nodes bootstrap from a few peers, but discovery and gossip
then pull in the rest of the network, so concurrent experiments by different
team members collide. This adds an opt-in way to run an isolated v2 overlay on
top of unchanged mainnet consensus.

Add an optional `[network.zakura] dev_network` cohort tag. When set, a node
only forms Zakura connections with peers sharing the same tag: its
`ZakuraHandshakeConfig` advertises `ZakuraNetworkId::Configured` and a
cohort-derived `chain_id` (`derive_dev_chain_id` = domain-separated blake2b
over the real genesis hash and the tag). Both fields are already validated in
the Zakura control handshake, the legacy->Zakura upgrade prelude, and signed
discovery records (records copy `handshake.chain_id`), so isolation propagates
with no wire-format change and no new reject code:

- a public mainnet node (`network_id = Mainnet`) and a dev node reject each
  other with `WrongNetwork` and stay on legacy;
- different cohorts (both `Configured`, different `chain_id`) reject with
  `WrongChain`, and cross-cohort discovery records fail import;
- same-tag peers match and form the private overlay.

The tag only scopes the Zakura v2 overlay. Genesis, network magic, and
activation heights are unchanged, so a cohort node validates the real chain.
`chain_id` here is a Zakura peer-matching id only; block validation uses the
unchanged network parameters. Has no effect unless `v2_p2p` is enabled.

The legacy->Zakura upgrade path rebuilds the handshake config from scratch, so
it now also threads the cohort tag; otherwise a tagged node would advertise the
cohort id on its native endpoint but the plain id during upgrades and could not
upgrade with its own cohort.

* docs(network): add Zakura dev-network README

Add a developer-facing README for the `[network.zakura] dev_network` cohort
feature next to the code (`zebra-network/src/zakura/README.md`): what it does,
the network_id/chain_id mechanism, the code map, and how to test. Complements
the operator-facing book guide.
Add deploy/deployer/, a dependency-free Python CLI for deploying zebrad to a
fleet of nodes and collecting their logs. It reuses the build -> scp ->
install-with-.bak -> systemctl restart -> rollback pattern from
deploy-zcashd-compat.yml, generalized to a dynamic multi-node TOML config
(per-node name / ssh_string / commit).

- build: resolve each node's commit to a SHA (origin/<ref> fallback) and build
  zebrad once per unique SHA into a cache, using a throwaway detached git
  worktree so the caller's working tree is untouched. Honors CARGO_TARGET_DIR.
- deploy: distribute the binary + a rendered zebrad.toml (deterministic
  [tracing] log_file) + a systemd unit, install with a .bak backup, restart,
  and roll back on failure. Nodes run in parallel.
- status: per-node service state + version.
- logs fetch / logs follow: copy or live-tail the deterministic log file by name.

Remote scripts are fed on stdin (bash -s) rather than as ssh args, since ssh
flattens argv and would collapse `bash -c '<multi-word>'` to its first word.
…267)

Tooling for deterministic, isolated sync-perf benchmarking against a private
Zakura cohort of operator-controlled serving nodes.

- deploy/runner/: perf.sh is the single entry point over the
  seed -> peers -> freeze -> run -> analyze lifecycle; feed_run.sh forks a
  snapshot and samples five-category bottleneck metrics into a CSV;
  feed_analyze.py does steady-state bottleneck attribution; a tokenized bench
  config and serving-fleet template; the live metrics dashboard; and a runbook.
  Host-specific paths and the cohort identity live in an untracked cohort.env.
- deploy/deployer/deploy.py: render a fleet-wide [network.zakura] cohort block,
  storage_mode / v2_p2p / legacy_p2p, and an optional metrics endpoint plus
  tracing filter; `status` now reports the running git commit and configured ref.
- make/perf.mk and Makefile: `make perf-*` wrappers (build-local, run, analyze,
  dashboard, verify-isolation, seed/peers/freeze/status).
…refinements (#271)

Adds per-phase commit-pipeline instrumentation (behind the `commit-metrics`
feature) and the perf-harness refinements that consume it, on top of the
private-cohort bench harness (#267).

State instrumentation (commit-metrics gated; default builds unaffected):
- finalized_state.rs: time the history-tree MMR push (history_push phase).
- block.rs: time spent-UTXO reads, address-balance reads, and batch build, and
  record committed batch size in bytes (write-throughput / per-block size).
- disk_db.rs: DiskWriteBatch::size_in_bytes() accessor for the above; gated to
  its sole caller so a default build does not flag it dead.

Perf harness (deploy/runner + make/perf.mk):
- feed_run.sh: stop a prior same-label run before re-forking, so a rerun can't
  collide on the fork/ports (RocksDB "multiple active instances" -> instant exit).
- perf.sh: add a `logs` subcommand (drift-spam filtered; RAW=1 for all).
- zebra-metrics-dashboard.py: smooth throughput over a 20s trailing window so
  batched checkpoint commits don't alias into false spike/zero; add --smooth.
- feed_analyze.py: commit-pipeline attribution over the new metric set.
- make/perf.mk: perf-logs target; default stop height -> 1.9M; wider analyze window.

Verified: `cargo check -p zebra-state` passes with and without commit-metrics;
rustfmt clean. Instrumentation exercised live by the bench binary emitting these
metrics during cohort sync.

AI disclosure: drafted with Claude Code (Claude Opus) — instrumentation, harness
changes, and this description.
@v12-auditor

v12-auditor Bot commented Jun 27, 2026

Copy link
Copy Markdown

Note

Complete: Audit complete. V12 found one issue worth reviewing.

Open the full results here.

FindingSeverityDetails
F-93627 🟠 High
Ironwood state is unenforced

V6 transactions contain an ironwood_shielded_data field and expose Ironwood nullifiers and note commitments, while ironwood.rs states that Ironwood has distinct note-commitment and nullifier state. The state layer still enforces only Sprout, Sapling, and Orchard: finalized duplicate-nullifier checks, mempool/best-chain duplicate checks, anchor checks, nullifier persistence, and note-commitment tree persistence all omit Ironwood. The finalized write path calls prepare_shielded_transaction_batch and prepare_trees_batch, but those helpers write only Sprout/Sapling/Orchard nullifiers and trees. On builds where V6 is enabled and NU6.3/NU7 is active, a malicious block or mempool transaction can therefore reuse an Ironwood nullifier across transactions or rely on Ironwood commitments and anchors that are never added to consensus state. The defect is conditional on Ironwood activation, but under that configuration it is a consensus-critical validation gap.

And one more auto-invalidated finding.

Analyzed six files, diff f2bf436...9cfc07a.

@p0mvn p0mvn marked this pull request as draft June 27, 2026 17:55
@p0mvn

p0mvn commented Jun 27, 2026

Copy link
Copy Markdown
Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.

Reviewed by Cursor Bugbot for commit 151bd15. Configure here.

Comment thread deploy/runner/perf.sh Outdated
@p0mvn p0mvn marked this pull request as ready for review June 27, 2026 20:56
@p0mvn p0mvn merged commit 8934575 into ironwood-main Jun 27, 2026
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant