unicity-node hardening — follow-ups from the bft-snapshots audit (issues #1 / #5 / #6)

## Context

  The merged `bft-snapshots` branch (#1 / PR #3, #5 / PR #11, #6 / PR #10) closed
  the **Priority 1** items in each issue. The Priority 2 and Priority 3 items
  remain open — they were listed in the original issue bodies but the merged PRs
  deferred them. This tracker collects those + two adjacent gaps discovered
  during the audit + one item the PR #3 review marked as future work.

  Each item has a passing-while-the-gap-holds repro test on
  `test-coverage-issues-5-6` (will land in a separate test-coverage PR linked
  here). When a fix lands, the corresponding test fails until the implementer
  flips the assertion and adds the positive-path case.

  ## Hard rules for the fixes

  1. **Propose and implement the EXACT minimal fix.** No new abstractions,
     schedulers, or sub-systems. Reuse existing machinery (`connect_to_sync` for
     F6a, `ProcessTimers()` for F6b, etc.). If a fix seems to require a new
     subsystem, stop and discuss.
  2. **Every fix must be covered by tests exercising BOTH paths** — positive
     (valid behavior still works) AND negative (the gap is closed). The
     existing repros become the negative-path tests once flipped.
  3. **No behavior change beyond the stated fix.**

  ## Findings — items the original tasks named (priority order)

  ### 1. F2b — RPC server does not support HTTP keep-alive
  **Original task:** **#5 Priority 3** (verbatim from #5 body).
  **Status:** open after PR #11 (P1 backlog fix only).
  **System impact now:** every RPC call opens a fresh TCP connection. FGP issues
  ~6 calls/round × 3 nodes → constant connect/teardown churn, accept-loop
  pressure, defeats FGP's client-side pooling (compounds with FGP F3).
  **System impact after fix:** persistent connections cut churn; FGP pool works.
  **Exact fix:** in `src/network/rpc_server.cpp`, honor `Connection: keep-alive`
  — keep the socket open for subsequent requests up to a sane idle timeout /
  request limit. No new I/O framework.
  **Tests:** `test/functional/bug_rpc_no_keep_alive.py` (runtime — verified
  reproduces: second request on same socket → `BrokenPipeError`).
  **Fix-test coverage:** positive — single request still works; negative — two
  sequential requests on one socket both succeed.

  ### 2. F6a — `addnode` has no retry on transient failure
  **Original task:** **#6 Priority 2** — *"Retry logic for addnode connections.
  When addnode 'add' initiates a connection and the async callback fails,
  schedule a retry (e.g., 3 attempts with exponential backoff)."*
  **Status:** open after PR #10 (sync conversion + logging fixed; retry deferred).
  **System impact now:** `addnode add <peer>` against a momentarily-unreachable
  peer fails once and is never retried — operator's manually-added peer
  silently never connects if it was down at add time. Reduces connectivity
  resilience.
  **System impact after fix:** transient failures are retried with backoff;
  manually-added peers eventually connect.
  **Exact fix:** add bounded retry/backoff (3 attempts, exponential, per #6's
  spec) for manually-added peers on transient connect failure, reusing
  `connect_to_sync` / the existing outbound scheduler. Don't build a new one.
  **Tests:** `test/functional/bug_addnode_no_retry.py` (runtime — verified
  reproduces: only the original failed attempt in `debug.log`, no retry).
  **Fix-test coverage:** positive — addnode to a reachable peer connects;
  negative — addnode to a down-then-up peer eventually connects.

  ### 3. F2 — thread-per-request `std::thread(...).detach()` model
  **Original task:** **#5 Priority 2** — *"Replace detached threads with thread
  pool"* (with code example in #5 body).
  **Status:** open after PR #11.
  **System impact now:** RPC accept loop spawns a detached `std::thread` per
  request. Under burst (3 FGP nodes' synchronized T1 timers firing ~6 calls
  each within ~35 ms), thread creation is unbounded — resource pressure, no
  backpressure beyond the busy-path cap.
  **System impact after fix:** bounded concurrency via a fixed-size worker pool + queue; predictable resource use under burst.
  **Exact fix:** replace per-request `std::thread(...).detach()` with a
  pre-allocated 4–8 worker thread pool (as #5 P2 itself suggests). No new
  concurrency framework.
  **Tests:** `test/functional/bug_rpc_thread_per_request.py` (source-level grep
  — a runtime load test was evaluated and deferred: thread/FD counts are
  scheduler-dependent and would be flaky; threading model is a source property).
  **Fix-test coverage:** positive — RPC still serves requests; negative — flip
  the source assertion (no per-request `std::thread/.detach`; pool present).

  ### 4. F6b — no FAST post-IBD chain-sync stall detection
  **Original task:** **#6 Priority 3** — *"Consider post-IBD stall detection.
  Add a lighter post-IBD check..."*
  **Status:** scoped down by the audit — most post-IBD logic IS present
  (`ConsiderEviction`, PING/inactivity timeouts, multi-peer sync). The specific
  gap: in `HeaderSyncManager::ProcessTimers()`, the `if (in_ibd)` branch has a
  `sync_deadline` check; the `else` (post-IBD) branch has only
  `ConsiderEviction`, no deadline.
  **System impact now:** a post-IBD peer that answers PINGs and has chainwork
  not visibly behind but silently drags feet on headers is only caught by the
  1200 s inactivity timeout — ~20 min wedge. IBD catches the same in seconds.
  **System impact after fix:** a post-IBD deadline analog catches stalling peers
  in seconds, matching the IBD path.
  **Exact fix:** add a post-IBD deadline-based stall check in the `else` branch
  of `ProcessTimers()` — a `sync_deadline` analog that disconnects a peer
  which hasn't delivered expected headers within a window. Mirror the IBD
  branch.
  **Tests:** `test/functional/bug_post_ibd_stall_disabled.py` (source-level —
  runtime variant deferred until a `node_simulator` "post-IBD header-stall"
  scenario exists).
  **Fix-test coverage:** positive — healthy post-IBD peer not disconnected;
  negative — flip the source assertion.

  ## Findings the PR #3 review marked as future work

  ### 5. S5 — Trust-base disk-load validation
  **Origin:** PR #3 review explicitly marked **"future work"** (issue #1 audit).
  **System impact now:** `LocalTrustBaseManager::Load()`
  (`src/chain/trust_base_manager.cpp:19-43`) deserializes `.cbor` via
  `FromCBOR(data)` and stores without `IsValid()` / `VerifySignatures()`. A
  tampered or corrupt-but-parseable `.cbor` loads silently. Threat model
  narrow (requires write access to data dir) but real.
  **System impact after fix:** invalid trust bases rejected/skipped at load
  time, not silently accepted as authoritative state.
  **Exact fix:** in `Load()`'s try block, after `FromCBOR(data)`, call
  `tb.IsValid(prev)` and `tb.VerifySignatures(prev)`; reject (or skip with a
  logged error) on failure. ~5 lines.
  **Tests:** the new SECTION in `test/unit/chain/trust_base_manager_tests.cpp`
  (writes a `quorum_threshold = 0` `.cbor` directly to the data subdir,
  asserts `Load()` accepts it silently — gap holds).
  **Fix-test coverage:** positive — valid trust bases still load; negative —
  flip (the invalid TB is rejected / not promoted to latest).

  ## Findings discovered during this audit

  ### 6. F6c — `getaddednodeinfo` RPC not implemented
  **Origin:** not in #6's body — surfaced during audit as an adjacent
  operability gap (Bitcoin Core has it; we don't).
  **System impact now:** none to consensus or liveness — operability only.
  Operators can't inspect what `addnode add` queued except by grepping
  `debug.log`.
  **System impact after fix:** operators can query the manually-added-node list + connection status, matching Bitcoin Core ergonomics.
  **Exact fix:** implement `getaddednodeinfo` returning the manually-added-node
  list + per-node connection state, mirroring Bitcoin Core's response shape.
  **Tests:** `test/functional/bug_getaddednodeinfo_missing.py` (runtime —
  returns `'Unknown command'`).
  **Fix-test coverage:** positive — added nodes appear in `getaddednodeinfo`;
  negative — flip the `'Unknown command'` assertion.

  ### 7. H2 — `bug_rpc_busy_path_resets` exits 0 even when INCONCLUSIVE
  **Origin:** discovered during Step-1 verification.
  **System impact now:** test-hygiene only — when the test can't saturate the
  busy-path cap (the normal case now that #5 P1 is fixed), it prints
  `"INCONCLUSIVE"` but still exits 0, looking like a pass when run on demand.
  **Exact fix:** either exit non-0 / skip on "couldn't saturate," OR retire
  this reproducer now that `feature_rpc_socket_backlog.py` covers the F1
  fix authoritatively.

  ## Priority order

  1. **F2b** (keep-alive, #5 P3) — clear runtime fix, FGP-pool interaction.
  2. **F6a** (addnode retry, #6 P2) — reuses existing reconnect machinery.
  3. **F2** (thread-per-request, #5 P2) — bounded-pool refactor (largest fix).
  4. **F6b** (post-IBD stall, #6 P3) — deadline analog in `ProcessTimers()`.
  5. **S5** (trust-base disk-load) — PR review future work.
  6. **F6c** (`getaddednodeinfo`) — operability improvement.
  7. **H2** (test exit code / retire) — test hygiene, anytime.

  ## Out of scope (link, don't duplicate)

  - **unicity-node #7** (pre-VERACK no-disconnect) — already its own open issue.
  - **JSON-RPC `id` echo** — fixed upstream in `f7d9fdb` (now on main); the
    FGP-side TODO cleanup is in finality-gadget#17.
  - **AC.11 list-tokens CLI** — satisfied via `<datadir>/reward_tokens.csv` per
    @ahtotruu's clarification ("`cat csv` is also CLI"). If #13's spec changes
    prompt a query RPC, that work belongs in #13.
  - **vPayload memory pruning** — explicitly deferred future work per the
    original #1 design comment (*"Payload pruning will not be implemented
    initially"*). Larger refactor; not defense-in-depth.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicity-node hardening — follow-ups from the bft-snapshots audit (issues #1 / #5 / #6) #14

Context

Hard rules for the fixes

Findings — items the original tasks named (priority order)

1. F2b — RPC server does not support HTTP keep-alive

2. F6a — `addnode` has no retry on transient failure

3. F2 — thread-per-request `std::thread(...).detach()` model

4. F6b — no FAST post-IBD chain-sync stall detection

Findings the PR #3 review marked as future work

5. S5 — Trust-base disk-load validation

Findings discovered during this audit

6. F6c — `getaddednodeinfo` RPC not implemented

7. H2 — `bug_rpc_busy_path_resets` exits 0 even when INCONCLUSIVE

Priority order

Out of scope (link, don't duplicate)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

unicity-node hardening — follow-ups from the bft-snapshots audit (issues #1 / #5 / #6) #14

Description

Context

Hard rules for the fixes

Findings — items the original tasks named (priority order)

1. F2b — RPC server does not support HTTP keep-alive

2. F6a — addnode has no retry on transient failure

3. F2 — thread-per-request std::thread(...).detach() model

4. F6b — no FAST post-IBD chain-sync stall detection

Findings the PR #3 review marked as future work

5. S5 — Trust-base disk-load validation

Findings discovered during this audit

6. F6c — getaddednodeinfo RPC not implemented

7. H2 — bug_rpc_busy_path_resets exits 0 even when INCONCLUSIVE

Priority order

Out of scope (link, don't duplicate)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. F6a — `addnode` has no retry on transient failure

3. F2 — thread-per-request `std::thread(...).detach()` model

6. F6c — `getaddednodeinfo` RPC not implemented

7. H2 — `bug_rpc_busy_path_resets` exits 0 even when INCONCLUSIVE