Skip to content

unicity-node hardening — follow-ups from the bft-snapshots audit (issues #1 / #5 / #6) #14

@b3y0urs3lf

Description

@b3y0urs3lf

Context

The merged bft-snapshots branch (#1 / PR #3, #5 / PR #11, #6 / PR #10) closed
the Priority 1 items in each issue. The Priority 2 and Priority 3 items
remain open — they were listed in the original issue bodies but the merged PRs
deferred them. This tracker collects those + two adjacent gaps discovered
during the audit + one item the PR #3 review marked as future work.

Each item has a passing-while-the-gap-holds repro test on
test-coverage-issues-5-6 (will land in a separate test-coverage PR linked
here). When a fix lands, the corresponding test fails until the implementer
flips the assertion and adds the positive-path case.

Hard rules for the fixes

  1. Propose and implement the EXACT minimal fix. No new abstractions,
    schedulers, or sub-systems. Reuse existing machinery (connect_to_sync for
    F6a, ProcessTimers() for F6b, etc.). If a fix seems to require a new
    subsystem, stop and discuss.
  2. Every fix must be covered by tests exercising BOTH paths — positive
    (valid behavior still works) AND negative (the gap is closed). The
    existing repros become the negative-path tests once flipped.
  3. No behavior change beyond the stated fix.

Findings — items the original tasks named (priority order)

1. F2b — RPC server does not support HTTP keep-alive

Original task: #5 Priority 3 (verbatim from #5 body).
Status: open after PR #11 (P1 backlog fix only).
System impact now: every RPC call opens a fresh TCP connection. FGP issues
~6 calls/round × 3 nodes → constant connect/teardown churn, accept-loop
pressure, defeats FGP's client-side pooling (compounds with FGP F3).
System impact after fix: persistent connections cut churn; FGP pool works.
Exact fix: in src/network/rpc_server.cpp, honor Connection: keep-alive
— keep the socket open for subsequent requests up to a sane idle timeout /
request limit. No new I/O framework.
Tests: test/functional/bug_rpc_no_keep_alive.py (runtime — verified
reproduces: second request on same socket → BrokenPipeError).
Fix-test coverage: positive — single request still works; negative — two
sequential requests on one socket both succeed.

2. F6a — addnode has no retry on transient failure

Original task: #6 Priority 2"Retry logic for addnode connections.
When addnode 'add' initiates a connection and the async callback fails,
schedule a retry (e.g., 3 attempts with exponential backoff)."

Status: open after PR #10 (sync conversion + logging fixed; retry deferred).
System impact now: addnode add <peer> against a momentarily-unreachable
peer fails once and is never retried — operator's manually-added peer
silently never connects if it was down at add time. Reduces connectivity
resilience.
System impact after fix: transient failures are retried with backoff;
manually-added peers eventually connect.
Exact fix: add bounded retry/backoff (3 attempts, exponential, per #6's
spec) for manually-added peers on transient connect failure, reusing
connect_to_sync / the existing outbound scheduler. Don't build a new one.
Tests: test/functional/bug_addnode_no_retry.py (runtime — verified
reproduces: only the original failed attempt in debug.log, no retry).
Fix-test coverage: positive — addnode to a reachable peer connects;
negative — addnode to a down-then-up peer eventually connects.

3. F2 — thread-per-request std::thread(...).detach() model

Original task: #5 Priority 2"Replace detached threads with thread
pool"
(with code example in #5 body).
Status: open after PR #11.
System impact now: RPC accept loop spawns a detached std::thread per
request. Under burst (3 FGP nodes' synchronized T1 timers firing ~6 calls
each within ~35 ms), thread creation is unbounded — resource pressure, no
backpressure beyond the busy-path cap.
System impact after fix: bounded concurrency via a fixed-size worker pool + queue; predictable resource use under burst.
Exact fix: replace per-request std::thread(...).detach() with a
pre-allocated 4–8 worker thread pool (as #5 P2 itself suggests). No new
concurrency framework.
Tests: test/functional/bug_rpc_thread_per_request.py (source-level grep
— a runtime load test was evaluated and deferred: thread/FD counts are
scheduler-dependent and would be flaky; threading model is a source property).
Fix-test coverage: positive — RPC still serves requests; negative — flip
the source assertion (no per-request std::thread/.detach; pool present).

4. F6b — no FAST post-IBD chain-sync stall detection

Original task: #6 Priority 3"Consider post-IBD stall detection.
Add a lighter post-IBD check..."

Status: scoped down by the audit — most post-IBD logic IS present
(ConsiderEviction, PING/inactivity timeouts, multi-peer sync). The specific
gap: in HeaderSyncManager::ProcessTimers(), the if (in_ibd) branch has a
sync_deadline check; the else (post-IBD) branch has only
ConsiderEviction, no deadline.
System impact now: a post-IBD peer that answers PINGs and has chainwork
not visibly behind but silently drags feet on headers is only caught by the
1200 s inactivity timeout — ~20 min wedge. IBD catches the same in seconds.
System impact after fix: a post-IBD deadline analog catches stalling peers
in seconds, matching the IBD path.
Exact fix: add a post-IBD deadline-based stall check in the else branch
of ProcessTimers() — a sync_deadline analog that disconnects a peer
which hasn't delivered expected headers within a window. Mirror the IBD
branch.
Tests: test/functional/bug_post_ibd_stall_disabled.py (source-level —
runtime variant deferred until a node_simulator "post-IBD header-stall"
scenario exists).
Fix-test coverage: positive — healthy post-IBD peer not disconnected;
negative — flip the source assertion.

Findings the PR #3 review marked as future work

5. S5 — Trust-base disk-load validation

Origin: PR #3 review explicitly marked "future work" (issue #1 audit).
System impact now: LocalTrustBaseManager::Load()
(src/chain/trust_base_manager.cpp:19-43) deserializes .cbor via
FromCBOR(data) and stores without IsValid() / VerifySignatures(). A
tampered or corrupt-but-parseable .cbor loads silently. Threat model
narrow (requires write access to data dir) but real.
System impact after fix: invalid trust bases rejected/skipped at load
time, not silently accepted as authoritative state.
Exact fix: in Load()'s try block, after FromCBOR(data), call
tb.IsValid(prev) and tb.VerifySignatures(prev); reject (or skip with a
logged error) on failure. ~5 lines.
Tests: the new SECTION in test/unit/chain/trust_base_manager_tests.cpp
(writes a quorum_threshold = 0 .cbor directly to the data subdir,
asserts Load() accepts it silently — gap holds).
Fix-test coverage: positive — valid trust bases still load; negative —
flip (the invalid TB is rejected / not promoted to latest).

Findings discovered during this audit

6. F6c — getaddednodeinfo RPC not implemented

Origin: not in #6's body — surfaced during audit as an adjacent
operability gap (Bitcoin Core has it; we don't).
System impact now: none to consensus or liveness — operability only.
Operators can't inspect what addnode add queued except by grepping
debug.log.
System impact after fix: operators can query the manually-added-node list + connection status, matching Bitcoin Core ergonomics.
Exact fix: implement getaddednodeinfo returning the manually-added-node
list + per-node connection state, mirroring Bitcoin Core's response shape.
Tests: test/functional/bug_getaddednodeinfo_missing.py (runtime —
returns 'Unknown command').
Fix-test coverage: positive — added nodes appear in getaddednodeinfo;
negative — flip the 'Unknown command' assertion.

7. H2 — bug_rpc_busy_path_resets exits 0 even when INCONCLUSIVE

Origin: discovered during Step-1 verification.
System impact now: test-hygiene only — when the test can't saturate the
busy-path cap (the normal case now that #5 P1 is fixed), it prints
"INCONCLUSIVE" but still exits 0, looking like a pass when run on demand.
Exact fix: either exit non-0 / skip on "couldn't saturate," OR retire
this reproducer now that feature_rpc_socket_backlog.py covers the F1
fix authoritatively.

Priority order

  1. F2b (keep-alive, IPC Socket drops connections under concurrent access (listen backlog overflow) #5 P3) — clear runtime fix, FGP-pool interaction.
  2. F6a (addnode retry, Silent outbound connection failure — addnode returns success but connection never established #6 P2) — reuses existing reconnect machinery.
  3. F2 (thread-per-request, IPC Socket drops connections under concurrent access (listen backlog overflow) #5 P2) — bounded-pool refactor (largest fix).
  4. F6b (post-IBD stall, Silent outbound connection failure — addnode returns success but connection never established #6 P3) — deadline analog in ProcessTimers().
  5. S5 (trust-base disk-load) — PR review future work.
  6. F6c (getaddednodeinfo) — operability improvement.
  7. H2 (test exit code / retire) — test hygiene, anytime.

Out of scope (link, don't duplicate)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions