You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The merged bft-snapshots branch (#1 / PR #3, #5 / PR #11, #6 / PR #10) closed
the Priority 1 items in each issue. The Priority 2 and Priority 3 items
remain open — they were listed in the original issue bodies but the merged PRs
deferred them. This tracker collects those + two adjacent gaps discovered
during the audit + one item the PR #3 review marked as future work.
Each item has a passing-while-the-gap-holds repro test on test-coverage-issues-5-6 (will land in a separate test-coverage PR linked
here). When a fix lands, the corresponding test fails until the implementer
flips the assertion and adds the positive-path case.
Hard rules for the fixes
Propose and implement the EXACT minimal fix. No new abstractions,
schedulers, or sub-systems. Reuse existing machinery (connect_to_sync for
F6a, ProcessTimers() for F6b, etc.). If a fix seems to require a new
subsystem, stop and discuss.
Every fix must be covered by tests exercising BOTH paths — positive
(valid behavior still works) AND negative (the gap is closed). The
existing repros become the negative-path tests once flipped.
No behavior change beyond the stated fix.
Findings — items the original tasks named (priority order)
1. F2b — RPC server does not support HTTP keep-alive
Original task:#5 Priority 3 (verbatim from #5 body). Status: open after PR #11 (P1 backlog fix only). System impact now: every RPC call opens a fresh TCP connection. FGP issues
~6 calls/round × 3 nodes → constant connect/teardown churn, accept-loop
pressure, defeats FGP's client-side pooling (compounds with FGP F3). System impact after fix: persistent connections cut churn; FGP pool works. Exact fix: in src/network/rpc_server.cpp, honor Connection: keep-alive
— keep the socket open for subsequent requests up to a sane idle timeout /
request limit. No new I/O framework. Tests:test/functional/bug_rpc_no_keep_alive.py (runtime — verified
reproduces: second request on same socket → BrokenPipeError). Fix-test coverage: positive — single request still works; negative — two
sequential requests on one socket both succeed.
2. F6a — addnode has no retry on transient failure
Original task:#6 Priority 2 — "Retry logic for addnode connections.
When addnode 'add' initiates a connection and the async callback fails,
schedule a retry (e.g., 3 attempts with exponential backoff)." Status: open after PR #10 (sync conversion + logging fixed; retry deferred). System impact now:addnode add <peer> against a momentarily-unreachable
peer fails once and is never retried — operator's manually-added peer
silently never connects if it was down at add time. Reduces connectivity
resilience. System impact after fix: transient failures are retried with backoff;
manually-added peers eventually connect. Exact fix: add bounded retry/backoff (3 attempts, exponential, per #6's
spec) for manually-added peers on transient connect failure, reusing connect_to_sync / the existing outbound scheduler. Don't build a new one. Tests:test/functional/bug_addnode_no_retry.py (runtime — verified
reproduces: only the original failed attempt in debug.log, no retry). Fix-test coverage: positive — addnode to a reachable peer connects;
negative — addnode to a down-then-up peer eventually connects.
3. F2 — thread-per-request std::thread(...).detach() model
Original task:#5 Priority 2 — "Replace detached threads with thread
pool" (with code example in #5 body). Status: open after PR #11. System impact now: RPC accept loop spawns a detached std::thread per
request. Under burst (3 FGP nodes' synchronized T1 timers firing ~6 calls
each within ~35 ms), thread creation is unbounded — resource pressure, no
backpressure beyond the busy-path cap. System impact after fix: bounded concurrency via a fixed-size worker pool + queue; predictable resource use under burst. Exact fix: replace per-request std::thread(...).detach() with a
pre-allocated 4–8 worker thread pool (as #5 P2 itself suggests). No new
concurrency framework. Tests:test/functional/bug_rpc_thread_per_request.py (source-level grep
— a runtime load test was evaluated and deferred: thread/FD counts are
scheduler-dependent and would be flaky; threading model is a source property). Fix-test coverage: positive — RPC still serves requests; negative — flip
the source assertion (no per-request std::thread/.detach; pool present).
4. F6b — no FAST post-IBD chain-sync stall detection
Original task:#6 Priority 3 — "Consider post-IBD stall detection.
Add a lighter post-IBD check..." Status: scoped down by the audit — most post-IBD logic IS present
(ConsiderEviction, PING/inactivity timeouts, multi-peer sync). The specific
gap: in HeaderSyncManager::ProcessTimers(), the if (in_ibd) branch has a sync_deadline check; the else (post-IBD) branch has only ConsiderEviction, no deadline. System impact now: a post-IBD peer that answers PINGs and has chainwork
not visibly behind but silently drags feet on headers is only caught by the
1200 s inactivity timeout — ~20 min wedge. IBD catches the same in seconds. System impact after fix: a post-IBD deadline analog catches stalling peers
in seconds, matching the IBD path. Exact fix: add a post-IBD deadline-based stall check in the else branch
of ProcessTimers() — a sync_deadline analog that disconnects a peer
which hasn't delivered expected headers within a window. Mirror the IBD
branch. Tests:test/functional/bug_post_ibd_stall_disabled.py (source-level —
runtime variant deferred until a node_simulator "post-IBD header-stall"
scenario exists). Fix-test coverage: positive — healthy post-IBD peer not disconnected;
negative — flip the source assertion.
Origin: PR #3 review explicitly marked "future work" (issue #1 audit). System impact now:LocalTrustBaseManager::Load()
(src/chain/trust_base_manager.cpp:19-43) deserializes .cbor via FromCBOR(data) and stores without IsValid() / VerifySignatures(). A
tampered or corrupt-but-parseable .cbor loads silently. Threat model
narrow (requires write access to data dir) but real. System impact after fix: invalid trust bases rejected/skipped at load
time, not silently accepted as authoritative state. Exact fix: in Load()'s try block, after FromCBOR(data), call tb.IsValid(prev) and tb.VerifySignatures(prev); reject (or skip with a
logged error) on failure. ~5 lines. Tests: the new SECTION in test/unit/chain/trust_base_manager_tests.cpp
(writes a quorum_threshold = 0.cbor directly to the data subdir,
asserts Load() accepts it silently — gap holds). Fix-test coverage: positive — valid trust bases still load; negative —
flip (the invalid TB is rejected / not promoted to latest).
Findings discovered during this audit
6. F6c — getaddednodeinfo RPC not implemented
Origin: not in #6's body — surfaced during audit as an adjacent
operability gap (Bitcoin Core has it; we don't). System impact now: none to consensus or liveness — operability only.
Operators can't inspect what addnode add queued except by grepping debug.log. System impact after fix: operators can query the manually-added-node list + connection status, matching Bitcoin Core ergonomics. Exact fix: implement getaddednodeinfo returning the manually-added-node
list + per-node connection state, mirroring Bitcoin Core's response shape. Tests:test/functional/bug_getaddednodeinfo_missing.py (runtime —
returns 'Unknown command'). Fix-test coverage: positive — added nodes appear in getaddednodeinfo;
negative — flip the 'Unknown command' assertion.
7. H2 — bug_rpc_busy_path_resets exits 0 even when INCONCLUSIVE
Origin: discovered during Step-1 verification. System impact now: test-hygiene only — when the test can't saturate the
busy-path cap (the normal case now that #5 P1 is fixed), it prints "INCONCLUSIVE" but still exits 0, looking like a pass when run on demand. Exact fix: either exit non-0 / skip on "couldn't saturate," OR retire
this reproducer now that feature_rpc_socket_backlog.py covers the F1
fix authoritatively.
vPayload memory pruning — explicitly deferred future work per the
original [TGE] BFT state snapshots to PoW chain #1 design comment ("Payload pruning will not be implemented
initially"). Larger refactor; not defense-in-depth.
Context
The merged
bft-snapshotsbranch (#1 / PR #3, #5 / PR #11, #6 / PR #10) closedthe Priority 1 items in each issue. The Priority 2 and Priority 3 items
remain open — they were listed in the original issue bodies but the merged PRs
deferred them. This tracker collects those + two adjacent gaps discovered
during the audit + one item the PR #3 review marked as future work.
Each item has a passing-while-the-gap-holds repro test on
test-coverage-issues-5-6(will land in a separate test-coverage PR linkedhere). When a fix lands, the corresponding test fails until the implementer
flips the assertion and adds the positive-path case.
Hard rules for the fixes
schedulers, or sub-systems. Reuse existing machinery (
connect_to_syncforF6a,
ProcessTimers()for F6b, etc.). If a fix seems to require a newsubsystem, stop and discuss.
(valid behavior still works) AND negative (the gap is closed). The
existing repros become the negative-path tests once flipped.
Findings — items the original tasks named (priority order)
1. F2b — RPC server does not support HTTP keep-alive
Original task: #5 Priority 3 (verbatim from #5 body).
Status: open after PR #11 (P1 backlog fix only).
System impact now: every RPC call opens a fresh TCP connection. FGP issues
~6 calls/round × 3 nodes → constant connect/teardown churn, accept-loop
pressure, defeats FGP's client-side pooling (compounds with FGP F3).
System impact after fix: persistent connections cut churn; FGP pool works.
Exact fix: in
src/network/rpc_server.cpp, honorConnection: keep-alive— keep the socket open for subsequent requests up to a sane idle timeout /
request limit. No new I/O framework.
Tests:
test/functional/bug_rpc_no_keep_alive.py(runtime — verifiedreproduces: second request on same socket →
BrokenPipeError).Fix-test coverage: positive — single request still works; negative — two
sequential requests on one socket both succeed.
2. F6a —
addnodehas no retry on transient failureOriginal task: #6 Priority 2 — "Retry logic for addnode connections.
When addnode 'add' initiates a connection and the async callback fails,
schedule a retry (e.g., 3 attempts with exponential backoff)."
Status: open after PR #10 (sync conversion + logging fixed; retry deferred).
System impact now:
addnode add <peer>against a momentarily-unreachablepeer fails once and is never retried — operator's manually-added peer
silently never connects if it was down at add time. Reduces connectivity
resilience.
System impact after fix: transient failures are retried with backoff;
manually-added peers eventually connect.
Exact fix: add bounded retry/backoff (3 attempts, exponential, per #6's
spec) for manually-added peers on transient connect failure, reusing
connect_to_sync/ the existing outbound scheduler. Don't build a new one.Tests:
test/functional/bug_addnode_no_retry.py(runtime — verifiedreproduces: only the original failed attempt in
debug.log, no retry).Fix-test coverage: positive — addnode to a reachable peer connects;
negative — addnode to a down-then-up peer eventually connects.
3. F2 — thread-per-request
std::thread(...).detach()modelOriginal task: #5 Priority 2 — "Replace detached threads with thread
pool" (with code example in #5 body).
Status: open after PR #11.
System impact now: RPC accept loop spawns a detached
std::threadperrequest. Under burst (3 FGP nodes' synchronized T1 timers firing ~6 calls
each within ~35 ms), thread creation is unbounded — resource pressure, no
backpressure beyond the busy-path cap.
System impact after fix: bounded concurrency via a fixed-size worker pool + queue; predictable resource use under burst.
Exact fix: replace per-request
std::thread(...).detach()with apre-allocated 4–8 worker thread pool (as #5 P2 itself suggests). No new
concurrency framework.
Tests:
test/functional/bug_rpc_thread_per_request.py(source-level grep— a runtime load test was evaluated and deferred: thread/FD counts are
scheduler-dependent and would be flaky; threading model is a source property).
Fix-test coverage: positive — RPC still serves requests; negative — flip
the source assertion (no per-request
std::thread/.detach; pool present).4. F6b — no FAST post-IBD chain-sync stall detection
Original task: #6 Priority 3 — "Consider post-IBD stall detection.
Add a lighter post-IBD check..."
Status: scoped down by the audit — most post-IBD logic IS present
(
ConsiderEviction, PING/inactivity timeouts, multi-peer sync). The specificgap: in
HeaderSyncManager::ProcessTimers(), theif (in_ibd)branch has async_deadlinecheck; theelse(post-IBD) branch has onlyConsiderEviction, no deadline.System impact now: a post-IBD peer that answers PINGs and has chainwork
not visibly behind but silently drags feet on headers is only caught by the
1200 s inactivity timeout — ~20 min wedge. IBD catches the same in seconds.
System impact after fix: a post-IBD deadline analog catches stalling peers
in seconds, matching the IBD path.
Exact fix: add a post-IBD deadline-based stall check in the
elsebranchof
ProcessTimers()— async_deadlineanalog that disconnects a peerwhich hasn't delivered expected headers within a window. Mirror the IBD
branch.
Tests:
test/functional/bug_post_ibd_stall_disabled.py(source-level —runtime variant deferred until a
node_simulator"post-IBD header-stall"scenario exists).
Fix-test coverage: positive — healthy post-IBD peer not disconnected;
negative — flip the source assertion.
Findings the PR #3 review marked as future work
5. S5 — Trust-base disk-load validation
Origin: PR #3 review explicitly marked "future work" (issue #1 audit).
System impact now:
LocalTrustBaseManager::Load()(
src/chain/trust_base_manager.cpp:19-43) deserializes.cborviaFromCBOR(data)and stores withoutIsValid()/VerifySignatures(). Atampered or corrupt-but-parseable
.cborloads silently. Threat modelnarrow (requires write access to data dir) but real.
System impact after fix: invalid trust bases rejected/skipped at load
time, not silently accepted as authoritative state.
Exact fix: in
Load()'s try block, afterFromCBOR(data), calltb.IsValid(prev)andtb.VerifySignatures(prev); reject (or skip with alogged error) on failure. ~5 lines.
Tests: the new SECTION in
test/unit/chain/trust_base_manager_tests.cpp(writes a
quorum_threshold = 0.cbordirectly to the data subdir,asserts
Load()accepts it silently — gap holds).Fix-test coverage: positive — valid trust bases still load; negative —
flip (the invalid TB is rejected / not promoted to latest).
Findings discovered during this audit
6. F6c —
getaddednodeinfoRPC not implementedOrigin: not in #6's body — surfaced during audit as an adjacent
operability gap (Bitcoin Core has it; we don't).
System impact now: none to consensus or liveness — operability only.
Operators can't inspect what
addnode addqueued except by greppingdebug.log.System impact after fix: operators can query the manually-added-node list + connection status, matching Bitcoin Core ergonomics.
Exact fix: implement
getaddednodeinforeturning the manually-added-nodelist + per-node connection state, mirroring Bitcoin Core's response shape.
Tests:
test/functional/bug_getaddednodeinfo_missing.py(runtime —returns
'Unknown command').Fix-test coverage: positive — added nodes appear in
getaddednodeinfo;negative — flip the
'Unknown command'assertion.7. H2 —
bug_rpc_busy_path_resetsexits 0 even when INCONCLUSIVEOrigin: discovered during Step-1 verification.
System impact now: test-hygiene only — when the test can't saturate the
busy-path cap (the normal case now that #5 P1 is fixed), it prints
"INCONCLUSIVE"but still exits 0, looking like a pass when run on demand.Exact fix: either exit non-0 / skip on "couldn't saturate," OR retire
this reproducer now that
feature_rpc_socket_backlog.pycovers the F1fix authoritatively.
Priority order
ProcessTimers().getaddednodeinfo) — operability improvement.Out of scope (link, don't duplicate)
idecho — fixed upstream inf7d9fdb(now on main); theFGP-side TODO cleanup is in finality-gadget#17.
<datadir>/reward_tokens.csvper@ahtotruu's clarification ("
cat csvis also CLI"). If [TGE] Amend block reward token handling #13's spec changesprompt a query RPC, that work belongs in [TGE] Amend block reward token handling #13.
original [TGE] BFT state snapshots to PoW chain #1 design comment ("Payload pruning will not be implemented
initially"). Larger refactor; not defense-in-depth.