test: unicity-node audit coverage (#1 / #5 / #6) + hardening repros (…#15
test: unicity-node audit coverage (#1 / #5 / #6) + hardening repros (…#15b3y0urs3lf wants to merge 1 commit into
Conversation
) Adds the test suite produced during the May 2026 audit of issues #1 (PR #3 — BFT state snapshots), #5 (PR #11 — RPC server robustness), and #6 (PR #10 — addnode & state queries). Test-only change; no production code is modified. Group A — regression / lock-in for behavior already delivered: - validation_tests.cpp: S1 mismatched-payloadRoot (3 sections, #1 AC.2/4), S4 exact-MAX boundary (#1 AC.3), req-6 negatives (2 sections, #1 AC.6) - token_generator_test.cpp: 0600 file-mode (#1 AC.9 / PR #3 review fix) - bft_client_tests.cpp: read-timeout (#1 AC.5 failure mode) - persistence_tests.cpp: vPayload round-trip (#1 AC.2/4 persistence) - rpc_integration_tests.cpp: fix "Invalid hex characters" SECTION that previously bailed on missing-rewardtokenid before hex parsing (now reaches validation; companion to lploom's Apr 8 Python-side fix) - feature_rpc_socket_backlog.py: #5 Priority 1 / PR #11 - feature_addnode_sync_rpc.py + feature_addnode_logging.py: #6 sync conversion + logging - feature_reward_tokens_csv.py: #1 AC.11 (CSV is the agreed interface per @ahtotruu's clarification: "cat csv is also CLI") - feature_bft_utb_retrieval.py: #1 AC.5 e2e against the live bft-fgp-2sh cluster (HTTP fetch → CBOR → safety check) Group B — passes-while-gap-holds repros (tracked in #14): - bug_rpc_no_keep_alive.py: #5 P3 / #14 F2b - bug_addnode_no_retry.py: #6 P2 / #14 F6a - bug_rpc_thread_per_request.py: #5 P2 / #14 F2 (source-level) - bug_post_ibd_stall_disabled.py: #6 P3 / #14 F6b (source-level) - bug_getaddednodeinfo_missing.py: audit-discovered / #14 F6c - bug_reward_token_list_missing.py: design-doc tripwire per @ahtotruu's clarification (no RPC by design; CSV is the interface) - bug_rpc_busy_path_resets.py: fixed-but-inconclusive / #14 H2 Flip-on-fix contract: the bug_* tests pass today by reproducing the gap. They are NOT discovered by test/functional/test_runner.py (its allowlist excludes bug_) — run them on demand. When an #14 fix lands, the corresponding bug_* test will exit non-zero ("gap closed unexpectedly"); the implementer should then invert the assertion, add the positive-path case, and rename to feature_* so it joins default CI. Refs: #1 #5 #6 #14 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request adds a comprehensive suite of functional and unit tests to reproduce known bugs and verify key features, including synchronous addnode behavior, RPC server backlog and concurrency handling, BFT UTB retrieval, and block validation. The reviewer's feedback highlights several opportunities to harden these tests against crashes and flakiness. Key recommendations include wrapping Linux-specific path checks in try-except blocks for cross-platform compatibility, dynamically allocating ports to avoid CI conflicts, handling potential IndexError and socket connection exceptions, simplifying fragile regex parsing, and refactoring the BFT client to make timeouts configurable to avoid slow unit tests.
| with open("/proc/sys/net/core/somaxconn") as f: | ||
| somaxconn = int(f.read().strip()) | ||
| log(f" /proc/sys/net/core/somaxconn = {somaxconn}", | ||
| GREEN if somaxconn >= 128 else RED) | ||
| if somaxconn < 128: | ||
| failures.append(f"SOMAXCONN too small: {somaxconn}") |
There was a problem hiding this comment.
The test attempts to read /proc/sys/net/core/somaxconn directly, which is a Linux-specific path. Running this test on non-Linux platforms (such as macOS or Windows) will raise a FileNotFoundError and crash the test suite.
To ensure cross-platform compatibility, wrap this check in a try-except block to gracefully skip the kernel parameter check on non-Linux systems.
| with open("/proc/sys/net/core/somaxconn") as f: | |
| somaxconn = int(f.read().strip()) | |
| log(f" /proc/sys/net/core/somaxconn = {somaxconn}", | |
| GREEN if somaxconn >= 128 else RED) | |
| if somaxconn < 128: | |
| failures.append(f"SOMAXCONN too small: {somaxconn}") | |
| try: | |
| with open("/proc/sys/net/core/somaxconn") as f: | |
| somaxconn = int(f.read().strip()) | |
| log(f" /proc/sys/net/core/somaxconn = {somaxconn}", | |
| GREEN if somaxconn >= 128 else RED) | |
| if somaxconn < 128: | |
| failures.append(f"SOMAXCONN too small: {somaxconn}") | |
| except FileNotFoundError: | |
| log(" /proc/sys/net/core/somaxconn not found (non-Linux platform), skipping kernel check", YELLOW) |
| node.start() | ||
|
|
||
| # Trigger several distinct failures on different ports so we can count log lines. | ||
| ports = [65501, 65502, 65503, 65504, 65505] |
There was a problem hiding this comment.
Using hardcoded ports (like 65501 through 65505) in functional tests can lead to intermittent test failures and flakiness if those ports are already in use by other processes on the host machine or in a shared CI environment.
Since pick_free_port is already imported in this file, use it to dynamically allocate unused ports for the test.
| ports = [65501, 65502, 65503, 65504, 65505] | |
| ports = [pick_free_port() for _ in range(5)] |
| rows = list(reader) | ||
|
|
||
| # Header line | ||
| header = rows[0] |
There was a problem hiding this comment.
If the CSV file is empty (e.g., due to a write failure or serialization issue in the node), rows will be an empty list, and accessing rows[0] will raise an unhandled IndexError: list index out of range, crashing the test.
Add a check to ensure rows is not empty before accessing the header.
if not rows:
log(" FAIL — CSV file is empty", RED)
return 1
header = rows[0]| s.settimeout(timeout) | ||
| data = b"" | ||
| while b"\r\n\r\n" not in data: | ||
| chunk = s.recv(4096) | ||
| if not chunk: | ||
| return None, None, None, True | ||
| data += chunk |
There was a problem hiding this comment.
The read_http_response helper function does not handle socket connection errors (such as ConnectionResetError, BrokenPipeError, or general OSError) during the initial receive loop. If the server abruptly closes or resets the connection, the test will crash with an unhandled exception instead of failing gracefully or correctly identifying that the connection was closed.
Wrap the receive loop in a try-except block to handle these connection errors and return closed = True.
| s.settimeout(timeout) | |
| data = b"" | |
| while b"\r\n\r\n" not in data: | |
| chunk = s.recv(4096) | |
| if not chunk: | |
| return None, None, None, True | |
| data += chunk | |
| s.settimeout(timeout) | |
| data = b"" | |
| try: | |
| while b"\r\n\r\n" not in data: | |
| chunk = s.recv(4096) | |
| if not chunk: | |
| return None, None, None, True | |
| data += chunk | |
| except (ConnectionResetError, BrokenPipeError, OSError): | |
| return None, None, None, True |
| m = re.search(r"void\s+RPCServer::ServerThread\s*\(\s*\)\s*\{(.+?)\n\}\n", text, re.DOTALL) | ||
| if not m: | ||
| log("ServerThread() function not found — has it been renamed/restructured?", YELLOW) | ||
| log("That alone could indicate the fix landed (e.g. a thread-pool refactor).", YELLOW) | ||
| return 1 | ||
| body = m.group(1) |
There was a problem hiding this comment.
The regular expression used to extract the body of RPCServer::ServerThread is highly fragile. It relies on the exact indentation and formatting of the closing braces (expecting \n}\n with no leading whitespace). If the source file is auto-formatted, or if nested blocks with specific indentation are added, the regex will fail to match or truncate the function body prematurely.
Since the presence of std::thread and .detach() in rpc_server.cpp is already a strong indicator of the thread-per-request pattern, consider searching the entire file content instead of attempting to extract the function body with a fragile regex.
| m = re.search(r"void\s+RPCServer::ServerThread\s*\(\s*\)\s*\{(.+?)\n\}\n", text, re.DOTALL) | |
| if not m: | |
| log("ServerThread() function not found — has it been renamed/restructured?", YELLOW) | |
| log("That alone could indicate the fix landed (e.g. a thread-pool refactor).", YELLOW) | |
| return 1 | |
| body = m.group(1) | |
| if "void RPCServer::ServerThread" not in text: | |
| log("ServerThread() function not found — has it been renamed/restructured?", YELLOW) | |
| log("That alone could indicate the fix landed (e.g. a thread-pool refactor).", YELLOW) | |
| return 1 | |
| body = text |
| SECTION("Timeout: slow server triggers read timeout (req 5 failure mode)") { | ||
| svr.Get("/api/v1/trustbases/slow", [](const httplib::Request&, httplib::Response& res) { | ||
| // Block longer than the client's 5 s read-timeout. | ||
| std::this_thread::sleep_for(std::chrono::seconds(6)); |
There was a problem hiding this comment.
Sleeping for 6 seconds in a unit test is a major anti-pattern as it significantly slows down the test suite execution. This is currently required because the read timeout of 5 seconds is hardcoded in bft_client.cpp.
Consider refactoring HttpBFTClient in a future production PR to make the read timeout configurable (e.g., via a constructor parameter or setter). This would allow unit tests to configure a very short timeout (e.g., 50ms) and execute almost instantaneously.
…#14)
Adds the test suite produced during the May 2026 audit of issues #1 (PR #3 — BFT state snapshots), #5 (PR #11 — RPC server robustness), and #6 (PR #10 — addnode & state queries). Test-only change; no production code is modified.
Group A — regression / lock-in for behavior already delivered:
Group B — passes-while-gap-holds repros (tracked in #14):
Flip-on-fix contract: the bug_* tests pass today by reproducing the gap. They are NOT discovered by test/functional/test_runner.py (its allowlist excludes bug_) — run them on demand. When an #14 fix lands, the corresponding bug_* test will exit non-zero ("gap closed unexpectedly"); the implementer should then invert the assertion, add the positive-path case, and rename to feature_* so it joins default CI.
Refs: #1 #5 #6 #14