fix(aggregator): classify submit_commitment/get_inclusion_proof failures as unreachable by vrogojin · Pull Request #1 · unicitynetwork/infra-probe

vrogojin · 2026-05-20T12:17:51Z

Summary

A real testnet outage on goggregator-test.unicity.network (HTTP 401 Unauthorized on submit_commitment) was being reported as DEGRADED (CLI exit 1) instead of UNREACHABLE (CLI exit 2). Downstream e2e pre-flight gates that only fire on exit 2 silently let test suites run against an aggregator that couldn't accept commitments — producing 35 phantom `Submit failed: [object Object]` failures on the SDK side (companion: sphere-sdk #191 / PR #192 fixes the SDK-side error stringification so the next outage at least surfaces the underlying reason).

Root cause: the verdict logic in src/probes/aggregator.mjs counted ALL failed checks uniformly — one fail = degraded, two fails = unreachable. That treated submit_commitment (the canonical write-path functional check the aggregator EXISTS to serve) as no more important than the diagnostic JSON-RPC plane probe. The CLAUDE.md "False-negative discipline" section explicitly warns against exactly this: "Functional checks are authoritative for the verdict; advisory checks are not."

Fix

Mark submit_commitment and get_inclusion_proof checks with a new critical: true field (backward-compatible — default falsy).
Extract verdict computation into a pure exported function computeAggregatorStatus(checks) so the rule is testable network-free.
New rule:
1. Any critical-check fail → `unreachable`.
2. Else ≥ 2 non-critical fails → `unreachable` (legacy preserved).
3. Else exactly 1 non-critical fail OR any warn → `degraded`.
4. Else → `healthy`.

Tests

8 new network-free tests in `tests/smoke.test.mjs` pin the rule:

healthy path
single non-critical fail → degraded
warn → degraded
critical fail (submit_commitment) → unreachable regardless of other passes (the headline reproducer)
critical fail (get_inclusion_proof) → unreachable
two non-critical fails → unreachable (legacy preserved)
critical fail wins over warn-on-others
explicit `critical: false` treated as non-critical

All 29 tests pass: `npm test` → `29 pass | 0 fail`.

Verified against live testnet

Before (v0.4.0):
```
⚠ aggregator https://goggregator-test.unicity.network
✓ health
✓ json-rpc
✗ submit_commitment HTTP 401 Unauthorized
✓ get_inclusion_proof
Status: DEGRADED (3/4 checks passed)
```
exit code: 1 (e2e preflight gates that only fire on exit 2 miss this).

After (this PR):
```
❌ aggregator https://goggregator-test.unicity.network
✓ health
✓ json-rpc
✗ submit_commitment HTTP 401 Unauthorized
✓ get_inclusion_proof
Status: UNREACHABLE (3/4 checks passed)
```
exit code: 2 (preflight gate fires; tests fail cleanly with a real signal).

Scope intentionally tight

Other probes (nostr publish, ipfs add+fetch, fulcrum tx-index) likely have similar functional checks that deserve critical: true flagging. Deferred to follow-up — this PR scopes to the reported regression.

The looseness in get_inclusion_proof (accepts any `result` object even when the just-submitted commitment didn't land) is a separate concern noted in CHANGELOG. Also deferred.

Test plan

`npm test` — 29/29 pass
Live probe against goggregator-test reproduces UNREACHABLE + exit 2
Exit code 0 verified for healthy services (`--only fulcrum`)
Downstream e2e preflight scripts pick up the new exit-2 signal — needs a quick check of consumers (out of scope for this repo)

Adds a sixth probe to the suite for the Unicity test faucet (https://faucet.unicity.network). Many SDK e2e suites depend on the faucet for wallet funding (uxf-send-receive, pointer-roundtrip, migrate-to-profile-conservation, profile-export-roundtrip); when the faucet is down, those suites silently time out at 240 s with a generic "Faucet top-up timed out" message. Probing upfront converts that into a clean 1-2 s SKIP with a precise diagnostic. The probe issues POST /api/v1/faucet/request with a deliberately- invalid probe nametag (`infra-probe-do-not-mint-zk7q3xa9p2v`). Healthy faucet returns HTTP 4xx + structured `{success: false, error: "Nametag not found: ..."}`. This exercises the full HTTP/parse/nametag-resolve/response-shaping pipeline WITHOUT consuming actual faucet quota or requiring a real wallet — the same "empty cost, full coverage" pattern the aggregator probe uses with its known-bad shard ID. Defense-in-depth: if the faucet ever returns success:true for the probe nametag, the probe DOWNGRADES to 'degraded' (validation may be broken — the probe nametag is supposed to be invalid). Network config: - testnet, dev: faucet = "https://faucet.unicity.network" - mainnet: faucet = null (no faucet by design) The probe layer treats null as a clean skip with verdict 'healthy' — cleaner than emitting a misleading "unreachable" verdict against a default URL that doesn't apply to the network being probed. SERVICES expands from 5 to 6: ['nostr', 'aggregator', 'ipfs', 'fulcrum', 'market', 'faucet']. The 'faucet' entry is appended at the end so existing --only invocations are unaffected. Tests: - 21 smoke tests pass (added 1 for mainnet/testnet faucet contract) - Live testnet probe: 6/6 services healthy - Live mainnet probe with --only faucet: skipped cleanly with "network mainnet has no faucet by design — skipped"

…res as unreachable The verdict logic counted ALL failed checks uniformly — one fail = degraded, two fails = unreachable. That treated submit_commitment (the canonical write-path functional check the aggregator exists to serve) as no more important than a diagnostic JSON-RPC plane probe. Symptom: testnet returns HTTP 401 Unauthorized on submit_commitment while /health and get_block_height keep working. Old verdict: `DEGRADED` (exit 1). Downstream e2e pre-flight gates that only fire on exit code 2 silently let test suites run against an aggregator that cannot accept commitments, producing 35 phantom `Submit failed: [object Object]` failures on the sphere-sdk side (companion issue: sphere-sdk #191). Fix: - Mark submit_commitment + get_inclusion_proof checks with `critical: true`. - Extract verdict computation into a pure `computeAggregatorStatus(checks)` function, exported so the rule is testable network-free. - Rule: any critical-check fail → 'unreachable'. Non-critical fails follow the legacy "≥ 2 → unreachable, exactly 1 → degraded" rule. Warns continue to drive 'degraded'. - 8 new smoke tests pin the rule: healthy / non-critical degraded / warn / critical-fail unreachable (both checks independently) / multi-fail legacy preserved / critical-fail-wins-over-warn / explicit critical:false. Encodes the CLAUDE.md "False-negative discipline" principle directly: the functional layer "is what catches real outages that liveness misses" — and when it catches them, the verdict MUST reflect that the service is unusable for its intended purpose, not merely slow. Bumps to 0.4.1.

gemini-code-assist

Code Review

This pull request introduces a new 'faucet' probe to the infrastructure monitoring tool, enabling verification of the Unicity test faucet. It also improves the aggregator probe's reliability by introducing a 'critical' check status, ensuring that failures in core write paths (submit_commitment and get_inclusion_proof) correctly trigger an 'unreachable' verdict instead of 'degraded'. The reviewer suggested elevating several error conditions in the new faucet probe from 'degraded' to 'unreachable' to better reflect the impact of malformed API responses on service availability.

gemini-code-assist · 2026-05-20T12:21:22Z

+        latencyMs,
+        message: 'response was not JSON',
+      });
+      return finalize('degraded', 'response shape malformed');


The request check is the primary functional check for the faucet. If the response is not valid JSON, the service is effectively unusable for a client. This should be considered unreachable, not degraded. This aligns with the primary goal of this PR, which is to correctly classify critical failures.

Suggested change

return finalize('degraded', 'response shape malformed');

return finalize('unreachable', 'response shape malformed');

gemini-code-assist · 2026-05-20T12:21:23Z

+        latencyMs,
+        message: `unexpected success field: ${typeof json.success}`,
+      });
+      return finalize('degraded', 'response shape malformed');


An unexpected success field value indicates a malformed response from the faucet's API. This should be treated as a critical failure, making the service unreachable rather than degraded.

Suggested change

return finalize('degraded', 'response shape malformed');

return finalize('unreachable', 'response shape malformed');

gemini-code-assist · 2026-05-20T12:21:23Z

+        latencyMs,
+        message: 'success:false but no error/message field',
+      });
+      return finalize('degraded', 'error field missing on rejection');


When the faucet API indicates a failure (success: false) but omits the error or message field, the response is incomplete and violates the expected contract. This points to a server-side issue that should render the service unreachable, not just degraded.

Suggested change

return finalize('degraded', 'error field missing on rejection');

return finalize('unreachable', 'error field missing on rejection');

Companion work for the sphere-sdk hermetic e2e stack (tests/e2e/local-infra/) which boots a local aggregator in BFT_ENABLED=false standalone mode. Two changes layered on top of #1's critical-check verdict fix: 1. NETWORKS.local — endpoints for the docker-compose stack: - aggregator http://127.0.0.1:3001 - nostr ws://127.0.0.1:7777 - ipfs http://127.0.0.1:8082 Fulcrum/Market intentionally non-local (no local counterpart yet). Faucet null — the local faucet is DM-driven, not HTTP, so the existing faucet probe doesn't apply. 2. aggregator /health body parser accepts both shapes: - BFT mode: {"status":"healthy","database":"ok","aggregators":{...}} - Standalone mode: {"status":"ok","role":"standalone","details":{"database":"connected",...}} Previously the standalone shape was reported as `degraded` even when the aggregator was fully functional (submit_commitment + get_inclusion_proof both passing). Verified against the live local stack: unicity-infra-probe --network local --only aggregator → HEALTHY (4/4 checks passed), exit 0. Tests: 30/30 pass. Two new tests cover NETWORKS.local shape + NETWORKS.local.faucet === null. Bumps to 0.4.2.

vrogojin added 2 commits May 4, 2026 17:01

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

vrogojin mentioned this pull request May 20, 2026

feat: add 'local' network + accept standalone aggregator status='ok' #2

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aggregator): classify submit_commitment/get_inclusion_proof failures as unreachable#1

fix(aggregator): classify submit_commitment/get_inclusion_proof failures as unreachable#1
vrogojin wants to merge 2 commits into
mainfrom
fix/aggregator-critical-fail-unreachable

vrogojin commented May 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return finalize('degraded', 'response shape malformed');
	return finalize('unreachable', 'response shape malformed');

	return finalize('degraded', 'error field missing on rejection');
	return finalize('unreachable', 'error field missing on rejection');

Conversation

vrogojin commented May 20, 2026

Summary

Fix

Tests

Verified against live testnet

Scope intentionally tight

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant