fix(aggregator): classify submit_commitment/get_inclusion_proof failures as unreachable#1
fix(aggregator): classify submit_commitment/get_inclusion_proof failures as unreachable#1vrogojin wants to merge 2 commits into
Conversation
Adds a sixth probe to the suite for the Unicity test faucet (https://faucet.unicity.network). Many SDK e2e suites depend on the faucet for wallet funding (uxf-send-receive, pointer-roundtrip, migrate-to-profile-conservation, profile-export-roundtrip); when the faucet is down, those suites silently time out at 240 s with a generic "Faucet top-up timed out" message. Probing upfront converts that into a clean 1-2 s SKIP with a precise diagnostic. The probe issues POST /api/v1/faucet/request with a deliberately- invalid probe nametag (`infra-probe-do-not-mint-zk7q3xa9p2v`). Healthy faucet returns HTTP 4xx + structured `{success: false, error: "Nametag not found: ..."}`. This exercises the full HTTP/parse/nametag-resolve/response-shaping pipeline WITHOUT consuming actual faucet quota or requiring a real wallet — the same "empty cost, full coverage" pattern the aggregator probe uses with its known-bad shard ID. Defense-in-depth: if the faucet ever returns success:true for the probe nametag, the probe DOWNGRADES to 'degraded' (validation may be broken — the probe nametag is supposed to be invalid). Network config: - testnet, dev: faucet = "https://faucet.unicity.network" - mainnet: faucet = null (no faucet by design) The probe layer treats null as a clean skip with verdict 'healthy' — cleaner than emitting a misleading "unreachable" verdict against a default URL that doesn't apply to the network being probed. SERVICES expands from 5 to 6: ['nostr', 'aggregator', 'ipfs', 'fulcrum', 'market', 'faucet']. The 'faucet' entry is appended at the end so existing --only invocations are unaffected. Tests: - 21 smoke tests pass (added 1 for mainnet/testnet faucet contract) - Live testnet probe: 6/6 services healthy - Live mainnet probe with --only faucet: skipped cleanly with "network mainnet has no faucet by design — skipped"
…res as unreachable
The verdict logic counted ALL failed checks uniformly — one fail = degraded,
two fails = unreachable. That treated submit_commitment (the canonical
write-path functional check the aggregator exists to serve) as no more
important than a diagnostic JSON-RPC plane probe.
Symptom: testnet returns HTTP 401 Unauthorized on submit_commitment while
/health and get_block_height keep working. Old verdict: `DEGRADED` (exit 1).
Downstream e2e pre-flight gates that only fire on exit code 2 silently let
test suites run against an aggregator that cannot accept commitments,
producing 35 phantom `Submit failed: [object Object]` failures on the
sphere-sdk side (companion issue: sphere-sdk #191).
Fix:
- Mark submit_commitment + get_inclusion_proof checks with `critical: true`.
- Extract verdict computation into a pure `computeAggregatorStatus(checks)`
function, exported so the rule is testable network-free.
- Rule: any critical-check fail → 'unreachable'. Non-critical fails follow
the legacy "≥ 2 → unreachable, exactly 1 → degraded" rule. Warns
continue to drive 'degraded'.
- 8 new smoke tests pin the rule: healthy / non-critical degraded / warn /
critical-fail unreachable (both checks independently) / multi-fail legacy
preserved / critical-fail-wins-over-warn / explicit critical:false.
Encodes the CLAUDE.md "False-negative discipline" principle directly: the
functional layer "is what catches real outages that liveness misses" — and
when it catches them, the verdict MUST reflect that the service is
unusable for its intended purpose, not merely slow.
Bumps to 0.4.1.
There was a problem hiding this comment.
Code Review
This pull request introduces a new 'faucet' probe to the infrastructure monitoring tool, enabling verification of the Unicity test faucet. It also improves the aggregator probe's reliability by introducing a 'critical' check status, ensuring that failures in core write paths (submit_commitment and get_inclusion_proof) correctly trigger an 'unreachable' verdict instead of 'degraded'. The reviewer suggested elevating several error conditions in the new faucet probe from 'degraded' to 'unreachable' to better reflect the impact of malformed API responses on service availability.
| latencyMs, | ||
| message: 'response was not JSON', | ||
| }); | ||
| return finalize('degraded', 'response shape malformed'); |
There was a problem hiding this comment.
The request check is the primary functional check for the faucet. If the response is not valid JSON, the service is effectively unusable for a client. This should be considered unreachable, not degraded. This aligns with the primary goal of this PR, which is to correctly classify critical failures.
| return finalize('degraded', 'response shape malformed'); | |
| return finalize('unreachable', 'response shape malformed'); |
| latencyMs, | ||
| message: `unexpected success field: ${typeof json.success}`, | ||
| }); | ||
| return finalize('degraded', 'response shape malformed'); |
There was a problem hiding this comment.
An unexpected success field value indicates a malformed response from the faucet's API. This should be treated as a critical failure, making the service unreachable rather than degraded.
| return finalize('degraded', 'response shape malformed'); | |
| return finalize('unreachable', 'response shape malformed'); |
| latencyMs, | ||
| message: 'success:false but no error/message field', | ||
| }); | ||
| return finalize('degraded', 'error field missing on rejection'); |
There was a problem hiding this comment.
When the faucet API indicates a failure (success: false) but omits the error or message field, the response is incomplete and violates the expected contract. This points to a server-side issue that should render the service unreachable, not just degraded.
| return finalize('degraded', 'error field missing on rejection'); | |
| return finalize('unreachable', 'error field missing on rejection'); |
Companion work for the sphere-sdk hermetic e2e stack (tests/e2e/local-infra/) which boots a local aggregator in BFT_ENABLED=false standalone mode. Two changes layered on top of #1's critical-check verdict fix: 1. NETWORKS.local — endpoints for the docker-compose stack: - aggregator http://127.0.0.1:3001 - nostr ws://127.0.0.1:7777 - ipfs http://127.0.0.1:8082 Fulcrum/Market intentionally non-local (no local counterpart yet). Faucet null — the local faucet is DM-driven, not HTTP, so the existing faucet probe doesn't apply. 2. aggregator /health body parser accepts both shapes: - BFT mode: {"status":"healthy","database":"ok","aggregators":{...}} - Standalone mode: {"status":"ok","role":"standalone","details":{"database":"connected",...}} Previously the standalone shape was reported as `degraded` even when the aggregator was fully functional (submit_commitment + get_inclusion_proof both passing). Verified against the live local stack: unicity-infra-probe --network local --only aggregator → HEALTHY (4/4 checks passed), exit 0. Tests: 30/30 pass. Two new tests cover NETWORKS.local shape + NETWORKS.local.faucet === null. Bumps to 0.4.2.
Summary
A real testnet outage on
goggregator-test.unicity.network(HTTP 401 Unauthorized onsubmit_commitment) was being reported asDEGRADED(CLI exit 1) instead ofUNREACHABLE(CLI exit 2). Downstream e2e pre-flight gates that only fire on exit 2 silently let test suites run against an aggregator that couldn't accept commitments — producing 35 phantom `Submit failed: [object Object]` failures on the SDK side (companion: sphere-sdk #191 / PR #192 fixes the SDK-side error stringification so the next outage at least surfaces the underlying reason).Root cause: the verdict logic in
src/probes/aggregator.mjscounted ALL failed checks uniformly — one fail = degraded, two fails = unreachable. That treatedsubmit_commitment(the canonical write-path functional check the aggregator EXISTS to serve) as no more important than the diagnostic JSON-RPC plane probe. The CLAUDE.md "False-negative discipline" section explicitly warns against exactly this: "Functional checks are authoritative for the verdict; advisory checks are not."Fix
submit_commitmentandget_inclusion_proofchecks with a newcritical: truefield (backward-compatible — default falsy).computeAggregatorStatus(checks)so the rule is testable network-free.Tests
8 new network-free tests in `tests/smoke.test.mjs` pin the rule:
All 29 tests pass: `npm test` → `29 pass | 0 fail`.
Verified against live testnet
Before (v0.4.0):
```
⚠ aggregator https://goggregator-test.unicity.network
✓ health
✓ json-rpc
✗ submit_commitment HTTP 401 Unauthorized
✓ get_inclusion_proof
Status: DEGRADED (3/4 checks passed)
```
exit code: 1 (e2e preflight gates that only fire on exit 2 miss this).
After (this PR):
```
❌ aggregator https://goggregator-test.unicity.network
✓ health
✓ json-rpc
✗ submit_commitment HTTP 401 Unauthorized
✓ get_inclusion_proof
Status: UNREACHABLE (3/4 checks passed)
```
exit code: 2 (preflight gate fires; tests fail cleanly with a real signal).
Scope intentionally tight
Other probes (nostr publish, ipfs add+fetch, fulcrum tx-index) likely have similar functional checks that deserve
critical: trueflagging. Deferred to follow-up — this PR scopes to the reported regression.The looseness in
get_inclusion_proof(accepts any `result` object even when the just-submitted commitment didn't land) is a separate concern noted in CHANGELOG. Also deferred.Test plan