Skip to content

fix(aggregator): classify submit_commitment/get_inclusion_proof failures as unreachable#1

Open
vrogojin wants to merge 2 commits into
mainfrom
fix/aggregator-critical-fail-unreachable
Open

fix(aggregator): classify submit_commitment/get_inclusion_proof failures as unreachable#1
vrogojin wants to merge 2 commits into
mainfrom
fix/aggregator-critical-fail-unreachable

Conversation

@vrogojin
Copy link
Copy Markdown
Contributor

Summary

A real testnet outage on goggregator-test.unicity.network (HTTP 401 Unauthorized on submit_commitment) was being reported as DEGRADED (CLI exit 1) instead of UNREACHABLE (CLI exit 2). Downstream e2e pre-flight gates that only fire on exit 2 silently let test suites run against an aggregator that couldn't accept commitments — producing 35 phantom `Submit failed: [object Object]` failures on the SDK side (companion: sphere-sdk #191 / PR #192 fixes the SDK-side error stringification so the next outage at least surfaces the underlying reason).

Root cause: the verdict logic in src/probes/aggregator.mjs counted ALL failed checks uniformly — one fail = degraded, two fails = unreachable. That treated submit_commitment (the canonical write-path functional check the aggregator EXISTS to serve) as no more important than the diagnostic JSON-RPC plane probe. The CLAUDE.md "False-negative discipline" section explicitly warns against exactly this: "Functional checks are authoritative for the verdict; advisory checks are not."

Fix

  • Mark submit_commitment and get_inclusion_proof checks with a new critical: true field (backward-compatible — default falsy).
  • Extract verdict computation into a pure exported function computeAggregatorStatus(checks) so the rule is testable network-free.
  • New rule:
    1. Any critical-check fail → `unreachable`.
    2. Else ≥ 2 non-critical fails → `unreachable` (legacy preserved).
    3. Else exactly 1 non-critical fail OR any warn → `degraded`.
    4. Else → `healthy`.

Tests

8 new network-free tests in `tests/smoke.test.mjs` pin the rule:

  • healthy path
  • single non-critical fail → degraded
  • warn → degraded
  • critical fail (submit_commitment) → unreachable regardless of other passes (the headline reproducer)
  • critical fail (get_inclusion_proof) → unreachable
  • two non-critical fails → unreachable (legacy preserved)
  • critical fail wins over warn-on-others
  • explicit `critical: false` treated as non-critical

All 29 tests pass: `npm test` → `29 pass | 0 fail`.

Verified against live testnet

Before (v0.4.0):
```
⚠ aggregator https://goggregator-test.unicity.network
✓ health
✓ json-rpc
✗ submit_commitment HTTP 401 Unauthorized
✓ get_inclusion_proof
Status: DEGRADED (3/4 checks passed)
```
exit code: 1 (e2e preflight gates that only fire on exit 2 miss this).

After (this PR):
```
❌ aggregator https://goggregator-test.unicity.network
✓ health
✓ json-rpc
✗ submit_commitment HTTP 401 Unauthorized
✓ get_inclusion_proof
Status: UNREACHABLE (3/4 checks passed)
```
exit code: 2 (preflight gate fires; tests fail cleanly with a real signal).

Scope intentionally tight

Other probes (nostr publish, ipfs add+fetch, fulcrum tx-index) likely have similar functional checks that deserve critical: true flagging. Deferred to follow-up — this PR scopes to the reported regression.

The looseness in get_inclusion_proof (accepts any `result` object even when the just-submitted commitment didn't land) is a separate concern noted in CHANGELOG. Also deferred.

Test plan

  • `npm test` — 29/29 pass
  • Live probe against goggregator-test reproduces UNREACHABLE + exit 2
  • Exit code 0 verified for healthy services (`--only fulcrum`)
  • Downstream e2e preflight scripts pick up the new exit-2 signal — needs a quick check of consumers (out of scope for this repo)

vrogojin added 2 commits May 4, 2026 17:01
Adds a sixth probe to the suite for the Unicity test faucet
(https://faucet.unicity.network). Many SDK e2e suites depend on the
faucet for wallet funding (uxf-send-receive, pointer-roundtrip,
migrate-to-profile-conservation, profile-export-roundtrip); when the
faucet is down, those suites silently time out at 240 s with a generic
"Faucet top-up timed out" message. Probing upfront converts that into
a clean 1-2 s SKIP with a precise diagnostic.

The probe issues POST /api/v1/faucet/request with a deliberately-
invalid probe nametag (`infra-probe-do-not-mint-zk7q3xa9p2v`).
Healthy faucet returns HTTP 4xx + structured
`{success: false, error: "Nametag not found: ..."}`. This exercises the
full HTTP/parse/nametag-resolve/response-shaping pipeline WITHOUT
consuming actual faucet quota or requiring a real wallet — the same
"empty cost, full coverage" pattern the aggregator probe uses with its
known-bad shard ID.

Defense-in-depth: if the faucet ever returns success:true for the
probe nametag, the probe DOWNGRADES to 'degraded' (validation may be
broken — the probe nametag is supposed to be invalid).

Network config:
- testnet, dev: faucet = "https://faucet.unicity.network"
- mainnet: faucet = null (no faucet by design)
The probe layer treats null as a clean skip with verdict 'healthy' —
cleaner than emitting a misleading "unreachable" verdict against a
default URL that doesn't apply to the network being probed.

SERVICES expands from 5 to 6: ['nostr', 'aggregator', 'ipfs',
'fulcrum', 'market', 'faucet']. The 'faucet' entry is appended at the
end so existing --only invocations are unaffected.

Tests:
- 21 smoke tests pass (added 1 for mainnet/testnet faucet contract)
- Live testnet probe: 6/6 services healthy
- Live mainnet probe with --only faucet: skipped cleanly with
  "network mainnet has no faucet by design — skipped"
…res as unreachable

The verdict logic counted ALL failed checks uniformly — one fail = degraded,
two fails = unreachable. That treated submit_commitment (the canonical
write-path functional check the aggregator exists to serve) as no more
important than a diagnostic JSON-RPC plane probe.

Symptom: testnet returns HTTP 401 Unauthorized on submit_commitment while
/health and get_block_height keep working. Old verdict: `DEGRADED` (exit 1).
Downstream e2e pre-flight gates that only fire on exit code 2 silently let
test suites run against an aggregator that cannot accept commitments,
producing 35 phantom `Submit failed: [object Object]` failures on the
sphere-sdk side (companion issue: sphere-sdk #191).

Fix:
  - Mark submit_commitment + get_inclusion_proof checks with `critical: true`.
  - Extract verdict computation into a pure `computeAggregatorStatus(checks)`
    function, exported so the rule is testable network-free.
  - Rule: any critical-check fail → 'unreachable'. Non-critical fails follow
    the legacy "≥ 2 → unreachable, exactly 1 → degraded" rule. Warns
    continue to drive 'degraded'.
  - 8 new smoke tests pin the rule: healthy / non-critical degraded / warn /
    critical-fail unreachable (both checks independently) / multi-fail legacy
    preserved / critical-fail-wins-over-warn / explicit critical:false.

Encodes the CLAUDE.md "False-negative discipline" principle directly: the
functional layer "is what catches real outages that liveness misses" — and
when it catches them, the verdict MUST reflect that the service is
unusable for its intended purpose, not merely slow.

Bumps to 0.4.1.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new 'faucet' probe to the infrastructure monitoring tool, enabling verification of the Unicity test faucet. It also improves the aggregator probe's reliability by introducing a 'critical' check status, ensuring that failures in core write paths (submit_commitment and get_inclusion_proof) correctly trigger an 'unreachable' verdict instead of 'degraded'. The reviewer suggested elevating several error conditions in the new faucet probe from 'degraded' to 'unreachable' to better reflect the impact of malformed API responses on service availability.

Comment thread src/probes/faucet.mjs
latencyMs,
message: 'response was not JSON',
});
return finalize('degraded', 'response shape malformed');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The request check is the primary functional check for the faucet. If the response is not valid JSON, the service is effectively unusable for a client. This should be considered unreachable, not degraded. This aligns with the primary goal of this PR, which is to correctly classify critical failures.

Suggested change
return finalize('degraded', 'response shape malformed');
return finalize('unreachable', 'response shape malformed');

Comment thread src/probes/faucet.mjs
latencyMs,
message: `unexpected success field: ${typeof json.success}`,
});
return finalize('degraded', 'response shape malformed');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

An unexpected success field value indicates a malformed response from the faucet's API. This should be treated as a critical failure, making the service unreachable rather than degraded.

Suggested change
return finalize('degraded', 'response shape malformed');
return finalize('unreachable', 'response shape malformed');

Comment thread src/probes/faucet.mjs
latencyMs,
message: 'success:false but no error/message field',
});
return finalize('degraded', 'error field missing on rejection');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When the faucet API indicates a failure (success: false) but omits the error or message field, the response is incomplete and violates the expected contract. This points to a server-side issue that should render the service unreachable, not just degraded.

Suggested change
return finalize('degraded', 'error field missing on rejection');
return finalize('unreachable', 'error field missing on rejection');

vrogojin added a commit that referenced this pull request May 24, 2026
Companion work for the sphere-sdk hermetic e2e stack
(tests/e2e/local-infra/) which boots a local aggregator in
BFT_ENABLED=false standalone mode. Two changes layered on top of #1's
critical-check verdict fix:

1. NETWORKS.local — endpoints for the docker-compose stack:
   - aggregator http://127.0.0.1:3001
   - nostr     ws://127.0.0.1:7777
   - ipfs      http://127.0.0.1:8082
   Fulcrum/Market intentionally non-local (no local counterpart yet).
   Faucet null — the local faucet is DM-driven, not HTTP, so the
   existing faucet probe doesn't apply.

2. aggregator /health body parser accepts both shapes:
   - BFT mode:        {"status":"healthy","database":"ok","aggregators":{...}}
   - Standalone mode: {"status":"ok","role":"standalone","details":{"database":"connected",...}}
   Previously the standalone shape was reported as `degraded` even
   when the aggregator was fully functional (submit_commitment +
   get_inclusion_proof both passing).

Verified against the live local stack:
  unicity-infra-probe --network local --only aggregator
  → HEALTHY (4/4 checks passed), exit 0.

Tests: 30/30 pass. Two new tests cover NETWORKS.local shape +
NETWORKS.local.faucet === null.

Bumps to 0.4.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant