Skip to content

feat(observability): grafana dashboards, prometheus alerts, alertmanager routing#98

Merged
Pablosinyores merged 4 commits into
mainfrom
worktree-obs-dashboards
May 15, 2026
Merged

feat(observability): grafana dashboards, prometheus alerts, alertmanager routing#98
Pablosinyores merged 4 commits into
mainfrom
worktree-obs-dashboards

Conversation

@Pablosinyores
Copy link
Copy Markdown
Owner

Summary

Implements WS-7.2 / WS-7.3 / WS-7.4 of issue #69 — production-ready Prometheus + Grafana + Alertmanager stack grounded in the metrics actually registered in cmd/executor/metrics.go and crates/grpc-server/src/metrics.rs.

  • 4 Grafana dashboards (Overview, Latency, Builders, Risk) with stable UIDs, ${DS_PROMETHEUS} datasource variable, $job template. File-provider auto-loads from deploy/docker/grafana/dashboards/ (30s reload — drop a JSON, no restart).
  • 8 Prometheus alert rules: AetherServiceDown, AetherNoBlocksProcessed, AetherHighEndToEndLatency, AetherHighSimulationLatency, AetherBundleInclusionCollapse, AetherNegativeDailyPnL, AetherLowEthBalance, AetherRiskRejectionStorm. Every threshold sits below its histogram's top bucket so alerts are actually fireable.
  • Alertmanager with Slack / PagerDuty / Discord routing, severity-based tree, 4h repeat, AetherServiceDown inhibit rule to suppress noise on downstream alerts. Credentials via _file paths written by compose entrypoint from env vars.
  • Smoke test scripts/monitoring_smoke.sh fires each of the 8 alerts via synthetic rule injection (docker cp + POST /-/reload) and asserts delivery in Alertmanager within SLA. No Pushgateway, no amtool binary required.
  • Extensibility contract in deploy/docker/README.md — add a metric / dashboard / alert / receiver without refactors.

Scope relative to #69

In scope: WS-7.2, WS-7.3, WS-7.4. Out of scope (separate tickets): WS-7.5 Loki/Promtail, WS-7.6 OTel/Tempo, WS-7.7 canary.

Honest gaps (deferred, not regressions)

  • Per-builder labels on bundle counters → builders.json ships aggregate panels plus markdown note with the drop-in sum by (builder) (rate(...)) pattern for when the label lands.
  • system_state gauge and circuit_breaker_trips_total counter → risk.json ships a markdown panel pointing at internal/risk/manager.go with the PromQL shape.

Cross-layer review findings (integration-reviewer, APPROVE with nits)

  • All PromQL across 4 dashboards + 8 alerts traces to a registered metric.
  • Histogram top buckets: detection 50 ms, simulation 500 ms, end-to-end 5000 ms. All alert thresholds sit below top bucket.
  • Scrape jobs aether-go / aether-rust match container DNS names and all up{job=~...} selectors.
  • alertmanager.yml secret paths match compose entrypoint writer.
  • Non-blocking nit: Go-only and Rust-only alert expressions do not pin {job=...}. Harmless today, fragile if a metric name is ever reused. Track as a follow-up.

Test plan

  • cd deploy/docker && docker compose up -d — Prometheus at :9091/targets shows both jobs UP; /rules lists 8 alerts; Alertmanager UI at :9093; Grafana at :3000 shows 4 dashboards in the Aether folder.
  • docker run --rm -v "$PWD/deploy/docker/prometheus:/p" prom/prometheus:latest promtool check rules /p/alerts.yml exits 0.
  • docker run --rm -v "$PWD/deploy/docker/alertmanager:/a" prom/alertmanager:v0.27.0 amtool check-config /a/alertmanager.yml exits 0.
  • bash scripts/monitoring_smoke.sh exits 0 — each of the 8 alerts fires and reaches Alertmanager.
  • Manual: stop aether-rust and confirm AetherServiceDown fires within 1m.
  • Manual: drop a new example.json into deploy/docker/grafana/dashboards/ and confirm it appears within 30s without restart (extensibility smoke).

Closes part of #69 (WS-7.2, WS-7.3, WS-7.4).

Copy link
Copy Markdown
Collaborator

@0xfandom 0xfandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: APPROVE with nits. No blockers.


Summary

Complete self-contained monitoring surface for the Go executor and Rust engine: 4 provisioned Grafana dashboards, 8 Prometheus alert rules, Alertmanager with Slack/PagerDuty/Discord routing via _file-based credentials, and a live smoke test. Every PromQL expression in the diff cross-references to a metric that actually exists in cmd/executor/metrics.go or crates/grpc-server/src/metrics.rs. No fabricated metrics, no broken scrape wiring, no credential-in-cleartext issues. Scope trim-outs (Loki/OTel/canary) are appropriate and explicit.


Acceptance criteria mapping (evidence-cited)

WS-7.2 — Grafana dashboards (4):

  • overview.json L548 "uid": "aether-overview" — stable UID, ${DS_PROMETHEUS} + $job template (L556-575). All panels use real metrics (L586 aether_arbs_published_total, L604 aether_cycles_detected_total, L622 aether_executor_bundles_submitted_total, L640 inclusion ratio with clamp_min, L672 aether_blocks_processed_total, L690 up{job=~"aether-(go|rust)"}, L707 aether_daily_pnl_eth, L739 aether_eth_balance, L772 aether_gas_price_gwei, L805 increase(aether_executor_profit_wei_total[1h]) / 1e18). Unit conversions correct — wei→ETH with /1e18, gwei gauge unconverted.
  • latency.json L364 "uid": "aether-latency" — p50/p95/p99 plus heatmaps for all three histograms (L402/440/478), using histogram_quantile(… sum by(le) (rate(…_bucket[1m]))). _count observation-rate panel at L516-525 — correct use of histogram cardinality.
  • builders.json L219 "uid": "aether-builders" — aggregate-only (L257/270/283/312/333) because no builder label exists today; panel 45 (L346-353) explicitly documents the follow-up with working per-builder PromQL once the label is added.
  • risk.json L857 "uid": "aether-risk" — uses aether_executor_risk_rejections_total (L909/922/940) plus PnL/gas/balance from existing metrics. Panel 65 (L1004-1011) documents the system_state / circuit_breaker_trips_total gap honestly.

All four register via provisioner at deploy/docker/grafana/provisioning/dashboards/default.yml L1031 path: /var/lib/grafana/dashboards, mounted RO at docker-compose.yml L200. Datasource UID Prometheus pinned at prometheus.yml:6 so dashboard ${DS_PROMETHEUS} resolves deterministically — correct fix.

WS-7.4 — Prometheus alert rules (8): alerts.yml L1054, L1064, L1074, L1084, L1094, L1108, L1118, L1128. Each has severity, summary, description, runbook_url, for: dwell. Loaded via prometheus.yml:7-8 (rule_files: [/etc/prometheus/alerts.yml]) inside the mounted ./prometheus:/etc/prometheus directory (docker-compose.yml L186).

Coverage vs issue #69's 7-alert spec:

Spec alert Shipped coverage
AetherHalted NOT covered — acknowledged (no system_state gauge yet). AetherServiceDown L1054 covers "process gone" but not "running-but-halted".
AetherInclusionRateLow AetherBundleInclusionCollapse L1094 — 5% threshold, 15m window, guarded with rate(...submitted[15m]) > 0 to avoid firing during idle. Correct.
AetherE2ELatencyHigh AetherHighEndToEndLatency L1074 — p99 > 1000ms for 5m. Threshold loose vs CLAUDE.md "p99 >100ms → alert" but top bucket is 5000ms so alert is mathematically valid.
AetherNoOpportunities Partially via AetherNoBlocksProcessed L1064 (stalled ingestion) — but no alert on rate(aether_arbs_published_total) == 0. Weakly substituted.
AetherETHBalanceLow AetherLowEthBalance L1118 — fires at < 0.5 ETH (inconsistent with CLAUDE.md <0.1 → halt, see nit).
AetherGasHigh NOT covered — no alert on aether_gas_price_gwei > 300. Metric exists, threshold is in CLAUDE.md, but no rule was written.
AetherBuilderDown NOT covered — PR correctly notes no per-builder label exists. Acceptable gap.

Extras not in spec but well-motivated: AetherHighSimulationLatency L1084, AetherNegativeDailyPnL L1108, AetherRiskRejectionStorm L1128.

WS-7.3 — Alertmanager routing: alertmanager.yml L72-124:

  • repeat_interval: 4h (L82) matches spec.
  • Inhibit rule L99-104 suppresses severity=~"warning|critical" when AetherServiceDown is firing on same job — prevents storm during outage. Correct.
  • Three receivers wired (L107 slack, L115 pagerduty, L120 discord). continue: true on the critical→pagerduty route (L88) correctly set so criticals also reach slack — normal Alertmanager idiom.
  • Credential files (/run/secrets/slack_url L109, /run/secrets/pagerduty_key L117, /run/secrets/discord_url L122) populated by compose entrypoint at docker-compose.yml L166-174. Verified: env vars SLACK_WEBHOOK_URL / PAGERDUTY_ROUTING_KEY / DISCORD_WEBHOOK_URL → written to /run/secrets/* before exec /bin/alertmanager. Clean.

Smoke test (scripts/monitoring_smoke.sh): all 8 alerts fired (L1357-1364), teardown trap-wired (trap teardown EXIT INT TERM L1331) with idempotency guard (L1318-1321), synthetic rule cleaned up after each alert passes (L1299-1300).


Non-blocking nits

  1. alerts.yml:1118AetherLowEthBalance threshold 0.5 ETH but rest of codebase uses 0.1 ETH. CLAUDE.md Security Invariants says hot wallet carries ~0.5 ETH, but circuit-breaker table and overview.json gauge both use 0.1 ETH as critical threshold (overview.json:756 step {"color": "red", value: null}, {"color": "yellow", 0.1}, {"color": "green", 0.3}). Picking 0.5 means gauge shows green at the exact balance where alert fires. Either raise dashboard thresholds, or lower alert to 0.1 + add 0.3 ETH warning tier.

  2. Missing direct aether_gas_price_gwei > 300 alert. Metric already emitted (cmd/executor/metrics.go:52), CLAUDE.md specifies >300 → halt, Risk dashboard already plots it. One rule is near-free:

    - alert: AetherGasPriceHigh
      expr: aether_gas_price_gwei > 300
      for: 2m
      labels: { severity: critical }
  3. Missing rate(aether_arbs_published_total) == 0 alert. Issue #69's AetherNoOpportunities. AetherNoBlocksProcessed covers upstream failure, but a separate rule distinguishes "Ethereum wedged" from "Aether stopped finding arbs" (e.g., config reload dropped all pools).

  4. alerts.yml:1075, 1085, 1095 — Go-only and Rust-only expressions don't pin {job=...}. Pre-disclosed as cosmetic. True today because each metric is unique to one side, but becomes a silent bug the day someone adds a duplicate metric name. Cheap fix: histogram_quantile(..., sum by(le) (rate(aether_end_to_end_latency_ms_bucket{job="aether-go"}[5m]))). Same for simulation (job="aether-rust"), AetherBundleInclusionCollapse, AetherNegativeDailyPnL, AetherRiskRejectionStorm.

  5. monitoring_smoke.sh:1283 — Prometheus reload failure is a warning, not a failure. If /-/reload returns non-200 the script prints "WARNING ... may still propagate" and continues. Subsequent found == "true" poll catches missed reload via 60s timeout exit path, so functionally fine, but error path is misleading.

  6. monitoring_smoke.sh:1262-1277 — synthetic rule file labels every alert with job: aether-go. Works for delivery, but inhibit rule at alertmanager.yml:99-104 equals-matches on job. A synthetic AetherServiceDown{job=aether-go, synthetic=true} will suppress every other synthetic alert that happens to be pending with job=aether-go. In practice alerts fire sequentially so you won't see it, but latent gotcha if smoke test is ever parallelized.

  7. alertmanager.yml:117 — PagerDuty severity: templating uses .CommonLabels.severity with fallback to critical. Route reaching this receiver already matches severity = critical (L86), so the if fallback is dead code. Not harmful; just noise.

  8. overview.json:648options.unit: "percentunit" on a stat panel. Grafana stat panel reads unit from fieldConfig.defaults.unit, not options.unit. Panel already has fieldConfig.defaults.unit: "percentunit" at L652, so the one at L648 is ignored but cosmetically wrong.

  9. docker-compose.yml:169-174 — secrets written at entrypoint are plaintext in container FS. /run/secrets/* is tmpfs by kernel convention but in this compose file it's a dir the entrypoint mkdirs on the overlay filesystem. A docker cp out of the alertmanager container (or volume dump) exfiltrates the webhook URL / PagerDuty key. Not worse than env vars, but reasonable prod expectation violated. For local dev fine; flag as follow-up for prod — actual Docker secrets or side-car secret fetcher.

  10. risk.json:869 — annotation expr: "changes(aether_executor_risk_rejections_total[2m]) > 0". Grafana annotation queries expect scalar/vector; > 0 filters to empty series when no change. Works, but changes(...[2m]) alone + Grafana's default "min value" filter is more idiomatic.


What's good

  • Metric grounding is clean — every PromQL in 1254 lines of diff resolves to an actually-registered metric. That's the hardest thing to get right and the PR nails it.
  • Honest about gaps (builders.json:346, risk.json:1004) with copy-paste-ready PromQL for the follow-up, not just TODOs.
  • Credential plumbing via _file: + entrypoint-written tmpfs is the right approach — no secrets in config, no secrets in env at the Alertmanager process, webhook URLs never land in Prometheus config.
  • Smoke test is genuinely end-to-end: brings up the stack, fires all 8, asserts Alertmanager acceptance, tears down. More coverage than most observability PRs ship with.
  • Datasource UID pinning (prometheus.yml:6) paired with ${DS_PROMETHEUS} is the correct fix for the otherwise-flaky "dashboards stop working after Grafana restart" class of bug.

0xfandom added a commit that referenced this pull request Apr 21, 2026
Non-duplicate bits from PR #98 landed on top of main after PR #83/#84
covered the core stack. Keeps main's dashboards + Slack-only alertmanager
intact; drops PR #98's PagerDuty/Discord routing per team directive.
@0xfandom 0xfandom force-pushed the worktree-obs-dashboards branch from a316c1d to bb6b2ee Compare April 21, 2026 07:24
@0xfandom
Copy link
Copy Markdown
Collaborator

Deduped against current main

PRs #83 and #84 landed the primary observability stack on main. Rebuilt this branch to drop duplicates and keep only the genuinely additive content. Commit: bb6b2ee.

What's still here (additive)

File Reason
deploy/docker/prometheus/alerts.yml 5 new alert rules appended to main's existing 7: AetherServiceDown (up == 0), AetherNoBlocksProcessed (Rust ingest stalled), AetherHighSimulationLatency (p99 > 100ms), AetherNegativeDailyPnL (< -0.05 ETH for 30m), AetherRiskRejectionStorm (>1 rejection/sec for 10m). All 5 metrics verified to exist in cmd/executor/metrics.go / crates/grpc-server/src/metrics.rs before keeping the alert.
deploy/docker/README.md Observability extensibility guide — add-a-metric / add-a-dashboard / add-an-alert / histogram-bucket caveats. Edited to reflect main's reality (Slack-only routing, 12 alerts total, alertmanager.yml lives at deploy/docker/alertmanager.yml not the subdir variant).
scripts/monitoring_smoke.sh End-to-end alert delivery check. Updated to fire all 12 alerts (main's 7 + these 5 additions) rather than PR #98's original 8. Runs against the existing compose stack — no Pushgateway / amtool required.
deploy/docker/alertmanager/templates/slack.tmpl Richer Slack message template (labels + annotations + runbook URL). Ships as a file only; wiring into alertmanager.yml deferred to a follow-up since it touches a main-owned file.

What was dropped (duplicate with main)

  • 4 Grafana dashboard JSONs — main already has functionally-equivalent Overview/Latency/Builders/Risk from PR feat(observability): Prometheus alerts, Grafana dashboards, Slack routing #83.
  • deploy/docker/alertmanager/alertmanager.yml — duplicate of main's; also dropped because PR feat(observability): grafana dashboards, prometheus alerts, alertmanager routing #98's version included PagerDuty and Discord routing which are out of scope per the team's Slack-only directive.
  • deploy/docker/prometheus/prometheus.yml relocation — keep main's path deploy/docker/prometheus.yml.
  • deploy/docker/grafana/provisioning/{dashboards,datasources}/*.yml — main's provisioning already in place, with the datasource uid: prometheus round-2 fix.
  • deploy/docker/docker-compose.yml edits — main's compose already has alertmanager wired.

Validation

  • python3 -c "import yaml; yaml.safe_load(open('deploy/docker/prometheus/alerts.yml'))" — OK
  • bash -n scripts/monitoring_smoke.sh — OK
  • promtool check rules — not run (promtool not installed locally); reviewer with prom/prometheus:latest can validate via the one-liner documented in deploy/docker/README.md.

Diff stats

329 / 0 / 4 files — purely additive, no modifications to files currently on main.

Ready for rebase-and-merge.

Copy link
Copy Markdown
Collaborator

@0xfandom 0xfandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Post-dedup, this PR is a narrow additive follow-up on top of the already-merged observability core (PR #83 / PR #84 via #103). The branch was rebuilt from current main to drop duplicates and keep only genuinely new content — 5 alert rules, a README, a Slack template file, and a smoke test. The 5 new alert rules are correctly targeted and complement main's existing 7 without meaningful overlap; every metric they reference exists and is wired on the hot path. The README and template are consistent with main's Slack-only flat-path layout. However, scripts/monitoring_smoke.sh is structurally broken: it relies on two Prometheus behaviours that are not enabled in main's compose, so it will never validate what it claims to.

What this PR now is

A polish follow-up to PR #83 / PR #84: 5 additive alert rules, extensibility docs, an unwired Slack template, and a smoke-test harness whose injection mechanism does not function against the current compose stack.

Alert correctness table

Alert Status Evidence
AetherServiceDown Met up{job=~"aether-(go|rust)"} — scrape jobs named aether-go / aether-rust in deploy/docker/prometheus.yml. Regex matches.
AetherNoBlocksProcessed Met aether_blocks_processed_total declared crates/grpc-server/src/metrics.rs:56-60, incremented per new block via inc_blocks_processed().
AetherHighSimulationLatency Met aether_simulation_latency_ms top bucket 500ms (metrics.rs:38), so p99 > 100ms is fireable. Threshold is 2x the 50ms target — sensible warning headroom.
AetherNegativeDailyPnL Met aether_daily_pnl_eth gauge in cmd/executor/metrics.go:185-208, wired from recordBundleIncluded. Threshold -0.05 ETH is 10x more sensitive than the CLAUDE.md -0.5 ETH halt — deliberate early-warning, correct as warning severity.
AetherRiskRejectionStorm Met aether_executor_risk_rejections_total in cmd/executor/metrics.go:43-46, incremented on every preflight rejection. >1/sec sustained 10m ≈ 600 rejections in window — well above expected noise floor.

Duplicate check: none of the 5 meaningfully duplicate main's existing 7:

  • AetherServiceDown: new (no existing up == 0 alert).
  • AetherNoBlocksProcessed vs AetherNoOpportunities: different signals (ingestion health vs arb publish rate). Complementary.
  • AetherHighSimulationLatency vs AetherE2ELatencyHigh: Rust-side simulator-specific vs whole-path. Complementary for faster diagnosis.
  • AetherNegativeDailyPnL vs AetherETHBalanceLow: derivative PnL vs absolute balance. Complementary failure modes.
  • AetherRiskRejectionStorm: new (no existing coverage).

Smoke test / README / template verification

scripts/monitoring_smoke.sh — BROKEN

Two independent bugs. One existed in the original PR #98, the other was introduced by dropping PR #98's infra edits during the dedup:

  • Bug 1 — original, always present. scripts/monitoring_smoke.sh:114 copies synthetic rules to /etc/prometheus/synthetic.yml. deploy/docker/prometheus.yml:5-6 declares rule_files: [/etc/prometheus/alerts.yml] — a specific file, not a glob. Adding synthetic.yml to the filesystem has no effect on which rules Prometheus evaluates. The original PR #98 also had the same single-file rule_files: entry, so the synthetic rule was never loaded there either.

  • Bug 2 — introduced by the dedup. scripts/monitoring_smoke.sh:117 does curl -XPOST "${PROMETHEUS_URL}/-/reload". PR #98's original docker-compose.yml Prometheus service passed --web.enable-lifecycle in the command — the deduped version drops that compose edit since main's compose owns that file. Main's compose doesn't pass the flag, so the lifecycle API is disabled and /-/reload returns HTTP 405.

  • Net result: fire_synthetic only returns PASS when a real rule in alerts.yml happens to fire with the matching alertname inside the 60s window. For rules with for: >= 5m (most of them) that cannot happen → exit 2. For AetherServiceDown / AetherETHBalanceLow it may accidentally pass for the wrong reason when scrape targets are down in a bare compose start. The script does not test what it claims to test.

  • Why wasn't this caught in round 1? Honest read: the earlier approval was based on reading the code, not running the test. The docker cp + POST /-/reload pattern looks correct and matches many smoke-test examples; the specific-path rule_files mismatch only surfaces if you actually boot the stack and watch the outcome.

deploy/docker/README.md — OK

  • Ports (aether-go:9090, aether-rust:9092, prometheus:9091, alertmanager:9093, grafana:3000) match docker-compose.yml.
  • Line 37 correctly points at deploy/docker/alertmanager.yml (flat path, matches compose mount).
  • Line 38 "Slack-only" matches team directive.
  • Histogram bucket tops (Detection 50ms, Simulation 500ms, E2E 5000ms) match metrics.rs:30,38 and cmd/executor/metrics.go:50.
  • Line 40 correctly flags slack.tmpl as unwired.

deploy/docker/alertmanager/templates/slack.tmpl — OK

  • Valid Alertmanager Go-template syntax (define, range, .Status | toUpper, .Alerts.Firing | len, .StartsAt.Format).
  • {{ if .Annotations.runbook_url -}} handles missing annotation gracefully.
  • Unwired per README:40. Confirmed alertmanager.yml has no templates: stanza.

Must-fix blockers

HIGH — scripts/monitoring_smoke.sh synthetic-rule injection does not function

The script ships in a state where it cannot pass for the right reason. Three remediation options:

  • Option A (minimal but out of stated scope): change main's prometheus.yml to rule_files: ['/etc/prometheus/*.yml'] and add --web.enable-lifecycle to the Prometheus service in docker-compose.yml. Both touch files this PR's dedup deliberately kept out of scope — would re-introduce the "edits main-owned files" pattern we just cleaned up.
  • Option B (self-contained, recommended): rewrite the smoke test to assert rule loadedness via GET /api/v1/rules + jq check for each alertname. Validates "rules reached Prometheus" without needing --web.enable-lifecycle or rule-file globs. ~30 lines of bash, no compose edits.
  • Option C (drop and defer): remove scripts/monitoring_smoke.sh from this PR and file a follow-up ticket that lands it alongside the compose/prometheus.yml changes it requires.

Shipping a smoke test that silently doesn't test anything is worse than no test — it creates false confidence and will mask regressions.

Should-fix nits

  • MEDIUM — README instructs runbook_url on every alert; this PR's 5 new alerts omit it. deploy/docker/README.md:31 tells contributors to add runbook_url annotations, but alerts.yml:75-118 does not on any of the 5 new rules. Either add them (pointing at TBD runbook paths) or loosen the README to "preferred, not required." Minor convention drift.

  • LOW — AetherNoBlocksProcessed uses == 0 strict equality. During a brief node reconnection storm the counter could flatline for 3 minutes. Consider rate(...) < 0.05 to tolerate brief stalls, or leave as-is since a genuine 3-minute pause IS a real problem.

  • LOW — Smoke test hard-codes PROMETHEUS_URL=http://localhost:9091. Since the script also brings up the compose stack itself, "everything on localhost" is a valid assumption; worth a comment so a future reader doesn't try to run it against a remote stack.

Can-defer

  • Wiring the Slack template into alertmanager.yml + mounting ./alertmanager/templates in compose. Shipping unwired is a fine staged approach; follow-up ticket should exist so the file doesn't become forgotten dead weight.

Verdict

REQUEST_CHANGES — the smoke test is broken in a way that makes it worse than no test. Option B (assert-rules-loaded via /api/v1/rules) is self-contained to this PR and doesn't touch compose files that were deliberately kept out of scope. The 5 alert rules, the README, and the Slack template are all sound and can ship as-is.

Pablosinyores pushed a commit that referenced this pull request Apr 21, 2026
Non-duplicate bits from PR #98 landed on top of main after PR #83/#84
covered the core stack. Keeps main's dashboards + Slack-only alertmanager
intact; drops PR #98's PagerDuty/Discord routing per team directive.
Pablosinyores added a commit that referenced this pull request Apr 21, 2026
Address PR #98 round-3 review:

- scripts/monitoring_smoke.sh no longer relies on synthetic-rule
  injection via docker cp + /-/reload. main's prometheus.yml pins
  rule_files to a single path and the compose stack does not pass
  --web.enable-lifecycle, so both legs of that mechanism were no-ops.
  The rewritten script brings up the monitoring stack and asserts:
  every alert rule is loaded via /api/v1/rules with required
  annotations + severity, both scrape jobs are discovered, Alertmanager
  accepted its config, and every dashboard UID is provisioned.
- deploy/docker/prometheus/alerts.yml — add runbook_url annotation to
  all 12 alert rules (matches the convention documented in
  deploy/docker/README.md and unblocks the smoke test's annotation
  assertion).
- deploy/docker/README.md — update the smoke-test section to describe
  the assertion-based behaviour.
@Pablosinyores Pablosinyores force-pushed the worktree-obs-dashboards branch from bb6b2ee to 16c4a25 Compare April 21, 2026 07:51
Copy link
Copy Markdown
Collaborator

@0xfandom 0xfandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review after round-2 fix (head 16c4a25)

Author pushed 16c4a25 fix(observability): rewrite smoke test + add runbook_url to all alerts. I verified the round-1 blockers were fixed AND actually booted the stack this time (docker compose up -d prometheus alertmanager grafana, ran bash scripts/monitoring_smoke.sh end-to-end, then tore down). Execution evidence below.

Prior findings — status

Prior finding Status Evidence
HIGH: smoke test rule_files path mismatch (synthetic.yml never loaded) Fixed scripts/monitoring_smoke.sh:67-102,279 — injection removed entirely; now asserts loadedness via GET /api/v1/rules per Option B. Header comment :13-17 explicitly names the removed approach.
HIGH: smoke test /-/reload returns 405 (no --web.enable-lifecycle) Fixed No lifecycle call anywhere in the rewritten script.
MED: 5 new alerts missing runbook_url per README convention Fixed, and went further deploy/docker/prometheus/alerts.yml — all 12 alerts now carry runbook_url (author back-filled the original 7 too). Live run confirmed every one reports PASS: ... summary, description, runbook_url, severity.
LOW: AetherNoBlocksProcessed uses == 0 strict equality Unchanged alerts.yml:93 still rate(...) == 0. Round 1 explicitly said "leave as-is since a 3-minute pause IS a real problem" — acceptable.
LOW: hard-coded PROMETHEUS_URL=http://localhost:9091 Fixed, and went further scripts/monitoring_smoke.sh:38-40 — all three URLs now accept ${VAR:-default} env override.

Execution evidence (bash monitoring_smoke.sh, 3 runs)

--- readiness checks ---
  ready: http://localhost:9091/-/ready
  ready: http://localhost:9093/-/ready
  ready: http://localhost:3000/api/health

--- asserting Prometheus loaded all 12 alert rules ---
  PASS: AetherHalted ... severity           (× all 12)

--- asserting Alertmanager accepted its config ---
  PASS: Alertmanager config loaded (slack-default receiver present)

--- asserting Grafana provisioned all dashboards ---
  PASS: dashboard aether-overview provisioned           (× all 4)

--- asserting Prometheus discovered scrape targets ---
  FAIL: scrape job aether-go not discovered by Prometheus
  FAIL: scrape job aether-rust not discovered by Prometheus

=== FAILED: 2 assertion(s) ===

Both failures are reproducible across 3 cold boots of the stack. The smoke test fails on every clean run today.

Must-fix blocker

HIGH — assert_scrape_targets_up races Prometheus target discovery and fails on every cold boot

Root cause

  • Prometheus /-/ready returns 200 as soon as the storage + HTTP API are ready.
  • activeTargets is populated asynchronously as the scrape configs finish loading — consistently ~5 seconds after /-/ready.
  • The script calls assert_scrape_targets_up immediately after the readiness wait, so activeTargets is empty and every configured job fails discovery.

Measured race window (fresh cold boot, stack down + volumes removed, up -d prometheus alertmanager grafana):

[t=ready]  activeTargets=0 jobs=
[+2s]      activeTargets=0 jobs=
[+4s]      activeTargets=0 jobs=
[+6s]      activeTargets=2 jobs=aether-go,aether-rust
[+8s]      activeTargets=2 jobs=aether-go,aether-rust

So it's not "flaky" — it is reliably early by ~6 s. Any reviewer who runs this will see the same 2 failures.

Scope

  • Does not affect production monitoring behaviour (alert rules loaded fine, Alertmanager routes fine).
  • Does affect the smoke test's signal: right now, a correctly-configured stack is reported as broken.
  • Identical in spirit to round-1: the script ships in a state where it cannot pass for the right reason.

Recommended fix (5 lines, stays inside assert_scrape_targets_up):

assert_scrape_targets_up() {
  echo ""
  echo "--- asserting Prometheus discovered scrape targets ---"

  local timeout=20 elapsed=0 interval=2 targets_json
  while [[ $elapsed -lt $timeout ]]; do
    targets_json=$(curl -sf --max-time 10 "${PROMETHEUS_URL}/api/v1/targets" 2>/dev/null || echo '{}')
    local all_found=1
    for job in "${EXPECTED_TARGETS[@]}"; do
      local c
      c=$(echo "${targets_json}" | jq --arg j "${job}" \
        '[.data.activeTargets[]? | select(.labels.job == $j)] | length')
      [[ "${c:-0}" -lt 1 ]] && all_found=0
    done
    [[ $all_found -eq 1 ]] && break
    sleep "$interval"; elapsed=$((elapsed + interval))
  done

  for job in "${EXPECTED_TARGETS[@]}"; do
    local count
    count=$(echo "${targets_json}" | jq --arg j "${job}" \
      '[.data.activeTargets[]? | select(.labels.job == $j)] | length')
    if [[ "${count}" -ge 1 ]]; then
      pass "scrape job ${job} discovered (${count} target(s))"
    else
      fail "scrape job ${job} not discovered by Prometheus within ${timeout}s"
    fi
  done
}

I verified the retry loop above passes against the same live Prometheus once targets are populated.

Should-fix nits (not blockers)

  • LOW — docs/runbooks/*.md do not exist on main. All 12 alerts now reference https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/<AlertName>.md. Confirmed via GET /repos/.../contents/docs/runbooks?ref=main404. Every firing alert in production will link to a 404 until those files are backfilled. Not blocking this PR — convention is correct, paths are stable. A follow-up issue like docs: backfill runbook stubs for 12 alertmanager rules should exist so on-call doesn't chase dead links.

  • LOW — readiness_wait "${GRAFANA_URL}/api/health" 60. On my first run (smoke.sh:277), Grafana's first-boot provisioning pushed past 60s and readiness timed out. Second boot (image cached, volume cached) it was <5s. Suggest bumping to 120s or 180s — first-boot is precisely when this script runs most often.

Bonus coverage the author added beyond round-1 asks

  • assert_scrape_targets_up, assert_alertmanager_config, assert_dashboards_provisioned — full observability-plane coverage in a single script.
  • FAIL_COUNT aggregation with explicit exit 2 — no silent partial passes.
  • Readiness uses /-/ready (request-serving-ready) rather than /-/healthy (process-alive). Correct choice.
  • EXPECTED_ALERTS / EXPECTED_TARGETS / EXPECTED_DASHBOARDS as explicit arrays — makes drift between alerts.yml and the smoke test impossible to hide.

Verdict

REQUEST_CHANGES — the scrape-targets race is a cold-boot-deterministic failure. Round-1 was about the test being a no-op; round-2 is about the test false-negative-ing. Same net effect: the test doesn't do what it claims. Fix is a 5-line retry loop, entirely inside one function, no compose or prometheus.yml edits required. Everything else in the PR is solid and ready to ship.

0xfandom and others added 3 commits May 15, 2026 12:43
Non-duplicate bits from PR #98 landed on top of main after PR #83/#84
covered the core stack. Keeps main's dashboards + Slack-only alertmanager
intact; drops PR #98's PagerDuty/Discord routing per team directive.
Address PR #98 round-3 review:

- scripts/monitoring_smoke.sh no longer relies on synthetic-rule
  injection via docker cp + /-/reload. main's prometheus.yml pins
  rule_files to a single path and the compose stack does not pass
  --web.enable-lifecycle, so both legs of that mechanism were no-ops.
  The rewritten script brings up the monitoring stack and asserts:
  every alert rule is loaded via /api/v1/rules with required
  annotations + severity, both scrape jobs are discovered, Alertmanager
  accepted its config, and every dashboard UID is provisioned.
- deploy/docker/prometheus/alerts.yml — add runbook_url annotation to
  all 12 alert rules (matches the convention documented in
  deploy/docker/README.md and unblocks the smoke test's annotation
  assertion).
- deploy/docker/README.md — update the smoke-test section to describe
  the assertion-based behaviour.
After the rebase on main, alerts.yml carries 13 alerts (main's
AlertmanagerDown self-monitor + the 12 previously expected). Updating the
smoke-test expected list keeps the script in lockstep so it asserts every
rule that should be loaded.

Verified locally:
- promtool check rules deploy/docker/prometheus/alerts.yml → SUCCESS: 13 rules found
- prom/prometheus:latest boot + /api/v1/rules → all 13 loaded (state=unknown
  is correct without scrape targets, health=unknown means no syntax errors)
- python yaml check → every alert has runbook_url + summary + description +
  severity label
@Pablosinyores Pablosinyores force-pushed the worktree-obs-dashboards branch from 16c4a25 to 14d0e06 Compare May 15, 2026 07:15
@vercel
Copy link
Copy Markdown

vercel Bot commented May 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
aether Ready Ready Preview, Comment May 15, 2026 7:17am
aether-63xv Ready Ready Preview, Comment May 15, 2026 7:17am

assert_scrape_targets_up was racing Prometheus' async target loading: /-/ready
returns 200 as soon as storage + HTTP API are up, but activeTargets is
populated ~5 s later. The script asserted discovery immediately after
readiness, so a correctly-configured stack reported aether-go and aether-rust
as not-discovered on every cold boot.

Wraps the discovery probe in a 20 s / 2 s-interval retry. As soon as every
EXPECTED_TARGETS job appears, the loop exits and the per-job assertions print
the discovered target count. If the deadline passes, the same failure message
fires with the timeout suffixed for diagnostics.

No change to the production monitoring path -- alert rules were already
loading and Alertmanager was already routing; this fixes only the smoke
test's signal.
@Pablosinyores Pablosinyores merged commit e2b031b into main May 15, 2026
7 checks passed
@Pablosinyores Pablosinyores deleted the worktree-obs-dashboards branch May 15, 2026 07:19
@Pablosinyores
Copy link
Copy Markdown
Owner Author

Verification on 4168d13

Rebase

  • Branch rebased on current main (clean, no conflicts after manual merge of alerts.yml)
  • git log origin/main..HEAD: 3 commits — feat + 2 fixes
  • git diff --stat origin/main..HEAD: 4 files / +436 / -0 (purely additive)

Blockers from prior round (all resolved)

Finding Status Evidence
HIGH: assert_scrape_targets_up races Prometheus discovery (~5 s gap) Fixed in 4168d13 20 s / 2 s-interval retry loop wrapping the targets probe — same shape the reviewer proposed
MED: 13th alert (AlertmanagerDown on main) missing from EXPECTED_ALERTS Fixed in 14d0e06 Added to the smoke list, so the assertion now covers all 13
runbook_url on every alert Held through rebase All 13 alerts carry runbook_url + summary + description + severity

Proofs run locally

promtool

$ docker run --rm --entrypoint promtool -v "$PWD/deploy/docker/prometheus:/p" prom/prometheus:latest check rules /p/alerts.yml
Checking /p/alerts.yml
  SUCCESS: 13 rules found

Live Prometheus boot + /api/v1/rules

$ docker run --rm -v "$PWD/deploy/docker/prometheus.yml:/etc/prometheus/prometheus.yml:ro" \
              -v "$PWD/deploy/docker/prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro" \
              -p 9091:9090 -d prom/prometheus:latest
$ curl -s http://localhost:9091/api/v1/rules | jq '.data.groups[0].rules | length'
13

All 13 reported health=unknown (no syntax errors; targets not yet scraped) — the correct cold-boot state.

Annotation structure (every alert has all 4 required fields)

total alerts: 13
OKall alerts have runbook_url+summary+description+severity

docker compose config

$ docker compose -f deploy/docker/docker-compose.yml config > /dev/null
$ echo $?
0

End-to-end smoke not run this session

scripts/monitoring_smoke.sh requires SLACK_WEBHOOK_URL (alertmanager refuses to boot without it — by design, per its entrypoint script) and a free port 3000 for Grafana, neither of which is available on this host today. The race-fix is plumbed in code, the reviewer's prior bash scripts/monitoring_smoke.sh end-to-end run (with the fix shape applied) is the operational evidence. CI is green on rebased HEAD.

Diff summary

deploy/docker/README.md                         |  67 ++++++ (new)
deploy/docker/alertmanager/templates/slack.tmpl |  16 ++   (new)
deploy/docker/prometheus/alerts.yml             |  58 +++++ (modified: 5 new alerts + runbook_url on existing 8)
scripts/monitoring_smoke.sh                     | 295 +++++ (new)

Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants