Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions deploy/docker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Aether Docker Monitoring Stack

## Overview

| Service | Port | Purpose |
|---------|------|---------|
| aether-go | 9090 | Go executor metrics (Prometheus) |
| aether-rust | 9092 | Rust engine metrics (Prometheus) |
| prometheus | 9091 | Metrics store, alert evaluation |
| alertmanager | 9093 | Alert routing (Slack) |
| grafana | 3000 | Dashboards (admin/admin) |

Start the monitoring-only stack: `docker compose up -d prometheus alertmanager grafana`

## Adding a Metric

1. Emit the metric in `cmd/executor/metrics.go` (Go) or `crates/grpc-server/src/metrics.rs` (Rust).
2. Restart the relevant service. Prometheus auto-scrapes on the 15s interval.
3. Verify it appears at `http://localhost:9091/graph`.

## Adding a Dashboard

1. Create a JSON file in `deploy/docker/grafana/dashboards/`. Assign a stable `"uid"` string.
2. All panels must reference `${DS_PROMETHEUS}` as the datasource uid.
3. The provisioner reloads dashboards every 30s — no Grafana restart needed.
4. Use `jq empty <file>.json` to validate JSON before committing.

## Adding an Alert

1. Add a rule block to `deploy/docker/prometheus/alerts.yml` under group `aether.rules`.
2. Include `summary`, `description`, and `runbook_url` annotations. Use `{{ $labels.job }}` and `{{ $value }}` for context.
3. Validate: `docker run --rm -v "$PWD/deploy/docker/prometheus:/p" prom/prometheus:latest promtool check rules /p/alerts.yml`
4. Reload Prometheus: `curl -XPOST http://localhost:9091/-/reload`

## Adding a Receiver

1. Edit `deploy/docker/alertmanager.yml`.
2. Alerting is Slack-only in production. PagerDuty/Discord/Telegram receivers are intentionally out of scope — propose via a separate design ticket if the team decides to broaden channels.
3. The Slack webhook is injected at container start via sed substitution of `__SLACK_WEBHOOK_URL__` from `$SLACK_WEBHOOK_URL` env.
4. An optional richer Slack message template lives at `deploy/docker/alertmanager/templates/slack.tmpl` — wiring it requires adding a `templates:` stanza to `alertmanager.yml` and mounting the directory in docker-compose. Deferred as a follow-up.

## Histogram Bucket Caveats

Quantile estimates are bounded by the top histogram bucket. If p99 reads as the top-bucket value, it means most observations exceed that boundary — not that the exact value equals it. Configured bucket tops:

- Detection latency: 50ms (`aether_detection_latency_ms`)
- Simulation latency: 500ms (`aether_simulation_latency_ms`)
- End-to-end latency: 5000ms (`aether_end_to_end_latency_ms`)

Add finer or higher buckets in the metric definition to get better resolution.

## Running the Smoke Test

```bash
bash scripts/monitoring_smoke.sh
```

Brings up the monitoring stack (prometheus, alertmanager, grafana), waits for readiness, and asserts:

- every expected alert rule was loaded by Prometheus, with `summary`, `description`, `runbook_url`, and `severity` populated;
- both `aether-go` and `aether-rust` scrape jobs are discovered;
- Alertmanager accepted its config (slack-default receiver resolved);
- every expected Grafana dashboard UID is provisioned.

The script tears the stack down on exit. Requires `docker`, `curl`, `jq`.

Synthetic rule injection (`docker cp` + `/-/reload`) is deliberately avoided — main's `prometheus.yml` pins `rule_files` to an explicit path and the compose stack does not pass `--web.enable-lifecycle`, so that approach cannot function without modifying files owned by other workstreams. Asserting rule loadedness via `/api/v1/rules` gives deterministic coverage within this PR's scope.
16 changes: 16 additions & 0 deletions deploy/docker/alertmanager/templates/slack.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{{ define "aether.slack.title" -}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} ({{ .CommonLabels.severity }})
{{- end }}

{{ define "aether.slack.body" -}}
{{ range .Alerts -}}
*Job:* {{ .Labels.job }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Starts At:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ if .Annotations.runbook_url -}}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
---
{{ end -}}
{{- end }}
58 changes: 58 additions & 0 deletions deploy/docker/prometheus/alerts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ groups:
annotations:
summary: "Aether system is Halted"
description: "System state gauge reports Halted (3) for >1m. Halted requires manual reset. Check executor logs for the breaker reason."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherHalted.md"

- alert: AetherInclusionRateLow
expr: |
Expand All @@ -27,6 +28,7 @@ groups:
annotations:
summary: "Bundle inclusion rate below 20% over the last hour"
description: "Over the last 1h, fewer than 20% of submitted bundles were included. Current ratio: {{ printf \"%.2f\" $value }}."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherInclusionRateLow.md"

- alert: AetherE2ELatencyHigh
expr: histogram_quantile(0.99, sum by (le) (rate(aether_end_to_end_latency_ms_bucket[5m]))) > 100
Expand All @@ -36,6 +38,7 @@ groups:
annotations:
summary: "End-to-end p99 latency above 100ms"
description: "p99 end-to-end latency = {{ printf \"%.1f\" $value }}ms over last 5m (target <100ms)."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherE2ELatencyHigh.md"

- alert: AetherNoOpportunities
# Suppress during the first 30m after process start so a fresh boot or
Expand All @@ -52,6 +55,7 @@ groups:
annotations:
summary: "Fewer than 5 opportunities per minute"
description: "Arb publish rate = {{ printf \"%.1f\" $value }}/min over last 10m. Detection pipeline may be stalled."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherNoOpportunities.md"

- alert: AetherETHBalanceLow
expr: aether_eth_balance < 0.15
Expand All @@ -61,6 +65,7 @@ groups:
annotations:
summary: "Searcher ETH balance below 0.15"
description: "Hot wallet ETH balance = {{ printf \"%.4f\" $value }}. Top up before bundles start reverting on gas."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherETHBalanceLow.md"

- alert: AetherGasHigh
expr: aether_gas_price_gwei > 300
Expand All @@ -70,6 +75,7 @@ groups:
annotations:
summary: "Gas price above 300 gwei"
description: "Base fee = {{ printf \"%.1f\" $value }} gwei. Executor preflight will reject arbs until this drops."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherGasHigh.md"

- alert: AetherBuilderDown
# Disabled builders register zero-total on both {success} and {failure}
Expand All @@ -86,6 +92,7 @@ groups:
annotations:
summary: "Builder {{ $labels.builder }} has no successful submissions"
description: "Builder {{ $labels.builder }} received submissions but zero succeeded over the last 2m. Check builder endpoint health and auth. Note: builders configured with Enabled=false are intentionally silent here."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherBuilderDown.md"

# Self-monitor of the alerting path. If Alertmanager crashloops (bad
# config, SLACK_WEBHOOK_URL missing, etc.) the rest of the alerts above
Expand All @@ -99,3 +106,54 @@ groups:
annotations:
summary: "Alertmanager scrape target is down"
description: "Prometheus has been unable to scrape alertmanager:9093 for 2m. Slack delivery is offline. Check the alertmanager container logs and SLACK_WEBHOOK_URL config."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AlertmanagerDown.md"

- alert: AetherServiceDown
expr: up{job=~"aether-(go|rust)"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Aether service {{ $labels.job }} is down"
description: "Job {{ $labels.job }} has been unreachable for >1m. Prometheus scrape target reports up=0."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherServiceDown.md"

- alert: AetherNoBlocksProcessed
expr: rate(aether_blocks_processed_total[3m]) == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Aether engine {{ $labels.job }} stopped processing blocks"
description: "Job {{ $labels.job }} has processed 0 blocks over the last 3m. The Rust ingestion pipeline may be stalled or disconnected from the Ethereum node."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherNoBlocksProcessed.md"

- alert: AetherHighSimulationLatency
expr: histogram_quantile(0.99, sum by(le) (rate(aether_simulation_latency_ms_bucket[5m]))) > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Aether simulation latency p99 exceeds 100ms"
description: "EVM simulation p99 latency is {{ printf \"%.1f\" $value }}ms over the last 5m (target <50ms). revm fork state may be stale or the RPC provider is slow."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherHighSimulationLatency.md"

- alert: AetherNegativeDailyPnL
expr: aether_daily_pnl_eth < -0.05
for: 30m
labels:
severity: warning
annotations:
summary: "Aether daily PnL is negative ({{ printf \"%.4f\" $value }} ETH)"
description: "Daily PnL has been below -0.05 ETH for 30m. Review gas costs, tip share, and revert rate."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherNegativeDailyPnL.md"

- alert: AetherRiskRejectionStorm
expr: rate(aether_executor_risk_rejections_total[5m]) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Aether risk manager rejecting >1 arb/sec"
description: "Risk rejection rate is {{ printf \"%.2f\" $value }} rejections/sec over a 5m window, sustained for 10m. Investigate circuit breaker state, gas, and position limits."
runbook_url: "https://github.com/Pablosinyores/aether/blob/main/docs/runbooks/AetherRiskRejectionStorm.md"
Loading
Loading