Skip to content

feat(monitoring): split alertmanager routes by severity + raise latency thresholds (ECHO-813)#23

Draft
spashii wants to merge 1 commit into
mainfrom
sam/echo-813-alertmanager-routing
Draft

feat(monitoring): split alertmanager routes by severity + raise latency thresholds (ECHO-813)#23
spashii wants to merge 1 commit into
mainfrom
sam/echo-813-alertmanager-routing

Conversation

@spashii
Copy link
Copy Markdown
Member

@spashii spashii commented May 13, 2026

Closes the noise/dropped-alerts split described in ECHO-813.

Three changes

1. Alertmanager route tree

  • Default receiver is now blackhole (no notification). An unlabelled alert silently drops instead of paging.
  • severity=criticalslack-prod#prod-alerts (unchanged behaviour for crit).
  • severity=warningblackhole. Visible in alertmanager UI, no Slack. Discoverability via the weekly digest from #ECHO-817.
  • alertname=Watchdogblackhole. Always-firing canary, never pages.

2. Latency thresholds

Alert Old New
IngressHighLatencyP95 (warning) 0.5s for 10m 1s for 15m
IngressCriticalLatencyP95 (critical) 1.5s for 5m 3s for 10m

These are starting values. The ticket flagged that real numbers should come from the actual p95 distribution over 7d. I don't have prometheus access from where I run, so couldn't pull that — happy to re-tune from a query you run.

Path-level exclusion (transcription, LLM streaming) needs a route label that isn't currently extracted from the ingress metric. Filed as a follow-up note in the rule comment rather than added here.

3. Watchdog alert

New pipeline rule group with vector(1) always-firing canary, severity: none. Routed to blackhole so it never notifies; a future dead-man's-switch consumer can poll the alertmanager API for its presence.

Severity audit

Every existing alert has an explicit severity: warning or severity: critical. No mislabels. The audit was:

Alert severity
HighNodeCPU, HighNodeMemory, NodeDiskFillingSoon warning
IngressHighErrorRate, IngressHighLatencyP95 warning
IngressCriticalErrorRate, IngressCriticalLatencyP95 critical
CrashLoopBackOff, OOMKilled, DeploymentAvailabilityShortfall critical
PodRestartSpike, HighContainerCPUUtilization, HighContainerMemoryUtilization warning

Verification

  • alertmanager.yml parses with python-yaml; route tree resolves all referenced receivers.
  • Latency-rule YAML structure parses cleanly.
  • amtool check-config / promtool check rules not available locally. CI should run them; if it doesn't, flagging that we should add it.

Confidence

Medium. Routing logic is straightforward and matches alertmanager's documented matcher syntax. Watchdog is the canonical Prometheus example. The thresholds are educated guesses — if you have the 7d p95 numbers, I'd rather use those.

Refs: ECHO-813

…cy thresholds

Three changes per ECHO-813:

1. Alertmanager route tree
   - Default receiver is now 'blackhole' (no notification).
   - severity=critical -> #prod-alerts (current behaviour preserved).
   - severity=warning -> 'blackhole' (visible in alertmanager UI, no
     slack). Discoverability via weekly digest (ECHO-817).
   - alertname=Watchdog -> 'blackhole' (always-firing canary, never
     pages).

   Side-effect: an unlabelled alert now silently drops instead of
   paging. Intentional — makes a missing label a dev-time bug, not a
   3am alert.

2. Latency thresholds
   - IngressHighLatencyP95: 0.5s/10m -> 1s/15m (severity=warning)
   - IngressCriticalLatencyP95: 1.5s/5m -> 3s/10m (severity=critical)

   Starting values; should be re-tuned once we can pull the actual p95
   distribution from prometheus. Path-level exclusion (transcription,
   LLM streaming) needs a route label that isn't currently extracted —
   separate follow-up.

3. Watchdog alert
   New 'pipeline' rule group with a vector(1) always-firing canary.
   Routed to 'blackhole' so it never notifies; a future dead-man's
   switch can poll the alertmanager API for its presence.

Severity audit: every existing alert has an explicit severity
(warning|critical). No mislabels found.

Refs: ECHO-813

Co-authored-by: Sameer <sameer@dembrane.com>
@linear
Copy link
Copy Markdown

linear Bot commented May 13, 2026

ECHO-813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant