feat(monitoring): split alertmanager routes by severity + raise latency thresholds (ECHO-813)#23
Draft
spashii wants to merge 1 commit into
Draft
feat(monitoring): split alertmanager routes by severity + raise latency thresholds (ECHO-813)#23spashii wants to merge 1 commit into
spashii wants to merge 1 commit into
Conversation
…cy thresholds
Three changes per ECHO-813:
1. Alertmanager route tree
- Default receiver is now 'blackhole' (no notification).
- severity=critical -> #prod-alerts (current behaviour preserved).
- severity=warning -> 'blackhole' (visible in alertmanager UI, no
slack). Discoverability via weekly digest (ECHO-817).
- alertname=Watchdog -> 'blackhole' (always-firing canary, never
pages).
Side-effect: an unlabelled alert now silently drops instead of
paging. Intentional — makes a missing label a dev-time bug, not a
3am alert.
2. Latency thresholds
- IngressHighLatencyP95: 0.5s/10m -> 1s/15m (severity=warning)
- IngressCriticalLatencyP95: 1.5s/5m -> 3s/10m (severity=critical)
Starting values; should be re-tuned once we can pull the actual p95
distribution from prometheus. Path-level exclusion (transcription,
LLM streaming) needs a route label that isn't currently extracted —
separate follow-up.
3. Watchdog alert
New 'pipeline' rule group with a vector(1) always-firing canary.
Routed to 'blackhole' so it never notifies; a future dead-man's
switch can poll the alertmanager API for its presence.
Severity audit: every existing alert has an explicit severity
(warning|critical). No mislabels found.
Refs: ECHO-813
Co-authored-by: Sameer <sameer@dembrane.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the noise/dropped-alerts split described in ECHO-813.
Three changes
1. Alertmanager route tree
blackhole(no notification). An unlabelled alert silently drops instead of paging.severity=critical→slack-prod→#prod-alerts(unchanged behaviour for crit).severity=warning→blackhole. Visible in alertmanager UI, no Slack. Discoverability via the weekly digest from #ECHO-817.alertname=Watchdog→blackhole. Always-firing canary, never pages.2. Latency thresholds
These are starting values. The ticket flagged that real numbers should come from the actual p95 distribution over 7d. I don't have prometheus access from where I run, so couldn't pull that — happy to re-tune from a query you run.
Path-level exclusion (transcription, LLM streaming) needs a route label that isn't currently extracted from the ingress metric. Filed as a follow-up note in the rule comment rather than added here.
3. Watchdog alert
New
pipelinerule group withvector(1)always-firing canary,severity: none. Routed toblackholeso it never notifies; a future dead-man's-switch consumer can poll the alertmanager API for its presence.Severity audit
Every existing alert has an explicit
severity: warningorseverity: critical. No mislabels. The audit was:Verification
amtool check-config/promtool check rulesnot available locally. CI should run them; if it doesn't, flagging that we should add it.Confidence
Medium. Routing logic is straightforward and matches alertmanager's documented matcher syntax. Watchdog is the canonical Prometheus example. The thresholds are educated guesses — if you have the 7d p95 numbers, I'd rather use those.
Refs: ECHO-813