feat(monitoring): weekly warning digest cronjob (ECHO-817)#24
Draft
spashii wants to merge 1 commit into
Draft
Conversation
Adds a CronJob in the monitoring namespace that runs every Monday at
09:00 Europe/Amsterdam, queries prometheus for the ALERTS metric over
the past 7d, groups by alertname, and posts a scannable summary to
#prod-alerts via the existing webhook in monitoring-secrets.
Replaces the discoverability that ECHO-813's null-receiver routing
removed. Goal: "is there signal in the noise?" — not paging.
Implementation:
- ConfigMap with a stdlib-only python script (~120 LOC including
comments). No third-party deps. Uses python:3.11-slim.
- CronJob mounts the script read-only and the monitoring-secrets
secret for the slack webhook URL.
- Uses `count_over_time(ALERTS{alertstate="firing",
severity="warning"}[7d])` for total firing duration and
`changes(ALERTS_FOR_STATE{severity="warning"}[7d])` (halved) for
event count.
- timeZone field requires k8s >= 1.27. Falls back to UTC if dropped.
Verified offline with a mocked prometheus response — script compiles,
queries fire, slack payload renders correctly.
Refs: ECHO-817
Co-authored-by: Sameer <sameer@dembrane.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the weekly warning digest from ECHO-817. Companion to #23 (ECHO-813), which dropped
severity=warningnotifications.Shape
One CronJob in the
monitoringnamespace:0 9 * * 1Europe/Amsterdam (Mon 9am).ALERTSmetric over 7d, filtered toseverity=warning.#prod-alertsvia the existing webhook inmonitoring-secrets.Output shape
Sample posted message:
Empty week prints a green-check line, not silence — easier to confirm the job actually ran.
Sort is by total firing minutes, descending. The eyeball pattern: which warnings are eating the most time, and are any of those worth promoting to critical?
How the numbers are computed
count_over_time(ALERTS{alertstate="firing", severity="warning"}[7d])× eval-interval (15s) ÷ 60.changes(ALERTS_FOR_STATE{severity="warning"}[7d]) / 2, sincechanges()counts both 0→1 and 1→0 transitions.Both are approximations, not exact. Good enough for "scan in 10s."
Implementation notes
urllib). No pip layer, no third-party.cronjob-warning-digest.yaml.timeZone: Europe/Amsterdamrequires k8s ≥ 1.27. If the cluster is older, drop the field and switch the schedule to UTC.Out of scope (deliberate)
#alerts-non-proddigest. Filed separately — once ECHO-814 lands and there are non-prod warnings to summarise, this CronJob can be cloned with a different label filter + webhook.Verification
Ran the script offline with a mocked prometheus response — compiles clean, queries fire, slack payload formats correctly.
Confidence
Medium. Script logic is straightforward and tested offline. Three things I'd want a human eye on:
changes()heuristic — for very flappy alerts (rapid on/off) the event count could over-count. Probably fine for first pass.timeZonefield — if the cluster is < 1.27, this needs to drop.#prod-alerts(the channel the existing webhook is bound to). If you want the digest somewhere else, point me at the new webhook secret.Refs: ECHO-817