Skip to content

feat(monitoring): weekly warning digest cronjob (ECHO-817)#24

Draft
spashii wants to merge 1 commit into
mainfrom
sam/echo-817-warning-digest-cronjob
Draft

feat(monitoring): weekly warning digest cronjob (ECHO-817)#24
spashii wants to merge 1 commit into
mainfrom
sam/echo-817-warning-digest-cronjob

Conversation

@spashii
Copy link
Copy Markdown
Member

@spashii spashii commented May 13, 2026

Implements the weekly warning digest from ECHO-817. Companion to #23 (ECHO-813), which dropped severity=warning notifications.

Shape

One CronJob in the monitoring namespace:

  • Schedule0 9 * * 1 Europe/Amsterdam (Mon 9am).
  • Source — prometheus ALERTS metric over 7d, filtered to severity=warning.
  • Sink#prod-alerts via the existing webhook in monitoring-secrets.

Output shape

Sample posted message:

:bar_chart: *Warning digest — last 7d*  (3 unique alertnames)
• *HighNodeCPU* — 3 firing event(s), ~60 min total
• *IngressHighErrorRate* — 2 firing event(s), ~5 min total
• *PodRestartSpike* — 1 firing event(s), ~2 min total

Empty week prints a green-check line, not silence — easier to confirm the job actually ran.

Sort is by total firing minutes, descending. The eyeball pattern: which warnings are eating the most time, and are any of those worth promoting to critical?

How the numbers are computed

  • Minutes firingcount_over_time(ALERTS{alertstate="firing", severity="warning"}[7d]) × eval-interval (15s) ÷ 60.
  • Event countchanges(ALERTS_FOR_STATE{severity="warning"}[7d]) / 2, since changes() counts both 0→1 and 1→0 transitions.

Both are approximations, not exact. Good enough for "scan in 10s."

Implementation notes

  • Python 3.11-slim, stdlib only (urllib). No pip layer, no third-party.
  • Script is in a ConfigMap, mounted read-only. The CronJob and ConfigMap ship together in cronjob-warning-digest.yaml.
  • timeZone: Europe/Amsterdam requires k8s ≥ 1.27. If the cluster is older, drop the field and switch the schedule to UTC.
  • Resources: 10m CPU / 32Mi mem requested, 100m / 128Mi limit. Job typically runs <2s.

Out of scope (deliberate)

  • #alerts-non-prod digest. Filed separately — once ECHO-814 lands and there are non-prod warnings to summarise, this CronJob can be cloned with a different label filter + webhook.

Verification

Ran the script offline with a mocked prometheus response — compiles clean, queries fire, slack payload formats correctly.

Confidence

Medium. Script logic is straightforward and tested offline. Three things I'd want a human eye on:

  1. The prometheus changes() heuristic — for very flappy alerts (rapid on/off) the event count could over-count. Probably fine for first pass.
  2. timeZone field — if the cluster is < 1.27, this needs to drop.
  3. The slack webhook only renders this in #prod-alerts (the channel the existing webhook is bound to). If you want the digest somewhere else, point me at the new webhook secret.

Refs: ECHO-817

Adds a CronJob in the monitoring namespace that runs every Monday at
09:00 Europe/Amsterdam, queries prometheus for the ALERTS metric over
the past 7d, groups by alertname, and posts a scannable summary to
#prod-alerts via the existing webhook in monitoring-secrets.

Replaces the discoverability that ECHO-813's null-receiver routing
removed. Goal: "is there signal in the noise?" — not paging.

Implementation:

  - ConfigMap with a stdlib-only python script (~120 LOC including
    comments). No third-party deps. Uses python:3.11-slim.
  - CronJob mounts the script read-only and the monitoring-secrets
    secret for the slack webhook URL.
  - Uses `count_over_time(ALERTS{alertstate="firing",
    severity="warning"}[7d])` for total firing duration and
    `changes(ALERTS_FOR_STATE{severity="warning"}[7d])` (halved) for
    event count.
  - timeZone field requires k8s >= 1.27. Falls back to UTC if dropped.

Verified offline with a mocked prometheus response — script compiles,
queries fire, slack payload renders correctly.

Refs: ECHO-817

Co-authored-by: Sameer <sameer@dembrane.com>
@linear
Copy link
Copy Markdown

linear Bot commented May 13, 2026

ECHO-817

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant