Skip to content

fix(monitoring): tighten grafana log panel regex (ECHO-815)#22

Draft
spashii wants to merge 1 commit into
mainfrom
sam/echo-815-grafana-log-regex
Draft

fix(monitoring): tighten grafana log panel regex (ECHO-815)#22
spashii wants to merge 1 commit into
mainfrom
sam/echo-815-grafana-log-regex

Conversation

@spashii
Copy link
Copy Markdown
Member

@spashii spashii commented May 13, 2026

Closes the substring false-positive problem on Grafana log panels described in ECHO-815.

What changes

Replace the substring regex used in all log panels:

(?i)(error|exception|fail|fatal|critical|panic|warn|warning)

with level-anchored patterns:

  • Error panels\b(ERROR|FATAL|CRITICAL|Exception|Traceback)\b|\[(error|fatal|critical)\]
  • Warning panels\b(WARN|WARNING)\b|\[(warn|warning)\]

Touches 14 places in configmap-grafana-dashboards.yaml:

  • 10 per-service "Error Logs" queries (api / directus / worker / worker-cpu / neo4j × 2 panels)
  • 1 "Echo Application Error Logs" panel
  • 2 logfilter variable values + 1 query string

Why this drops the noise

_ is a word character in RE2. \berror\b does NOT match inside transcript_error / latest_error / latest_error_code — directus's loudest column-name offenders disappear without exclusion lists.

ERROR|FATAL|CRITICAL is case-sensitive on purpose: matches python logging level prefixes (confirmed: echo server uses standard logging.getLogger(...)). Exception / Traceback covers tracebacks. \[(error|fatal|critical)\] covers nginx-style bracketed lowercase.

What this drops that you might miss

  • Lowercase mentions of "error" inside log bodies at non-error levels (e.g. an INFO line that talks about an error). Mostly low-signal.
  • Custom log formats not using one of these tokens. None found in the echo codebase.

The real fix (promtail JSON pipeline → extract level as a label → panels filter on {level=~"error|fatal|critical"}) supersedes this. Filed in ECHO-815 as a follow-up.

Verification

Locally JSON-parsed each changed expr string and ran the inner regex against sample log lines:

line matches
ERROR: kafka conn refused yes
Exception in thread main yes
nginx [error] 502 bad gateway yes
INFO: latest_error column was null no
INFO: transcript_error field updated no
INFO accessing /api/transcript_errors?id=42 no
WARNING: slow query no (handled by warning panel)

Helm not installed locally so couldn't render the full template — the changes are pure data inside a block scalar, no template substitution touched.

Confidence

Medium-high. The change is mechanical and the regex was verified against the directus-column false-positives. The unknown is panel-level UX: warning panels now only show WARN/WARNING (instead of also catching the bottom of the error spectrum). If that turns out to be wrong, easy follow-up.

Refs: ECHO-815

Replace substring patterns like
(?i)(error|exception|fail|fatal|critical|panic|warn|warning) with
level-anchored \b(ERROR|FATAL|CRITICAL|Exception|Traceback)\b
|\[(error|fatal|critical)\] across all per-service Error Logs panels,
the Echo Application Error Logs panel, and the logfilter variable.
Same shape for warning panels with WARN|WARNING.

Underscore is a word character in RE2, so \berror\b does not match
inside transcript_error / latest_error / etc. — drops the directus
column noise without exclusion lists. Case-sensitive uppercase
catches python logging level prefixes; bracketed lowercase catches
nginx-style. Exception/Traceback covers python tracebacks.

Real fix (promtail label extraction) supersedes this — separate
follow-up.

Refs: ECHO-815

Co-authored-by: Sameer <sameer@dembrane.com>
@linear
Copy link
Copy Markdown

linear Bot commented May 13, 2026

ECHO-815

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant