Skip to content

fix: Keep reporter thread alive when a collector raises#133

Merged
adamlogic merged 1 commit into
mainfrom
adam/jdo-1362-reporter-thread-dies-on-a-single-collector-exception
May 11, 2026
Merged

fix: Keep reporter thread alive when a collector raises#133
adamlogic merged 1 commit into
mainfrom
adam/jdo-1362-reporter-thread-dies-on-a-single-collector-exception

Conversation

@adamlogic
Copy link
Copy Markdown
Member

@adamlogic adamlogic commented May 11, 2026

Summary

Closes JDO-1362.

Today, if any collector.collect() raises, the exception propagates all the way through _run_loop and kills the reporter thread for the lifetime of the process — silently disabling all metric reporting (across every adapter) until the dyno restarts.

We hit this in the wild on a customer's Celery + Heroku TLS Redis setup where inspect.active() (triggered by TRACK_BUSY_JOBS=True) blew up with an SSL handshake error. The stack trace ended at threading.py:1012, in run — the reporter thread was gone. See the linked Plain thread on JDO-1362 for context.

This PR adds two defensive layers in judoscale/core/reporter.py:

  1. all_metrics now wraps each collector.collect() in try/except, logs the failing collector by class name with a full traceback, and continues with the rest. One bad collector no longer takes down the others.
  2. _run_loop now wraps _report_metrics() in try/except so any unanticipated error in a reporting cycle is logged but the thread survives and tries again on the next interval.

Separately, JDO-1363 tracks investigating the actual SSL/pidbox root cause.

Test plan

  • Added test_all_metrics_continues_when_one_collector_raises — a failing collector raises RuntimeError, a healthy collector still produces its metric, and the failure is logged.
  • Added test_run_loop_survives_report_metrics_exception_report_metrics raises on the first cycle; the run loop runs another cycle (proving the thread is alive) before being stopped cleanly.
  • Full suite passes locally (137/137) with all extras installed.

Made with Cursor


Note

Medium Risk
Changes core metrics reporting resilience by swallowing/logging exceptions in the reporter loop and per-collector collection; risk is moderate because it can mask unexpected failures and alters runtime behavior of the background thread.

Overview
Improves reporter robustness by preventing metric reporting from stopping when unexpected exceptions occur.

The reporter _run_loop now wraps _report_metrics() in a broad try/except to log failures and continue on the next interval, and all_metrics now logs and skips individual collectors whose collect() raises so other collectors still report.

Adds tests covering continued metric collection when one collector fails and ensuring the reporter thread survives a _report_metrics exception.

Reviewed by Cursor Bugbot for commit 330f409. Bugbot is set up for automated code reviews on this repo. Configure here.

A single collector raising an exception during collect() would propagate
all the way up through _run_loop, killing the reporter thread for the
lifetime of the process. Wrap each collector's collect() and the per-cycle
_report_metrics() call so transient/unexpected errors are logged but the
reporter keeps running on the next interval.

JDO-1362

Co-authored-by: Cursor <cursoragent@cursor.com>
@adamlogic adamlogic marked this pull request as ready for review May 11, 2026 18:18
Copy link
Copy Markdown
Member

@carlosantoniodasilva carlosantoniodasilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 , works similarly to our Ruby implementation with log exceptions...

@adamlogic adamlogic merged commit ece7ea1 into main May 11, 2026
6 checks passed
@adamlogic adamlogic deleted the adam/jdo-1362-reporter-thread-dies-on-a-single-collector-exception branch May 11, 2026 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants