fix: Keep reporter thread alive when a collector raises#133
Merged
adamlogic merged 1 commit intoMay 11, 2026
Merged
Conversation
A single collector raising an exception during collect() would propagate all the way up through _run_loop, killing the reporter thread for the lifetime of the process. Wrap each collector's collect() and the per-cycle _report_metrics() call so transient/unexpected errors are logged but the reporter keeps running on the next interval. JDO-1362 Co-authored-by: Cursor <cursoragent@cursor.com>
carlosantoniodasilva
approved these changes
May 11, 2026
Member
carlosantoniodasilva
left a comment
There was a problem hiding this comment.
LGTM 👍 , works similarly to our Ruby implementation with log exceptions...
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes JDO-1362.
Today, if any
collector.collect()raises, the exception propagates all the way through_run_loopand kills the reporter thread for the lifetime of the process — silently disabling all metric reporting (across every adapter) until the dyno restarts.We hit this in the wild on a customer's Celery + Heroku TLS Redis setup where
inspect.active()(triggered byTRACK_BUSY_JOBS=True) blew up with an SSL handshake error. The stack trace ended atthreading.py:1012, in run— the reporter thread was gone. See the linked Plain thread on JDO-1362 for context.This PR adds two defensive layers in
judoscale/core/reporter.py:all_metricsnow wraps eachcollector.collect()in try/except, logs the failing collector by class name with a full traceback, and continues with the rest. One bad collector no longer takes down the others._run_loopnow wraps_report_metrics()in try/except so any unanticipated error in a reporting cycle is logged but the thread survives and tries again on the next interval.Separately, JDO-1363 tracks investigating the actual SSL/pidbox root cause.
Test plan
test_all_metrics_continues_when_one_collector_raises— a failing collector raisesRuntimeError, a healthy collector still produces its metric, and the failure is logged.test_run_loop_survives_report_metrics_exception—_report_metricsraises on the first cycle; the run loop runs another cycle (proving the thread is alive) before being stopped cleanly.Made with Cursor
Note
Medium Risk
Changes core metrics reporting resilience by swallowing/logging exceptions in the reporter loop and per-collector collection; risk is moderate because it can mask unexpected failures and alters runtime behavior of the background thread.
Overview
Improves reporter robustness by preventing metric reporting from stopping when unexpected exceptions occur.
The reporter
_run_loopnow wraps_report_metrics()in a broadtry/exceptto log failures and continue on the next interval, andall_metricsnow logs and skips individual collectors whosecollect()raises so other collectors still report.Adds tests covering continued metric collection when one collector fails and ensuring the reporter thread survives a
_report_metricsexception.Reviewed by Cursor Bugbot for commit 330f409. Bugbot is set up for automated code reviews on this repo. Configure here.