Skip to content

Wrap MetricsCollector initialization in log_exceptions#268

Closed
ScotterC wants to merge 1 commit into
judoscale:mainfrom
ScotterC:fix/protect-metrics-collector-initialization
Closed

Wrap MetricsCollector initialization in log_exceptions#268
ScotterC wants to merge 1 commit into
judoscale:mainfrom
ScotterC:fix/protect-metrics-collector-initialization

Conversation

@ScotterC
Copy link
Copy Markdown

MetricsCollector#initialize can run DB queries (e.g., GoodJob fetches queue names via SELECT distinct queue_name). If the database is under load and the query hits a statement timeout (PG::QueryCanceled), the exception propagates through Reporter#start! and crashes the process.

The collect() method is already wrapped in log_exceptions, but the .new call was not. This applies the same protection to initialization, consistent with how transient errors are handled elsewhere.

If a collector fails to initialize, it's skipped and the reporter continues with the remaining collectors. If all collectors fail, the reporter logs a warning and returns without starting the loop.

Fixes #267

MetricsCollector#initialize can run DB queries (e.g., GoodJob fetches
queue names via SELECT distinct queue_name). If the database is under
load and the query hits a statement timeout (PG::QueryCanceled), the
exception propagates through Reporter#start! and crashes the process.

The collect() method is already wrapped in log_exceptions, but the
.new call was not. This applies the same protection to initialization,
consistent with how transient errors are handled elsewhere.

If a collector fails to initialize, it's skipped and the reporter
continues with the remaining collectors. If all collectors fail, the
reporter logs a warning and returns without starting the loop.

Fixes judoscale#267

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@carlosantoniodasilva
Copy link
Copy Markdown
Member

@ScotterC appreciate the issue and pull request!

We discussed this a bit, and are leaning towards going a slightly different direction, by sidestepping the issue entirely, and removing the init methods that eagerly load queues and force a potential DB / Redis connection. This would leave collect handling all queue related business / queries, which already runs inside the reporter thread and has all exception handling.

I have a WIP change here: #269

@carlosantoniodasilva
Copy link
Copy Markdown
Member

@ScotterC I just released a new v1.13.2 which removes the init eager loading of queues in favor of fetching them on the first collect, which should hopefully handle this better going forward. Please let us know how that works for you. Thanks again for your report.

@ScotterC
Copy link
Copy Markdown
Author

Great. I'll try it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MetricsCollector#initialize crashes process on DB timeout (PG::QueryCanceled)

2 participants