fix: Include broker connection stats in adapter report payload by adamlogic · Pull Request #134 · judoscale/judoscale-python

adamlogic · 2026-05-11T19:35:07Z

Summary

Snapshots connected_clients and maxclients from the broker Redis via INFO clients at the top of every CeleryMetricsCollector.collect() cycle and ships them under a new broker block on the celery adapter's entry in the outgoing report.
Logs a warning whenever the broker reports fewer than 10 free connection slots, so customers can spot connection-cap pressure before inspect.active() starts failing.
Adds a report_metadata property to the Collector protocol (default {}) and merges it into Adapter.as_tuple so any collector can contribute adapter-scoped fields without further wiring.
Purpose: surface when a broker is approaching its connection cap. On TLS Redis that ceiling otherwise manifests as an opaque SSLEOFError: UNEXPECTED_EOF_WHILE_READING during a fresh connection's handshake (Redis closes the socket before any application-layer error can be sent), instead of the clean max number of clients reached you'd see over plain TCP.

JDO-1363

Cost

INFO clients is one round trip on the already-open broker connection. Benchmarked locally at ~70-100µs server-side; expect ~1-2ms total in-region. The collector already calls INFO (all sections) at startup for the version check, so this isn't a new permission requirement on managed Redis providers.

Payload shape

Old:

"judoscale-celery": {"runtime_version": "5.6.3", "adapter_version": "1.13.2"}

New (when stats are available):

"judoscale-celery": {
  "runtime_version": "5.6.3",
  "adapter_version": "1.13.2",
  "broker": {"connected_clients": 31, "maxclients": 40}
}

The broker block is omitted entirely if the INFO call fails or returns missing keys, so the report never carries stale or partial numbers. Additive change to the payload — older agents simply don't send it.

Warning behavior

When maxclients - connected_clients < 10:

WARNING - [judoscale] Broker is near its connection limit: 31/40 connections in use (9 remaining). New connections may be rejected, which on TLS Redis can surface as SSL handshake errors.

One warning per report cycle (every 10s by default). Stays silent when headroom is healthy.

Test plan

poetry run pytest — 151 passed (6 new tests covering the broker-stats lifecycle, warning threshold, and adapter merge behavior).
Manual end-to-end against a local TLS Redis in Docker, sized so the first reporter cycle succeeds and the second one trips the cap: verified the broker block makes it onto the payload on both cycles, the warning fires whenever < 10 slots remain, and inspect.active()'s SSLEOFError on cycle 2 is absorbed by the JDO-1362 collector-survival wrapper without losing the broker metadata.
Backend coordination: the receiving adapter API needs to know how to parse and store the new broker field. Tracked separately.

Note

Medium Risk
Adds new top-level metadata to the report payload and introduces Redis INFO clients calls/logging in the Celery collector; could affect downstream API parsing and increase broker load/log volume if misconfigured.

Overview
Reports now include a new top-level metadata block, built by merging a new report_metadata property exposed by all metrics collectors.

CeleryMetricsCollector now snapshots Redis broker connection stats (connected_clients, maxclients) via INFO clients each collect() cycle, publishes them under metadata['celery-broker'], and logs when remaining connection headroom drops below a threshold.

Tests were updated/added to cover section-aware Redis INFO mocking, metadata inclusion/omission behavior, and the near-connection-limit log trigger.

^{Reviewed by Cursor Bugbot for commit 8e4b971. Bugbot is set up for automated code reviews on this repo. Configure here.}

Reports `connected_clients` and `maxclients` from Redis `INFO clients` under the celery adapter's entry on each report cycle. Surfaces when the broker is approaching its connection cap, which on TLS Redis can otherwise manifest as an opaque SSLEOFError during a new connection's handshake rather than a clear "max number of clients reached" error. JDO-1363 Co-authored-by: Cursor <cursoragent@cursor.com>

Logs a warning during `_refresh_broker_stats` whenever the broker reports `maxclients - connected_clients < 10`. Surfaces the connection-cap pressure that causes `inspect.active()` to fail on TLS Redis (where exhaustion manifests as an opaque SSLEOFError) before the collector starts dropping cycles. JDO-1363 Co-authored-by: Cursor <cursoragent@cursor.com>

carlosantoniodasilva

Looks good, I had a couple thoughts below, mainly about whether to expand on the adapters key vs exposing this as separate metadata.

Do you plan on shipping and keeping this, or is it more of a potential temp troubleshooting change?

carlosantoniodasilva · 2026-05-12T13:23:45Z

+        if self.metrics_collector is not None:
+            # Let the collector contribute adapter-scoped fields (broker
+            # stats, etc.) computed during its most recent collect() cycle.
+            extra = getattr(self.metrics_collector, "report_metadata", None) or {}


We introduced the concept of "metadata" added to each report, which is stored separately: I had applied it to a previous PR for troubleshooting but it never got merged, but the infra should still be in place.

https://github.com/judoscale/judoscale-python/pull/114/changes#diff-dc3605e76b7fa7403b885af969869ff577f3020b0bea517455a47caeb6bf745aR131-R135

Applying this to the adapter info itself means it gets mangled with adapters and will cause additional updates on the backend as the adapters get propagated, so it might be best if we can keep it out of that. (adapters is more "stable" and only changes with version changes, metadata is more volatile since it is likely going to be different with each report)

Good call, I forgot we supported that.

carlosantoniodasilva · 2026-05-12T13:27:04Z

+# Crossing into single digits is a strong predictor of pidbox failures
+# (e.g. the SSLEOFError seen on TLS Redis when new connections are
+# rejected mid-handshake under cap exhaustion).
+BROKER_CONNECTIONS_WARN_THRESHOLD = 10


I wonder if this should be a % of connections left, e.g. if you have 40 connections than maybe 4, if you have 100 than it's 10. (assuming we keep a 10% threshold)

I went with a static number because our connection-usage is not dependent on the total max clients. Our "busy job tracking" can use up to 18 clients in my testing. (Details in Linear comment)

But as for why I chose "10" as the number... I'm not really sure. Since I'm seeing 18 clients in my testing, maybe 20 makes more sense here.

I think I want to stick with 10... 20 feels too aggressive since Heroku's cheapest Redis plans have very restrictive limits. We'll only potentially use more than 10 connections if the client is configured to track busy jobs, so I feel like 10 hits the spot between the two. 🤷‍♂️ Not a strong opinion on it.

Sounds good, it's a starting point we can tweak going forward.

It's really odd how many clients/connections tracking busy jobs can take tbh.

20 feels too aggressive

I'm changing my mind on this. I think 20 is a better starting point based on what I'm seeing with a current customer support thread. I'm going to change the logging from WARN to INFO though so it's easier to silence (it'll going to log every 10 seconds).

carlosantoniodasilva · 2026-05-12T13:30:06Z

+
+        stats: Dict[str, int] = {}
+        for key in ("connected_clients", "maxclients"):
+            value = info.get(key) if isinstance(info, dict) else None


It feels like this if isinstance(info, dict) could be pushed up, since it is a guard that has nothing to do with the value check? It's fine as-is though, just a thought.

carlosantoniodasilva · 2026-05-12T13:33:47Z

+    def test_report_metadata_empty_before_collect(self, worker_1, celery):
+        celery.connection_for_read().channel().client.scan_iter.return_value = []
+        collector = CeleryMetricsCollector(worker_1, celery)
+        assert collector.report_metadata == {}
+
+    def test_report_metadata_populated_after_collect(self, worker_1, celery):
+        celery.connection_for_read().channel().client.scan_iter.return_value = []
+        collector = CeleryMetricsCollector(worker_1, celery)
+        collector.collect()
+        assert collector.report_metadata == {
+            "broker": {"connected_clients": 3, "maxclients": 40}
+        }


Feels like these could be merged maybe.

Suggested change

def test_report_metadata_empty_before_collect(self, worker_1, celery):

celery.connection_for_read().channel().client.scan_iter.return_value = []

collector = CeleryMetricsCollector(worker_1, celery)

assert collector.report_metadata == {}

def test_report_metadata_populated_after_collect(self, worker_1, celery):

celery.connection_for_read().channel().client.scan_iter.return_value = []

collector = CeleryMetricsCollector(worker_1, celery)

collector.collect()

assert collector.report_metadata == {

"broker": {"connected_clients": 3, "maxclients": 40}

}

def test_report_metadata_populates_after_collect(self, worker_1, celery):

celery.connection_for_read().channel().client.scan_iter.return_value = []

collector = CeleryMetricsCollector(worker_1, celery)

assert collector.report_metadata == {}

collector.collect()

assert collector.report_metadata == {

"broker": {"connected_clients": 3, "maxclients": 40}

}

carlosantoniodasilva · 2026-05-12T13:35:19Z

+            for record in caplog.records
+        )
+
+    def test_does_not_warn_when_broker_has_headroom(


We could also maybe test this one as part of the base "populate" test, since it's all the same setup, just a matter of checking no log output. But it's fine on its own too to maybe not make that one too noisy.

adamlogic

Do you plan on shipping and keeping this, or is it more of a potential temp troubleshooting change?

Was thinking it'd be helpful to have permanently.

adamlogic · 2026-05-12T20:17:07Z

+# Crossing into single digits is a strong predictor of pidbox failures
+# (e.g. the SSLEOFError seen on TLS Redis when new connections are
+# rejected mid-handshake under cap exhaustion).
+BROKER_CONNECTIONS_WARN_THRESHOLD = 10


I went with a static number because our connection-usage is not dependent on the total max clients. Our "busy job tracking" can use up to 18 clients in my testing. (Details in Linear comment)

But as for why I chose "10" as the number... I'm not really sure. Since I'm seeing 18 clients in my testing, maybe 20 makes more sense here.

adamlogic · 2026-05-12T20:21:22Z

+        if self.metrics_collector is not None:
+            # Let the collector contribute adapter-scoped fields (broker
+            # stats, etc.) computed during its most recent collect() cycle.
+            extra = getattr(self.metrics_collector, "report_metadata", None) or {}


Good call, I forgot we supported that.

Per Carlos's PR review: keep adapter-scoped stable fields in `adapters` and put volatile per-report fields in a sibling `metadata` block. Also merge two report_metadata tests per his code suggestion. Co-authored-by: Cursor <cursoragent@cursor.com>

Per Carlos's PR review: the type guard has nothing to do with the per-key value check, so handle it once up front alongside the other early-return failure modes. Co-authored-by: Cursor <cursoragent@cursor.com>

Per Carlos's PR review: the no-warning case shares its setup with `test_report_metadata_populates_after_collect`, so assert it there instead of in a dedicated test. Co-authored-by: Cursor <cursoragent@cursor.com>

adamlogic

Thanks for the suggestions! Ready for another look.

carlosantoniodasilva

Just another small suggestion, but otherwise LGTM! 👍

carlosantoniodasilva · 2026-05-13T12:52:10Z

+# Crossing into single digits is a strong predictor of pidbox failures
+# (e.g. the SSLEOFError seen on TLS Redis when new connections are
+# rejected mid-handshake under cap exhaustion).
+BROKER_CONNECTIONS_WARN_THRESHOLD = 10


Sounds good, it's a starting point we can tweak going forward.

It's really odd how many clients/connections tracking busy jobs can take tbh.

carlosantoniodasilva · 2026-05-13T12:56:06Z

+        # INFO call failed so the report doesn't carry stale numbers.
+        if not self._broker_stats:
+            return {}
+        return {"broker": dict(self._broker_stats)}


Since this goes in a "generic metadata" field, maybe we prefix with celery, e.g.:

Suggested change

return {"broker": dict(self._broker_stats)}

return {"celery-broker": dict(self._broker_stats)}

or similar... just in case we add more stuff to metadata from other packages in the future. (for example dramatiq also has the concept of a "broker")

Namespacing the key as `celery-broker` keeps the generic metadata field collision-free as other adapter packages (e.g. dramatiq, which also has a broker) start contributing their own entries. Co-authored-by: Cursor <cursoragent@cursor.com>

Raising the threshold to 20 gives a little more lead time before the pidbox starts losing connections, and demoting the message to INFO keeps it out of WARN-level dashboards now that we're casting a wider net. Co-authored-by: Cursor <cursoragent@cursor.com>

adamlogic

Thanks!

adamlogic · 2026-05-13T14:22:56Z

+# Crossing into single digits is a strong predictor of pidbox failures
+# (e.g. the SSLEOFError seen on TLS Redis when new connections are
+# rejected mid-handshake under cap exhaustion).
+BROKER_CONNECTIONS_WARN_THRESHOLD = 10


20 feels too aggressive

I'm changing my mind on this. I think 20 is a better starting point based on what I'm seeing with a current customer support thread. I'm going to change the logging from WARN to INFO though so it's easier to silence (it'll going to log every 10 seconds).

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.}

cursor · 2026-05-13T14:30:36Z

+                    f"connections in use ({remaining} remaining). "
+                    f"New connections may be rejected, which on TLS Redis "
+                    f"can surface as SSL handshake errors."
+                )


Log level is info instead of warning

Medium Severity

The broker connection-limit message uses logger.info() but the PR description explicitly states it should be a warning — including a sample output prefixed with WARNING. Using INFO means the message won't appear at the WARNING log level many production deployments default to, defeating the purpose of alerting operators to connection-cap pressure.

^{Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.}

cursor · 2026-05-13T14:30:36Z

+# Running low on headroom is a strong predictor of pidbox failures
+# (e.g. the SSLEOFError seen on TLS Redis when new connections are
+# rejected mid-handshake under cap exhaustion).
+BROKER_CONNECTIONS_INFO_THRESHOLD = 20


Threshold constant doesn't match agreed-upon value

Medium Severity

BROKER_CONNECTIONS_INFO_THRESHOLD is set to 20, but the PR description states the threshold is 10 (with an example showing 9 remaining triggering the warning). The PR discussion also has the author explicitly saying "I think I want to stick with 10" and the reviewer agreeing. The constant appears to not have been updated to match the final decision.

^{Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.}

Base automatically changed from adam/jdo-1362-reporter-thread-dies-on-a-single-collector-exception to main May 11, 2026 19:39

adamlogic and others added 2 commits May 11, 2026 15:57

adamlogic force-pushed the adam/jdo-1363-report-broker-connection-stats branch from 7226b03 to 900fd3c Compare May 11, 2026 19:57

adamlogic marked this pull request as ready for review May 11, 2026 20:33

adamlogic requested a review from carlosantoniodasilva May 11, 2026 20:33

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread judoscale/celery/collector.py Outdated

carlosantoniodasilva reviewed May 12, 2026

View reviewed changes

adamlogic commented May 12, 2026

View reviewed changes

adamlogic and others added 3 commits May 12, 2026 16:33

Hoist isinstance(info, dict) guard out of value-extraction loop

86873e1

Per Carlos's PR review: the type guard has nothing to do with the per-key value check, so handle it once up front alongside the other early-return failure modes. Co-authored-by: Cursor <cursoragent@cursor.com>

Fold broker-headroom no-warn check into populate test

c7e888d

Per Carlos's PR review: the no-warning case shares its setup with `test_report_metadata_populates_after_collect`, so assert it there instead of in a dedicated test. Co-authored-by: Cursor <cursoragent@cursor.com>

adamlogic commented May 12, 2026

View reviewed changes

adamlogic requested a review from carlosantoniodasilva May 12, 2026 20:42

carlosantoniodasilva approved these changes May 13, 2026

View reviewed changes

adamlogic and others added 2 commits May 13, 2026 10:25

Prefix broker metadata key with celery

68fc4ab

Namespacing the key as `celery-broker` keeps the generic metadata field collision-free as other adapter packages (e.g. dramatiq, which also has a broker) start contributing their own entries. Co-authored-by: Cursor <cursoragent@cursor.com>

adamlogic commented May 13, 2026

View reviewed changes

adamlogic merged commit 2b7afdf into main May 13, 2026
6 checks passed

adamlogic deleted the adam/jdo-1363-report-broker-connection-stats branch May 13, 2026 14:30

cursor Bot reviewed May 13, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 13, 2026

chore(main): release 1.13.3 #135

Merged

	return {"broker": dict(self._broker_stats)}
	return {"celery-broker": dict(self._broker_stats)}

Conversation

adamlogic commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Cost

Payload shape

Warning behavior

Test plan

Uh oh!

Uh oh!

carlosantoniodasilva left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamlogic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamlogic left a comment

Choose a reason for hiding this comment

Uh oh!

carlosantoniodasilva left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamlogic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 13, 2026

Choose a reason for hiding this comment

Log level is info instead of warning

Uh oh!

cursor Bot May 13, 2026

Choose a reason for hiding this comment

Threshold constant doesn't match agreed-upon value

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adamlogic commented May 11, 2026 •

edited by cursor Bot

Loading