Skip to content

fix: Include broker connection stats in adapter report payload#134

Merged
adamlogic merged 7 commits into
mainfrom
adam/jdo-1363-report-broker-connection-stats
May 13, 2026
Merged

fix: Include broker connection stats in adapter report payload#134
adamlogic merged 7 commits into
mainfrom
adam/jdo-1363-report-broker-connection-stats

Conversation

@adamlogic
Copy link
Copy Markdown
Member

@adamlogic adamlogic commented May 11, 2026

Summary

  • Snapshots connected_clients and maxclients from the broker Redis via INFO clients at the top of every CeleryMetricsCollector.collect() cycle and ships them under a new broker block on the celery adapter's entry in the outgoing report.
  • Logs a warning whenever the broker reports fewer than 10 free connection slots, so customers can spot connection-cap pressure before inspect.active() starts failing.
  • Adds a report_metadata property to the Collector protocol (default {}) and merges it into Adapter.as_tuple so any collector can contribute adapter-scoped fields without further wiring.
  • Purpose: surface when a broker is approaching its connection cap. On TLS Redis that ceiling otherwise manifests as an opaque SSLEOFError: UNEXPECTED_EOF_WHILE_READING during a fresh connection's handshake (Redis closes the socket before any application-layer error can be sent), instead of the clean max number of clients reached you'd see over plain TCP.

JDO-1363

Cost

INFO clients is one round trip on the already-open broker connection. Benchmarked locally at ~70-100µs server-side; expect ~1-2ms total in-region. The collector already calls INFO (all sections) at startup for the version check, so this isn't a new permission requirement on managed Redis providers.

Payload shape

Old:

"judoscale-celery": {"runtime_version": "5.6.3", "adapter_version": "1.13.2"}

New (when stats are available):

"judoscale-celery": {
  "runtime_version": "5.6.3",
  "adapter_version": "1.13.2",
  "broker": {"connected_clients": 31, "maxclients": 40}
}

The broker block is omitted entirely if the INFO call fails or returns missing keys, so the report never carries stale or partial numbers. Additive change to the payload — older agents simply don't send it.

Warning behavior

When maxclients - connected_clients < 10:

WARNING - [judoscale] Broker is near its connection limit: 31/40 connections in use (9 remaining). New connections may be rejected, which on TLS Redis can surface as SSL handshake errors.

One warning per report cycle (every 10s by default). Stays silent when headroom is healthy.

Test plan

  • poetry run pytest — 151 passed (6 new tests covering the broker-stats lifecycle, warning threshold, and adapter merge behavior).
  • Manual end-to-end against a local TLS Redis in Docker, sized so the first reporter cycle succeeds and the second one trips the cap: verified the broker block makes it onto the payload on both cycles, the warning fires whenever < 10 slots remain, and inspect.active()'s SSLEOFError on cycle 2 is absorbed by the JDO-1362 collector-survival wrapper without losing the broker metadata.
  • Backend coordination: the receiving adapter API needs to know how to parse and store the new broker field. Tracked separately.

Note

Medium Risk
Adds new top-level metadata to the report payload and introduces Redis INFO clients calls/logging in the Celery collector; could affect downstream API parsing and increase broker load/log volume if misconfigured.

Overview
Reports now include a new top-level metadata block, built by merging a new report_metadata property exposed by all metrics collectors.

CeleryMetricsCollector now snapshots Redis broker connection stats (connected_clients, maxclients) via INFO clients each collect() cycle, publishes them under metadata['celery-broker'], and logs when remaining connection headroom drops below a threshold.

Tests were updated/added to cover section-aware Redis INFO mocking, metadata inclusion/omission behavior, and the near-connection-limit log trigger.

Reviewed by Cursor Bugbot for commit 8e4b971. Bugbot is set up for automated code reviews on this repo. Configure here.

Base automatically changed from adam/jdo-1362-reporter-thread-dies-on-a-single-collector-exception to main May 11, 2026 19:39
adamlogic and others added 2 commits May 11, 2026 15:57
Reports `connected_clients` and `maxclients` from Redis `INFO clients`
under the celery adapter's entry on each report cycle. Surfaces when
the broker is approaching its connection cap, which on TLS Redis can
otherwise manifest as an opaque SSLEOFError during a new connection's
handshake rather than a clear "max number of clients reached" error.

JDO-1363

Co-authored-by: Cursor <cursoragent@cursor.com>
Logs a warning during `_refresh_broker_stats` whenever the broker
reports `maxclients - connected_clients < 10`. Surfaces the
connection-cap pressure that causes `inspect.active()` to fail on TLS
Redis (where exhaustion manifests as an opaque SSLEOFError) before the
collector starts dropping cycles.

JDO-1363

Co-authored-by: Cursor <cursoragent@cursor.com>
@adamlogic adamlogic force-pushed the adam/jdo-1363-report-broker-connection-stats branch from 7226b03 to 900fd3c Compare May 11, 2026 19:57
@adamlogic adamlogic marked this pull request as ready for review May 11, 2026 20:33
Comment thread judoscale/celery/collector.py Outdated
Copy link
Copy Markdown
Member

@carlosantoniodasilva carlosantoniodasilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I had a couple thoughts below, mainly about whether to expand on the adapters key vs exposing this as separate metadata.

Do you plan on shipping and keeping this, or is it more of a potential temp troubleshooting change?

Comment thread judoscale/core/adapter.py Outdated
if self.metrics_collector is not None:
# Let the collector contribute adapter-scoped fields (broker
# stats, etc.) computed during its most recent collect() cycle.
extra = getattr(self.metrics_collector, "report_metadata", None) or {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We introduced the concept of "metadata" added to each report, which is stored separately: I had applied it to a previous PR for troubleshooting but it never got merged, but the infra should still be in place.

https://github.com/judoscale/judoscale-python/pull/114/changes#diff-dc3605e76b7fa7403b885af969869ff577f3020b0bea517455a47caeb6bf745aR131-R135

Applying this to the adapter info itself means it gets mangled with adapters and will cause additional updates on the backend as the adapters get propagated, so it might be best if we can keep it out of that. (adapters is more "stable" and only changes with version changes, metadata is more volatile since it is likely going to be different with each report)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I forgot we supported that.

Comment thread judoscale/celery/collector.py Outdated
# Crossing into single digits is a strong predictor of pidbox failures
# (e.g. the SSLEOFError seen on TLS Redis when new connections are
# rejected mid-handshake under cap exhaustion).
BROKER_CONNECTIONS_WARN_THRESHOLD = 10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a % of connections left, e.g. if you have 40 connections than maybe 4, if you have 100 than it's 10. (assuming we keep a 10% threshold)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with a static number because our connection-usage is not dependent on the total max clients. Our "busy job tracking" can use up to 18 clients in my testing. (Details in Linear comment)

But as for why I chose "10" as the number... I'm not really sure. Since I'm seeing 18 clients in my testing, maybe 20 makes more sense here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want to stick with 10... 20 feels too aggressive since Heroku's cheapest Redis plans have very restrictive limits. We'll only potentially use more than 10 connections if the client is configured to track busy jobs, so I feel like 10 hits the spot between the two. 🤷‍♂️ Not a strong opinion on it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, it's a starting point we can tweak going forward.

It's really odd how many clients/connections tracking busy jobs can take tbh.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 feels too aggressive

I'm changing my mind on this. I think 20 is a better starting point based on what I'm seeing with a current customer support thread. I'm going to change the logging from WARN to INFO though so it's easier to silence (it'll going to log every 10 seconds).

Comment thread judoscale/celery/collector.py Outdated
Comment thread judoscale/celery/collector.py Outdated

stats: Dict[str, int] = {}
for key in ("connected_clients", "maxclients"):
value = info.get(key) if isinstance(info, dict) else None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this if isinstance(info, dict) could be pushed up, since it is a guard that has nothing to do with the value check? It's fine as-is though, just a thought.

Comment thread tests/test_collectors.py
Comment thread tests/test_collectors.py Outdated
Comment on lines +372 to +383
def test_report_metadata_empty_before_collect(self, worker_1, celery):
celery.connection_for_read().channel().client.scan_iter.return_value = []
collector = CeleryMetricsCollector(worker_1, celery)
assert collector.report_metadata == {}

def test_report_metadata_populated_after_collect(self, worker_1, celery):
celery.connection_for_read().channel().client.scan_iter.return_value = []
collector = CeleryMetricsCollector(worker_1, celery)
collector.collect()
assert collector.report_metadata == {
"broker": {"connected_clients": 3, "maxclients": 40}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like these could be merged maybe.

Suggested change
def test_report_metadata_empty_before_collect(self, worker_1, celery):
celery.connection_for_read().channel().client.scan_iter.return_value = []
collector = CeleryMetricsCollector(worker_1, celery)
assert collector.report_metadata == {}
def test_report_metadata_populated_after_collect(self, worker_1, celery):
celery.connection_for_read().channel().client.scan_iter.return_value = []
collector = CeleryMetricsCollector(worker_1, celery)
collector.collect()
assert collector.report_metadata == {
"broker": {"connected_clients": 3, "maxclients": 40}
}
def test_report_metadata_populates_after_collect(self, worker_1, celery):
celery.connection_for_read().channel().client.scan_iter.return_value = []
collector = CeleryMetricsCollector(worker_1, celery)
assert collector.report_metadata == {}
collector.collect()
assert collector.report_metadata == {
"broker": {"connected_clients": 3, "maxclients": 40}
}

Comment thread tests/test_collectors.py Outdated
for record in caplog.records
)

def test_does_not_warn_when_broker_has_headroom(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also maybe test this one as part of the base "populate" test, since it's all the same setup, just a matter of checking no log output. But it's fine on its own too to maybe not make that one too noisy.

Copy link
Copy Markdown
Member Author

@adamlogic adamlogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan on shipping and keeping this, or is it more of a potential temp troubleshooting change?

Was thinking it'd be helpful to have permanently.

Comment thread judoscale/celery/collector.py Outdated
# Crossing into single digits is a strong predictor of pidbox failures
# (e.g. the SSLEOFError seen on TLS Redis when new connections are
# rejected mid-handshake under cap exhaustion).
BROKER_CONNECTIONS_WARN_THRESHOLD = 10
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with a static number because our connection-usage is not dependent on the total max clients. Our "busy job tracking" can use up to 18 clients in my testing. (Details in Linear comment)

But as for why I chose "10" as the number... I'm not really sure. Since I'm seeing 18 clients in my testing, maybe 20 makes more sense here.

Comment thread tests/test_collectors.py
Comment thread judoscale/core/adapter.py Outdated
if self.metrics_collector is not None:
# Let the collector contribute adapter-scoped fields (broker
# stats, etc.) computed during its most recent collect() cycle.
extra = getattr(self.metrics_collector, "report_metadata", None) or {}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I forgot we supported that.

adamlogic and others added 3 commits May 12, 2026 16:33
Per Carlos's PR review: keep adapter-scoped stable fields in `adapters`
and put volatile per-report fields in a sibling `metadata` block. Also
merge two report_metadata tests per his code suggestion.

Co-authored-by: Cursor <cursoragent@cursor.com>
Per Carlos's PR review: the type guard has nothing to do with the
per-key value check, so handle it once up front alongside the other
early-return failure modes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Per Carlos's PR review: the no-warning case shares its setup with
`test_report_metadata_populates_after_collect`, so assert it there
instead of in a dedicated test.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Member Author

@adamlogic adamlogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! Ready for another look.

Copy link
Copy Markdown
Member

@carlosantoniodasilva carlosantoniodasilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another small suggestion, but otherwise LGTM! 👍

Comment thread judoscale/celery/collector.py Outdated
# Crossing into single digits is a strong predictor of pidbox failures
# (e.g. the SSLEOFError seen on TLS Redis when new connections are
# rejected mid-handshake under cap exhaustion).
BROKER_CONNECTIONS_WARN_THRESHOLD = 10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, it's a starting point we can tweak going forward.

It's really odd how many clients/connections tracking busy jobs can take tbh.

Comment thread judoscale/celery/collector.py Outdated
# INFO call failed so the report doesn't carry stale numbers.
if not self._broker_stats:
return {}
return {"broker": dict(self._broker_stats)}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this goes in a "generic metadata" field, maybe we prefix with celery, e.g.:

Suggested change
return {"broker": dict(self._broker_stats)}
return {"celery-broker": dict(self._broker_stats)}

or similar... just in case we add more stuff to metadata from other packages in the future. (for example dramatiq also has the concept of a "broker")

adamlogic and others added 2 commits May 13, 2026 10:25
Namespacing the key as `celery-broker` keeps the generic metadata field
collision-free as other adapter packages (e.g. dramatiq, which also has
a broker) start contributing their own entries.

Co-authored-by: Cursor <cursoragent@cursor.com>
Raising the threshold to 20 gives a little more lead time before the
pidbox starts losing connections, and demoting the message to INFO
keeps it out of WARN-level dashboards now that we're casting a wider
net.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Member Author

@adamlogic adamlogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread judoscale/celery/collector.py Outdated
# Crossing into single digits is a strong predictor of pidbox failures
# (e.g. the SSLEOFError seen on TLS Redis when new connections are
# rejected mid-handshake under cap exhaustion).
BROKER_CONNECTIONS_WARN_THRESHOLD = 10
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 feels too aggressive

I'm changing my mind on this. I think 20 is a better starting point based on what I'm seeing with a current customer support thread. I'm going to change the logging from WARN to INFO though so it's easier to silence (it'll going to log every 10 seconds).

@adamlogic adamlogic merged commit 2b7afdf into main May 13, 2026
6 checks passed
@adamlogic adamlogic deleted the adam/jdo-1363-report-broker-connection-stats branch May 13, 2026 14:30
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.

f"connections in use ({remaining} remaining). "
f"New connections may be rejected, which on TLS Redis "
f"can surface as SSL handshake errors."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log level is info instead of warning

Medium Severity

The broker connection-limit message uses logger.info() but the PR description explicitly states it should be a warning — including a sample output prefixed with WARNING. Using INFO means the message won't appear at the WARNING log level many production deployments default to, defeating the purpose of alerting operators to connection-cap pressure.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.

# Running low on headroom is a strong predictor of pidbox failures
# (e.g. the SSLEOFError seen on TLS Redis when new connections are
# rejected mid-handshake under cap exhaustion).
BROKER_CONNECTIONS_INFO_THRESHOLD = 20
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threshold constant doesn't match agreed-upon value

Medium Severity

BROKER_CONNECTIONS_INFO_THRESHOLD is set to 20, but the PR description states the threshold is 10 (with an example showing 9 remaining triggering the warning). The PR discussion also has the author explicitly saying "I think I want to stick with 10" and the reviewer agreeing. The constant appears to not have been updated to match the final decision.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants