fix: Include broker connection stats in adapter report payload#134
Conversation
Reports `connected_clients` and `maxclients` from Redis `INFO clients` under the celery adapter's entry on each report cycle. Surfaces when the broker is approaching its connection cap, which on TLS Redis can otherwise manifest as an opaque SSLEOFError during a new connection's handshake rather than a clear "max number of clients reached" error. JDO-1363 Co-authored-by: Cursor <cursoragent@cursor.com>
Logs a warning during `_refresh_broker_stats` whenever the broker reports `maxclients - connected_clients < 10`. Surfaces the connection-cap pressure that causes `inspect.active()` to fail on TLS Redis (where exhaustion manifests as an opaque SSLEOFError) before the collector starts dropping cycles. JDO-1363 Co-authored-by: Cursor <cursoragent@cursor.com>
7226b03 to
900fd3c
Compare
carlosantoniodasilva
left a comment
There was a problem hiding this comment.
Looks good, I had a couple thoughts below, mainly about whether to expand on the adapters key vs exposing this as separate metadata.
Do you plan on shipping and keeping this, or is it more of a potential temp troubleshooting change?
| if self.metrics_collector is not None: | ||
| # Let the collector contribute adapter-scoped fields (broker | ||
| # stats, etc.) computed during its most recent collect() cycle. | ||
| extra = getattr(self.metrics_collector, "report_metadata", None) or {} |
There was a problem hiding this comment.
We introduced the concept of "metadata" added to each report, which is stored separately: I had applied it to a previous PR for troubleshooting but it never got merged, but the infra should still be in place.
Applying this to the adapter info itself means it gets mangled with adapters and will cause additional updates on the backend as the adapters get propagated, so it might be best if we can keep it out of that. (adapters is more "stable" and only changes with version changes, metadata is more volatile since it is likely going to be different with each report)
There was a problem hiding this comment.
Good call, I forgot we supported that.
| # Crossing into single digits is a strong predictor of pidbox failures | ||
| # (e.g. the SSLEOFError seen on TLS Redis when new connections are | ||
| # rejected mid-handshake under cap exhaustion). | ||
| BROKER_CONNECTIONS_WARN_THRESHOLD = 10 |
There was a problem hiding this comment.
I wonder if this should be a % of connections left, e.g. if you have 40 connections than maybe 4, if you have 100 than it's 10. (assuming we keep a 10% threshold)
There was a problem hiding this comment.
I went with a static number because our connection-usage is not dependent on the total max clients. Our "busy job tracking" can use up to 18 clients in my testing. (Details in Linear comment)
But as for why I chose "10" as the number... I'm not really sure. Since I'm seeing 18 clients in my testing, maybe 20 makes more sense here.
There was a problem hiding this comment.
I think I want to stick with 10... 20 feels too aggressive since Heroku's cheapest Redis plans have very restrictive limits. We'll only potentially use more than 10 connections if the client is configured to track busy jobs, so I feel like 10 hits the spot between the two. 🤷♂️ Not a strong opinion on it.
There was a problem hiding this comment.
Sounds good, it's a starting point we can tweak going forward.
It's really odd how many clients/connections tracking busy jobs can take tbh.
There was a problem hiding this comment.
20 feels too aggressive
I'm changing my mind on this. I think 20 is a better starting point based on what I'm seeing with a current customer support thread. I'm going to change the logging from WARN to INFO though so it's easier to silence (it'll going to log every 10 seconds).
|
|
||
| stats: Dict[str, int] = {} | ||
| for key in ("connected_clients", "maxclients"): | ||
| value = info.get(key) if isinstance(info, dict) else None |
There was a problem hiding this comment.
It feels like this if isinstance(info, dict) could be pushed up, since it is a guard that has nothing to do with the value check? It's fine as-is though, just a thought.
| def test_report_metadata_empty_before_collect(self, worker_1, celery): | ||
| celery.connection_for_read().channel().client.scan_iter.return_value = [] | ||
| collector = CeleryMetricsCollector(worker_1, celery) | ||
| assert collector.report_metadata == {} | ||
|
|
||
| def test_report_metadata_populated_after_collect(self, worker_1, celery): | ||
| celery.connection_for_read().channel().client.scan_iter.return_value = [] | ||
| collector = CeleryMetricsCollector(worker_1, celery) | ||
| collector.collect() | ||
| assert collector.report_metadata == { | ||
| "broker": {"connected_clients": 3, "maxclients": 40} | ||
| } |
There was a problem hiding this comment.
Feels like these could be merged maybe.
| def test_report_metadata_empty_before_collect(self, worker_1, celery): | |
| celery.connection_for_read().channel().client.scan_iter.return_value = [] | |
| collector = CeleryMetricsCollector(worker_1, celery) | |
| assert collector.report_metadata == {} | |
| def test_report_metadata_populated_after_collect(self, worker_1, celery): | |
| celery.connection_for_read().channel().client.scan_iter.return_value = [] | |
| collector = CeleryMetricsCollector(worker_1, celery) | |
| collector.collect() | |
| assert collector.report_metadata == { | |
| "broker": {"connected_clients": 3, "maxclients": 40} | |
| } | |
| def test_report_metadata_populates_after_collect(self, worker_1, celery): | |
| celery.connection_for_read().channel().client.scan_iter.return_value = [] | |
| collector = CeleryMetricsCollector(worker_1, celery) | |
| assert collector.report_metadata == {} | |
| collector.collect() | |
| assert collector.report_metadata == { | |
| "broker": {"connected_clients": 3, "maxclients": 40} | |
| } |
| for record in caplog.records | ||
| ) | ||
|
|
||
| def test_does_not_warn_when_broker_has_headroom( |
There was a problem hiding this comment.
We could also maybe test this one as part of the base "populate" test, since it's all the same setup, just a matter of checking no log output. But it's fine on its own too to maybe not make that one too noisy.
adamlogic
left a comment
There was a problem hiding this comment.
Do you plan on shipping and keeping this, or is it more of a potential temp troubleshooting change?
Was thinking it'd be helpful to have permanently.
| # Crossing into single digits is a strong predictor of pidbox failures | ||
| # (e.g. the SSLEOFError seen on TLS Redis when new connections are | ||
| # rejected mid-handshake under cap exhaustion). | ||
| BROKER_CONNECTIONS_WARN_THRESHOLD = 10 |
There was a problem hiding this comment.
I went with a static number because our connection-usage is not dependent on the total max clients. Our "busy job tracking" can use up to 18 clients in my testing. (Details in Linear comment)
But as for why I chose "10" as the number... I'm not really sure. Since I'm seeing 18 clients in my testing, maybe 20 makes more sense here.
| if self.metrics_collector is not None: | ||
| # Let the collector contribute adapter-scoped fields (broker | ||
| # stats, etc.) computed during its most recent collect() cycle. | ||
| extra = getattr(self.metrics_collector, "report_metadata", None) or {} |
There was a problem hiding this comment.
Good call, I forgot we supported that.
Per Carlos's PR review: keep adapter-scoped stable fields in `adapters` and put volatile per-report fields in a sibling `metadata` block. Also merge two report_metadata tests per his code suggestion. Co-authored-by: Cursor <cursoragent@cursor.com>
Per Carlos's PR review: the type guard has nothing to do with the per-key value check, so handle it once up front alongside the other early-return failure modes. Co-authored-by: Cursor <cursoragent@cursor.com>
Per Carlos's PR review: the no-warning case shares its setup with `test_report_metadata_populates_after_collect`, so assert it there instead of in a dedicated test. Co-authored-by: Cursor <cursoragent@cursor.com>
adamlogic
left a comment
There was a problem hiding this comment.
Thanks for the suggestions! Ready for another look.
carlosantoniodasilva
left a comment
There was a problem hiding this comment.
Just another small suggestion, but otherwise LGTM! 👍
| # Crossing into single digits is a strong predictor of pidbox failures | ||
| # (e.g. the SSLEOFError seen on TLS Redis when new connections are | ||
| # rejected mid-handshake under cap exhaustion). | ||
| BROKER_CONNECTIONS_WARN_THRESHOLD = 10 |
There was a problem hiding this comment.
Sounds good, it's a starting point we can tweak going forward.
It's really odd how many clients/connections tracking busy jobs can take tbh.
| # INFO call failed so the report doesn't carry stale numbers. | ||
| if not self._broker_stats: | ||
| return {} | ||
| return {"broker": dict(self._broker_stats)} |
There was a problem hiding this comment.
Since this goes in a "generic metadata" field, maybe we prefix with celery, e.g.:
| return {"broker": dict(self._broker_stats)} | |
| return {"celery-broker": dict(self._broker_stats)} |
or similar... just in case we add more stuff to metadata from other packages in the future. (for example dramatiq also has the concept of a "broker")
Namespacing the key as `celery-broker` keeps the generic metadata field collision-free as other adapter packages (e.g. dramatiq, which also has a broker) start contributing their own entries. Co-authored-by: Cursor <cursoragent@cursor.com>
Raising the threshold to 20 gives a little more lead time before the pidbox starts losing connections, and demoting the message to INFO keeps it out of WARN-level dashboards now that we're casting a wider net. Co-authored-by: Cursor <cursoragent@cursor.com>
| # Crossing into single digits is a strong predictor of pidbox failures | ||
| # (e.g. the SSLEOFError seen on TLS Redis when new connections are | ||
| # rejected mid-handshake under cap exhaustion). | ||
| BROKER_CONNECTIONS_WARN_THRESHOLD = 10 |
There was a problem hiding this comment.
20 feels too aggressive
I'm changing my mind on this. I think 20 is a better starting point based on what I'm seeing with a current customer support thread. I'm going to change the logging from WARN to INFO though so it's easier to silence (it'll going to log every 10 seconds).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.
| f"connections in use ({remaining} remaining). " | ||
| f"New connections may be rejected, which on TLS Redis " | ||
| f"can surface as SSL handshake errors." | ||
| ) |
There was a problem hiding this comment.
Log level is info instead of warning
Medium Severity
The broker connection-limit message uses logger.info() but the PR description explicitly states it should be a warning — including a sample output prefixed with WARNING. Using INFO means the message won't appear at the WARNING log level many production deployments default to, defeating the purpose of alerting operators to connection-cap pressure.
Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.
| # Running low on headroom is a strong predictor of pidbox failures | ||
| # (e.g. the SSLEOFError seen on TLS Redis when new connections are | ||
| # rejected mid-handshake under cap exhaustion). | ||
| BROKER_CONNECTIONS_INFO_THRESHOLD = 20 |
There was a problem hiding this comment.
Threshold constant doesn't match agreed-upon value
Medium Severity
BROKER_CONNECTIONS_INFO_THRESHOLD is set to 20, but the PR description states the threshold is 10 (with an example showing 9 remaining triggering the warning). The PR discussion also has the author explicitly saying "I think I want to stick with 10" and the reviewer agreeing. The constant appears to not have been updated to match the final decision.
Reviewed by Cursor Bugbot for commit 8e4b971. Configure here.


Summary
connected_clientsandmaxclientsfrom the broker Redis viaINFO clientsat the top of everyCeleryMetricsCollector.collect()cycle and ships them under a newbrokerblock on the celery adapter's entry in the outgoing report.inspect.active()starts failing.report_metadataproperty to theCollectorprotocol (default{}) and merges it intoAdapter.as_tupleso any collector can contribute adapter-scoped fields without further wiring.SSLEOFError: UNEXPECTED_EOF_WHILE_READINGduring a fresh connection's handshake (Redis closes the socket before any application-layer error can be sent), instead of the cleanmax number of clients reachedyou'd see over plain TCP.JDO-1363
Cost
INFO clientsis one round trip on the already-open broker connection. Benchmarked locally at ~70-100µs server-side; expect ~1-2ms total in-region. The collector already callsINFO(all sections) at startup for the version check, so this isn't a new permission requirement on managed Redis providers.Payload shape
Old:
New (when stats are available):
The
brokerblock is omitted entirely if theINFOcall fails or returns missing keys, so the report never carries stale or partial numbers. Additive change to the payload — older agents simply don't send it.Warning behavior
When
maxclients - connected_clients < 10:One warning per report cycle (every 10s by default). Stays silent when headroom is healthy.
Test plan
poetry run pytest— 151 passed (6 new tests covering the broker-stats lifecycle, warning threshold, and adapter merge behavior).brokerblock makes it onto the payload on both cycles, the warning fires whenever< 10slots remain, andinspect.active()'sSSLEOFErroron cycle 2 is absorbed by the JDO-1362 collector-survival wrapper without losing the broker metadata.brokerfield. Tracked separately.Note
Medium Risk
Adds new top-level
metadatato the report payload and introduces RedisINFO clientscalls/logging in the Celery collector; could affect downstream API parsing and increase broker load/log volume if misconfigured.Overview
Reports now include a new top-level
metadatablock, built by merging a newreport_metadataproperty exposed by all metrics collectors.CeleryMetricsCollectornow snapshots Redis broker connection stats (connected_clients,maxclients) viaINFO clientseachcollect()cycle, publishes them undermetadata['celery-broker'], and logs when remaining connection headroom drops below a threshold.Tests were updated/added to cover section-aware Redis
INFOmocking, metadata inclusion/omission behavior, and the near-connection-limit log trigger.Reviewed by Cursor Bugbot for commit 8e4b971. Bugbot is set up for automated code reviews on this repo. Configure here.