-
Notifications
You must be signed in to change notification settings - Fork 14.8k
KAFKA-19747: Update ClientTelemetryReporter telemetry push error handling #20661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-19747: Update ClientTelemetryReporter telemetry push error handling #20661
Conversation
clients/src/main/java/org/apache/kafka/common/telemetry/internals/ClientTelemetryReporter.java
Outdated
Show resolved
Hide resolved
clients/src/main/java/org/apache/kafka/common/telemetry/internals/ClientTelemetryReporter.java
Outdated
Show resolved
Hide resolved
|
@apoorvmittal10 PTAL |
apoorvmittal10
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix.
clients/src/main/java/org/apache/kafka/common/telemetry/internals/ClientTelemetryReporter.java
Outdated
Show resolved
Hide resolved
clients/src/main/java/org/apache/kafka/common/telemetry/internals/ClientTelemetryReporter.java
Outdated
Show resolved
Hide resolved
mjsax
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM, I am just not sure if we make it too complex, tracing down the full chain for "causes"?
839a65f to
026728b
Compare
…g between retryable and fatal exceptions and add test cases
026728b to
900be01
Compare
|
@mjsax I've updated the logic for checking exceptions |
mjsax
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. LGTM.
|
Merged #20661 into trunk |
…ling (#20661) When a failure occurs with a push telemetry request, any exception is treated as fatal, increasing the time interval to `Integer.MAX_VALUE` effectively turning telemetry off. This PR updates the error handling to check if the exception is a transient one with expected recovery and keeps the telemetry interval value the same in those cases since a recovery is expected. Reviewers: Apoorv Mittal <[email protected]>, Matthias Sax<[email protected]>
…ling (#20661) When a failure occurs with a push telemetry request, any exception is treated as fatal, increasing the time interval to `Integer.MAX_VALUE` effectively turning telemetry off. This PR updates the error handling to check if the exception is a transient one with expected recovery and keeps the telemetry interval value the same in those cases since a recovery is expected. Reviewers: Apoorv Mittal <[email protected]>, Matthias Sax<[email protected]>
|
cherry-picked to 4.1 aceb32d |
|
cherry-picked to 4.0 3243300 |
| public void handleFailedPushTelemetryRequest(KafkaException maybeFatalException) { | ||
| log.debug("The broker generated an error for the push telemetry network API request", maybeFatalException); | ||
| handleFailedRequest(maybeFatalException != null); | ||
| handleFailedRequest(isRetryable(maybeFatalException)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, the consumer will display the following warning, even though users may not be concerned with telemetry
[2025-10-16 02:27:16,077] WARN Received unrecoverable error from broker, disabling telemetry (org.apache.kafka.common.telemetry.internals.ClientTelemetryReporter)
Perhaps we shouldn't print the warning if the error is an UnsupportedVersionException. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It's even an issue before this change, and I did raise this to @apoorvmittal10 at some point in the past...
We get something like
org.apache.kafka.common.errors.UnsupportedVersionException: The node does not support GET_TELEMETRY_SUBSCRIPTIONS
This come from https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L149
I always found it very annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree to both. For the issue pointed by @mjsax, AFAIR, that was a common issue in code and not specific to telemetry, the log comes in debug mode. The current PR issue gets highlighted as it logs in WARN and do not differentiate with UnsupportedVersionException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the feedback we will file a PR to fix the "highlight"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chia7712 have you done this already? We'd like to get the follow-up in the upcoming 4.1.1 release, if you don't have the bandwidth right now, I can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries @chia7712, thanks for the update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was already fixed by #20722 😄
…ling (apache#20661) When a failure occurs with a push telemetry request, any exception is treated as fatal, increasing the time interval to `Integer.MAX_VALUE` effectively turning telemetry off. This PR updates the error handling to check if the exception is a transient one with expected recovery and keeps the telemetry interval value the same in those cases since a recovery is expected. Reviewers: Apoorv Mittal <[email protected]>, Matthias Sax<[email protected]>
When a failure occurs with a push telemetry request, any exception is
treated as fatal, increasing the time interval to
Integer.MAX_VALUEeffectively turning telemetry off. This PR updates the error handling
to check if the exception is a transient one with expected recovery and
keeps the telemetry interval value the same in those cases since a
recovery is expected.
Reviewers: Apoorv Mittal [email protected], Matthias
Sax[email protected]