Skip to content

Conversation

@bbejeck
Copy link
Member

@bbejeck bbejeck commented Oct 8, 2025

When a failure occurs with a push telemetry request, any exception is
treated as fatal, increasing the time interval to Integer.MAX_VALUE
effectively turning telemetry off. This PR updates the error handling
to check if the exception is a transient one with expected recovery and
keeps the telemetry interval value the same in those cases since a
recovery is expected.

Reviewers: Apoorv Mittal [email protected], Matthias
Sax[email protected]

@bbejeck bbejeck requested a review from apoorvmittal10 October 8, 2025 14:35
@bbejeck
Copy link
Member Author

bbejeck commented Oct 8, 2025

@apoorvmittal10 PTAL

Copy link
Contributor

@apoorvmittal10 apoorvmittal10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix.

Copy link
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM, I am just not sure if we make it too complex, tracing down the full chain for "causes"?

@bbejeck bbejeck force-pushed the KAFKA-19747_improve_failed_push_handling branch from 839a65f to 026728b Compare October 8, 2025 22:19
@bbejeck bbejeck force-pushed the KAFKA-19747_improve_failed_push_handling branch from 026728b to 900be01 Compare October 9, 2025 13:18
@bbejeck
Copy link
Member Author

bbejeck commented Oct 9, 2025

@mjsax I've updated the logic for checking exceptions
@apoorvmittal10 not sure if you would like to take another look

Copy link
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. LGTM.

@bbejeck bbejeck merged commit 1e95d04 into apache:trunk Oct 9, 2025
32 of 35 checks passed
@bbejeck
Copy link
Member Author

bbejeck commented Oct 9, 2025

Merged #20661 into trunk

@bbejeck bbejeck deleted the KAFKA-19747_improve_failed_push_handling branch October 9, 2025 20:32
bbejeck added a commit that referenced this pull request Oct 9, 2025
…ling (#20661)

When a failure occurs with a push telemetry request, any exception is
treated as fatal, increasing the time interval to `Integer.MAX_VALUE`
effectively turning telemetry off.  This PR updates the error handling
to check if the exception is a transient one with expected recovery and
keeps the telemetry interval value the same in those cases since a
recovery is expected.

Reviewers: Apoorv Mittal <[email protected]>, Matthias
 Sax<[email protected]>
bbejeck added a commit that referenced this pull request Oct 9, 2025
…ling (#20661)

When a failure occurs with a push telemetry request, any exception is
treated as fatal, increasing the time interval to `Integer.MAX_VALUE`
effectively turning telemetry off.  This PR updates the error handling
to check if the exception is a transient one with expected recovery and
keeps the telemetry interval value the same in those cases since a
recovery is expected.

Reviewers: Apoorv Mittal <[email protected]>, Matthias
 Sax<[email protected]>
@bbejeck
Copy link
Member Author

bbejeck commented Oct 9, 2025

cherry-picked to 4.1 aceb32d

@bbejeck
Copy link
Member Author

bbejeck commented Oct 9, 2025

cherry-picked to 4.0 3243300

public void handleFailedPushTelemetryRequest(KafkaException maybeFatalException) {
log.debug("The broker generated an error for the push telemetry network API request", maybeFatalException);
handleFailedRequest(maybeFatalException != null);
handleFailedRequest(isRetryable(maybeFatalException));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, the consumer will display the following warning, even though users may not be concerned with telemetry

[2025-10-16 02:27:16,077] WARN Received unrecoverable error from broker, disabling telemetry (org.apache.kafka.common.telemetry.internals.ClientTelemetryReporter)

Perhaps we shouldn't print the warning if the error is an UnsupportedVersionException. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It's even an issue before this change, and I did raise this to @apoorvmittal10 at some point in the past...

We get something like

org.apache.kafka.common.errors.UnsupportedVersionException: The node does not support GET_TELEMETRY_SUBSCRIPTIONS

This come from https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L149

I always found it very annoying.

Copy link
Contributor

@apoorvmittal10 apoorvmittal10 Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree to both. For the issue pointed by @mjsax, AFAIR, that was a common issue in code and not specific to telemetry, the log comes in debug mode. The current PR issue gets highlighted as it logs in WARN and do not differentiate with UnsupportedVersionException.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the feedback we will file a PR to fix the "highlight"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712 have you done this already? We'd like to get the follow-up in the upcoming 4.1.1 release, if you don't have the bandwidth right now, I can do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbejeck Sorry for the delay. @DL1231 will file a PR as soon as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries @chia7712, thanks for the update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was already fixed by #20722 😄

eduwercamacaro pushed a commit to littlehorse-enterprises/kafka that referenced this pull request Nov 12, 2025
…ling (apache#20661)

When a failure occurs with a push telemetry request, any exception is
treated as fatal, increasing the time interval to `Integer.MAX_VALUE`
effectively turning telemetry off.  This PR updates the error handling
to check if the exception is a transient one with expected recovery and
keeps the telemetry interval value the same in those cases since a
recovery is expected.

Reviewers: Apoorv Mittal <[email protected]>, Matthias
 Sax<[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants