KAFKA-19928: Added retry and backoff mechanism in NetworkPartitionMetadataClient #21001

chirag-wadhwa5 · 2025-11-26T18:49:53Z

Currently, if a ListOffsets request fails in
NetworkPartitionMetadataClient for any reason, the corresponding future
is completed then and there, without any retries. But the NetworkClient
and InterbrokerSendThread are loaded lazily in the
NetworkPartitionMetadataClient on the arrival of the first request. But
when the first request comes, it is immediately enqueued in the
NetworkClient, before the connection could be established, thereby
always failing the first request. As a solution to that, this PR
introduces a retry mechanism with an upper limit on the retry attempts,
as well as exponential backoff between succesive retries.

Reviewers: Apoorv Mittal [email protected], Andrew Schofield
[email protected], Sushant Mahajan [email protected]

…adataClient

apoorvmittal10

Thanks for the changes, some doubts.

apoorvmittal10 · 2025-11-26T19:56:43Z

clients/src/main/java/org/apache/kafka/common/utils/ExponentialBackoffManager.java

+/**
+ * Manages retry attempts and exponential backoff for requests.
+ */
+public class ExponentialBackoffManager {


Do we really require a new class and can't use some existing mechanism?

The PersisterStateManager already had a subclass called BackOffManager, and I requred a similar thing. Instead of having 2 subclasses at different places, I think having a utility class is better. It could be used in future elsewhere as well.

There already exists https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/utils/ExponentialBackoff.java, can't that be used?

core/src/main/scala/kafka/server/BrokerServer.scala

AndrewJSchofield · 2025-11-26T20:07:35Z

@chirag-wadhwa5 Checkstyle failures to fix please.

smjn · 2025-11-27T09:02:13Z

clients/src/main/java/org/apache/kafka/common/utils/ExponentialBackoffManager.java

+/**
+ * Manages retry attempts and exponential backoff for requests.
+ */
+public class ExponentialBackoffManager {


Since this is a public class now, please add a few unit tests.

smjn · 2025-11-27T09:07:33Z

...dinator/src/main/java/org/apache/kafka/coordinator/group/NetworkPartitionMetadataClient.java

+            shouldRetry = true;
        } else if (clientResponse.wasTimedOut()) {
-            log.error("Response for ListOffsets for TopicPartitions: {} timed out - {}.", partitionFutures.keySet(), clientResponse);
+            log.warn("Response for ListOffsets for TopicPartitions: {} timed out - {}.", partitionFutures.keySet(), clientResponse);


maybe remove logging from retriable requests - log if attempts exhausted only.

Thanks for the review. Maybe I'll change it to debug.

smjn · 2025-11-27T09:22:41Z

@chirag-wadhwa5 There are thread leaks from the new reaper org.opentest4j.AssertionFailedError: Thread leak detected: network-partition-metadata-client-reaper ==>
Please see it is being closed properly in BrokerServer

AndrewJSchofield · 2025-11-28T10:19:22Z

@chirag-wadhwa5 Please can you merge latest changes from trunk.

apoorvmittal10

Approving given we will improve it in near future.

…adataClient (#21001) Currently, if a ListOffsets request fails in NetworkPartitionMetadataClient for any reason, the corresponding future is completed then and there, without any retries. But the NetworkClient and InterbrokerSendThread are loaded lazily in the NetworkPartitionMetadataClient on the arrival of the first request. But when the first request comes, it is immediately enqueued in the NetworkClient, before the connection could be established, thereby always failing the first request. As a solution to that, this PR introduces a retry mechanism with an upper limit on the retry attempts, as well as exponential backoff between succesive retries. Reviewers: Apoorv Mittal <[email protected]>, Andrew Schofield <[email protected]>, Sushant Mahajan <[email protected]>

AndrewJSchofield · 2025-11-30T21:26:42Z

Cherry-picked to 4.2.

…adataClient (apache#21001) Currently, if a ListOffsets request fails in NetworkPartitionMetadataClient for any reason, the corresponding future is completed then and there, without any retries. But the NetworkClient and InterbrokerSendThread are loaded lazily in the NetworkPartitionMetadataClient on the arrival of the first request. But when the first request comes, it is immediately enqueued in the NetworkClient, before the connection could be established, thereby always failing the first request. As a solution to that, this PR introduces a retry mechanism with an upper limit on the retry attempts, as well as exponential backoff between succesive retries. Reviewers: Apoorv Mittal <[email protected]>, Andrew Schofield <[email protected]>, Sushant Mahajan <[email protected]>

KAFKA-19928: Added retry and backoff mechanism in NetworkPartitionMet…

c45f96c

…adataClient

github-actions bot added triage PRs from the community core Kafka Broker clients group-coordinator labels Nov 26, 2025

KAFKA-19928: Added the kafka license to ExponentialBackoffManager

db69337

apoorvmittal10 added ci-approved and removed triage PRs from the community labels Nov 26, 2025

apoorvmittal10 reviewed Nov 26, 2025

View reviewed changes

AndrewJSchofield added the KIP-932 Queues for Kafka label Nov 26, 2025

KAFKA-19928: minor changes to resolve checkstyle failures

2fe22eb

smjn reviewed Nov 27, 2025

View reviewed changes

chirag-wadhwa5 added 2 commits November 27, 2025 15:59

KAFKA-19928: Added tests for ExponentialBackoffManager

6f7bb60

KAFKA-19928: Minor changes for better readability

de2bf10

apoorvmittal10 approved these changes Nov 28, 2025

View reviewed changes

Merge branch 'trunk' into KAFKA-19928

cfdce65

chirag-wadhwa5 requested a review from AndrewJSchofield November 29, 2025 08:14

AndrewJSchofield approved these changes Nov 30, 2025

View reviewed changes

AndrewJSchofield merged commit ed3af72 into apache:trunk Nov 30, 2025
23 checks passed

KAFKA-19928: Added retry and backoff mechanism in NetworkPartitionMetadataClient #21001

KAFKA-19928: Added retry and backoff mechanism in NetworkPartitionMetadataClient #21001

Uh oh!

Conversation

chirag-wadhwa5 commented Nov 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-wadhwa5 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AndrewJSchofield commented Nov 26, 2025

Uh oh!

smjn Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

smjn Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chirag-wadhwa5 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

smjn commented Nov 27, 2025

Uh oh!

AndrewJSchofield commented Nov 28, 2025

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AndrewJSchofield commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chirag-wadhwa5 commented Nov 26, 2025 •

edited by github-actions bot

Loading

smjn Nov 27, 2025 •

edited

Loading