Skip to content

consultopo: add retry logic for transient consul errors#883

Open
sbaker617 wants to merge 5 commits into
slack-22.0from
v22-add-consul-retry
Open

consultopo: add retry logic for transient consul errors#883
sbaker617 wants to merge 5 commits into
slack-22.0from
v22-add-consul-retry

Conversation

@sbaker617

@sbaker617 sbaker617 commented Jun 23, 2026

Copy link
Copy Markdown

What's this?

Adds transparent retry with exponential backoff to the consul topology service's KV operations. During consul leader elections or transient network blips, operations now retry instead of immediately propagating errors to callers.

How it works

Defines a kvClient interface matching the 4 KV methods used by the package (Get, List, Keys, Txn), and wraps the real consul client with a retryKV decorator that retries on transient errors (5xx, connection refused/reset, HTTP timeouts). Only server.go is modified — all other existing files are untouched.

Configurable via flags (enabled by default):

  • --topo_consul_retry_count (default 3)
  • --topo_consul_retry_base_delay (default 250ms)
  • --topo_consul_retry_max_delay (default 5s)
  • --topo_consul_retry_enabled (default true)

Lock lifecycle operations (Lock/Unlock/Destroy) are intentionally not retried.

Testing

deployed branch builds:
vttablet:
slackhqB_v22-add-consul-retry -> v22-add-consul-retry_220
deployed to all stages

vtgate: -> sb_v22-consul-retry_DEV-STAGING -> v22-add-consul-retry_220
deployed to all DEV stages
vtorc: -> sb_v22-consul-retry_DEV-STAGING -> v22-add-consul-retry_220
deployed to all DEV stages
vtctld: -> sb_v22-consul-retry_DEV-STAGING -> v22-add-consul-retry_220
deployed to all DEV stages

manually triggered leader election:
https://consul-ops-dev.tinyspeck.com/datacenters/dev-us-east-1-vtcell1e/3504474a-b55d-4d8c-8133-7c3a2219084b

zero errors observed:
default/muron
datastores/datastores
## Why

Brings consul topo resilience closer to etcd (which gets retries for free via gRPC) and ZooKeeper (which has withRetry). Motivated by flakiness during consul leader elections in production.


Mostly written by Claude Code — I provided direction and design constraints.

@github-actions github-actions Bot added this to the v22.0.4 milestone Jun 23, 2026
sbaker617 and others added 2 commits June 25, 2026 18:54
Introduces a kvClient interface wrapper (retryKV) that transparently
retries consul KV operations on transient network errors and 5xx
responses. Configurable via flags, enabled by default.

Covers Get, List, Keys, Txn — all operations used by file.go,
directory.go, watch.go, and election.go. Lock lifecycle operations
are intentionally not retried.

Flags:
  --topo_consul_retry_count (default 3)
  --topo_consul_retry_base_delay (default 250ms)
  --topo_consul_retry_max_delay (default 5s)
  --topo_consul_retry_enabled (default true)

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Steve Baker <s.baker@slack-corp.com>
Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Steve Baker <s.baker@slack-corp.com>
@sbaker617 sbaker617 force-pushed the v22-add-consul-retry branch from c156ba0 to 142b516 Compare June 25, 2026 22:55
@sbaker617 sbaker617 marked this pull request as ready for review June 26, 2026 18:03
@sbaker617 sbaker617 requested a review from a team as a code owner June 26, 2026 18:03
@sbaker617 sbaker617 requested a review from tanjinx June 26, 2026 18:03
sbaker617 and others added 3 commits June 26, 2026 15:18
rand.Int64N(0) panics if baseDelay is set to 0 via flag. Return
zero duration immediately when the computed delay is non-positive.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Steve Baker <s.baker@slack-corp.com>
Replace time.Sleep with a select on ctx.Done and time.After so that
retries abort promptly when the caller's context is canceled or
expired. Pass context from callers in file.go, directory.go, and
election.go via QueryOptions so the retry loop can observe it.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Steve Baker <s.baker@slack-corp.com>
A retried CAS transaction can return a spurious conflict if the
original commit succeeded but the response was lost. Document that
this is inherent and not worse than the no-retry baseline.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Steve Baker <s.baker@slack-corp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant