consultopo: add retry logic for transient consul errors#883
Open
sbaker617 wants to merge 5 commits into
Open
Conversation
Introduces a kvClient interface wrapper (retryKV) that transparently retries consul KV operations on transient network errors and 5xx responses. Configurable via flags, enabled by default. Covers Get, List, Keys, Txn — all operations used by file.go, directory.go, watch.go, and election.go. Lock lifecycle operations are intentionally not retried. Flags: --topo_consul_retry_count (default 3) --topo_consul_retry_base_delay (default 250ms) --topo_consul_retry_max_delay (default 5s) --topo_consul_retry_enabled (default true) Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Steve Baker <s.baker@slack-corp.com>
Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Steve Baker <s.baker@slack-corp.com>
c156ba0 to
142b516
Compare
rand.Int64N(0) panics if baseDelay is set to 0 via flag. Return zero duration immediately when the computed delay is non-positive. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Steve Baker <s.baker@slack-corp.com>
Replace time.Sleep with a select on ctx.Done and time.After so that retries abort promptly when the caller's context is canceled or expired. Pass context from callers in file.go, directory.go, and election.go via QueryOptions so the retry loop can observe it. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Steve Baker <s.baker@slack-corp.com>
A retried CAS transaction can return a spurious conflict if the original commit succeeded but the response was lost. Document that this is inherent and not worse than the no-retry baseline. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Steve Baker <s.baker@slack-corp.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's this?
Adds transparent retry with exponential backoff to the consul topology service's KV operations. During consul leader elections or transient network blips, operations now retry instead of immediately propagating errors to callers.
How it works
Defines a
kvClientinterface matching the 4 KV methods used by the package (Get,List,Keys,Txn), and wraps the real consul client with aretryKVdecorator that retries on transient errors (5xx, connection refused/reset, HTTP timeouts). Onlyserver.gois modified — all other existing files are untouched.Configurable via flags (enabled by default):
--topo_consul_retry_count(default 3)--topo_consul_retry_base_delay(default 250ms)--topo_consul_retry_max_delay(default 5s)--topo_consul_retry_enabled(default true)Lock lifecycle operations (
Lock/Unlock/Destroy) are intentionally not retried.Testing
deployed branch builds:
vttablet:
slackhqB_v22-add-consul-retry -> v22-add-consul-retry_220
deployed to all stages
vtgate: -> sb_v22-consul-retry_DEV-STAGING -> v22-add-consul-retry_220
deployed to all DEV stages
vtorc: -> sb_v22-consul-retry_DEV-STAGING -> v22-add-consul-retry_220
deployed to all DEV stages
vtctld: -> sb_v22-consul-retry_DEV-STAGING -> v22-add-consul-retry_220
deployed to all DEV stages
manually triggered leader election:
https://consul-ops-dev.tinyspeck.com/datacenters/dev-us-east-1-vtcell1e/3504474a-b55d-4d8c-8133-7c3a2219084b
zero errors observed:
default/muron
datastores/datastores
## Why
Brings consul topo resilience closer to etcd (which gets retries for free via gRPC) and ZooKeeper (which has
withRetry). Motivated by flakiness during consul leader elections in production.Mostly written by Claude Code — I provided direction and design constraints.