fix: recover cluster when majority of primaries are lost#244
Conversation
When a majority of shard primaries are down, the operator was stuck indefinitely because it refused to forget dead primaries that still had live replicas referencing them, and automatic failover could never succeed without quorum. Add promoteOrphanedReplicas which issues CLUSTER FAILOVER TAKEOVER to the best replica (highest replication offset) of each dead primary when quorum is unreachable. TAKEOVER runs before FORGET so slots remain continuously owned. With persistence enabled, pods return with the same node ID and the operator skips TAKEOVER to let them rejoin naturally. Also bypasses the HasReplicaOf guard in forgetStaleNodes when quorum is lost, allowing dead primaries to be cleaned up from the gossip table. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
| // the pod will return with the same node ID, so TAKEOVER is skipped. | ||
| // Returns (result, true) if a promotion was made and reconcile should requeue. | ||
| func (r *ValkeyClusterReconciler) promoteOrphanedReplicas(ctx context.Context, cluster *valkeyiov1alpha1.ValkeyCluster, state *valkey.ClusterState) (ctrl.Result, bool) { | ||
| if state.HasFailoverQuorum() || cluster.Spec.Persistence != nil { |
There was a problem hiding this comment.
Why does persistence enabled make a difference here? If the primary is down, wouldn't we still need to force a FAILOVER regardless?
There was a problem hiding this comment.
My thinking was that with persistence a crashed/killed pod returns with the same node ID and data, then the takeover is unnecessary.
The edge case of a stuck pod (unschedulable, PVC issues) probably need a timeout or a pod-status check of some sort? Maybe I should add a issue about that as a followup?
There was a problem hiding this comment.
Hmm yeah you're right. I think @melancholictheory mentioned this case previously. Persistence will always get the best data guarantees, when we don't persist we need the operator to fix things.
As a small optimisation, we can short-circuit the if condition by putting the cheaper call on the left side of the OR statement:
if cluster.Spec.Persistence != nil || state.HasFailoverQuorum() {Then state.HasFailoverQuorum() would not be called if cluster.Spec.Persistence != nil.
I saw this line being done in a couple of places.
There was a problem hiding this comment.
I switched order now.
I was wondering if tracing the reconcile loop would be of any benefit (using OpenTelemetry) to see whats taking time in the loop. I cant see any existing operator using it internally, so maybe people just use counters and pprof to find performance issues/improvements..
There was a problem hiding this comment.
Tracing would definitely be great to have, but I think having metrics first would be ideal. When we want to triage high reconciliation times, at that point having tracing would be a good feature to implement to investigate.
|
| Filename | Overview |
|---|---|
| internal/valkey/clusterstate.go | Adds HasFailoverQuorum, IsNodeFailed, and BestReplicaOf helpers. HasFailoverQuorum conservatively returns false when cluster_size is absent; BestReplicaOf correctly uses slave_repl_offset; IsNodeFailed correctly covers both fail and fail? states. |
| internal/controller/valkeycluster_controller.go | Adds promoteOrphanedReplicas called before forgetStaleNodes. All TAKEOVER attempts (even failed ones) now prevent forgetStaleNodes from running that cycle via the attempted > 0 guard, and the forgetStaleNodes guard bypass is correctly gated on both quorum loss and persistence absence. |
| internal/valkey/clusterstate_test.go | Adds thorough table-driven tests for HasFailoverQuorum (including the missing cluster_size case), IsNodeFailed (fail, pfail, healthy, unknown), and BestReplicaOf (highest offset selection, no-match). Tests now use slave_repl_offset matching the implementation. |
| docs/status-conditions.md | Adds a Recovery events section documenting the new ReplicasTakenOver Normal event emitted when orphaned replicas are promoted via CLUSTER FAILOVER TAKEOVER. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Reconcile] --> B[getValkeyClusterState]
B --> C{promoteOrphanedReplicas}
C --> D{Persistence enabled OR HasFailoverQuorum?}
D -- Yes --> E[return false, no-op]
D -- No --> F[Find dead primaries via IsNodeFailed]
F --> G{Any dead primaries with replicas?}
G -- No --> E
G -- Yes --> H[BestReplicaOf each dead primary]
H --> I[CLUSTER FAILOVER TAKEOVER]
I --> J{attempted > 0?}
J -- Yes --> K[Emit ReplicasTakenOver event if promoted > 0]
K --> L[Requeue after 2s, return handled=true]
J -- No --> E
E --> M[forgetStaleNodes]
M --> N{Node has replica referencing it?}
N -- No --> O[CLUSTER FORGET]
N -- Yes --> P{Persistence enabled OR HasFailoverQuorum?}
P -- Yes --> Q[Skip forget, wait for auto-failover]
P -- No --> O
O --> R[Continue reconcile loop]
Reviews (4): Last reviewed commit: "fixup: check Persistence before HasFailo..." | Re-trigger Greptile
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Prevents forgetStaleNodes from running when TAKEOVER was attempted but failed on a reachable replica. Without this, a transient communication error could cause FORGET to remove the dead primary from gossip while the replica is still alive, blocking its promotion. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Matches what Valkey's internal getNodeReplicationOffset() uses for replicas. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
jdheyburn
left a comment
There was a problem hiding this comment.
LGTM! I've not tested it, but looks like it does all the right things ![]()
valkey-io#244 added ClusterState.BestReplicaOf, which picks the highest-offset replica with its own inline slave_repl_offset parse. Extract a single HighestOffsetReplica helper (backed by NodeState.ReplicationOffset) and have both BestReplicaOf and the proactive-failover path go through it, so the offset-selection logic lives in one place. Signed-off-by: melancholictheory <selimvhorst@gmail.com>
This PR closes #240
Summary
When a majority of shard primaries are down, the operator was stuck indefinitely because it refused to forget dead primaries that still had live replicas referencing them, and automatic failover could never succeed without quorum.
Add
promoteOrphanedReplicaswhich issuesCLUSTER FAILOVER TAKEOVERto the best replica (highest replication offset) of each dead primary when quorum is unreachable. TAKEOVER runs before FORGET so slots remain continuously owned. With persistence enabled, pods return with the same node ID and the operator skips TAKEOVER to let them rejoin naturally.Also bypasses the
HasReplicaOfguard inforgetStaleNodeswhen quorum is lost, allowing dead primaries to be cleaned up from the gossip table.Testing
Found with #203 and can be reproduced with:
CHAOS_SHARDS=3 CHAOS_REPLICAS=1 CHAOS_TARGET_SHARDS=0,1 CHAOS_SCENARIOS=delete-primary-pod make test-chaos3 shards 1 replica each, kills 2 primaries every iteration for quorum loss.
Checklist
Before submitting the PR make sure the following are checked:
pre-commit run --all-filesor hooks on commit)