fix: recover cluster when majority of primaries are lost by bjosv · Pull Request #244 · valkey-io/valkey-operator

bjosv · 2026-06-11T07:44:32Z

This PR closes #240

Summary

When a majority of shard primaries are down, the operator was stuck indefinitely because it refused to forget dead primaries that still had live replicas referencing them, and automatic failover could never succeed without quorum.

Add promoteOrphanedReplicas which issues CLUSTER FAILOVER TAKEOVER to the best replica (highest replication offset) of each dead primary when quorum is unreachable. TAKEOVER runs before FORGET so slots remain continuously owned. With persistence enabled, pods return with the same node ID and the operator skips TAKEOVER to let them rejoin naturally.

Also bypasses the HasReplicaOf guard in forgetStaleNodes when quorum is lost, allowing dead primaries to be cleaned up from the gossip table.

Testing

Found with #203 and can be reproduced with:

CHAOS_SHARDS=3 CHAOS_REPLICAS=1 CHAOS_TARGET_SHARDS=0,1 CHAOS_SCENARIOS=delete-primary-pod make test-chaos

3 shards 1 replica each, kills 2 primaries every iteration for quorum loss.

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one issue.
Commit message explains what changed and why
Tests are added or updated.
Documentation files are updated.
I have run pre-commit locally (pre-commit run --all-files or hooks on commit)

When a majority of shard primaries are down, the operator was stuck indefinitely because it refused to forget dead primaries that still had live replicas referencing them, and automatic failover could never succeed without quorum. Add promoteOrphanedReplicas which issues CLUSTER FAILOVER TAKEOVER to the best replica (highest replication offset) of each dead primary when quorum is unreachable. TAKEOVER runs before FORGET so slots remain continuously owned. With persistence enabled, pods return with the same node ID and the operator skips TAKEOVER to let them rejoin naturally. Also bypasses the HasReplicaOf guard in forgetStaleNodes when quorum is lost, allowing dead primaries to be cleaned up from the gossip table. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

jdheyburn · 2026-06-11T09:32:28Z

+// the pod will return with the same node ID, so TAKEOVER is skipped.
+// Returns (result, true) if a promotion was made and reconcile should requeue.
+func (r *ValkeyClusterReconciler) promoteOrphanedReplicas(ctx context.Context, cluster *valkeyiov1alpha1.ValkeyCluster, state *valkey.ClusterState) (ctrl.Result, bool) {
+	if state.HasFailoverQuorum() || cluster.Spec.Persistence != nil {


Why does persistence enabled make a difference here? If the primary is down, wouldn't we still need to force a FAILOVER regardless?

My thinking was that with persistence a crashed/killed pod returns with the same node ID and data, then the takeover is unnecessary.
The edge case of a stuck pod (unschedulable, PVC issues) probably need a timeout or a pod-status check of some sort? Maybe I should add a issue about that as a followup?

Hmm yeah you're right. I think @melancholictheory mentioned this case previously. Persistence will always get the best data guarantees, when we don't persist we need the operator to fix things.

As a small optimisation, we can short-circuit the if condition by putting the cheaper call on the left side of the OR statement:

if cluster.Spec.Persistence != nil || state.HasFailoverQuorum() {

Then state.HasFailoverQuorum() would not be called if cluster.Spec.Persistence != nil.

I saw this line being done in a couple of places.

I switched order now.
I was wondering if tracing the reconcile loop would be of any benefit (using OpenTelemetry) to see whats taking time in the loop. I cant see any existing operator using it internally, so maybe people just use counters and pprof to find performance issues/improvements..

Tracing would definitely be great to have, but I think having metrics first would be ideal. When we want to triage high reconciliation times, at that point having tracing would be a good feature to implement to investigate.

jdheyburn · 2026-06-12T13:25:57Z

@greptile-apps

greptile-apps · 2026-06-12T13:32:20Z

Greptile Summary

This PR adds recovery logic for Valkey clusters that have lost a majority of shard primaries, a scenario where automatic failover was previously impossible due to quorum requirements. The fix introduces promoteOrphanedReplicas which issues CLUSTER FAILOVER TAKEOVER to the best replica of each dead primary before any CLUSTER FORGET is attempted.

promoteOrphanedReplicas: New controller method that detects quorum loss via HasFailoverQuorum(), identifies dead primaries, and issues TAKEOVER to the replica with the highest slave_repl_offset. Returns early (requeueing after 2 s) whenever any TAKEOVER was attempted, preventing forgetStaleNodes from running until at least one promotion succeeds and quorum is restored.
forgetStaleNodes guard bypass: When quorum is unreachable and persistence is disabled, the HasReplicaOf guard that previously prevented forgetting dead-but-still-referenced primaries is bypassed, allowing the gossip table to be cleaned up after promotion.
New clusterstate.go helpers: HasFailoverQuorum (uses cluster_size from CLUSTER INFO; safely returns false when the field is absent), IsNodeFailed (checks both fail and fail? flags), and BestReplicaOf (selects highest slave_repl_offset replica).

Confidence Score: 5/5

Safe to merge. The quorum-loss recovery path is well-guarded: TAKEOVER is skipped when persistence is enabled, all TAKEOVER attempts (including failed ones) block forgetStaleNodes from running in the same cycle, and HasFailoverQuorum conservatively returns false when cluster_size is unavailable.

All three previously identified concerns have been addressed: TAKEOVER failures now requeue rather than falling through to forgetStaleNodes, the missing cluster_size field now safely defaults to no-quorum instead of masking quorum loss, and BestReplicaOf now uses slave_repl_offset. The new helpers are well-tested including the edge cases.

No files require special attention. The controller logic and clusterstate helpers are consistent with each other and with the existing codebase patterns.

Important Files Changed

Filename	Overview
internal/valkey/clusterstate.go	Adds HasFailoverQuorum, IsNodeFailed, and BestReplicaOf helpers. HasFailoverQuorum conservatively returns false when cluster_size is absent; BestReplicaOf correctly uses slave_repl_offset; IsNodeFailed correctly covers both fail and fail? states.
internal/controller/valkeycluster_controller.go	Adds promoteOrphanedReplicas called before forgetStaleNodes. All TAKEOVER attempts (even failed ones) now prevent forgetStaleNodes from running that cycle via the attempted > 0 guard, and the forgetStaleNodes guard bypass is correctly gated on both quorum loss and persistence absence.
internal/valkey/clusterstate_test.go	Adds thorough table-driven tests for HasFailoverQuorum (including the missing cluster_size case), IsNodeFailed (fail, pfail, healthy, unknown), and BestReplicaOf (highest offset selection, no-match). Tests now use slave_repl_offset matching the implementation.
docs/status-conditions.md	Adds a Recovery events section documenting the new ReplicasTakenOver Normal event emitted when orphaned replicas are promoted via CLUSTER FAILOVER TAKEOVER.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Reconcile] --> B[getValkeyClusterState]
    B --> C{promoteOrphanedReplicas}
    C --> D{Persistence enabled OR HasFailoverQuorum?}
    D -- Yes --> E[return false, no-op]
    D -- No --> F[Find dead primaries via IsNodeFailed]
    F --> G{Any dead primaries with replicas?}
    G -- No --> E
    G -- Yes --> H[BestReplicaOf each dead primary]
    H --> I[CLUSTER FAILOVER TAKEOVER]
    I --> J{attempted > 0?}
    J -- Yes --> K[Emit ReplicasTakenOver event if promoted > 0]
    K --> L[Requeue after 2s, return handled=true]
    J -- No --> E
    E --> M[forgetStaleNodes]
    M --> N{Node has replica referencing it?}
    N -- No --> O[CLUSTER FORGET]
    N -- Yes --> P{Persistence enabled OR HasFailoverQuorum?}
    P -- Yes --> Q[Skip forget, wait for auto-failover]
    P -- No --> O
    O --> R[Continue reconcile loop]

_{Reviews (4): Last reviewed commit: "fixup: check Persistence before HasFailo..." | Re-trigger Greptile}

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

Prevents forgetStaleNodes from running when TAKEOVER was attempted but failed on a reachable replica. Without this, a transient communication error could cause FORGET to remove the dead primary from gossip while the replica is still alive, blocking its promotion. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

Matches what Valkey's internal getNodeReplicationOffset() uses for replicas. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

jdheyburn

LGTM! I've not tested it, but looks like it does all the right things

valkey-io#244 added ClusterState.BestReplicaOf, which picks the highest-offset replica with its own inline slave_repl_offset parse. Extract a single HighestOffsetReplica helper (backed by NodeState.ReplicationOffset) and have both BestReplicaOf and the proactive-failover path go through it, so the offset-selection logic lives in one place. Signed-off-by: melancholictheory <selimvhorst@gmail.com>

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

jdheyburn reviewed Jun 11, 2026

View reviewed changes

jdheyburn mentioned this pull request Jun 12, 2026

feat: fail over to the highest-offset replica #249

Merged

5 tasks

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread internal/controller/valkeycluster_controller.go Outdated

Comment thread internal/valkey/clusterstate.go Outdated

bjosv added 3 commits June 12, 2026 16:17

fixup: improve readability

e76a41d

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

fixup: change HasFailoverQuorum default

9151a5e

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread internal/valkey/clusterstate.go Outdated

bjosv added 2 commits June 12, 2026 18:21

fixup: use slave_repl_offset in BestReplicaOf()

69ab4a9

Matches what Valkey's internal getNodeReplicationOffset() uses for replicas. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

fixup: check Persistence before HasFailoverQuorum()

602b7c5

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

jdheyburn approved these changes Jun 16, 2026

View reviewed changes

bjosv merged commit cebde25 into valkey-io:main Jun 17, 2026
8 checks passed

bjosv deleted the recover-primaries-lost branch June 17, 2026 08:41

bjosv mentioned this pull request Jun 17, 2026

[bug] Handle quorum loss with persistence enabled (stuck/unschedulable pods) #256

Open

Uh oh!

Conversation

bjosv commented Jun 11, 2026

Summary

Testing

Checklist

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdheyburn Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

bjosv Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

bjosv Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdheyburn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading