Skip to content

fix: recover cluster when majority of primaries are lost#244

Merged
bjosv merged 6 commits into
valkey-io:mainfrom
Nordix:recover-primaries-lost
Jun 17, 2026
Merged

fix: recover cluster when majority of primaries are lost#244
bjosv merged 6 commits into
valkey-io:mainfrom
Nordix:recover-primaries-lost

Conversation

@bjosv

@bjosv bjosv commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

This PR closes #240

Summary

When a majority of shard primaries are down, the operator was stuck indefinitely because it refused to forget dead primaries that still had live replicas referencing them, and automatic failover could never succeed without quorum.

Add promoteOrphanedReplicas which issues CLUSTER FAILOVER TAKEOVER to the best replica (highest replication offset) of each dead primary when quorum is unreachable. TAKEOVER runs before FORGET so slots remain continuously owned. With persistence enabled, pods return with the same node ID and the operator skips TAKEOVER to let them rejoin naturally.

Also bypasses the HasReplicaOf guard in forgetStaleNodes when quorum is lost, allowing dead primaries to be cleaned up from the gossip table.

Testing

Found with #203 and can be reproduced with:

CHAOS_SHARDS=3 CHAOS_REPLICAS=1 CHAOS_TARGET_SHARDS=0,1 CHAOS_SCENARIOS=delete-primary-pod make test-chaos

3 shards 1 replica each, kills 2 primaries every iteration for quorum loss.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one issue.
  • Commit message explains what changed and why
  • Tests are added or updated.
  • Documentation files are updated.
  • I have run pre-commit locally (pre-commit run --all-files or hooks on commit)

When a majority of shard primaries are down, the operator was stuck
indefinitely because it refused to forget dead primaries that still had
live replicas referencing them, and automatic failover could never
succeed without quorum.

Add promoteOrphanedReplicas which issues CLUSTER FAILOVER TAKEOVER to
the best replica (highest replication offset) of each dead primary when
quorum is unreachable. TAKEOVER runs before FORGET so slots remain
continuously owned. With persistence enabled, pods return with the same
node ID and the operator skips TAKEOVER to let them rejoin naturally.

Also bypasses the HasReplicaOf guard in forgetStaleNodes when quorum is
lost, allowing dead primaries to be cleaned up from the gossip table.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Comment thread internal/valkey/clusterstate.go Outdated
Comment thread internal/valkey/clusterstate.go
Comment thread internal/valkey/clusterstate.go
// the pod will return with the same node ID, so TAKEOVER is skipped.
// Returns (result, true) if a promotion was made and reconcile should requeue.
func (r *ValkeyClusterReconciler) promoteOrphanedReplicas(ctx context.Context, cluster *valkeyiov1alpha1.ValkeyCluster, state *valkey.ClusterState) (ctrl.Result, bool) {
if state.HasFailoverQuorum() || cluster.Spec.Persistence != nil {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does persistence enabled make a difference here? If the primary is down, wouldn't we still need to force a FAILOVER regardless?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that with persistence a crashed/killed pod returns with the same node ID and data, then the takeover is unnecessary.
The edge case of a stuck pod (unschedulable, PVC issues) probably need a timeout or a pod-status check of some sort? Maybe I should add a issue about that as a followup?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah you're right. I think @melancholictheory mentioned this case previously. Persistence will always get the best data guarantees, when we don't persist we need the operator to fix things.

As a small optimisation, we can short-circuit the if condition by putting the cheaper call on the left side of the OR statement:

if cluster.Spec.Persistence != nil || state.HasFailoverQuorum() {

Then state.HasFailoverQuorum() would not be called if cluster.Spec.Persistence != nil.

I saw this line being done in a couple of places.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched order now.
I was wondering if tracing the reconcile loop would be of any benefit (using OpenTelemetry) to see whats taking time in the loop. I cant see any existing operator using it internally, so maybe people just use counters and pprof to find performance issues/improvements..

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing would definitely be great to have, but I think having metrics first would be ideal. When we want to triage high reconciliation times, at that point having tracing would be a good feature to implement to investigate.

@jdheyburn

Copy link
Copy Markdown
Collaborator

@greptile-apps

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds recovery logic for Valkey clusters that have lost a majority of shard primaries, a scenario where automatic failover was previously impossible due to quorum requirements. The fix introduces promoteOrphanedReplicas which issues CLUSTER FAILOVER TAKEOVER to the best replica of each dead primary before any CLUSTER FORGET is attempted.

  • promoteOrphanedReplicas: New controller method that detects quorum loss via HasFailoverQuorum(), identifies dead primaries, and issues TAKEOVER to the replica with the highest slave_repl_offset. Returns early (requeueing after 2 s) whenever any TAKEOVER was attempted, preventing forgetStaleNodes from running until at least one promotion succeeds and quorum is restored.
  • forgetStaleNodes guard bypass: When quorum is unreachable and persistence is disabled, the HasReplicaOf guard that previously prevented forgetting dead-but-still-referenced primaries is bypassed, allowing the gossip table to be cleaned up after promotion.
  • New clusterstate.go helpers: HasFailoverQuorum (uses cluster_size from CLUSTER INFO; safely returns false when the field is absent), IsNodeFailed (checks both fail and fail? flags), and BestReplicaOf (selects highest slave_repl_offset replica).

Confidence Score: 5/5

Safe to merge. The quorum-loss recovery path is well-guarded: TAKEOVER is skipped when persistence is enabled, all TAKEOVER attempts (including failed ones) block forgetStaleNodes from running in the same cycle, and HasFailoverQuorum conservatively returns false when cluster_size is unavailable.

All three previously identified concerns have been addressed: TAKEOVER failures now requeue rather than falling through to forgetStaleNodes, the missing cluster_size field now safely defaults to no-quorum instead of masking quorum loss, and BestReplicaOf now uses slave_repl_offset. The new helpers are well-tested including the edge cases.

No files require special attention. The controller logic and clusterstate helpers are consistent with each other and with the existing codebase patterns.

Important Files Changed

Filename Overview
internal/valkey/clusterstate.go Adds HasFailoverQuorum, IsNodeFailed, and BestReplicaOf helpers. HasFailoverQuorum conservatively returns false when cluster_size is absent; BestReplicaOf correctly uses slave_repl_offset; IsNodeFailed correctly covers both fail and fail? states.
internal/controller/valkeycluster_controller.go Adds promoteOrphanedReplicas called before forgetStaleNodes. All TAKEOVER attempts (even failed ones) now prevent forgetStaleNodes from running that cycle via the attempted > 0 guard, and the forgetStaleNodes guard bypass is correctly gated on both quorum loss and persistence absence.
internal/valkey/clusterstate_test.go Adds thorough table-driven tests for HasFailoverQuorum (including the missing cluster_size case), IsNodeFailed (fail, pfail, healthy, unknown), and BestReplicaOf (highest offset selection, no-match). Tests now use slave_repl_offset matching the implementation.
docs/status-conditions.md Adds a Recovery events section documenting the new ReplicasTakenOver Normal event emitted when orphaned replicas are promoted via CLUSTER FAILOVER TAKEOVER.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Reconcile] --> B[getValkeyClusterState]
    B --> C{promoteOrphanedReplicas}
    C --> D{Persistence enabled OR HasFailoverQuorum?}
    D -- Yes --> E[return false, no-op]
    D -- No --> F[Find dead primaries via IsNodeFailed]
    F --> G{Any dead primaries with replicas?}
    G -- No --> E
    G -- Yes --> H[BestReplicaOf each dead primary]
    H --> I[CLUSTER FAILOVER TAKEOVER]
    I --> J{attempted > 0?}
    J -- Yes --> K[Emit ReplicasTakenOver event if promoted > 0]
    K --> L[Requeue after 2s, return handled=true]
    J -- No --> E
    E --> M[forgetStaleNodes]
    M --> N{Node has replica referencing it?}
    N -- No --> O[CLUSTER FORGET]
    N -- Yes --> P{Persistence enabled OR HasFailoverQuorum?}
    P -- Yes --> Q[Skip forget, wait for auto-failover]
    P -- No --> O
    O --> R[Continue reconcile loop]
Loading

Reviews (4): Last reviewed commit: "fixup: check Persistence before HasFailo..." | Re-trigger Greptile

Comment thread internal/controller/valkeycluster_controller.go Outdated
Comment thread internal/valkey/clusterstate.go Outdated
bjosv added 3 commits June 12, 2026 16:17
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Prevents forgetStaleNodes from running when TAKEOVER was attempted but
failed on a reachable replica. Without this, a transient communication
error could cause FORGET to remove the dead primary from gossip while
the replica is still alive, blocking its promotion.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Comment thread internal/valkey/clusterstate.go Outdated
bjosv added 2 commits June 12, 2026 18:21
Matches what Valkey's internal getNodeReplicationOffset()
uses for replicas.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

@jdheyburn jdheyburn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I've not tested it, but looks like it does all the right things :shipit:

@bjosv bjosv merged commit cebde25 into valkey-io:main Jun 17, 2026
8 checks passed
@bjosv bjosv deleted the recover-primaries-lost branch June 17, 2026 08:41
melancholictheory added a commit to melancholictheory/valkey-operator that referenced this pull request Jun 17, 2026
valkey-io#244 added ClusterState.BestReplicaOf, which picks the highest-offset
replica with its own inline slave_repl_offset parse. Extract a single
HighestOffsetReplica helper (backed by NodeState.ReplicationOffset) and
have both BestReplicaOf and the proactive-failover path go through it, so
the offset-selection logic lives in one place.

Signed-off-by: melancholictheory <selimvhorst@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Operator cannot recover cluster when majority of primaries are lost

2 participants