fix: rolling update loses all keys when replicas are not synced by bjosv · Pull Request #208 · valkey-io/valkey-operator

bjosv · 2026-05-29T13:43:07Z

Summary

This PR fixes two issues that together make rolling updates safe with and without PVCs.

Without persistence: restarting a replica destroys its cluster membership and data. The operator considered it Ready and immediately rolled the primary. With no synced replica to fail over to, all shard data was lost.
With persistence: the replica stays synced and proactive failover is attempted, but the _operator user ACL was missing cluster|failover. Every failover failed with NOPERM and the primary was rolled without a graceful handoff.

Testing

Found using #203 running the rolling-update scenario. It patches io-threads on the ValkeyCluster to trigger a rolling restart of all pods, then verifies the cluster recovers and all keys are preserved.

# Without persistence
CHAOS_SCENARIOS=rolling-update make test-chaos

# With persistence
CHAOS_SCENARIOS=rolling-update CHAOS_PERSISTENCE=true make test-chaos

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one issue.
Commit message explains what changed and why
Tests are added or updated.
Documentation files are updated.
I have run pre-commit locally (pre-commit run --all-files or hooks on commit)

The proactive failover logic issues CLUSTER FAILOVER on a replica before rolling the shard primary, but the _operator system user was missing this command in its ACL. Every failover attempt failed with NOPERM and the primary was rolled without a graceful handoff. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

During a rolling update without persistence, restarting a replica destroys its cluster membership (nodes.conf) and data. The operator marked it Ready after PING and immediately rolled the primary. With no synced replica to fail over to, all shard data was lost. Skip rolling a primary when it has no synced replicas, letting the reconcile continue to the MEET/REPLICATE phases that rejoin the replica. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

greptile-apps · 2026-05-29T13:49:17Z

Greptile Summary

This PR fixes two independent bugs that together could cause complete shard data loss during a rolling update when replicas are not synced.

ACL fix (users.go): Adds +cluster|failover to the _operator user's ACL. Without this, CLUSTER FAILOVER would fail with NOPERM, silently skipping the graceful handoff before rolling a primary — even when a synced replica was available.
Safe-roll gate (valkeycluster_controller.go): When findFailoverShard returns nil (no synced replica found), the new code checks whether the node being rolled is actually the shard primary. If it is and no synced replicas exist (the dangerous case), the roll is deferred by returning (false, false, nil) rather than proceeding immediately. The outer reconcile loop continues so that MEET / REPLICATE phases can still run and the replica can re-join; the "check replicas in sync" block (requeuing every 2 s) then drives retries until the replica catches up, at which point the next iteration succeeds with a healthy failover target.

Confidence Score: 4/5

Safe to merge — both changes are targeted and correct; no data-loss path exists in the new code.

The ACL addition is a clean one-liner with no risk. The controller change defers a primary roll when replicas aren't synced, relies on the existing 'check replicas in sync' requeue to drive retries, and correctly allows MEET/REPLICATE phases to proceed so the replica can rejoin the cluster. One edge case: when the deferred primary's replicas sync before the second cluster-state fetch inside the same reconcile, the cluster gets a brief Ready/non-Progressing status (~30 s) while the primary's spec update is still pending. This won't cause data loss or stalls, but could produce a confusing Ready→Progressing event transition visible to operators.

The controller change in internal/controller/valkeycluster_controller.go deserves a second look around the interaction between the deferred-roll return value and the cluster-condition reporting path.

Important Files Changed

Filename	Overview
internal/controller/users.go	Adds `+cluster
internal/controller/valkeycluster_controller.go	Adds a guard that defers a primary roll when no synced replica exists; logic is sound and the requeue path via the existing replica-sync check is correct, but after the defer the cluster can briefly be marked Ready (up to 30 s) while the rolling update is still pending.

Sequence Diagram

sequenceDiagram
    participant Operator
    participant ReconcileValkeyNodes
    participant ClusterState
    participant ValkeyPrimary
    participant ValkeyReplica

    Note over Operator: Rolling update triggered (config change)
    Operator->>ReconcileValkeyNodes: reconcileValkeyNodes()
    ReconcileValkeyNodes->>ClusterState: getValkeyClusterState() [snapshot]

    Note over ReconcileValkeyNodes: Iterate replicas-first
    ReconcileValkeyNodes->>ValkeyReplica: Update spec (roll replica)
    ValkeyReplica-->>ReconcileValkeyNodes: "requeue=true"
    ReconcileValkeyNodes-->>Operator: (true, nil) → RequeueAfter 2s

    Note over Operator: Wait for replica K8s Ready...

    Operator->>ReconcileValkeyNodes: reconcileValkeyNodes() [fresh snapshot]
    ReconcileValkeyNodes->>ClusterState: getValkeyClusterState()
    ClusterState-->>ReconcileValkeyNodes: replica not synced (master_link_status≠up)

    ReconcileValkeyNodes->>ReconcileValkeyNodes: findFailoverShard → nil
    ReconcileValkeyNodes->>ClusterState: FindShardForAddress(primaryIP)
    ClusterState-->>ReconcileValkeyNodes: shardInState (primary confirmed)
    Note over ReconcileValkeyNodes: NEW: primary has no synced replicas → defer roll
    ReconcileValkeyNodes-->>Operator: (false, nil) — continues reconcile

    Operator->>ClusterState: getValkeyClusterState() [second fetch]
    ClusterState-->>Operator: replica not in sync
    Note over Operator: IsReplicationInSync()=false → RequeueAfter 2s

    Note over Operator: Replica syncs (master_link_status=up)...

    Operator->>ReconcileValkeyNodes: reconcileValkeyNodes() [fresh snapshot]
    ReconcileValkeyNodes->>ClusterState: getValkeyClusterState()
    ClusterState-->>ReconcileValkeyNodes: replica IS synced
    ReconcileValkeyNodes->>ReconcileValkeyNodes: findFailoverShard → shard + replicas
    Note over ReconcileValkeyNodes: NEW: +cluster|failover ACL now permits this
    ReconcileValkeyNodes->>ValkeyReplica: CLUSTER FAILOVER
    ValkeyReplica-->>ValkeyPrimary: becomes new primary
    ReconcileValkeyNodes->>ValkeyPrimary: Update spec (roll old primary)
    ValkeyPrimary-->>ReconcileValkeyNodes: "requeue=true"
    ReconcileValkeyNodes-->>Operator: (true, nil) → RequeueAfter 2s

_{Reviews (1): Last reviewed commit: "fix: defer primary roll until replica is..." | Re-trigger Greptile}

greptile-apps · 2026-05-29T13:49:22Z

+			} else if cluster.Spec.Replicas > 0 {
+				// findFailoverShard returned nil for one of three reasons:
+				// 1. Node is the shard primary but has no synced replicas: skip roll, would lose data
+				// 2. Node is in a shard but is a replica: safe to roll
+				// 3. Node isn't in any shard (isolated): safe to roll
+				// Only case 1 requires skipping; identify it's the actual primary of its shard.
+				shardInState := clusterState.FindShardForAddress(current.Status.PodIP)
+				if shardInState != nil && shardInState.GetPrimaryNode() != nil && shardInState.GetPrimaryNode().Address == current.Status.PodIP {
+					log.Info("primary has no synced replicas, deferring roll",
+						"name", node.Name, "address", current.Status.PodIP,
+						"shardNodes", len(shardInState.Nodes),
+						"shardId", shardInState.Id)
+					return false, false, nil
+				}
 			}


Rolling update briefly appears complete while primary is still pending

When the primary roll is deferred and the replica subsequently syncs before the next reconcile pass, reconcileValkeyNodes returns (false, nil) rather than (true, nil). The main reconcile then fetches a fresh cluster state, passes the "replicas in sync" check, and reaches the healthy-cluster path at line 342 — emitting a ClusterReady event and clearing Progressing. The primary's pending spec update is not reflected in any condition, and the next attempt is delayed by the 30 s periodic requeue (line 354).

In practice the window is short and data is never at risk (the replica is synced, failover can succeed on the next pass), but operators watching events or conditions during a rolling update may see a spurious Ready→Progressing transition ~30 s later. A return true, false, nil here would keep Progressing=True throughout, though it would need to be done carefully to not block the MEET/REPLICATE phases that the deferred path relies on.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

jdheyburn · 2026-05-29T14:31:37Z

 			"+cluster|addslotsrange",     // assign slots
 			"+cluster|replicate",         // set up replication
 			"+cluster|forget",            // remove stale nodes
+			"+cluster|failover",          // proactive failover before rolling primary


Oh man we really need to get some CI check to make sure the operator has everything it needs... This particular line was on me - so my bad

I think we merged ACL stuff PR right after the proactive PR, so we missed it.
Also, I guess a test of failover pass without the proactive feature since there is a fallback (except it lost data).
But there is probably a way to check for NOPERM or some other way

jdheyburn · 2026-05-29T14:34:31Z

Something I've been wanting to introduce at some point is better readiness probes - perhaps gated on cluster_state:ok. I tried to do that in #120 but there were some other instabilities which prevented reconcile from progressing. I think that would help prevent the operator from rolling pods when the cluster is unhealthy, and also prevent k8s from descheduling pods when it is reconciling.

bjosv added 2 commits May 29, 2026 15:14

greptile-apps Bot reviewed May 29, 2026

View reviewed changes

jdheyburn reviewed May 29, 2026

View reviewed changes

jdheyburn approved these changes Jun 2, 2026

View reviewed changes

jdheyburn merged commit 6be74d2 into valkey-io:main Jun 2, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: rolling update loses all keys when replicas are not synced#208

fix: rolling update loses all keys when replicas are not synced#208
jdheyburn merged 2 commits into
valkey-io:mainfrom
Nordix:fix-rolling-updates

bjosv commented May 29, 2026

Uh oh!

greptile-apps Bot commented May 29, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 29, 2026

Uh oh!

jdheyburn May 29, 2026 •

edited

Loading

Uh oh!

bjosv May 29, 2026

Uh oh!

jdheyburn commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bjosv commented May 29, 2026

Summary

Testing

Checklist

Uh oh!

greptile-apps Bot commented May 29, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjosv May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jdheyburn May 29, 2026 •

edited

Loading