Skip to content

fix: rolling update loses all keys when replicas are not synced#208

Merged
jdheyburn merged 2 commits into
valkey-io:mainfrom
Nordix:fix-rolling-updates
Jun 2, 2026
Merged

fix: rolling update loses all keys when replicas are not synced#208
jdheyburn merged 2 commits into
valkey-io:mainfrom
Nordix:fix-rolling-updates

Conversation

@bjosv

@bjosv bjosv commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR fixes two issues that together make rolling updates safe with and without PVCs.

  1. Without persistence: restarting a replica destroys its cluster membership and data. The operator considered it Ready and immediately rolled the primary. With no synced replica to fail over to, all shard data was lost.

  2. With persistence: the replica stays synced and proactive failover is attempted, but the _operator user ACL was missing cluster|failover. Every failover failed with NOPERM and the primary was rolled without a graceful handoff.

Testing

Found using #203 running the rolling-update scenario. It patches io-threads on the ValkeyCluster to trigger a rolling restart of all pods, then verifies the cluster recovers and all keys are preserved.

# Without persistence
CHAOS_SCENARIOS=rolling-update make test-chaos

# With persistence
CHAOS_SCENARIOS=rolling-update CHAOS_PERSISTENCE=true make test-chaos

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one issue.
  • Commit message explains what changed and why
  • Tests are added or updated.
  • Documentation files are updated.
  • I have run pre-commit locally (pre-commit run --all-files or hooks on commit)

bjosv added 2 commits May 29, 2026 15:14
The proactive failover logic issues CLUSTER FAILOVER on a replica
before rolling the shard primary, but the _operator system user
was missing this command in its ACL. Every failover attempt failed
with NOPERM and the primary was rolled without a graceful handoff.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
During a rolling update without persistence, restarting a replica
destroys its cluster membership (nodes.conf) and data. The operator
marked it Ready after PING and immediately rolled the primary.
With no synced replica to fail over to, all shard data was lost.

Skip rolling a primary when it has no synced replicas, letting the
reconcile continue to the MEET/REPLICATE phases that rejoin the
replica.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
@greptile-apps

greptile-apps Bot commented May 29, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes two independent bugs that together could cause complete shard data loss during a rolling update when replicas are not synced.

  • ACL fix (users.go): Adds +cluster|failover to the _operator user's ACL. Without this, CLUSTER FAILOVER would fail with NOPERM, silently skipping the graceful handoff before rolling a primary — even when a synced replica was available.
  • Safe-roll gate (valkeycluster_controller.go): When findFailoverShard returns nil (no synced replica found), the new code checks whether the node being rolled is actually the shard primary. If it is and no synced replicas exist (the dangerous case), the roll is deferred by returning (false, false, nil) rather than proceeding immediately. The outer reconcile loop continues so that MEET / REPLICATE phases can still run and the replica can re-join; the "check replicas in sync" block (requeuing every 2 s) then drives retries until the replica catches up, at which point the next iteration succeeds with a healthy failover target.

Confidence Score: 4/5

Safe to merge — both changes are targeted and correct; no data-loss path exists in the new code.

The ACL addition is a clean one-liner with no risk. The controller change defers a primary roll when replicas aren't synced, relies on the existing 'check replicas in sync' requeue to drive retries, and correctly allows MEET/REPLICATE phases to proceed so the replica can rejoin the cluster. One edge case: when the deferred primary's replicas sync before the second cluster-state fetch inside the same reconcile, the cluster gets a brief Ready/non-Progressing status (~30 s) while the primary's spec update is still pending. This won't cause data loss or stalls, but could produce a confusing Ready→Progressing event transition visible to operators.

The controller change in internal/controller/valkeycluster_controller.go deserves a second look around the interaction between the deferred-roll return value and the cluster-condition reporting path.

Important Files Changed

Filename Overview
internal/controller/users.go Adds `+cluster
internal/controller/valkeycluster_controller.go Adds a guard that defers a primary roll when no synced replica exists; logic is sound and the requeue path via the existing replica-sync check is correct, but after the defer the cluster can briefly be marked Ready (up to 30 s) while the rolling update is still pending.

Sequence Diagram

sequenceDiagram
    participant Operator
    participant ReconcileValkeyNodes
    participant ClusterState
    participant ValkeyPrimary
    participant ValkeyReplica

    Note over Operator: Rolling update triggered (config change)
    Operator->>ReconcileValkeyNodes: reconcileValkeyNodes()
    ReconcileValkeyNodes->>ClusterState: getValkeyClusterState() [snapshot]

    Note over ReconcileValkeyNodes: Iterate replicas-first
    ReconcileValkeyNodes->>ValkeyReplica: Update spec (roll replica)
    ValkeyReplica-->>ReconcileValkeyNodes: "requeue=true"
    ReconcileValkeyNodes-->>Operator: (true, nil) → RequeueAfter 2s

    Note over Operator: Wait for replica K8s Ready...

    Operator->>ReconcileValkeyNodes: reconcileValkeyNodes() [fresh snapshot]
    ReconcileValkeyNodes->>ClusterState: getValkeyClusterState()
    ClusterState-->>ReconcileValkeyNodes: replica not synced (master_link_status≠up)

    ReconcileValkeyNodes->>ReconcileValkeyNodes: findFailoverShard → nil
    ReconcileValkeyNodes->>ClusterState: FindShardForAddress(primaryIP)
    ClusterState-->>ReconcileValkeyNodes: shardInState (primary confirmed)
    Note over ReconcileValkeyNodes: NEW: primary has no synced replicas → defer roll
    ReconcileValkeyNodes-->>Operator: (false, nil) — continues reconcile

    Operator->>ClusterState: getValkeyClusterState() [second fetch]
    ClusterState-->>Operator: replica not in sync
    Note over Operator: IsReplicationInSync()=false → RequeueAfter 2s

    Note over Operator: Replica syncs (master_link_status=up)...

    Operator->>ReconcileValkeyNodes: reconcileValkeyNodes() [fresh snapshot]
    ReconcileValkeyNodes->>ClusterState: getValkeyClusterState()
    ClusterState-->>ReconcileValkeyNodes: replica IS synced
    ReconcileValkeyNodes->>ReconcileValkeyNodes: findFailoverShard → shard + replicas
    Note over ReconcileValkeyNodes: NEW: +cluster|failover ACL now permits this
    ReconcileValkeyNodes->>ValkeyReplica: CLUSTER FAILOVER
    ValkeyReplica-->>ValkeyPrimary: becomes new primary
    ReconcileValkeyNodes->>ValkeyPrimary: Update spec (roll old primary)
    ValkeyPrimary-->>ReconcileValkeyNodes: "requeue=true"
    ReconcileValkeyNodes-->>Operator: (true, nil) → RequeueAfter 2s
Loading

Reviews (1): Last reviewed commit: "fix: defer primary roll until replica is..." | Re-trigger Greptile

Comment on lines +484 to 498
} else if cluster.Spec.Replicas > 0 {
// findFailoverShard returned nil for one of three reasons:
// 1. Node is the shard primary but has no synced replicas: skip roll, would lose data
// 2. Node is in a shard but is a replica: safe to roll
// 3. Node isn't in any shard (isolated): safe to roll
// Only case 1 requires skipping; identify it's the actual primary of its shard.
shardInState := clusterState.FindShardForAddress(current.Status.PodIP)
if shardInState != nil && shardInState.GetPrimaryNode() != nil && shardInState.GetPrimaryNode().Address == current.Status.PodIP {
log.Info("primary has no synced replicas, deferring roll",
"name", node.Name, "address", current.Status.PodIP,
"shardNodes", len(shardInState.Nodes),
"shardId", shardInState.Id)
return false, false, nil
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Rolling update briefly appears complete while primary is still pending

When the primary roll is deferred and the replica subsequently syncs before the next reconcile pass, reconcileValkeyNodes returns (false, nil) rather than (true, nil). The main reconcile then fetches a fresh cluster state, passes the "replicas in sync" check, and reaches the healthy-cluster path at line 342 — emitting a ClusterReady event and clearing Progressing. The primary's pending spec update is not reflected in any condition, and the next attempt is delayed by the 30 s periodic requeue (line 354).

In practice the window is short and data is never at risk (the replica is synced, failover can succeed on the next pass), but operators watching events or conditions during a rolling update may see a spurious Ready→Progressing transition ~30 s later. A return true, false, nil here would keep Progressing=True throughout, though it would need to be done carefully to not block the MEET/REPLICATE phases that the deferred path relies on.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

"+cluster|addslotsrange", // assign slots
"+cluster|replicate", // set up replication
"+cluster|forget", // remove stale nodes
"+cluster|failover", // proactive failover before rolling primary

@jdheyburn jdheyburn May 29, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man we really need to get some CI check to make sure the operator has everything it needs... This particular line was on me - so my bad

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we merged ACL stuff PR right after the proactive PR, so we missed it.
Also, I guess a test of failover pass without the proactive feature since there is a fallback (except it lost data).
But there is probably a way to check for NOPERM or some other way

@jdheyburn

Copy link
Copy Markdown
Collaborator

Something I've been wanting to introduce at some point is better readiness probes - perhaps gated on cluster_state:ok. I tried to do that in #120 but there were some other instabilities which prevented reconcile from progressing. I think that would help prevent the operator from rolling pods when the cluster is unhealthy, and also prevent k8s from descheduling pods when it is reconciling.

@jdheyburn jdheyburn merged commit 6be74d2 into valkey-io:main Jun 2, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants