fix: rolling update loses all keys when replicas are not synced#208
Conversation
The proactive failover logic issues CLUSTER FAILOVER on a replica before rolling the shard primary, but the _operator system user was missing this command in its ACL. Every failover attempt failed with NOPERM and the primary was rolled without a graceful handoff. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
During a rolling update without persistence, restarting a replica destroys its cluster membership (nodes.conf) and data. The operator marked it Ready after PING and immediately rolled the primary. With no synced replica to fail over to, all shard data was lost. Skip rolling a primary when it has no synced replicas, letting the reconcile continue to the MEET/REPLICATE phases that rejoin the replica. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
|
| Filename | Overview |
|---|---|
| internal/controller/users.go | Adds `+cluster |
| internal/controller/valkeycluster_controller.go | Adds a guard that defers a primary roll when no synced replica exists; logic is sound and the requeue path via the existing replica-sync check is correct, but after the defer the cluster can briefly be marked Ready (up to 30 s) while the rolling update is still pending. |
Sequence Diagram
sequenceDiagram
participant Operator
participant ReconcileValkeyNodes
participant ClusterState
participant ValkeyPrimary
participant ValkeyReplica
Note over Operator: Rolling update triggered (config change)
Operator->>ReconcileValkeyNodes: reconcileValkeyNodes()
ReconcileValkeyNodes->>ClusterState: getValkeyClusterState() [snapshot]
Note over ReconcileValkeyNodes: Iterate replicas-first
ReconcileValkeyNodes->>ValkeyReplica: Update spec (roll replica)
ValkeyReplica-->>ReconcileValkeyNodes: "requeue=true"
ReconcileValkeyNodes-->>Operator: (true, nil) → RequeueAfter 2s
Note over Operator: Wait for replica K8s Ready...
Operator->>ReconcileValkeyNodes: reconcileValkeyNodes() [fresh snapshot]
ReconcileValkeyNodes->>ClusterState: getValkeyClusterState()
ClusterState-->>ReconcileValkeyNodes: replica not synced (master_link_status≠up)
ReconcileValkeyNodes->>ReconcileValkeyNodes: findFailoverShard → nil
ReconcileValkeyNodes->>ClusterState: FindShardForAddress(primaryIP)
ClusterState-->>ReconcileValkeyNodes: shardInState (primary confirmed)
Note over ReconcileValkeyNodes: NEW: primary has no synced replicas → defer roll
ReconcileValkeyNodes-->>Operator: (false, nil) — continues reconcile
Operator->>ClusterState: getValkeyClusterState() [second fetch]
ClusterState-->>Operator: replica not in sync
Note over Operator: IsReplicationInSync()=false → RequeueAfter 2s
Note over Operator: Replica syncs (master_link_status=up)...
Operator->>ReconcileValkeyNodes: reconcileValkeyNodes() [fresh snapshot]
ReconcileValkeyNodes->>ClusterState: getValkeyClusterState()
ClusterState-->>ReconcileValkeyNodes: replica IS synced
ReconcileValkeyNodes->>ReconcileValkeyNodes: findFailoverShard → shard + replicas
Note over ReconcileValkeyNodes: NEW: +cluster|failover ACL now permits this
ReconcileValkeyNodes->>ValkeyReplica: CLUSTER FAILOVER
ValkeyReplica-->>ValkeyPrimary: becomes new primary
ReconcileValkeyNodes->>ValkeyPrimary: Update spec (roll old primary)
ValkeyPrimary-->>ReconcileValkeyNodes: "requeue=true"
ReconcileValkeyNodes-->>Operator: (true, nil) → RequeueAfter 2s
Reviews (1): Last reviewed commit: "fix: defer primary roll until replica is..." | Re-trigger Greptile
| } else if cluster.Spec.Replicas > 0 { | ||
| // findFailoverShard returned nil for one of three reasons: | ||
| // 1. Node is the shard primary but has no synced replicas: skip roll, would lose data | ||
| // 2. Node is in a shard but is a replica: safe to roll | ||
| // 3. Node isn't in any shard (isolated): safe to roll | ||
| // Only case 1 requires skipping; identify it's the actual primary of its shard. | ||
| shardInState := clusterState.FindShardForAddress(current.Status.PodIP) | ||
| if shardInState != nil && shardInState.GetPrimaryNode() != nil && shardInState.GetPrimaryNode().Address == current.Status.PodIP { | ||
| log.Info("primary has no synced replicas, deferring roll", | ||
| "name", node.Name, "address", current.Status.PodIP, | ||
| "shardNodes", len(shardInState.Nodes), | ||
| "shardId", shardInState.Id) | ||
| return false, false, nil | ||
| } | ||
| } |
There was a problem hiding this comment.
Rolling update briefly appears complete while primary is still pending
When the primary roll is deferred and the replica subsequently syncs before the next reconcile pass, reconcileValkeyNodes returns (false, nil) rather than (true, nil). The main reconcile then fetches a fresh cluster state, passes the "replicas in sync" check, and reaches the healthy-cluster path at line 342 — emitting a ClusterReady event and clearing Progressing. The primary's pending spec update is not reflected in any condition, and the next attempt is delayed by the 30 s periodic requeue (line 354).
In practice the window is short and data is never at risk (the replica is synced, failover can succeed on the next pass), but operators watching events or conditions during a rolling update may see a spurious Ready→Progressing transition ~30 s later. A return true, false, nil here would keep Progressing=True throughout, though it would need to be done carefully to not block the MEET/REPLICATE phases that the deferred path relies on.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| "+cluster|addslotsrange", // assign slots | ||
| "+cluster|replicate", // set up replication | ||
| "+cluster|forget", // remove stale nodes | ||
| "+cluster|failover", // proactive failover before rolling primary |
There was a problem hiding this comment.
Oh man we really need to get some CI check to make sure the operator has everything it needs... This particular line was on me - so my bad
There was a problem hiding this comment.
I think we merged ACL stuff PR right after the proactive PR, so we missed it.
Also, I guess a test of failover pass without the proactive feature since there is a fallback (except it lost data).
But there is probably a way to check for NOPERM or some other way
|
Something I've been wanting to introduce at some point is better readiness probes - perhaps gated on cluster_state:ok. I tried to do that in #120 but there were some other instabilities which prevented reconcile from progressing. I think that would help prevent the operator from rolling pods when the cluster is unhealthy, and also prevent k8s from descheduling pods when it is reconciling. |
Summary
This PR fixes two issues that together make rolling updates safe with and without PVCs.
Without persistence: restarting a replica destroys its cluster membership and data. The operator considered it Ready and immediately rolled the primary. With no synced replica to fail over to, all shard data was lost.
With persistence: the replica stays synced and proactive failover is attempted, but the
_operatoruser ACL was missingcluster|failover. Every failover failed with NOPERM and the primary was rolled without a graceful handoff.Testing
Found using #203 running the rolling-update scenario. It patches io-threads on the ValkeyCluster to trigger a rolling restart of all pods, then verifies the cluster recovers and all keys are preserved.
Checklist
Before submitting the PR make sure the following are checked:
pre-commit run --all-filesor hooks on commit)