feat: Safer cluster node rolls by jdheyburn · Pull Request #120 · valkey-io/valkey-operator

jdheyburn · 2026-03-21T15:14:03Z

This PR enhances the readiness probe checks such that
nodes must be in a ready state before progressing with
the sequential roll that was introduced in #116.

It does this by introducing checks for the node liveness
via a Running status field. Once the node is live then
the controller is enhanced to attempt to get the node to
rejoin the cluster, regardless if a volume is set or not.

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

This PR enhances the readiness probe checks such that nodes must be in a ready state before progressing with the sequential roll that was introduced in valkey-io#116. It does this by introducing checks for the node liveness via a Running status field. Once the node is live then the controller is enhanced to attempt to get the node to rejoin the cluster, regardless if a volume is set or not. Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

bjosv · 2026-03-31T22:26:33Z

+if echo "$cluster_info" | grep -q '^cluster_state:'; then
+    response=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]')
+    if [ "$response" != "cluster_state:ok" ]; then
+        echo "$response" >&2
+        exit 1
+    fi
+fi


We can avoid grep:ing twice using

Suggested change

if echo "$cluster_info" | grep -q '^cluster_state:'; then

response=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]')

if [ "$response" != "cluster_state:ok" ]; then

echo "$response" >&2

exit 1

fi

fi

cluster_state=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]')

if [ -n "$cluster_state" ] && [ "$cluster_state" != "cluster_state:ok" ]; then

echo "$cluster_state" >&2

exit 1

fi

Thanks, added!

bjosv · 2026-03-31T23:06:41Z

 		setCondition(cluster, valkeyiov1alpha1.ConditionProgressing, valkeyiov1alpha1.ReasonUpdatingNodes, "Updating ValkeyNodes", metav1.ConditionTrue)
-		_ = r.updateStatus(ctx, cluster, nil)
-		return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
+		return r.reconcileNodeTransition(ctx, cluster)


Why can't we just continue here? We get the list, the cluster state, forgetStaleNodes and run phase 1.. in below code in a similar way as in reconcileNodeTransition()

Fair point - I've done some refactoring by introducing a nodeTransitioning to indicate an early exit. Thanks!

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

…n/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn · 2026-04-02T08:59:59Z

I'm having a nightmare with this e2e test. It passes everytime locally but not on here. I'll try to find some more time to work out why

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn · 2026-04-08T12:48:15Z

Placing back into draft while I look to enhance cluster stability

…n/valkey-operator into jdheyburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn · 2026-04-10T13:28:07Z

This branch has pulled in the proactive failover from below PR:

feat: proactive failovers before cluster rolls #128

And it seems to be much more stable now.

…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

melancholictheory · 2026-06-08T11:26:06Z

read through this. the Running-vs-Ready split for the bootstrap-vs-rolling gate is a nice way to dodge the cluster_state:ok chicken-and-egg at MEET time, shouldWaitForNode reads clean. a few notes from our own go at safer rolls:

the part i'd think hardest about: cluster_state is a cluster-global signal, and putting it in the pod readiness probe couples every pod's endpoint membership to overall cluster health. during a roll, the moment a shard is transiently without a serving primary, cluster_state flips and every pod's probe fails at once, not just the one being rolled, so healthy shards get pulled from the Service too. for the roll-progression gate that's exactly what you want (wait for convergence), but baking it into the probe spreads it to Service membership. since the operator already pulls clusterState in reconcileValkeyNodes, one option is to keep the probe node-local (PING, not loading, this node serves its own slots) and do the cluster_state:ok convergence check operator-side in shouldWaitForNode where it already lives. that decouples "is this a healthy endpoint" from "is the whole cluster converged".

worth being explicit about cluster-require-full-coverage too. with the default yes, a shard mid-failover flips cluster_state:fail cluster-wide, so the new probe takes the whole cluster NotReady during that window. with no, only the affected shard degrades and cluster_state stays ok. the proactive failover you pulled in from #128 is what keeps a planned primary roll from tripping it (replica promoted before the old primary goes down), so those two changes lean on each other.

on the e2e flake (passes local, fails CI): that pattern on roll tests is usually a convergence-timing race. couple of things we'd check: whether the readiness probe's periodSeconds + timeout is short enough that a transient cluster_state:fail clears inside the test's eventually-window on slower CI, and whether the test waits on the operator's clusterFormed/Ready convergence rather than a fixed sleep. the 2s requeue racing a longer probe period can also leave the node looking NotReady longer than the test expects.

happy to dig into any of these.

jdheyburn · 2026-06-08T14:51:53Z

@melancholictheory I'll probably close out this PR since it's so far behind main, which has had a number of stability improvements on it.

Essentially what I am looking to do here is prevent pods from being descheduled when they shouldn't be - such as the operator deploying out changes. The operator could be performing a proactive failover from primary to replica. While that's happening the pod readiness is OK. In the meantime, a pod could be descheduled because the K8s node is being terminated. Having some gate on whether the cluster/pod is healthy would help prevent this.

Keen to hear your thoughts?

melancholictheory · 2026-06-08T15:06:46Z

makes sense to close it if it's drifted that far from main. the goal is worth keeping though, and i think it's separable from the readiness rework.

the thing we kept running into: you can't really gate descheduling through readiness. the SIGTERM can land at any point, and during a proactive failover the old primary is still correctly Ready (it's serving until the replica takes over), so anything keyed on Ready won't fire at the moment you'd actually want it to. and if you force it NotReady to block eviction, you're back to flapping it out of the Service endpoints. readiness ends up fighting itself.

what worked better for us was two layers, one per disruption class:

PodDisruptionBudget for the voluntary stuff. a per-shard PDB (maxUnavailable: 1) makes kubectl drain, the eviction API, and the cluster-autoscaler respect "don't take a pod if the shard can't afford it", the eviction just gets a 429 until it's safe. that already covers most "node is terminating" cases, since a well-behaved drain goes through the eviction API.

shutdown-on-sigterm failover for the actual termination. it landed in valkey 9.0 (valkey#1091): on SIGTERM a cluster primary does a manual failover to a replica as part of graceful shutdown, before the process exits. the nice property is it's triggered by the SIGTERM itself, so whenever a pod does get descheduled it hands off on the way out, you're not racing the operator's failover against the kubelet. needs terminationGracePeriodSeconds >= cluster-manual-failover-timeout so the grace window doesn't get cut short by SIGKILL.

together those cover both classes: PDB blocks the drains that shouldn't happen, sigterm-failover makes the descheduling that does happen graceful. the one gap neither closes is a truly involuntary loss (node just disappears, no SIGTERM), but that's the same blast radius as any unplanned primary failure, the cluster's own failover handles it.

and it lines up with the proactive failover you pulled from #128: the operator owns the planned rolls, sigterm-failover owns the out-of-band descheduling. same split, two mechanisms.

Closes #248 ### Summary Inject `shutdown-on-sigterm failover` into the managed Valkey config by default for every ValkeyCluster. On SIGTERM (node drain, eviction, preemption, or `kubectl delete pod`), a primary fails its slots over to a replica during graceful shutdown, which covers the unplanned termination that the operator's own proactive failover never sees. ### Implementation - Added the directive to `getBaseConfig`, alongside the other operator-managed defaults. - It is a Valkey 9.0+ directive, which matches the operator's documented baseline, so it renders unconditionally with no version gate. - No double-failover risk: it only fires if the node is still primary at SIGTERM, and it is a no-op on replicas. ### Acceptance criteria - [x] `shutdown-on-sigterm failover` injected into valkey.conf by default - [x] Works alongside the operator's proactive failover without double-firing (no-op on replicas) - [ ] E2E test (drain the primary's node, verify a replica is promoted before the pod exits) I left the E2E test out for now since it needs node-drain orchestration in the e2e harness. Happy to add it here or as a follow-up, whichever you prefer. The unit test asserts the directive is rendered. The `terminationGracePeriodSeconds` constraint is tracked separately in #260; the docs note the relationship. ### Testing - Unit test asserts `shutdown-on-sigterm failover` is in the rendered config. - `make test` and `make lint` pass locally. ### References - Discussion #231 (shutdown-on-sigterm section) - #120 (out-of-band termination discussion) - valkey/valkey#1091 ### Checklist - [x] This Pull Request is related to one issue. - [x] Commit message explains what changed and why - [x] Tests are added or updated. - [x] Documentation files are updated. - [ ] I have run pre-commit locally (ran `make test` and `make lint` instead) --------- Signed-off-by: melancholictheory <selimvhorst@gmail.com>

yangqiu-dot and others added 2 commits March 4, 2026 11:05

fix race condition between forget and failover

6097785

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

jdheyburn force-pushed the jdheyburn/feat/safer-cluster-rolls branch from 990bf0a to 131249e Compare March 30, 2026 12:57

jdheyburn changed the title ~~Safer cluster rolls~~ feat: Safer cluster rolls Mar 30, 2026

jdheyburn changed the title ~~feat: Safer cluster rolls~~ feat: Safer cluster node rolls Mar 30, 2026

jdheyburn marked this pull request as ready for review March 30, 2026 13:12

ci: trigger

b9f0d72

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

bjosv reviewed Mar 31, 2026

View reviewed changes

jdheyburn added 3 commits April 1, 2026 17:15

refactor: better readiness probe

b4a617f

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

refactor: reuse logic for joining a node back to cluster

0025218

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Merge remote-tracking branch 'ysqyang/failover-bug-fix' into jdheybur…

6c05c5d

…n/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

feat: proactive cluster failover before rolling pod updates

b326498

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn marked this pull request as draft April 8, 2026 12:48

Merge branch 'feat/cluster-proactive-failover' of github.com:jdheybur…

180e596

…n/valkey-operator into jdheyburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn added 2 commits April 10, 2026 14:41

Merge branch 'main' of github.com:valkey-io/valkey-operator into jdhe…

e489d6b

…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

chore: fix lint

1a81563

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn mentioned this pull request Apr 14, 2026

[Design] Persistent Volumes for ValkeyNode #121

Closed

jdheyburn added 4 commits April 14, 2026 21:17

Merge branch 'main' of github.com:valkey-io/valkey-operator into jdhe…

d5647e7

…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Merge branch 'main' of github.com:valkey-io/valkey-operator into jdhe…

aa3d2f9

…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Fix lint

599db08

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

Merge branch 'main' of github.com:valkey-io/valkey-operator into jdhe…

dbffae6

…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>

jdheyburn mentioned this pull request May 29, 2026

fix: rolling update loses all keys when replicas are not synced #208

Merged

5 tasks

jdheyburn mentioned this pull request Jun 12, 2026

[feat] Support shutdown-on-sigterm failover config #248

Closed

3 tasks

melancholictheory mentioned this pull request Jun 19, 2026

feat: inject shutdown-on-sigterm failover by default #268

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Safer cluster node rolls#120

feat: Safer cluster node rolls#120
jdheyburn wants to merge 14 commits into
valkey-io:mainfrom
jdheyburn:jdheyburn/feat/safer-cluster-rolls

jdheyburn commented Mar 21, 2026 •

edited

Loading

Uh oh!

bjosv Mar 31, 2026

Uh oh!

jdheyburn Apr 1, 2026

Uh oh!

bjosv Mar 31, 2026

Uh oh!

jdheyburn Apr 1, 2026

Uh oh!

jdheyburn commented Apr 2, 2026

Uh oh!

jdheyburn commented Apr 8, 2026

Uh oh!

jdheyburn commented Apr 10, 2026

Uh oh!

melancholictheory commented Jun 8, 2026

Uh oh!

jdheyburn commented Jun 8, 2026

Uh oh!

melancholictheory commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jdheyburn commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjosv Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

bjosv Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn commented Apr 2, 2026

Uh oh!

jdheyburn commented Apr 8, 2026

Uh oh!

jdheyburn commented Apr 10, 2026

Uh oh!

melancholictheory commented Jun 8, 2026

Uh oh!

jdheyburn commented Jun 8, 2026

Uh oh!

melancholictheory commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jdheyburn commented Mar 21, 2026 •

edited

Loading