Skip to content

feat: Safer cluster node rolls#120

Draft
jdheyburn wants to merge 14 commits into
valkey-io:mainfrom
jdheyburn:jdheyburn/feat/safer-cluster-rolls
Draft

feat: Safer cluster node rolls#120
jdheyburn wants to merge 14 commits into
valkey-io:mainfrom
jdheyburn:jdheyburn/feat/safer-cluster-rolls

Conversation

@jdheyburn

@jdheyburn jdheyburn commented Mar 21, 2026

Copy link
Copy Markdown
Collaborator

This PR enhances the readiness probe checks such that
nodes must be in a ready state before progressing with
the sequential roll that was introduced in #116.

It does this by introducing checks for the node liveness
via a Running status field. Once the node is live then
the controller is enhanced to attempt to get the node to
rejoin the cluster, regardless if a volume is set or not.

yangqiu-dot and others added 2 commits March 4, 2026 11:05
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
This PR enhances the readiness probe checks such that
nodes must be in a ready state before progressing with
the sequential roll that was introduced in valkey-io#116.

It does this by introducing checks for the node liveness
via a Running status field. Once the node is live then
the controller is enhanced to attempt to get the node to
rejoin the cluster, regardless if a volume is set or not.

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn jdheyburn force-pushed the jdheyburn/feat/safer-cluster-rolls branch from 990bf0a to 131249e Compare March 30, 2026 12:57
@jdheyburn jdheyburn changed the title Safer cluster rolls feat: Safer cluster rolls Mar 30, 2026
@jdheyburn jdheyburn changed the title feat: Safer cluster rolls feat: Safer cluster node rolls Mar 30, 2026
@jdheyburn jdheyburn marked this pull request as ready for review March 30, 2026 13:12
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Comment on lines +54 to +60
if echo "$cluster_info" | grep -q '^cluster_state:'; then
response=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]')
if [ "$response" != "cluster_state:ok" ]; then
echo "$response" >&2
exit 1
fi
fi

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid grep:ing twice using

Suggested change
if echo "$cluster_info" | grep -q '^cluster_state:'; then
response=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]')
if [ "$response" != "cluster_state:ok" ]; then
echo "$response" >&2
exit 1
fi
fi
cluster_state=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]')
if [ -n "$cluster_state" ] && [ "$cluster_state" != "cluster_state:ok" ]; then
echo "$cluster_state" >&2
exit 1
fi

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added!

setCondition(cluster, valkeyiov1alpha1.ConditionProgressing, valkeyiov1alpha1.ReasonUpdatingNodes, "Updating ValkeyNodes", metav1.ConditionTrue)
_ = r.updateStatus(ctx, cluster, nil)
return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
return r.reconcileNodeTransition(ctx, cluster)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just continue here? We get the list, the cluster state, forgetStaleNodes and run phase 1.. in below code in a similar way as in reconcileNodeTransition()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point - I've done some refactoring by introducing a nodeTransitioning to indicate an early exit. Thanks!

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…n/feat/safer-cluster-rolls

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn

Copy link
Copy Markdown
Collaborator Author

I'm having a nightmare with this e2e test. It passes everytime locally but not on here. I'll try to find some more time to work out why

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn

Copy link
Copy Markdown
Collaborator Author

Placing back into draft while I look to enhance cluster stability

@jdheyburn jdheyburn marked this pull request as draft April 8, 2026 12:48
…n/valkey-operator into jdheyburn/feat/safer-cluster-rolls

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn

Copy link
Copy Markdown
Collaborator Author

This branch has pulled in the proactive failover from below PR:

And it seems to be much more stable now.

…yburn/feat/safer-cluster-rolls

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…yburn/feat/safer-cluster-rolls

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…yburn/feat/safer-cluster-rolls

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…yburn/feat/safer-cluster-rolls

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@melancholictheory

Copy link
Copy Markdown
Contributor

read through this. the Running-vs-Ready split for the bootstrap-vs-rolling gate is a nice way to dodge the cluster_state:ok chicken-and-egg at MEET time, shouldWaitForNode reads clean. a few notes from our own go at safer rolls:

the part i'd think hardest about: cluster_state is a cluster-global signal, and putting it in the pod readiness probe couples every pod's endpoint membership to overall cluster health. during a roll, the moment a shard is transiently without a serving primary, cluster_state flips and every pod's probe fails at once, not just the one being rolled, so healthy shards get pulled from the Service too. for the roll-progression gate that's exactly what you want (wait for convergence), but baking it into the probe spreads it to Service membership. since the operator already pulls clusterState in reconcileValkeyNodes, one option is to keep the probe node-local (PING, not loading, this node serves its own slots) and do the cluster_state:ok convergence check operator-side in shouldWaitForNode where it already lives. that decouples "is this a healthy endpoint" from "is the whole cluster converged".

worth being explicit about cluster-require-full-coverage too. with the default yes, a shard mid-failover flips cluster_state:fail cluster-wide, so the new probe takes the whole cluster NotReady during that window. with no, only the affected shard degrades and cluster_state stays ok. the proactive failover you pulled in from #128 is what keeps a planned primary roll from tripping it (replica promoted before the old primary goes down), so those two changes lean on each other.

on the e2e flake (passes local, fails CI): that pattern on roll tests is usually a convergence-timing race. couple of things we'd check: whether the readiness probe's periodSeconds + timeout is short enough that a transient cluster_state:fail clears inside the test's eventually-window on slower CI, and whether the test waits on the operator's clusterFormed/Ready convergence rather than a fixed sleep. the 2s requeue racing a longer probe period can also leave the node looking NotReady longer than the test expects.

happy to dig into any of these.

@jdheyburn

Copy link
Copy Markdown
Collaborator Author

@melancholictheory I'll probably close out this PR since it's so far behind main, which has had a number of stability improvements on it.

Essentially what I am looking to do here is prevent pods from being descheduled when they shouldn't be - such as the operator deploying out changes. The operator could be performing a proactive failover from primary to replica. While that's happening the pod readiness is OK. In the meantime, a pod could be descheduled because the K8s node is being terminated. Having some gate on whether the cluster/pod is healthy would help prevent this.

Keen to hear your thoughts?

@melancholictheory

Copy link
Copy Markdown
Contributor

makes sense to close it if it's drifted that far from main. the goal is worth keeping though, and i think it's separable from the readiness rework.

the thing we kept running into: you can't really gate descheduling through readiness. the SIGTERM can land at any point, and during a proactive failover the old primary is still correctly Ready (it's serving until the replica takes over), so anything keyed on Ready won't fire at the moment you'd actually want it to. and if you force it NotReady to block eviction, you're back to flapping it out of the Service endpoints. readiness ends up fighting itself.

what worked better for us was two layers, one per disruption class:

PodDisruptionBudget for the voluntary stuff. a per-shard PDB (maxUnavailable: 1) makes kubectl drain, the eviction API, and the cluster-autoscaler respect "don't take a pod if the shard can't afford it", the eviction just gets a 429 until it's safe. that already covers most "node is terminating" cases, since a well-behaved drain goes through the eviction API.

shutdown-on-sigterm failover for the actual termination. it landed in valkey 9.0 (valkey#1091): on SIGTERM a cluster primary does a manual failover to a replica as part of graceful shutdown, before the process exits. the nice property is it's triggered by the SIGTERM itself, so whenever a pod does get descheduled it hands off on the way out, you're not racing the operator's failover against the kubelet. needs terminationGracePeriodSeconds >= cluster-manual-failover-timeout so the grace window doesn't get cut short by SIGKILL.

together those cover both classes: PDB blocks the drains that shouldn't happen, sigterm-failover makes the descheduling that does happen graceful. the one gap neither closes is a truly involuntary loss (node just disappears, no SIGTERM), but that's the same blast radius as any unplanned primary failure, the cluster's own failover handles it.

and it lines up with the proactive failover you pulled from #128: the operator owns the planned rolls, sigterm-failover owns the out-of-band descheduling. same split, two mechanisms.

bjosv pushed a commit that referenced this pull request Jun 23, 2026
Closes #248

### Summary

Inject `shutdown-on-sigterm failover` into the managed Valkey config by
default for every ValkeyCluster. On SIGTERM (node drain, eviction,
preemption, or `kubectl delete pod`), a primary fails its slots over to
a replica during graceful shutdown, which covers the unplanned
termination that the operator's own proactive failover never sees.

### Implementation

- Added the directive to `getBaseConfig`, alongside the other
operator-managed defaults.
- It is a Valkey 9.0+ directive, which matches the operator's documented
baseline, so it renders unconditionally with no version gate.
- No double-failover risk: it only fires if the node is still primary at
SIGTERM, and it is a no-op on replicas.

### Acceptance criteria

- [x] `shutdown-on-sigterm failover` injected into valkey.conf by
default
- [x] Works alongside the operator's proactive failover without
double-firing (no-op on replicas)
- [ ] E2E test (drain the primary's node, verify a replica is promoted
before the pod exits)

I left the E2E test out for now since it needs node-drain orchestration
in the e2e harness. Happy to add it here or as a follow-up, whichever
you prefer. The unit test asserts the directive is rendered.

The `terminationGracePeriodSeconds` constraint is tracked separately in
#260; the docs note the relationship.

### Testing

- Unit test asserts `shutdown-on-sigterm failover` is in the rendered
config.
- `make test` and `make lint` pass locally.

### References

- Discussion #231 (shutdown-on-sigterm section)
- #120 (out-of-band termination discussion)
- valkey/valkey#1091

### Checklist

- [x] This Pull Request is related to one issue.
- [x] Commit message explains what changed and why
- [x] Tests are added or updated.
- [x] Documentation files are updated.
- [ ] I have run pre-commit locally (ran `make test` and `make lint`
instead)

---------

Signed-off-by: melancholictheory <selimvhorst@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants