feat: Safer cluster node rolls#120
Conversation
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
This PR enhances the readiness probe checks such that nodes must be in a ready state before progressing with the sequential roll that was introduced in valkey-io#116. It does this by introducing checks for the node liveness via a Running status field. Once the node is live then the controller is enhanced to attempt to get the node to rejoin the cluster, regardless if a volume is set or not. Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
990bf0a to
131249e
Compare
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
| if echo "$cluster_info" | grep -q '^cluster_state:'; then | ||
| response=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]') | ||
| if [ "$response" != "cluster_state:ok" ]; then | ||
| echo "$response" >&2 | ||
| exit 1 | ||
| fi | ||
| fi |
There was a problem hiding this comment.
We can avoid grep:ing twice using
| if echo "$cluster_info" | grep -q '^cluster_state:'; then | |
| response=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]') | |
| if [ "$response" != "cluster_state:ok" ]; then | |
| echo "$response" >&2 | |
| exit 1 | |
| fi | |
| fi | |
| cluster_state=$(echo "$cluster_info" | grep '^cluster_state:' | tr -d '[:space:]') | |
| if [ -n "$cluster_state" ] && [ "$cluster_state" != "cluster_state:ok" ]; then | |
| echo "$cluster_state" >&2 | |
| exit 1 | |
| fi |
| setCondition(cluster, valkeyiov1alpha1.ConditionProgressing, valkeyiov1alpha1.ReasonUpdatingNodes, "Updating ValkeyNodes", metav1.ConditionTrue) | ||
| _ = r.updateStatus(ctx, cluster, nil) | ||
| return ctrl.Result{RequeueAfter: 2 * time.Second}, nil | ||
| return r.reconcileNodeTransition(ctx, cluster) |
There was a problem hiding this comment.
Why can't we just continue here? We get the list, the cluster state, forgetStaleNodes and run phase 1.. in below code in a similar way as in reconcileNodeTransition()
There was a problem hiding this comment.
Fair point - I've done some refactoring by introducing a nodeTransitioning to indicate an early exit. Thanks!
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…n/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
|
I'm having a nightmare with this e2e test. It passes everytime locally but not on here. I'll try to find some more time to work out why |
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
|
Placing back into draft while I look to enhance cluster stability |
…n/valkey-operator into jdheyburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
|
This branch has pulled in the proactive failover from below PR: And it seems to be much more stable now. |
…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…yburn/feat/safer-cluster-rolls Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
|
read through this. the Running-vs-Ready split for the bootstrap-vs-rolling gate is a nice way to dodge the cluster_state:ok chicken-and-egg at MEET time, shouldWaitForNode reads clean. a few notes from our own go at safer rolls: the part i'd think hardest about: cluster_state is a cluster-global signal, and putting it in the pod readiness probe couples every pod's endpoint membership to overall cluster health. during a roll, the moment a shard is transiently without a serving primary, cluster_state flips and every pod's probe fails at once, not just the one being rolled, so healthy shards get pulled from the Service too. for the roll-progression gate that's exactly what you want (wait for convergence), but baking it into the probe spreads it to Service membership. since the operator already pulls clusterState in reconcileValkeyNodes, one option is to keep the probe node-local (PING, not loading, this node serves its own slots) and do the cluster_state:ok convergence check operator-side in shouldWaitForNode where it already lives. that decouples "is this a healthy endpoint" from "is the whole cluster converged". worth being explicit about cluster-require-full-coverage too. with the default yes, a shard mid-failover flips cluster_state:fail cluster-wide, so the new probe takes the whole cluster NotReady during that window. with no, only the affected shard degrades and cluster_state stays ok. the proactive failover you pulled in from #128 is what keeps a planned primary roll from tripping it (replica promoted before the old primary goes down), so those two changes lean on each other. on the e2e flake (passes local, fails CI): that pattern on roll tests is usually a convergence-timing race. couple of things we'd check: whether the readiness probe's periodSeconds + timeout is short enough that a transient cluster_state:fail clears inside the test's eventually-window on slower CI, and whether the test waits on the operator's clusterFormed/Ready convergence rather than a fixed sleep. the 2s requeue racing a longer probe period can also leave the node looking NotReady longer than the test expects. happy to dig into any of these. |
|
@melancholictheory I'll probably close out this PR since it's so far behind main, which has had a number of stability improvements on it. Essentially what I am looking to do here is prevent pods from being descheduled when they shouldn't be - such as the operator deploying out changes. The operator could be performing a proactive failover from primary to replica. While that's happening the pod readiness is OK. In the meantime, a pod could be descheduled because the K8s node is being terminated. Having some gate on whether the cluster/pod is healthy would help prevent this. Keen to hear your thoughts? |
|
makes sense to close it if it's drifted that far from main. the goal is worth keeping though, and i think it's separable from the readiness rework. the thing we kept running into: you can't really gate descheduling through readiness. the SIGTERM can land at any point, and during a proactive failover the old primary is still correctly Ready (it's serving until the replica takes over), so anything keyed on Ready won't fire at the moment you'd actually want it to. and if you force it NotReady to block eviction, you're back to flapping it out of the Service endpoints. readiness ends up fighting itself. what worked better for us was two layers, one per disruption class: PodDisruptionBudget for the voluntary stuff. a per-shard PDB (maxUnavailable: 1) makes kubectl drain, the eviction API, and the cluster-autoscaler respect "don't take a pod if the shard can't afford it", the eviction just gets a 429 until it's safe. that already covers most "node is terminating" cases, since a well-behaved drain goes through the eviction API. shutdown-on-sigterm failover for the actual termination. it landed in valkey 9.0 (valkey#1091): on SIGTERM a cluster primary does a manual failover to a replica as part of graceful shutdown, before the process exits. the nice property is it's triggered by the SIGTERM itself, so whenever a pod does get descheduled it hands off on the way out, you're not racing the operator's failover against the kubelet. needs terminationGracePeriodSeconds >= cluster-manual-failover-timeout so the grace window doesn't get cut short by SIGKILL. together those cover both classes: PDB blocks the drains that shouldn't happen, sigterm-failover makes the descheduling that does happen graceful. the one gap neither closes is a truly involuntary loss (node just disappears, no SIGTERM), but that's the same blast radius as any unplanned primary failure, the cluster's own failover handles it. and it lines up with the proactive failover you pulled from #128: the operator owns the planned rolls, sigterm-failover owns the out-of-band descheduling. same split, two mechanisms. |
Closes #248 ### Summary Inject `shutdown-on-sigterm failover` into the managed Valkey config by default for every ValkeyCluster. On SIGTERM (node drain, eviction, preemption, or `kubectl delete pod`), a primary fails its slots over to a replica during graceful shutdown, which covers the unplanned termination that the operator's own proactive failover never sees. ### Implementation - Added the directive to `getBaseConfig`, alongside the other operator-managed defaults. - It is a Valkey 9.0+ directive, which matches the operator's documented baseline, so it renders unconditionally with no version gate. - No double-failover risk: it only fires if the node is still primary at SIGTERM, and it is a no-op on replicas. ### Acceptance criteria - [x] `shutdown-on-sigterm failover` injected into valkey.conf by default - [x] Works alongside the operator's proactive failover without double-firing (no-op on replicas) - [ ] E2E test (drain the primary's node, verify a replica is promoted before the pod exits) I left the E2E test out for now since it needs node-drain orchestration in the e2e harness. Happy to add it here or as a follow-up, whichever you prefer. The unit test asserts the directive is rendered. The `terminationGracePeriodSeconds` constraint is tracked separately in #260; the docs note the relationship. ### Testing - Unit test asserts `shutdown-on-sigterm failover` is in the rendered config. - `make test` and `make lint` pass locally. ### References - Discussion #231 (shutdown-on-sigterm section) - #120 (out-of-band termination discussion) - valkey/valkey#1091 ### Checklist - [x] This Pull Request is related to one issue. - [x] Commit message explains what changed and why - [x] Tests are added or updated. - [x] Documentation files are updated. - [ ] I have run pre-commit locally (ran `make test` and `make lint` instead) --------- Signed-off-by: melancholictheory <selimvhorst@gmail.com>
This PR enhances the readiness probe checks such that
nodes must be in a ready state before progressing with
the sequential roll that was introduced in #116.
It does this by introducing checks for the node liveness
via a Running status field. Once the node is live then
the controller is enhanced to attempt to get the node to
rejoin the cluster, regardless if a volume is set or not.