feat: inject shutdown-on-sigterm failover by default by melancholictheory · Pull Request #268 · valkey-io/valkey-operator

melancholictheory · 2026-06-19T18:51:11Z

Closes #248

Summary

Inject shutdown-on-sigterm failover into the managed Valkey config by default for every ValkeyCluster. On SIGTERM (node drain, eviction, preemption, or kubectl delete pod), a primary fails its slots over to a replica during graceful shutdown, which covers the unplanned termination that the operator's own proactive failover never sees.

Implementation

Added the directive to getBaseConfig, alongside the other operator-managed defaults.
It is a Valkey 9.0+ directive, which matches the operator's documented baseline, so it renders unconditionally with no version gate.
No double-failover risk: it only fires if the node is still primary at SIGTERM, and it is a no-op on replicas.

Acceptance criteria

shutdown-on-sigterm failover injected into valkey.conf by default
Works alongside the operator's proactive failover without double-firing (no-op on replicas)
E2E test (drain the primary's node, verify a replica is promoted before the pod exits)

I left the E2E test out for now since it needs node-drain orchestration in the e2e harness. Happy to add it here or as a follow-up, whichever you prefer. The unit test asserts the directive is rendered.

The terminationGracePeriodSeconds constraint is tracked separately in #260; the docs note the relationship.

Testing

Unit test asserts shutdown-on-sigterm failover is in the rendered config.
make test and make lint pass locally.

References

Discussion Design: proactive (zero-downtime) rolling restart — pre-empt the failover before recreating the primary #231 (shutdown-on-sigterm section)
feat: Safer cluster node rolls #120 (out-of-band termination discussion)
valkey/valkey#1091

Checklist

This Pull Request is related to one issue.
Commit message explains what changed and why
Tests are added or updated.
Documentation files are updated.
I have run pre-commit locally (ran make test and make lint instead)

Render `shutdown-on-sigterm failover` in the base cluster config so a primary hands its slots off to a replica during graceful shutdown (node drain, eviction, preemption). This covers out-of-band descheduling that the operator's own rolling failover never observes. Requires Valkey 9.0+. Signed-off-by: melancholictheory <selimvhorst@gmail.com>

greptile-apps · 2026-06-19T18:54:30Z

Greptile Summary

Injects shutdown-on-sigterm failover into the operator-managed Valkey config so that a primary node receiving SIGTERM (from a node drain, eviction, or preemption) hands its slots to a replica before exiting, covering out-of-band descheduling that the operator's proactive failover never observes.

config.go: Adds "shutdown-on-sigterm": "failover" to getBaseConfig alongside the other non-overridable cluster directives; scoped only to ValkeyCluster (not standalone ValkeyNode), and is a no-op on replicas so there is no double-failover risk.
config_test.go: Extends the existing cluster config unit test to assert the directive appears in the rendered valkey.conf output.
docs/valkeycluster.md: Adds a new "Graceful shutdown" section explaining the Valkey 9.0+ requirement, the terminationGracePeriodSeconds / cluster-manual-failover-timeout relationship, and what to do if users raise the manual failover timeout.

Confidence Score: 5/5

Safe to merge — the change is a single base-config addition that fires only on SIGTERM, is a no-op on replicas, and does not interfere with the operator's own proactive failover path.

The code change is minimal: one map entry in getBaseConfig, a matching unit test, and accurate documentation. The directive is correctly scoped to ValkeyCluster only (standalone ValkeyNode uses buildManagedConfig, which is untouched). The default timeout values cited in the docs (5s cluster-manual-failover-timeout, 30s Kubernetes grace period) are verified correct against the Valkey documentation, giving roughly 25s of margin. No control-flow, reconciler logic, or data paths are modified.

No files require special attention.

Important Files Changed

Filename	Overview
internal/controller/config.go	Adds "shutdown-on-sigterm failover" to getBaseConfig; correctly scoped to ValkeyCluster only (standalone ValkeyNode uses buildManagedConfig which is unaffected), and last-value-wins placement makes it non-overridable by user config.
internal/controller/config_test.go	Adds a ContainSubstring assertion for "shutdown-on-sigterm failover" in the rendered config; matches the exact "key value\n" format produced by writeConfigLine.
docs/valkeycluster.md	Adds a Graceful shutdown section; default values cited (30s terminationGracePeriodSeconds, 5s cluster-manual-failover-timeout) are accurate per Valkey documentation and give correct margin guidance.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant K8s as Kubernetes
    participant Primary as Valkey Primary Pod
    participant Replica as Valkey Replica Pod

    K8s->>Primary: SIGTERM (node drain / eviction / preemption)
    Note over Primary: shutdown-on-sigterm failover triggers
    Primary->>Replica: CLUSTER FAILOVER (manual)
    Note over Replica: Pauses replication stream,<br/>catches up to primary offset
    Replica-->>Primary: Failover votes secured
    Note over Replica: Promoted to Primary<br/>(within cluster-manual-failover-timeout, default 5s)
    Primary->>Primary: Demoted to Replica
    Note over Primary: Normal shutdown sequence<br/>(within terminationGracePeriodSeconds, default 30s)
    Primary->>K8s: Pod exits cleanly
    Note over K8s: SIGKILL never sent<br/>(~25s margin with defaults)

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant K8s as Kubernetes
    participant Primary as Valkey Primary Pod
    participant Replica as Valkey Replica Pod

    K8s->>Primary: SIGTERM (node drain / eviction / preemption)
    Note over Primary: shutdown-on-sigterm failover triggers
    Primary->>Replica: CLUSTER FAILOVER (manual)
    Note over Replica: Pauses replication stream,<br/>catches up to primary offset
    Replica-->>Primary: Failover votes secured
    Note over Replica: Promoted to Primary<br/>(within cluster-manual-failover-timeout, default 5s)
    Primary->>Primary: Demoted to Replica
    Note over Primary: Normal shutdown sequence<br/>(within terminationGracePeriodSeconds, default 30s)
    Primary->>K8s: Pod exits cleanly
    Note over K8s: SIGKILL never sent<br/>(~25s margin with defaults)

_{Reviews (2): Last reviewed commit: "docs: clarify shutdown-on-sigterm grace-..." | Re-trigger Greptile}

Spell out the actual defaults: the Kubernetes terminationGracePeriodSeconds of 30s comfortably exceeds the Valkey cluster-manual-failover-timeout default of 5s, so the failover completes out of the box. A grace-period bump is only needed when the failover timeout is raised. Signed-off-by: melancholictheory <selimvhorst@gmail.com>

melancholictheory · 2026-06-19T19:18:07Z

on the grace-period note: the numbers are the other way round. the default cluster-manual-failover-timeout is 5s, not 30s (config.c registers it with a 5000ms default), so with the Kubernetes default terminationGracePeriodSeconds of 30s there is roughly 25s of margin and the failover completes out of the box. updated the docs to spell out both defaults and to say you only need to bump the grace period if you raise the timeout.

on whether shutdown-on-sigterm should be overridable: fair distinction, it is more of an operational preference than the correctness settings it sits next to in getBaseConfig. the operator doesn't have an "overridable defaults" layer today, it is either enforced base config or user config. #248 framed it as injected by default, so i kept it enforced, but happy to move it to an overridable layer if you'd rather users could set nosave / default. @jdheyburn that one is a design call for you.

jdheyburn · 2026-06-22T09:03:13Z

Let's keep it enforced for now, if there is a need to override the defaults then we'll consider that when we get there.

jdheyburn

LGTM, thank you for updating the docs too!

jdheyburn · 2026-06-22T09:05:13Z

For the e2e test, I think a follow up issue is fine for now.

bjosv

Nice!

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread docs/valkeycluster.md Outdated

Comment thread internal/controller/config.go

jdheyburn approved these changes Jun 22, 2026

View reviewed changes

This was referenced Jun 22, 2026

E2E test for shutdown-on-sigterm failover (drain primary node, verify replica promotion) #270

Open

feat: configurable terminationGracePeriodSeconds for graceful failover #271

Open

bjosv approved these changes Jun 23, 2026

View reviewed changes

bjosv merged commit 21dd262 into valkey-io:main Jun 23, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: inject shutdown-on-sigterm failover by default#268

feat: inject shutdown-on-sigterm failover by default#268
bjosv merged 2 commits into
valkey-io:mainfrom
melancholictheory:feat/shutdown-on-sigterm-failover

melancholictheory commented Jun 19, 2026

Uh oh!

greptile-apps Bot commented Jun 19, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

melancholictheory commented Jun 19, 2026

Uh oh!

jdheyburn commented Jun 22, 2026

Uh oh!

jdheyburn left a comment

Uh oh!

jdheyburn commented Jun 22, 2026

Uh oh!

bjosv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

melancholictheory commented Jun 19, 2026

Summary

Implementation

Acceptance criteria

Testing

References

Checklist

Uh oh!

greptile-apps Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

melancholictheory commented Jun 19, 2026

Uh oh!

jdheyburn commented Jun 22, 2026

Uh oh!

jdheyburn left a comment

Choose a reason for hiding this comment

Uh oh!

jdheyburn commented Jun 22, 2026

Uh oh!

bjosv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Jun 19, 2026 •

edited

Loading