Skip to content

[bug] Handle quorum loss with persistence enabled (stuck/unschedulable pods) #256

Description

@bjosv

Not sure this is top priority, but keeping the info about this issue identified when working on PR #244.

Currently, promoteOrphanedReplicas skips TAKEOVER when persistence is configured, assuming the crashed pod will return with the same node ID and data. This is correct for transient failures (OOMKill, node reboot), but leaves the cluster stuck in degraded state when the primary cannot come back:

  • Pod unschedulable (node drained, resource pressure, taint added)
  • PVC bound to a failed node (local storage, zone loss)
  • PVC manually deleted
  • Node permanently removed from the cluster

Expected behavior:
After a configurable timeout, or based on pod status checks, the operator should detect that the primary is not coming back and promote a replica via TAKEOVER (accepting potential data loss for the unreplicated writes).

Alternatives:

  1. Timeout-based: If a primary remains in fail state for longer than a threshold (e.g., spec.failoverTimeout, default 5m), proceed with TAKEOVER even with persistence enabled.
  2. Pod-status-based: Check if the primary's pod is in an unrecoverable state (Unschedulable, PVC Lost, node deleted) and trigger TAKEOVER immediately.
  3. Annotation escape hatch: Allow manual intervention via valkey.io/force-takeover: "true" on the ValkeyCluster resource.
  4. ...

Risks:
TAKEOVER trades consensus for liveness. If the "dead" primary is actually partitioned (still serving clients on the other side of a network split), promoting a replica creates two owners for the same slots with competing epochs. With persistence, the old primary could resume serving stale data when the partition heals.

To mitigate this, the operator should:

  1. Confirm via the Kubernetes API that the old primary is actually gone (pod deleted, node gone), not just unreachable over the cluster bus.
  2. Fence the old primary (delete its pod) before issuing TAKEOVER, preventing it from coming back as a competing primary.

Context:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions