fix(valkey): switchover scripts iterate stale POD_FQDN_LIST after scale-out

## Problem

When a Valkey cluster is scaled out (e.g. 3 -> 4 replicas) and a targeted switchover is then issued to the freshly added replica, the OpsRequest fails with:

```
WARNING: could not confirm new primary within 300s
```

even though Sentinel has already promoted the fresh candidate, post/settle topology is correct, and `replica-priority` has been restored.

## Root cause

`addons/valkey/scripts/switchover.sh` iterates a member list sourced from the container env variable `VALKEY_POD_FQDN_LIST`, which is rendered into pod environment at pod creation time via `componentVarRef.podFQDNs`. The container env of an existing pod is not refreshed by KubeBlocks after scale-out.

So when scale-out grows replicas from N to N+1, the **old primary's** action container still sees the old N-entry list. All iteration points in `switchover.sh` then miss the freshly added candidate:

- `set_priorities_with_candidate_bias()` — does not set `replica-priority=1` on the fresh candidate
- `restore_priorities()` — does not restore on the fresh candidate
- `wait_for_new_master()` — never probes the fresh candidate, so it cannot observe `role:master` even after Sentinel promotion
- `check_*` helpers using the same list

## Fix

Introduce `pod_fqdns_with_candidate()` that unions `KB_SWITCHOVER_CANDIDATE_FQDN` (passed at action time as `expected_fqdn` / `candidate_fqdn`) into the env list. All iteration points are switched to consume the union list.

## Validation

- ShellSpec: 55 examples, 0 failures (`scripts-ut-spec/valkey_switchover_spec.sh`), with new cases covering stale-list scenarios.
- Live broader smoke test (143 PASS / 4 FAIL / 2 SKIP, the 4 fails are non-product environment/capability gaps): T09 fresh scale-out targeted switchover one-shot pass, T14 targeted switchover Ops Succeed with candidate becoming primary, T15 sentinel failover normal.
- Live chaos suite **143 PASS / 0 FAIL / 0 SKIP** covering master kill, all-sentinel kill, all 6 pods kill, rapid master kill, restart, scale-out/in during writes, vscale during writes — fix holds under concurrent writes and chaos.

## Same-pattern risk in other addons

Redis (`addons/redis/scripts/redis-switchover.sh`) follows the identical pattern with `REDIS_POD_FQDN_LIST` and `SENTINEL_POD_FQDN_LIST` injected via `componentVarRef.podFQDNs`. The same iteration points (`set_redis_priorities`, `recover_redis_priorities`, `check_redis_kernel_status`, `check_switchover_result`) carry the same architectural risk. This PR does not modify Redis — left for a follow-up evaluation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(valkey): switchover scripts iterate stale POD_FQDN_LIST after scale-out #2608

Problem

Root cause

Fix

Validation

Same-pattern risk in other addons

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix(valkey): switchover scripts iterate stale POD_FQDN_LIST after scale-out #2608

Description

Problem

Root cause

Fix

Validation

Same-pattern risk in other addons

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions