Skip to content

fix: check primary node first when resolving shard index#207

Merged
bjosv merged 1 commit into
valkey-io:mainfrom
Nordix:fix-scale-shards
May 31, 2026
Merged

fix: check primary node first when resolving shard index#207
bjosv merged 1 commit into
valkey-io:mainfrom
Nordix:fix-scale-shards

Conversation

@bjosv

@bjosv bjosv commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

During scale-in, shardIndexFromState could match a stale replica from a drained shard that temporarily appears in a remaining shard via gossip before CLUSTER FORGET propagates. This returns the wrong shard index, causing the controller to drain the wrong shards, e.g. draining 3 shards out of 4 total, instead of just draining 1 shard, leading to an unrecoverable Reconciling state.

Fixed by checking the primary node first. The primary is the authoritative slot owner and its ValkeyNode CR always has the correct shard-index label. Stale replicas from drained shards are never primaries of remaining shards.

Testing

This has been found using #203 running:
CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaos

This repeatedly scales the cluster to a random shard count (between 3–9) and verifies it recovers correctly each time, running until failure.

Previously failed within 3–20 iterations; now passes 1000+ without failure.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one issue.
  • Commit message explains what changed and why
  • Tests are added or updated.
  • Documentation files are updated.
  • I have run pre-commit locally (pre-commit run --all-files or hooks on commit)

During scale-in, shardIndexFromState could match a stale replica from a
drained shard that temporarily appears in a remaining shard via gossip
before CLUSTER FORGET propagates. This returns the wrong shard index,
causing the controller to drain the wrong shards — e.g. draining 3 of 4
instead of 1, leading to an unrecoverable Reconciling state.

Fix by checking the primary node first. The primary is the authoritative
slot owner and its ValkeyNode CR always has the correct shard-index label.
Stale replicas from drained shards are never primaries of remaining shards.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
@greptile-apps

greptile-apps Bot commented May 29, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes a scale-in race condition in drainExcessShards where shardIndexFromState could return a wrong shard index because a stale gossip replica from a drained shard temporarily appeared as a node in a remaining shard, before CLUSTER FORGET propagated. The fix makes the function consult the primary node first — which is the authoritative slot owner and always holds the correct shard-index label — before falling back to iterating all nodes.

  • A primary-first lookup is inserted at the top of shardIndexFromState; if the primary's address resolves to a ValkeyNode, that shard index is returned immediately without inspecting replicas.
  • The pre-existing fallback loop over all shard.Nodes is kept for the case where the primary is unavailable, so the fix does not break the no-primary path but also does not fully close that fallback's exposure to the same race in edge cases.
  • No unit test is added for the corrected ordering; a regression test that places a stale replica first in Nodes would lock in this invariant more reliably than chaos testing alone.

Confidence Score: 4/5

Safe to merge — the change is a targeted, minimal fix to a real scale-in bug with strong empirical validation (1000+ chaos iterations).

The primary-first lookup correctly addresses the root cause for the common case. The retained fallback loop is only reached when the primary pod IP has not yet propagated to the ValkeyNode status, a narrow window, but it is the same code path that triggered the original bug. No unit test covers the corrected ordering, so a future refactor could reintroduce the regression without an obvious signal.

internal/controller/valkeycluster_controller.go — specifically the fallback loop in shardIndexFromState and the absence of a unit test for the primary-first ordering guarantee.

Important Files Changed

Filename Overview
internal/controller/valkeycluster_controller.go Adds primary-first lookup in shardIndexFromState to avoid misidentifying a shard as draining during scale-in due to stale gossip replicas; fallback to full-node iteration is retained and could theoretically still exhibit the race in a narrow edge case.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["shardIndexFromState(shard, nodes)"] --> B{GetPrimaryNode non-nil?}
    B -- "yes" --> C["nodeRoleAndShard(primary.Address, nodes)"]
    C --> D{shardIndex >= 0?}
    D -- "yes" --> E["return shardIndex ✓ (authoritative result)"]
    D -- "no (primary IP not in ValkeyNode list yet)" --> F
    B -- "no (PrimaryId not in shard.Nodes)" --> F
    F["iterate shard.Nodes"] --> G["nodeRoleAndShard(node.Address, nodes)"]
    G --> H{shardIndex >= 0?}
    H -- "yes" --> I["return shardIndex (may still match stale replica in edge case)"]
    H -- "no" --> J["next node"]
    J --> G
    J -- "exhausted" --> K["return -1"]
Loading

Comments Outside Diff (2)

  1. internal/controller/valkeycluster_controller.go, line 1213-1226 (link)

    P2 Fallback loop still susceptible to the original race

    The primary-first check correctly guards the common path, but when GetPrimaryNode() returns non-nil and yet nodeRoleAndShard returns -1 (primary address not yet in the ValkeyNode list), the code falls through to the full shard.Nodes loop. That loop is the exact path that triggered the original gossip-race bug: a stale replica from the drained shard whose ValkeyNode CR still carries the old shard-index will produce a wrong result.

    In practice this secondary path is narrow — it only fires when the primary pod IP hasn't propagated to the ValkeyNode status yet — but it means the fix doesn't fully eliminate the race condition; it narrows its window. Worth noting in the comment or, better, logging a warning before entering the fallback so the case is observable in production.

  2. internal/controller/valkeycluster_controller.go, line 1213-1226 (link)

    P2 No unit test for the primary-first ordering

    The PR checklist marks "Tests are added or updated" as unchecked, and no test for shardIndexFromState exists anywhere in the controller test files. Given this is a subtle ordering fix where a stale-replica in shard.Nodes at index 0 causes the wrong return value, a unit test that constructs a ShardState with a correct primary and a stale replica node placed first in Nodes — and asserts the function returns the primary's shard index — would lock in the invariant and prevent regression without requiring chaos testing.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "fix: check primary node first when resol..." | Re-trigger Greptile

@jdheyburn jdheyburn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thank you! The chaos testing suite really was worth the time you put into it!

@bjosv bjosv merged commit 37421b7 into valkey-io:main May 31, 2026
8 checks passed
@bjosv bjosv deleted the fix-scale-shards branch May 31, 2026 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants