fix: check primary node first when resolving shard index#207
Conversation
During scale-in, shardIndexFromState could match a stale replica from a drained shard that temporarily appears in a remaining shard via gossip before CLUSTER FORGET propagates. This returns the wrong shard index, causing the controller to drain the wrong shards — e.g. draining 3 of 4 instead of 1, leading to an unrecoverable Reconciling state. Fix by checking the primary node first. The primary is the authoritative slot owner and its ValkeyNode CR always has the correct shard-index label. Stale replicas from drained shards are never primaries of remaining shards. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
|
| Filename | Overview |
|---|---|
| internal/controller/valkeycluster_controller.go | Adds primary-first lookup in shardIndexFromState to avoid misidentifying a shard as draining during scale-in due to stale gossip replicas; fallback to full-node iteration is retained and could theoretically still exhibit the race in a narrow edge case. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["shardIndexFromState(shard, nodes)"] --> B{GetPrimaryNode non-nil?}
B -- "yes" --> C["nodeRoleAndShard(primary.Address, nodes)"]
C --> D{shardIndex >= 0?}
D -- "yes" --> E["return shardIndex ✓ (authoritative result)"]
D -- "no (primary IP not in ValkeyNode list yet)" --> F
B -- "no (PrimaryId not in shard.Nodes)" --> F
F["iterate shard.Nodes"] --> G["nodeRoleAndShard(node.Address, nodes)"]
G --> H{shardIndex >= 0?}
H -- "yes" --> I["return shardIndex (may still match stale replica in edge case)"]
H -- "no" --> J["next node"]
J --> G
J -- "exhausted" --> K["return -1"]
Comments Outside Diff (2)
-
internal/controller/valkeycluster_controller.go, line 1213-1226 (link)Fallback loop still susceptible to the original race
The primary-first check correctly guards the common path, but when
GetPrimaryNode()returns non-nil and yetnodeRoleAndShardreturns-1(primary address not yet in the ValkeyNode list), the code falls through to the fullshard.Nodesloop. That loop is the exact path that triggered the original gossip-race bug: a stale replica from the drained shard whose ValkeyNode CR still carries the old shard-index will produce a wrong result.In practice this secondary path is narrow — it only fires when the primary pod IP hasn't propagated to the ValkeyNode status yet — but it means the fix doesn't fully eliminate the race condition; it narrows its window. Worth noting in the comment or, better, logging a warning before entering the fallback so the case is observable in production.
-
internal/controller/valkeycluster_controller.go, line 1213-1226 (link)No unit test for the primary-first ordering
The PR checklist marks "Tests are added or updated" as unchecked, and no test for
shardIndexFromStateexists anywhere in the controller test files. Given this is a subtle ordering fix where a stale-replica inshard.Nodesat index 0 causes the wrong return value, a unit test that constructs aShardStatewith a correct primary and a stale replica node placed first inNodes— and asserts the function returns the primary's shard index — would lock in the invariant and prevent regression without requiring chaos testing.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "fix: check primary node first when resol..." | Re-trigger Greptile
jdheyburn
left a comment
There was a problem hiding this comment.
Makes sense, thank you! The chaos testing suite really was worth the time you put into it!
Summary
During scale-in,
shardIndexFromStatecould match a stale replica from a drained shard that temporarily appears in a remaining shard via gossip beforeCLUSTER FORGETpropagates. This returns the wrong shard index, causing the controller to drain the wrong shards, e.g. draining 3 shards out of 4 total, instead of just draining 1 shard, leading to an unrecoverable Reconciling state.Fixed by checking the primary node first. The primary is the authoritative slot owner and its ValkeyNode CR always has the correct shard-index label. Stale replicas from drained shards are never primaries of remaining shards.
Testing
This has been found using #203 running:
CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaosThis repeatedly scales the cluster to a random shard count (between 3–9) and verifies it recovers correctly each time, running until failure.
Previously failed within 3–20 iterations; now passes 1000+ without failure.
Checklist
Before submitting the PR make sure the following are checked:
pre-commit run --all-filesor hooks on commit)