Add chaos test suite for long-running fault injection#203
Conversation
|
We still have work to do before a 1.0 according to these tests.. 😅 |
|
| Filename | Overview |
|---|---|
| test/chaos/chaos_test.go | New 1113-line chaos test suite with fault injection scenarios; pauseWorkerNode is missing node deduplication (unlike the parallel network-partition functions), and rollingUpdate/deleteControllerPod call Eventually().Should() directly inside Inject, bypassing the main loop's error routing. |
| test/utils/chaos.go | New 743-line chaos utility file; HealNode uses iptables -F which flushes all filter-table rules (not just the two added by PartitionNode), and deleteRecreateCluster calls Eventually inside the scenario function rather than returning an error. |
| test/chaos/chaos_suite_test.go | Standard Ginkgo suite bootstrap: builds images, loads them into Kind, deploys operator, and sets up cert-manager. Straightforward and correct. |
| test/chaos/client/main.go | Minimal Go client that seeds keys then continuously overwrites them at a configurable RPS; logic is clean with no issues. |
| Makefile | Adds test-chaos target and parameterizes KIND_WORKERS; worker-count validation on existing clusters is a good guard. |
| .github/workflows/test.yml | Adds a compile-only CI step for the chaos package using the chaos build tag; no functional test execution in CI. |
| docs/chaos-testing.md | New documentation covering all environment variables, scenarios, and usage examples; comprehensive and accurate. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[BeforeSuite: build images, deploy operator] --> B[BeforeAll: parse env config, create ValkeyCluster]
B --> C[Seed test data via background client pod]
C --> D{Loop: pick scenario}
D -->|random / sequential| E[Log cluster state before]
E --> F[Inject fault]
F -->|error with skip:| D
F -->|inject error| G[Fail: log iteration + scenario + seed]
F -->|success| H[Eventually: wait for cluster recovery]
H -->|timeout| G
H -->|recovered| I{scenario.losesData?}
I -->|no| J[VerifyTestData]
J -->|fail| G
J -->|pass| D
I -->|yes| K[FlushAll + re-seed]
K --> D
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[BeforeSuite: build images, deploy operator] --> B[BeforeAll: parse env config, create ValkeyCluster]
B --> C[Seed test data via background client pod]
C --> D{Loop: pick scenario}
D -->|random / sequential| E[Log cluster state before]
E --> F[Inject fault]
F -->|error with skip:| D
F -->|inject error| G[Fail: log iteration + scenario + seed]
F -->|success| H[Eventually: wait for cluster recovery]
H -->|timeout| G
H -->|recovered| I{scenario.losesData?}
I -->|no| J[VerifyTestData]
J -->|fail| G
J -->|pass| D
I -->|yes| K[FlushAll + re-seed]
K --> D
Reviews (6): Last reviewed commit: "chaos: validate config hash, add test st..." | Re-trigger Greptile
### Summary During scale-in, `shardIndexFromState` could match a stale replica from a drained shard that temporarily appears in a remaining shard via gossip before `CLUSTER FORGET` propagates. This returns the wrong shard index, causing the controller to drain the wrong shards, e.g. draining 3 shards out of 4 total, instead of just draining 1 shard, leading to an unrecoverable Reconciling state. Fixed by checking the primary node first. The primary is the authoritative slot owner and its ValkeyNode CR always has the correct shard-index label. Stale replicas from drained shards are never primaries of remaining shards. ### Testing This has been found using #203 running: `CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaos` This repeatedly scales the cluster to a random shard count (between 3–9) and verifies it recovers correctly each time, running until failure. Previously failed within 3–20 iterations; now passes 1000+ without failure. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
### Summary This PR fixes two issues that together make rolling updates safe with and without PVCs. 1. Without persistence: restarting a replica destroys its cluster membership and data. The operator considered it Ready and immediately rolled the primary. With no synced replica to fail over to, all shard data was lost. 3. With persistence: the replica stays synced and proactive failover is attempted, but the `_operator` user ACL was missing `cluster|failover`. Every failover failed with NOPERM and the primary was rolled without a graceful handoff. ### Testing Found using #203 running the rolling-update scenario. It patches io-threads on the ValkeyCluster to trigger a rolling restart of all pods, then verifies the cluster recovers and all keys are preserved. ``` # Without persistence CHAOS_SCENARIOS=rolling-update make test-chaos # With persistence CHAOS_SCENARIOS=rolling-update CHAOS_PERSISTENCE=true make test-chaos ``` ### Checklist Before submitting the PR make sure the following are checked: - [ ] This Pull Request is related to one issue. - [x] Commit message explains what changed and why - [ ] Tests are added or updated. - [ ] Documentation files are updated. - [x] I have run pre-commit locally (`pre-commit run --all-files` or hooks on commit) --------- Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
…222) This PR closes #216 ### Summary Add `cluster-allow-replica-migration=no` to prevent Valkey from moving replicas between shards autonomously, which conflicts with the operator's topology management. Add `cluster-replica-validity-factor=0` so replicas always attempt failover regardless of disconnection time. With cluster-node-timeout at 2s, the default factor of 10 gives only a 30s window before replicas refuse to failover, causing stuck clusters under disruption. Reorder `PlanDrainMove` to check slot count before primary existence, preventing a spurious error on already-drained shards. Returning an error due to the primary would cause the caller (drainExcessShards) to propagate it up, halting scale-down and leaving stale ValkeyNodes around forever. This problem solved itself previously when `cluster-allow-replica-migration=yes`. ### Testing This problem was found by using #203 and running on a machine with moderate load: `CHAOS_MAX_SHARDS=15 CHAOS_SCENARIOS="scale-shards" make test-chaos` ### Checklist Before submitting the PR make sure the following are checked: - [x] This Pull Request is related to one issue. - [x] Commit message explains what changed and why - [ ] Tests are added or updated. - [ ] Documentation files are updated. - [x] I have run pre-commit locally (`pre-commit run --all-files` or hooks on commit) Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Introduce a standalone chaos test framework in test/chaos/ that
continuously injects faults into a ValkeyCluster until failure.
Scenarios include pod deletion, workload deletion, network partitions,
container pauses, shard scaling, rolling updates, controller pod
deletion, worker node pauses, and full cluster delete/recreate.
- Dedicated build tag (//go:build chaos) and Kind cluster
- New Makefile target: make test-chaos
- CI step to verify compilation
- Configurable via environment variables (scenarios, shards,
replicas, workload type, tolerations, CPU pressure ...)
- CPU pressure mode: randomly throttle Kind worker node CPUs per
iteration to simulate loaded nodes (CHAOS_CPU_PRESSURE=true)
- Network partition scenarios block all traffic (not just Valkey
ports) to simulate fully unreachable nodes
- Scenarios disabled by default: network-partition-primary,
network-partition-replica, pause-worker-node (require tolerations
for meaningful testing)
Examples:
# Run all default scenarios
make test-chaos
# Stress test random scaling between 3 and 9 shards
CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaos
# Randomly alternate between pod deletion and scaling
CHAOS_SCENARIOS=delete-primary-pod,scale-shards make test-chaos
# Alternate between pod deletion and scaling in sequence
CHAOS_SCENARIOS=delete-primary-pod,scale-shards CHAOS_MODE=sequential make test-chaos
# Repeatedly kill the primary pod of a random shard
CHAOS_SCENARIOS=delete-primary-pod make test-chaos
# Stress test with CPU pressure
CHAOS_CPU_PRESSURE=true make test-chaos
# Test network partitions with pod evictions
CHAOS_SCENARIOS=network-partition-primary,pause-worker-node \
CHAOS_TOLERATION_SECONDS=10 make test-chaos
# Repeatedly delete pods across multiple shards
CHAOS_SCENARIOS=delete-multiple-shard-pods \
CHAOS_SHARDS=7 CHAOS_REPLICAS=2 KIND_WORKERS=3 make test-chaos
# Run with Deployment workload type
CHAOS_WORKLOAD_TYPE=Deployment make test-chaos
Note: Network partition, container pause, and CPU pressure scenarios
require Docker access to Kind worker nodes. These only work when
running against a local Kind cluster, not remote clusters.
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
All scenarios now accept multiple target shards via CHAOS_TARGET_SHARDS (comma-separated indices, "all", or "random"). This removes the delete-multiple-shard-pods scenario by merging it into delete-shard-pods, and enables multi-shard testing for all fault injection scenarios. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
… scenarios Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
…ary-pod Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
… writes
Add a lightweight Go client (test/chaos/client/) that seeds keys deterministically
and optionally maintains continuous writes at a configurable rate (CHAOS_WRITE_RPS).
This replaces valkey-benchmark for both seeding and background writes because
valkey-benchmark's {tag} rewriting produces non-deterministic key names,
making exact key count verification impossible.
The custom client:
- Seeds keys sequentially (key:000000000000 to key:000000099999)
- Overwrites the same keys in a loop (no extra keys created)
- Uses valkey-go cluster mode (auto-routing, auto-reconnect)
- Reports writes/errors every 5s via pod logs
- Supports configurable value size (CHAOS_DATA_SIZE)
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
| func pauseWorkerNode(ctx *ChaosContext) error { | ||
| // Eviction threshold: 40s (node-monitor-grace) + tolerationSeconds. | ||
| // Range spans below and above threshold to cover both eviction and non-eviction cases. | ||
| evictionThreshold := 40*time.Second + time.Duration(ctx.TolerationSec)*time.Second | ||
| duration := randomDuration(ctx.Rand, 3*time.Second, evictionThreshold+30*time.Second) | ||
| var paused []string | ||
| for _, shard := range ctx.TargetShards { | ||
| pod, err := utils.GetShardPrimaryPod(ctx.ClusterName, ctx.Namespace, shard) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| nodeName, err := utils.GetPodNodeName(pod, ctx.Namespace) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| _, _ = fmt.Fprintf(GinkgoWriter, " Pausing Kind node %s (primary pod: %s, shard %d) for %s\n", nodeName, pod, shard, duration.Truncate(time.Second)) | ||
| logIfControllerNode(nodeName) | ||
| cmd := exec.Command("docker", "pause", nodeName) | ||
| if _, err := utils.Run(cmd); err != nil { | ||
| return err | ||
| } | ||
| paused = append(paused, nodeName) | ||
| } | ||
| time.Sleep(duration) | ||
| for _, nodeName := range paused { | ||
| _, _ = fmt.Fprintf(GinkgoWriter, " Unpausing Kind node %s\n", nodeName) | ||
| cmd := exec.Command("docker", "unpause", nodeName) | ||
| if _, err := utils.Run(cmd); err != nil { | ||
| return err | ||
| } | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
pauseWorkerNode doesn't deduplicate nodes, leaving nodes paused on failure
When CHAOS_TARGET_SHARDS targets multiple shards that happen to share a Kind worker node (common with the default 2-worker setup and many shards), the second docker pause <nodeName> fails because the node is already paused. The function returns an error immediately, leaving the paused node in the paused slice — but the time.Sleep and unpause loop are never reached. The test then calls Fail(...), and the Kind node stays frozen for the rest of the suite.
Both networkPartitionPrimary and networkPartitionReplica handle this correctly with slices.Contains(nodes, nodeName) deduplication before acting. The same guard is missing here.
Summary
Introduce a standalone chaos test framework in
test/chaos/that continuously injects faults into aValkeyClusteruntil failure. Scenarios include pod deletion, workload deletion, network partitions, container pauses, worker node pauses, shard scaling, rolling updates, and full cluster delete/recreate.The framework creates a
ValkeyCluster, seeds test data, then enters a loop that randomly selects a fault scenario, run it and waits for the cluster to recover toReadystate, and verifies data integrity. On failure it collectsCLUSTER NODESoutput, pod logs, and resource state for debugging.Examples:
Repeatedly kill the primary pod of a random shards
CHAOS_SCENARIOS=delete-primary-pod make test-chaosStress test random scaling between 3 and 9 shards
CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaosRandomly alternate between pod deletion and scaling
CHAOS_SCENARIOS=delete-primary-pod,scale-shards make test-chaosAlternate between pod deletion and scaling in sequence
CHAOS_SCENARIOS=delete-primary-pod,scale-shards CHAOS_MODE=sequential make test-chaosDelete and recreate the cluster repeatedly
CHAOS_SCENARIOS=delete-recreate-cluster make test-chaosRun all default scenarios with Deployment workload type
CHAOS_WORKLOAD_TYPE=Deployment make test-chaosRun with CPU pressure on worker nodes
CHAOS_CPU_PRESSURE=true make test-chaosNote: Network partition, container pause, and CPU pressure scenarios require Docker access to Kind worker nodes. These only work when running against a local Kind cluster, not remote clusters.
Scenarios
delete-primary-pod: Kill the primary pod of a random shardsdelete-replica-pod: Kill a replica pod of targeted shardsdelete-shard-pods: Kill all pods in targeted shardsdelete-primary-workload: Delete the Deployment/StatefulSet of a primarydelete-replica-workload: Delete the Deployment/StatefulSet of a replicapause-primary-container: Docker-pause a primary containerpause-replica-container: Docker-pause a replica containerscale-shards: Randomly scale the cluster up or downrolling-update: Changes cluster config to trigger a rolling updatedelete-recreate-cluster: Delete the CR, wait for cleanup, recreatedelete-controller-pod: Kills the operator controller podpause-worker-node: Pauses the Kind worker node hosting the primary (disabled by default)network-partition-primary: Isolate the primary's node (disabled by default)network-partition-replica: Isolate a replica's node (disabled by default)Configuration
All behavior is controlled via environment variables:
CHAOS_TARGET_SHARDS: Shards to target: random, all, or comma-separated indices (default: random)CHAOS_WORKLOAD_TYPE: Deployment or StatefulSet (default: StatefulSet)CHAOS_SCENARIOS: Comma-separated list to enable (default: all except disabled)CHAOS_MODE: random or sequential (default: random)CHAOS_MIN_SHARDS/CHAOS_MAX_SHARDS: Shard range for scalingCHAOS_REPLICAS: Number of replicas per shardCHAOS_RECOVERY_TIMEOUT: Max time to wait for recoveryCHAOS_TOLERATION_SECONDS: Pod toleration seconds for not-ready/unreachable (0 = not set)CHAOS_CPU_PRESSURE: Throttle Kind worker node CPUs each iteration (default: false)CHAOS_SEED: Reproducible randomnessOther info
Checklist
Before submitting the PR make sure the following are checked:
pre-commit run --all-filesor hooks on commit)