Add chaos test suite for long-running fault injection by bjosv · Pull Request #203 · valkey-io/valkey-operator

bjosv · 2026-05-28T09:53:50Z

Summary

Introduce a standalone chaos test framework in test/chaos/ that continuously injects faults into a ValkeyCluster until failure. Scenarios include pod deletion, workload deletion, network partitions, container pauses, worker node pauses, shard scaling, rolling updates, and full cluster delete/recreate.

The framework creates a ValkeyCluster, seeds test data, then enters a loop that randomly selects a fault scenario, run it and waits for the cluster to recover to Ready state, and verifies data integrity. On failure it collects CLUSTER NODES output, pod logs, and resource state for debugging.

Examples:

Repeatedly kill the primary pod of a random shards
CHAOS_SCENARIOS=delete-primary-pod make test-chaos
Stress test random scaling between 3 and 9 shards
CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaos
Randomly alternate between pod deletion and scaling
CHAOS_SCENARIOS=delete-primary-pod,scale-shards make test-chaos
Alternate between pod deletion and scaling in sequence
CHAOS_SCENARIOS=delete-primary-pod,scale-shards CHAOS_MODE=sequential make test-chaos
Delete and recreate the cluster repeatedly
CHAOS_SCENARIOS=delete-recreate-cluster make test-chaos
Run all default scenarios with Deployment workload type
CHAOS_WORKLOAD_TYPE=Deployment make test-chaos
Run with CPU pressure on worker nodes
CHAOS_CPU_PRESSURE=true make test-chaos

Note: Network partition, container pause, and CPU pressure scenarios require Docker access to Kind worker nodes. These only work when running against a local Kind cluster, not remote clusters.

Scenarios

delete-primary-pod: Kill the primary pod of a random shards
delete-replica-pod: Kill a replica pod of targeted shards
delete-shard-pods: Kill all pods in targeted shards
delete-primary-workload: Delete the Deployment/StatefulSet of a primary
delete-replica-workload: Delete the Deployment/StatefulSet of a replica
pause-primary-container: Docker-pause a primary container
pause-replica-container: Docker-pause a replica container
scale-shards: Randomly scale the cluster up or down
rolling-update: Changes cluster config to trigger a rolling update
delete-recreate-cluster: Delete the CR, wait for cleanup, recreate
delete-controller-pod: Kills the operator controller pod
pause-worker-node: Pauses the Kind worker node hosting the primary (disabled by default)
network-partition-primary: Isolate the primary's node (disabled by default)
network-partition-replica: Isolate a replica's node (disabled by default)

Configuration

All behavior is controlled via environment variables:

CHAOS_TARGET_SHARDS: Shards to target: random, all, or comma-separated indices (default: random)
CHAOS_WORKLOAD_TYPE: Deployment or StatefulSet (default: StatefulSet)
CHAOS_SCENARIOS: Comma-separated list to enable (default: all except disabled)
CHAOS_MODE: random or sequential (default: random)
CHAOS_MIN_SHARDS / CHAOS_MAX_SHARDS: Shard range for scaling
CHAOS_REPLICAS: Number of replicas per shard
CHAOS_RECOVERY_TIMEOUT: Max time to wait for recovery
CHAOS_TOLERATION_SECONDS: Pod toleration seconds for not-ready/unreachable (0 = not set)
CHAOS_CPU_PRESSURE: Throttle Kind worker node CPUs each iteration (default: false)
CHAOS_SEED: Reproducible randomness

Other info

Dedicated build tag (//go:build chaos) and Kind cluster
New Makefile target: make test-chaos
CI step to verify compilation

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one issue.
Commit message explains what changed and why
Tests are added or updated.
Documentation files are updated.
I have run pre-commit locally (pre-commit run --all-files or hooks on commit)

bjosv · 2026-05-28T09:57:59Z

We still have work to do before a 1.0 according to these tests.. 😅

greptile-apps · 2026-05-28T10:06:13Z

Greptile Summary

This PR introduces a standalone chaos test framework (test/chaos/) that continuously injects faults into a ValkeyCluster until failure, covering pod deletion, network partitions, container pauses, node pauses, shard/replica scaling, rolling updates, and full cluster delete/recreate.

A new test-chaos Makefile target spins up a dedicated Kind cluster, deploys the operator, seeds test data via a custom Go client pod, and loops indefinitely through randomly or sequentially selected fault scenarios, verifying recovery and data integrity after each one.
Fourteen named scenarios are registered in allScenarios; compound faults (+-separated), CPU pressure throttling, and configurable target-shard selection are supported via environment variables.
A CI step verifies the package compiles with the chaos build tag, but does not execute the tests in CI.

Confidence Score: 3/5

The chaos framework is well-structured but has several defects in fault-injection and cleanup paths that would cause stuck Kind nodes or silent test misbehaviour during actual runs.

The pauseWorkerNode function attempts to pause each target shard's host node without deduplicating — unlike networkPartitionPrimary which explicitly does so. When two target shards land on the same worker (the common case with 2 workers and many shards), the second docker pause fails and the function returns early, leaving the node frozen with no cleanup path. Additionally, rollingUpdate and deleteControllerPod call Eventually().Should() directly inside their Inject functions, bypassing the iteration/scenario/seed context. HealNode runs iptables -F flushing all filter-table rules including CNI rules. And scaleShards scales below MinShards when both bounds are equal.

test/chaos/chaos_test.go and test/utils/chaos.go contain the defects in fault injection and cleanup logic.

Important Files Changed

Filename	Overview
test/chaos/chaos_test.go	New 1113-line chaos test suite with fault injection scenarios; `pauseWorkerNode` is missing node deduplication (unlike the parallel network-partition functions), and `rollingUpdate`/`deleteControllerPod` call `Eventually().Should()` directly inside `Inject`, bypassing the main loop's error routing.
test/utils/chaos.go	New 743-line chaos utility file; `HealNode` uses `iptables -F` which flushes all filter-table rules (not just the two added by PartitionNode), and `deleteRecreateCluster` calls `Eventually` inside the scenario function rather than returning an error.
test/chaos/chaos_suite_test.go	Standard Ginkgo suite bootstrap: builds images, loads them into Kind, deploys operator, and sets up cert-manager. Straightforward and correct.
test/chaos/client/main.go	Minimal Go client that seeds keys then continuously overwrites them at a configurable RPS; logic is clean with no issues.
Makefile	Adds `test-chaos` target and parameterizes `KIND_WORKERS`; worker-count validation on existing clusters is a good guard.
.github/workflows/test.yml	Adds a compile-only CI step for the chaos package using the `chaos` build tag; no functional test execution in CI.
docs/chaos-testing.md	New documentation covering all environment variables, scenarios, and usage examples; comprehensive and accurate.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[BeforeSuite: build images, deploy operator] --> B[BeforeAll: parse env config, create ValkeyCluster]
    B --> C[Seed test data via background client pod]
    C --> D{Loop: pick scenario}
    D -->|random / sequential| E[Log cluster state before]
    E --> F[Inject fault]
    F -->|error with skip:| D
    F -->|inject error| G[Fail: log iteration + scenario + seed]
    F -->|success| H[Eventually: wait for cluster recovery]
    H -->|timeout| G
    H -->|recovered| I{scenario.losesData?}
    I -->|no| J[VerifyTestData]
    J -->|fail| G
    J -->|pass| D
    I -->|yes| K[FlushAll + re-seed]
    K --> D

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[BeforeSuite: build images, deploy operator] --> B[BeforeAll: parse env config, create ValkeyCluster]
    B --> C[Seed test data via background client pod]
    C --> D{Loop: pick scenario}
    D -->|random / sequential| E[Log cluster state before]
    E --> F[Inject fault]
    F -->|error with skip:| D
    F -->|inject error| G[Fail: log iteration + scenario + seed]
    F -->|success| H[Eventually: wait for cluster recovery]
    H -->|timeout| G
    H -->|recovered| I{scenario.losesData?}
    I -->|no| J[VerifyTestData]
    J -->|fail| G
    J -->|pass| D
    I -->|yes| K[FlushAll + re-seed]
    K --> D

_{Reviews (6): Last reviewed commit: "chaos: validate config hash, add test st..." | Re-trigger Greptile}

### Summary During scale-in, `shardIndexFromState` could match a stale replica from a drained shard that temporarily appears in a remaining shard via gossip before `CLUSTER FORGET` propagates. This returns the wrong shard index, causing the controller to drain the wrong shards, e.g. draining 3 shards out of 4 total, instead of just draining 1 shard, leading to an unrecoverable Reconciling state. Fixed by checking the primary node first. The primary is the authoritative slot owner and its ValkeyNode CR always has the correct shard-index label. Stale replicas from drained shards are never primaries of remaining shards. ### Testing This has been found using #203 running: `CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaos` This repeatedly scales the cluster to a random shard count (between 3–9) and verifies it recovers correctly each time, running until failure. Previously failed within 3–20 iterations; now passes 1000+ without failure. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

### Summary This PR fixes two issues that together make rolling updates safe with and without PVCs. 1. Without persistence: restarting a replica destroys its cluster membership and data. The operator considered it Ready and immediately rolled the primary. With no synced replica to fail over to, all shard data was lost. 3. With persistence: the replica stays synced and proactive failover is attempted, but the `_operator` user ACL was missing `cluster|failover`. Every failover failed with NOPERM and the primary was rolled without a graceful handoff. ### Testing Found using #203 running the rolling-update scenario. It patches io-threads on the ValkeyCluster to trigger a rolling restart of all pods, then verifies the cluster recovers and all keys are preserved. ``` # Without persistence CHAOS_SCENARIOS=rolling-update make test-chaos # With persistence CHAOS_SCENARIOS=rolling-update CHAOS_PERSISTENCE=true make test-chaos ``` ### Checklist Before submitting the PR make sure the following are checked: - [ ] This Pull Request is related to one issue. - [x] Commit message explains what changed and why - [ ] Tests are added or updated. - [ ] Documentation files are updated. - [x] I have run pre-commit locally (`pre-commit run --all-files` or hooks on commit) --------- Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

…222) This PR closes #216 ### Summary Add `cluster-allow-replica-migration=no` to prevent Valkey from moving replicas between shards autonomously, which conflicts with the operator's topology management. Add `cluster-replica-validity-factor=0` so replicas always attempt failover regardless of disconnection time. With cluster-node-timeout at 2s, the default factor of 10 gives only a 30s window before replicas refuse to failover, causing stuck clusters under disruption. Reorder `PlanDrainMove` to check slot count before primary existence, preventing a spurious error on already-drained shards. Returning an error due to the primary would cause the caller (drainExcessShards) to propagate it up, halting scale-down and leaving stale ValkeyNodes around forever. This problem solved itself previously when `cluster-allow-replica-migration=yes`. ### Testing This problem was found by using #203 and running on a machine with moderate load: `CHAOS_MAX_SHARDS=15 CHAOS_SCENARIOS="scale-shards" make test-chaos` ### Checklist Before submitting the PR make sure the following are checked: - [x] This Pull Request is related to one issue. - [x] Commit message explains what changed and why - [ ] Tests are added or updated. - [ ] Documentation files are updated. - [x] I have run pre-commit locally (`pre-commit run --all-files` or hooks on commit) Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

Introduce a standalone chaos test framework in test/chaos/ that continuously injects faults into a ValkeyCluster until failure. Scenarios include pod deletion, workload deletion, network partitions, container pauses, shard scaling, rolling updates, controller pod deletion, worker node pauses, and full cluster delete/recreate. - Dedicated build tag (//go:build chaos) and Kind cluster - New Makefile target: make test-chaos - CI step to verify compilation - Configurable via environment variables (scenarios, shards, replicas, workload type, tolerations, CPU pressure ...) - CPU pressure mode: randomly throttle Kind worker node CPUs per iteration to simulate loaded nodes (CHAOS_CPU_PRESSURE=true) - Network partition scenarios block all traffic (not just Valkey ports) to simulate fully unreachable nodes - Scenarios disabled by default: network-partition-primary, network-partition-replica, pause-worker-node (require tolerations for meaningful testing) Examples: # Run all default scenarios make test-chaos # Stress test random scaling between 3 and 9 shards CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaos # Randomly alternate between pod deletion and scaling CHAOS_SCENARIOS=delete-primary-pod,scale-shards make test-chaos # Alternate between pod deletion and scaling in sequence CHAOS_SCENARIOS=delete-primary-pod,scale-shards CHAOS_MODE=sequential make test-chaos # Repeatedly kill the primary pod of a random shard CHAOS_SCENARIOS=delete-primary-pod make test-chaos # Stress test with CPU pressure CHAOS_CPU_PRESSURE=true make test-chaos # Test network partitions with pod evictions CHAOS_SCENARIOS=network-partition-primary,pause-worker-node \ CHAOS_TOLERATION_SECONDS=10 make test-chaos # Repeatedly delete pods across multiple shards CHAOS_SCENARIOS=delete-multiple-shard-pods \ CHAOS_SHARDS=7 CHAOS_REPLICAS=2 KIND_WORKERS=3 make test-chaos # Run with Deployment workload type CHAOS_WORKLOAD_TYPE=Deployment make test-chaos Note: Network partition, container pause, and CPU pressure scenarios require Docker access to Kind worker nodes. These only work when running against a local Kind cluster, not remote clusters. Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>