Skip to content

Add chaos test suite for long-running fault injection#203

Open
bjosv wants to merge 11 commits into
valkey-io:mainfrom
Nordix:chaos-testing
Open

Add chaos test suite for long-running fault injection#203
bjosv wants to merge 11 commits into
valkey-io:mainfrom
Nordix:chaos-testing

Conversation

@bjosv

@bjosv bjosv commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Introduce a standalone chaos test framework in test/chaos/ that continuously injects faults into a ValkeyCluster until failure. Scenarios include pod deletion, workload deletion, network partitions, container pauses, worker node pauses, shard scaling, rolling updates, and full cluster delete/recreate.

The framework creates a ValkeyCluster, seeds test data, then enters a loop that randomly selects a fault scenario, run it and waits for the cluster to recover to Ready state, and verifies data integrity. On failure it collects CLUSTER NODES output, pod logs, and resource state for debugging.

Examples:

  • Repeatedly kill the primary pod of a random shards
    CHAOS_SCENARIOS=delete-primary-pod make test-chaos

  • Stress test random scaling between 3 and 9 shards
    CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaos

  • Randomly alternate between pod deletion and scaling
    CHAOS_SCENARIOS=delete-primary-pod,scale-shards make test-chaos

  • Alternate between pod deletion and scaling in sequence
    CHAOS_SCENARIOS=delete-primary-pod,scale-shards CHAOS_MODE=sequential make test-chaos

  • Delete and recreate the cluster repeatedly
    CHAOS_SCENARIOS=delete-recreate-cluster make test-chaos

  • Run all default scenarios with Deployment workload type
    CHAOS_WORKLOAD_TYPE=Deployment make test-chaos

  • Run with CPU pressure on worker nodes
    CHAOS_CPU_PRESSURE=true make test-chaos

Note: Network partition, container pause, and CPU pressure scenarios require Docker access to Kind worker nodes. These only work when running against a local Kind cluster, not remote clusters.

Scenarios

  • delete-primary-pod: Kill the primary pod of a random shards
  • delete-replica-pod: Kill a replica pod of targeted shards
  • delete-shard-pods: Kill all pods in targeted shards
  • delete-primary-workload: Delete the Deployment/StatefulSet of a primary
  • delete-replica-workload: Delete the Deployment/StatefulSet of a replica
  • pause-primary-container: Docker-pause a primary container
  • pause-replica-container: Docker-pause a replica container
  • scale-shards: Randomly scale the cluster up or down
  • rolling-update: Changes cluster config to trigger a rolling update
  • delete-recreate-cluster: Delete the CR, wait for cleanup, recreate
  • delete-controller-pod: Kills the operator controller pod
  • pause-worker-node: Pauses the Kind worker node hosting the primary (disabled by default)
  • network-partition-primary: Isolate the primary's node (disabled by default)
  • network-partition-replica: Isolate a replica's node (disabled by default)

Configuration

All behavior is controlled via environment variables:

  • CHAOS_TARGET_SHARDS: Shards to target: random, all, or comma-separated indices (default: random)
  • CHAOS_WORKLOAD_TYPE: Deployment or StatefulSet (default: StatefulSet)
  • CHAOS_SCENARIOS: Comma-separated list to enable (default: all except disabled)
  • CHAOS_MODE: random or sequential (default: random)
  • CHAOS_MIN_SHARDS / CHAOS_MAX_SHARDS: Shard range for scaling
  • CHAOS_REPLICAS: Number of replicas per shard
  • CHAOS_RECOVERY_TIMEOUT: Max time to wait for recovery
  • CHAOS_TOLERATION_SECONDS: Pod toleration seconds for not-ready/unreachable (0 = not set)
  • CHAOS_CPU_PRESSURE: Throttle Kind worker node CPUs each iteration (default: false)
  • CHAOS_SEED: Reproducible randomness

Other info

  • Dedicated build tag (//go:build chaos) and Kind cluster
  • New Makefile target: make test-chaos
  • CI step to verify compilation

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one issue.
  • Commit message explains what changed and why
  • Tests are added or updated.
  • Documentation files are updated.
  • I have run pre-commit locally (pre-commit run --all-files or hooks on commit)

@bjosv

bjosv commented May 28, 2026

Copy link
Copy Markdown
Collaborator Author

We still have work to do before a 1.0 according to these tests.. 😅

@greptile-apps

greptile-apps Bot commented May 28, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces a standalone chaos test framework (test/chaos/) that continuously injects faults into a ValkeyCluster until failure, covering pod deletion, network partitions, container pauses, node pauses, shard/replica scaling, rolling updates, and full cluster delete/recreate.

  • A new test-chaos Makefile target spins up a dedicated Kind cluster, deploys the operator, seeds test data via a custom Go client pod, and loops indefinitely through randomly or sequentially selected fault scenarios, verifying recovery and data integrity after each one.
  • Fourteen named scenarios are registered in allScenarios; compound faults (+-separated), CPU pressure throttling, and configurable target-shard selection are supported via environment variables.
  • A CI step verifies the package compiles with the chaos build tag, but does not execute the tests in CI.

Confidence Score: 3/5

The chaos framework is well-structured but has several defects in fault-injection and cleanup paths that would cause stuck Kind nodes or silent test misbehaviour during actual runs.

The pauseWorkerNode function attempts to pause each target shard's host node without deduplicating — unlike networkPartitionPrimary which explicitly does so. When two target shards land on the same worker (the common case with 2 workers and many shards), the second docker pause fails and the function returns early, leaving the node frozen with no cleanup path. Additionally, rollingUpdate and deleteControllerPod call Eventually().Should() directly inside their Inject functions, bypassing the iteration/scenario/seed context. HealNode runs iptables -F flushing all filter-table rules including CNI rules. And scaleShards scales below MinShards when both bounds are equal.

test/chaos/chaos_test.go and test/utils/chaos.go contain the defects in fault injection and cleanup logic.

Important Files Changed

Filename Overview
test/chaos/chaos_test.go New 1113-line chaos test suite with fault injection scenarios; pauseWorkerNode is missing node deduplication (unlike the parallel network-partition functions), and rollingUpdate/deleteControllerPod call Eventually().Should() directly inside Inject, bypassing the main loop's error routing.
test/utils/chaos.go New 743-line chaos utility file; HealNode uses iptables -F which flushes all filter-table rules (not just the two added by PartitionNode), and deleteRecreateCluster calls Eventually inside the scenario function rather than returning an error.
test/chaos/chaos_suite_test.go Standard Ginkgo suite bootstrap: builds images, loads them into Kind, deploys operator, and sets up cert-manager. Straightforward and correct.
test/chaos/client/main.go Minimal Go client that seeds keys then continuously overwrites them at a configurable RPS; logic is clean with no issues.
Makefile Adds test-chaos target and parameterizes KIND_WORKERS; worker-count validation on existing clusters is a good guard.
.github/workflows/test.yml Adds a compile-only CI step for the chaos package using the chaos build tag; no functional test execution in CI.
docs/chaos-testing.md New documentation covering all environment variables, scenarios, and usage examples; comprehensive and accurate.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[BeforeSuite: build images, deploy operator] --> B[BeforeAll: parse env config, create ValkeyCluster]
    B --> C[Seed test data via background client pod]
    C --> D{Loop: pick scenario}
    D -->|random / sequential| E[Log cluster state before]
    E --> F[Inject fault]
    F -->|error with skip:| D
    F -->|inject error| G[Fail: log iteration + scenario + seed]
    F -->|success| H[Eventually: wait for cluster recovery]
    H -->|timeout| G
    H -->|recovered| I{scenario.losesData?}
    I -->|no| J[VerifyTestData]
    J -->|fail| G
    J -->|pass| D
    I -->|yes| K[FlushAll + re-seed]
    K --> D
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[BeforeSuite: build images, deploy operator] --> B[BeforeAll: parse env config, create ValkeyCluster]
    B --> C[Seed test data via background client pod]
    C --> D{Loop: pick scenario}
    D -->|random / sequential| E[Log cluster state before]
    E --> F[Inject fault]
    F -->|error with skip:| D
    F -->|inject error| G[Fail: log iteration + scenario + seed]
    F -->|success| H[Eventually: wait for cluster recovery]
    H -->|timeout| G
    H -->|recovered| I{scenario.losesData?}
    I -->|no| J[VerifyTestData]
    J -->|fail| G
    J -->|pass| D
    I -->|yes| K[FlushAll + re-seed]
    K --> D
Loading

Reviews (6): Last reviewed commit: "chaos: validate config hash, add test st..." | Re-trigger Greptile

Comment thread test/chaos/chaos_test.go Outdated
Comment thread test/utils/chaos.go Outdated
bjosv added a commit that referenced this pull request May 31, 2026
### Summary

During scale-in, `shardIndexFromState` could match a stale replica from
a drained shard that temporarily appears in a remaining shard via gossip
before `CLUSTER FORGET` propagates. This returns the wrong shard index,
causing the controller to drain the wrong shards, e.g. draining 3 shards
out of 4 total, instead of just draining 1 shard, leading to an
unrecoverable Reconciling state.

Fixed by checking the primary node first. The primary is the
authoritative slot owner and its ValkeyNode CR always has the correct
shard-index label. Stale replicas from drained shards are never
primaries of remaining shards.

### Testing

This has been found using #203 running:
`CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make
test-chaos`

This repeatedly scales the cluster to a random shard count (between 3–9)
and verifies it recovers correctly each time, running until failure.

Previously failed within 3–20 iterations; now passes 1000+ without
failure.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
jdheyburn pushed a commit that referenced this pull request Jun 2, 2026
### Summary

This PR fixes two issues that together make rolling updates safe with
and without PVCs.

1. Without persistence: restarting a replica destroys its cluster
membership and data. The operator considered it Ready and immediately
rolled the primary. With no synced replica to fail over to, all shard
data was lost.

3. With persistence: the replica stays synced and proactive failover is
attempted, but the `_operator` user ACL was missing `cluster|failover`.
Every failover failed with NOPERM and the primary was rolled without a
graceful handoff.

### Testing

Found using #203 running the rolling-update scenario. It patches
io-threads on the ValkeyCluster to trigger a rolling restart of all
pods, then verifies the cluster recovers and all keys are preserved.

```
# Without persistence
CHAOS_SCENARIOS=rolling-update make test-chaos

# With persistence
CHAOS_SCENARIOS=rolling-update CHAOS_PERSISTENCE=true make test-chaos
```



### Checklist

Before submitting the PR make sure the following are checked:

- [ ] This Pull Request is related to one issue.
- [x] Commit message explains what changed and why
- [ ] Tests are added or updated.
- [ ] Documentation files are updated.
- [x] I have run pre-commit locally (`pre-commit run --all-files` or
hooks on commit)

---------

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
bjosv added a commit that referenced this pull request Jun 7, 2026
…222)

This PR closes #216 

### Summary

Add `cluster-allow-replica-migration=no` to prevent Valkey from moving
replicas between shards autonomously, which conflicts with the
operator's topology management.

Add `cluster-replica-validity-factor=0` so replicas always attempt
failover regardless of disconnection time. With cluster-node-timeout at
2s, the default factor of 10 gives only a 30s window before replicas
refuse to failover, causing stuck clusters under disruption.

Reorder `PlanDrainMove` to check slot count before primary existence,
preventing a spurious error on already-drained shards. Returning an
error due to the primary would cause the caller (drainExcessShards) to
propagate it up, halting scale-down and leaving stale ValkeyNodes around
forever.
This problem solved itself previously when
`cluster-allow-replica-migration=yes`.

### Testing

This problem was found by using #203 and running on a machine with
moderate load:
`CHAOS_MAX_SHARDS=15 CHAOS_SCENARIOS="scale-shards" make test-chaos`

### Checklist

Before submitting the PR make sure the following are checked:

- [x] This Pull Request is related to one issue.
- [x] Commit message explains what changed and why
- [ ] Tests are added or updated.
- [ ] Documentation files are updated.
- [x] I have run pre-commit locally (`pre-commit run --all-files` or
hooks on commit)

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
@bjosv bjosv marked this pull request as ready for review June 9, 2026 09:46
bjosv added 3 commits June 9, 2026 16:09
Introduce a standalone chaos test framework in test/chaos/ that
continuously injects faults into a ValkeyCluster until failure.
Scenarios include pod deletion, workload deletion, network partitions,
container pauses, shard scaling, rolling updates, controller pod
deletion, worker node pauses, and full cluster delete/recreate.

- Dedicated build tag (//go:build chaos) and Kind cluster
- New Makefile target: make test-chaos
- CI step to verify compilation
- Configurable via environment variables (scenarios, shards,
  replicas, workload type, tolerations, CPU pressure ...)
- CPU pressure mode: randomly throttle Kind worker node CPUs per
  iteration to simulate loaded nodes (CHAOS_CPU_PRESSURE=true)
- Network partition scenarios block all traffic (not just Valkey
  ports) to simulate fully unreachable nodes
- Scenarios disabled by default: network-partition-primary,
  network-partition-replica, pause-worker-node (require tolerations
  for meaningful testing)

Examples:

  # Run all default scenarios
  make test-chaos

  # Stress test random scaling between 3 and 9 shards
  CHAOS_SCENARIOS=scale-shards CHAOS_MIN_SHARDS=3 CHAOS_MAX_SHARDS=9 make test-chaos

  # Randomly alternate between pod deletion and scaling
  CHAOS_SCENARIOS=delete-primary-pod,scale-shards make test-chaos

  # Alternate between pod deletion and scaling in sequence
  CHAOS_SCENARIOS=delete-primary-pod,scale-shards CHAOS_MODE=sequential make test-chaos

  # Repeatedly kill the primary pod of a random shard
  CHAOS_SCENARIOS=delete-primary-pod make test-chaos

  # Stress test with CPU pressure
  CHAOS_CPU_PRESSURE=true make test-chaos

  # Test network partitions with pod evictions
  CHAOS_SCENARIOS=network-partition-primary,pause-worker-node \
    CHAOS_TOLERATION_SECONDS=10 make test-chaos

  # Repeatedly delete pods across multiple shards
  CHAOS_SCENARIOS=delete-multiple-shard-pods \
    CHAOS_SHARDS=7 CHAOS_REPLICAS=2 KIND_WORKERS=3 make test-chaos

  # Run with Deployment workload type
  CHAOS_WORKLOAD_TYPE=Deployment make test-chaos

Note: Network partition, container pause, and CPU pressure scenarios
require Docker access to Kind worker nodes. These only work when
running against a local Kind cluster, not remote clusters.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
All scenarios now accept multiple target shards via CHAOS_TARGET_SHARDS
(comma-separated indices, "all", or "random"). This removes the
delete-multiple-shard-pods scenario by merging it into delete-shard-pods,
and enables multi-shard testing for all fault injection scenarios.

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
bjosv added 2 commits June 10, 2026 14:52
… scenarios

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

bjosv added 6 commits June 16, 2026 01:39
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
…ary-pod

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
… writes

Add a lightweight Go client (test/chaos/client/) that seeds keys deterministically
and optionally maintains continuous writes at a configurable rate (CHAOS_WRITE_RPS).

This replaces valkey-benchmark for both seeding and background writes because
valkey-benchmark's {tag} rewriting produces non-deterministic key names,
making exact key count verification impossible.

The custom client:
- Seeds keys sequentially (key:000000000000 to key:000000099999)
- Overwrites the same keys in a loop (no extra keys created)
- Uses valkey-go cluster mode (auto-routing, auto-reconnect)
- Reports writes/errors every 5s via pod logs
- Supports configurable value size (CHAOS_DATA_SIZE)

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Comment thread test/chaos/chaos_test.go
Comment on lines +719 to +751
func pauseWorkerNode(ctx *ChaosContext) error {
// Eviction threshold: 40s (node-monitor-grace) + tolerationSeconds.
// Range spans below and above threshold to cover both eviction and non-eviction cases.
evictionThreshold := 40*time.Second + time.Duration(ctx.TolerationSec)*time.Second
duration := randomDuration(ctx.Rand, 3*time.Second, evictionThreshold+30*time.Second)
var paused []string
for _, shard := range ctx.TargetShards {
pod, err := utils.GetShardPrimaryPod(ctx.ClusterName, ctx.Namespace, shard)
if err != nil {
return err
}
nodeName, err := utils.GetPodNodeName(pod, ctx.Namespace)
if err != nil {
return err
}
_, _ = fmt.Fprintf(GinkgoWriter, " Pausing Kind node %s (primary pod: %s, shard %d) for %s\n", nodeName, pod, shard, duration.Truncate(time.Second))
logIfControllerNode(nodeName)
cmd := exec.Command("docker", "pause", nodeName)
if _, err := utils.Run(cmd); err != nil {
return err
}
paused = append(paused, nodeName)
}
time.Sleep(duration)
for _, nodeName := range paused {
_, _ = fmt.Fprintf(GinkgoWriter, " Unpausing Kind node %s\n", nodeName)
cmd := exec.Command("docker", "unpause", nodeName)
if _, err := utils.Run(cmd); err != nil {
return err
}
}
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 pauseWorkerNode doesn't deduplicate nodes, leaving nodes paused on failure

When CHAOS_TARGET_SHARDS targets multiple shards that happen to share a Kind worker node (common with the default 2-worker setup and many shards), the second docker pause <nodeName> fails because the node is already paused. The function returns an error immediately, leaving the paused node in the paused slice — but the time.Sleep and unpause loop are never reached. The test then calls Fail(...), and the Kind node stays frozen for the rest of the suite.

Both networkPartitionPrimary and networkPartitionReplica handle this correctly with slices.Contains(nodes, nodeName) deduplication before acting. The same guard is missing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant