Production-ready Jepsen and Litmus chaos automation for CloudNativePG (CNPG) clusters.
Want to run chaos testing immediately? Follow these streamlined steps:
- Clone this repo → Get the chaos experiments and scripts (section 0)
- Setup cluster → Bootstrap CNPG Playground (section 1)
- Install CNPG → Deploy operator + sample cluster (section 2)
- Install Litmus → Install operator, experiments, and RBAC (sections 3, 3.5, 3.6)
- Smoke-test chaos → Run the quick pod-delete check without monitoring (section 4)
- Add monitoring → Install Prometheus for probe validation (section 5; required before section 6 with probes enabled)
- Run Jepsen → Full consistency testing layered on chaos (section 6)
First time users: Use section 4 as a smoke test without Prometheus, then return to section 5 to install monitoring before running the Jepsen workflow in section 6.
- Linux/macOS shell with
bash,git,curl,jq, and internet access. - Container + Kubernetes tooling: Docker or Podman, the Kind CLI tool,
kubectl,helm, thekubectl cnpgplugin binary, and thecmctlutility for cert-manager. - Install the CNPG plugin using kubectl krew (recommended):
# Install or update to the latest version kubectl krew update kubectl krew install cnpg || kubectl krew upgrade cnpg kubectl cnpg version
Alternative installation methods:
- For Debian/Ubuntu: Download
.debfrom releases page - For RHEL/Fedora: Download
.rpmfrom releases page - See official installation docs for all methods
- For Debian/Ubuntu: Download
- Optional but recommended:
kubectx,stern,kubectl-view-secret(see the CNPG Playground README for a complete list). - Disk Space: Minimum 30GB free disk space recommended:
- Kind cluster nodes: ~5GB
- Container images: ~5GB (first run with image pull)
- Prometheus/MongoDB storage: ~10GB
- Jepsen results + logs: ~5GB
- Buffer for growth: ~5GB
- Sufficient local resources for a multi-node Kind cluster (≈8 CPUs / 12 GB RAM) and permission to run port-forwards.
Once the tooling is present, everything else is managed via repository scripts and Helm charts.
Follow these sections in order; each references the authoritative upstream documentation to keep this README concise.
First, clone this repository to access the chaos experiments and scripts:
git clone https://github.com/cloudnative-pg/chaos-testing.git
cd chaos-testingAll subsequent commands reference files in this repository (experiments, scripts, monitoring configs).
The upstream documentation provides detailed instructions for prerequisites and networking. Follow the setup instructions here: https://github.com/cloudnative-pg/cnpg-playground#usage.
Deploy the cnpg-playground project in a parallel folder to chaos-testing:
cd ..
git clone https://github.com/cloudnative-pg/cnpg-playground.git
cd cnpg-playground
./scripts/setup.sh eu # creates kind-k8s-eu clusterFollow the instructions on the screen. In particular, make sure that you:
- export the
KUBECONFIGvariable, as described - set the correct context for kubectl
For example:
export KUBECONFIG=<PATH_TO_CNPG_PLAYGROUND>/k8s/kube-config.yaml
kubectl config use-context kind-k8s-eu
If unsure, type:
./scripts/info.sh # displays contexts and access information
With the Kind cluster running, install the operator using the kubectl cnpg plugin as recommended in the CloudNativePG Installation & Upgrades guide. This approach ensures you get the latest stable operator version:
In the cnpg-playground folder:
# Install the latest operator version using the kubectl cnpg plugin
kubectl cnpg install generate --control-plane | \
kubectl --context kind-k8s-eu apply -f - --server-side
# Verify the controller rollout
kubectl --context kind-k8s-eu rollout status deployment \
-n cnpg-system cnpg-controller-managerIn the chaos-testing folder:
cd ../chaos-testing
# Create the pg-eu PostgreSQL cluster for chaos testing
kubectl apply -f clusters/pg-eu-cluster.yaml
# Verify cluster is ready (this will watch until healthy)
kubectl get cluster pg-eu -w # Wait until status shows "Cluster in healthy state"
# Press Ctrl+C when you see: pg-eu 3 3 ready XX mLitmus 3.x separates the operator (via litmus-core) from the ChaosCenter UI (via litmus chart). Install both, then add the experiment definitions and RBAC:
# Add Litmus Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
# Install litmus-core (operator + CRDs)
helm upgrade --install litmus-core litmuschaos/litmus-core \
--namespace litmus --create-namespace \
--wait --timeout 10m
# Verify CRDs are installed
kubectl get crd chaosengines.litmuschaos.io chaosexperiments.litmuschaos.io chaosresults.litmuschaos.io
# Verify operator is running
kubectl -n litmus get deploy litmus
kubectl -n litmus wait --for=condition=Available deployment/litmus --timeout=5mThe ChaosEngine requires ChaosExperiment resources to exist before it can run. Install the pod-delete experiment:
# Install from Chaos Hub (has namespace: default hardcoded, so override it)
kubectl apply --namespace=litmus -f https://hub.litmuschaos.io/api/chaos/master?file=faults/kubernetes/pod-delete/fault.yaml
# Verify experiment is installed
kubectl -n litmus get chaosexperiments
# Should show: pod-deleteApply the RBAC configuration and verify the service account has correct permissions:
# Apply RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding)
kubectl apply -f litmus-rbac.yaml
# Verify the ServiceAccount exists in litmus namespace
kubectl -n litmus get serviceaccount litmus-admin
# Verify the ClusterRoleBinding points to correct namespace
kubectl get clusterrolebinding litmus-admin -o jsonpath='{.subjects[0].namespace}'
# Should output: litmus (not default)
# Test permissions (optional)
kubectl auth can-i delete pods --as=system:serviceaccount:litmus:litmus-admin -n default
# Should output: yesImportant: The
litmus-rbac.yamlClusterRoleBinding must referencenamespace: litmusin the subjects section. If you see errors like"litmus-admin" cannot get resource "chaosengines", verify the namespace matches where the ServiceAccount exists.
Before setting up the full monitoring stack, you can verify chaos mechanics work independently:
# Apply the probe-free chaos engine (no Prometheus dependency)
kubectl apply -f experiments/cnpg-jepsen-chaos-noprobes.yaml
# Watch the chaos runner pod start (refreshes every 2s)
# Press Ctrl+C once you see the runner pod appear
watch -n2 'kubectl -n litmus get pods | grep cnpg-jepsen-chaos-noprobes-runner'
# Monitor CNPG pod deletions in real-time
bash scripts/monitor-cnpg-pods.sh pg-eu default litmus kind-k8s-eu
# Wait for chaos runner pod to be created, then check logs
kubectl -n litmus wait --for=condition=ready pod -l chaos-runner-name=cnpg-jepsen-chaos-noprobes --timeout=60s && \
runner_pod=$(kubectl -n litmus get pods -l chaos-runner-name=cnpg-jepsen-chaos-noprobes -o jsonpath='{.items[0].metadata.name}') && \
kubectl -n litmus logs -f "$runner_pod"
# After completion, check the result (engine name differs)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-noprobes-pod-delete -o jsonpath='{.status.experimentStatus.verdict}'
# Should output: Pass (if probes are disabled) or Error (if Prometheus probes enabled but Prometheus not installed)
# Clean up for next test
kubectl -n litmus delete chaosengine cnpg-jepsen-chaos-noprobesWhat to observe:
- The runner pod starts and creates an experiment pod (
pod-delete-xxxxx) - CNPG primary pods are deleted every 60 seconds
- CNPG automatically promotes a replica to primary after each deletion
- Deleted pods are recreated by the StatefulSet controller
- The experiment runs for 10 minutes (TOTAL_CHAOS_DURATION=600)
Note: Keep using
experiments/cnpg-jepsen-chaos-noprobes.yamluntil Section 5 installs Prometheus/Grafana. Once monitoring is online, switch toexperiments/cnpg-jepsen-chaos.yaml(probes enabled) for full observability.
The cnpg-playground provides a built-in monitoring stack with Prometheus and Grafana. From the cnpg-playground directory:
cd ../cnpg-playground
./monitoring/setup.sh euThis script installs:
- Prometheus Operator (in
prometheus-operatornamespace) - Grafana Operator with the official CloudNativePG dashboard (in
grafananamespace) - Auto-configured for the
kind-k8s-eucluster
Once installation completes, create the PodMonitor to expose CNPG metrics:
# Switch back to chaos-testing directory
cd ../chaos-testing
# Apply CNPG PodMonitor
kubectl apply -f monitoring/podmonitor-pg-eu.yaml
# Verify PodMonitor
kubectl get podmonitor pg-eu -o wide
# Verify Prometheus is scraping CNPG metrics
kubectl -n prometheus-operator port-forward svc/prometheus-operated 9090:9090 &
curl -s --data-urlencode 'query=sum(cnpg_collector_up{cluster="pg-eu"})' "http://localhost:9090/api/v1/query"Access Grafana dashboard:
kubectl -n grafana port-forward svc/grafana-service 3000:3000
# Open http://localhost:3000 with:
# Username: admin
# Password: admin (you'll be prompted to change on first login)The official CloudNativePG dashboard is pre-configured and available at: Home → Dashboards → grafana → CloudNativePG
Note: If you recreate the
pg-eucluster, reapply the PodMonitor so Prometheus resumes scraping:kubectl apply -f monitoring/podmonitor-pg-eu.yaml
✅ Required before section 6 (when probes are enabled): Complete this monitoring setup so the Prometheus probes defined in
experiments/cnpg-jepsen-chaos.yamlcan succeed.
This project relies on cnpg-playground's monitoring implementation. Be aware of the following dependencies:
What we depend on:
- Script:
/path/to/cnpg-playground/monitoring/setup.sh - Namespace:
prometheus-operator - Service:
prometheus-operated(created by Prometheus Operator for CR namedprometheus) - Port:
9090(Prometheus default)
If cnpg-playground monitoring changes, you may need to update:
- Prometheus endpoint in
experiments/cnpg-jepsen-chaos.yaml(lines 89, 132, 148) - Service check in
.github/workflows/chaos-test-full.yml(line 57) - Service check in
scripts/run-jepsen-chaos-test.sh(line 279)
Troubleshooting: If probes fail with connection errors:
# Verify the Prometheus service exists
kubectl -n prometheus-operator get svc
# If service name changed, update all probe endpoints
# in experiments/cnpg-jepsen-chaos.yaml./scripts/run-jepsen-chaos-test.sh pg-eu app 600This script deploys Jepsen (jepsenpg image), applies the Litmus ChaosEngine (primary pod delete), monitors logs, collects results, and cleans up transient resources automatically (no manual exit needed - the script handles everything).
Prerequisites before running the script:
- Section 5 completed (Prometheus/Grafana running) so probes succeed.
- Chaos workflow validated (run
experiments/cnpg-jepsen-chaos.yamlonce manually if you need to confirm Litmus + CNPG wiring). - Docker registry access to pull
ardentperf/jepsenpgimage (or pre-pulled into cluster). kubectlcontext pointing to the playground cluster with sufficient resources.- Increase max open files limit if needed (required for Jepsen on some systems):
ulimit -n 65536This may need to be configured in your container runtime or Kind cluster configuration if running in a containerized environment.
Script knobs:
LITMUS_NAMESPACE(defaultlitmus) – set if you installed Litmus in a different namespace.PROMETHEUS_NAMESPACE(defaultprometheus-operator) – used to auto-detect the Prometheus service backing Litmus probes.JEPSEN_IMAGEis pinned toardentperf/jepsenpg@sha256:4a3644d9484de3144ad2ea300e1b66568b53d85a87bf12aa64b00661a82311acfor reproducibility. Update this digest only after verifying upstream releases.
-
All test results are stored under
logs/jepsen-chaos-<timestamp>/. -
Quick validation commands:
# Check Litmus chaos verdict (note: use -n litmus, not -n default) kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \ -o jsonpath='{.status.experimentStatus.verdict}' # View full chaos result details kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete -o yaml # Check probe results (if Prometheus was installed) kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \ -o jsonpath='{.status.probeStatuses}' | jq
-
Archive
history.ednandchaos-results/chaosresult.yamlfor analysis or reporting.
-
Each run creates a folder under
logs/jepsen-chaos-<timestamp>/. -
Key files:
results/history.edn→ Jepsen operation history.results/chaos-results/chaosresult.yaml→ Litmus verdict + probe output.
-
Quick checks:
# Chaos results (note: namespace is 'litmus' by default) kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \ -o jsonpath='{.status.experimentStatus.verdict}'
- CNPG Playground: https://github.com/cloudnative-pg/cnpg-playground
- CloudNativePG Installation & Upgrades: https://cloudnative-pg.io/documentation/current/installation_upgrade/
- Litmus Helm chart: https://github.com/litmuschaos/litmus-helm/
- kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
- CNPG Grafana dashboards: https://github.com/cloudnative-pg/grafana-dashboards
- License: Apache 2.0 (see
LICENSE).
Watch CNPG pods, chaos engines, and cluster events during experiments:
# Monitor pod deletions and failovers in real-time
bash scripts/monitor-cnpg-pods.sh <cluster-name> <cnpg-namespace> <chaos-namespace> <kube-context>
# Example
bash scripts/monitor-cnpg-pods.sh pg-eu default litmus kind-k8s-euWhat it shows:
- CNPG pod status with role labels (primary/replica)
- Active ChaosEngines in the chaos namespace
- Recent Kubernetes events (pod deletions, promotions, etc.)
- Updates every 2 seconds
- CNPG Documentation: https://cloudnative-pg.io/documentation/
- Litmus Documentation: https://docs.litmuschaos.io/
- Jepsen Documentation: https://jepsen.io/
- PostgreSQL High Availability: https://www.postgresql.org/docs/current/high-availability.html
Follow the sections above to execute chaos tests. Review the logs for analysis, and consult the /archive directory for additional documentation if needed.
