Version: 1.7.0 Date: 2026-01-25 Status: Implementation Complete - Awaiting VM Provisioning
- Executive Summary
- Requirements Summary
- Architecture Overview
- Multi-Tenancy Solution
- Demo Database & Workload
- CI/CD Pipeline Design
- Lifecycle Management
- DNS & SSL Configuration
- Resource Planning
- VM Provisioning & Monitoring
- Implementation Phases
- Definition of Done
- Risks & Mitigations
This plan describes the implementation of branch-based preview environments for PostgresAI monitoring. Each preview environment will be accessible at preview-{branch-slug}.pgai.watch (3rd level subdomain for Cloudflare wildcard cert compatibility) and will run a complete monitoring stack observing its own demo PostgreSQL database with realistic workload.
Note: Branch names like claude/feature-x or feature_test are sanitized to claude-feature-x and feature-test for DNS compatibility (e.g., preview-claude-feature-x.pgai.watch).
Key design decisions:
- Single Hetzner VM hosting multiple preview environments
- Traefik reverse proxy for dynamic routing with wildcard SSL via DNS-01
- Dynamic DNS records via Cloudflare API (created on deploy, deleted on destroy)
- Manual CI trigger for preview creation (not automatic)
- 3-day TTL with cleanup on branch merge/delete
- pgbench-based workload with variable patterns (spikes, valleys)
- Per-preview generated credentials (stored only in
.env, never in logs) - Hard limit enforcement on concurrent previews (count running containers)
- VM-side locking with flock to prevent races
- Config sync via rsync (not Docker registry)
| Requirement | Decision |
|---|---|
| Infrastructure | Single Hetzner VM, 4 vCPU / 8GB initially |
| VM naming | Must start with ralph-preview- (e.g., ralph-preview-01) |
| VM constraint | Only 1 Hetzner machine for this project can exist at any time |
| Multi-tenancy | Traefik + Docker Compose per branch |
| DNS | Dynamic A records via Cloudflare API (per-preview) |
| SSL | Wildcard cert via DNS-01 challenge |
| Authentication | Per-preview generated password (in .env only) |
| Preview trigger | Manual CI job trigger |
| Lifecycle | 3-day TTL + cleanup on merge/delete |
| Demo workload | pgbench with variable patterns, max 4 sessions |
| Deployment | rsync + SSH from GitLab CI runners |
| Max concurrent | Hard limit: 2 (counts running containers) |
| Concurrency control | VM-side flock + CI resource_group |
┌─────────────────────────────────────────────────────────┐
│ Hetzner VM (CX31: 4 vCPU / 8GB) │
│ │
┌──────────┐ │ ┌─────────────────────────────────────────────────┐ │
│Cloudflare│ │ │ Traefik (ports 80, 443) │ │
│ DNS │───────────────────▶│ │ - Wildcard cert: *.pgai.watch (DNS-01) │ │
│ (API) │ │ │ - Routes preview-{branch}.pgai.watch → Grafana│ │
│ │ │ │ - Dashboard: localhost only (SSH tunnel) │ │
└──────────┘ │ └──────────────────────┬──────────────────────────┘ │
▲ │ │ │
│ create/delete │ ┌───────────────┼───────────────┐ │
│ A records │ ▼ ▼ ▼ │
│ │ ┌────────────┐ ┌────────────┐ │
┌────┴─────┐ │ │ feature-x │ │ fix-bug │ (max 2 concurrent) │
│ deploy/ │ │ │ network │ │ network │ │
│ destroy │ │ │ │ │ │ │
│ scripts │ │ │ - grafana │ │ - grafana │ │
└──────────┘ │ │ - pgwatch │ │ - pgwatch │ │
│ │ - victoria │ │ - victoria │ │
│ │ - sink-pg │ │ - sink-pg │ │
│ │ - target │ │ - target │ │
│ │ - workload │ │ - workload │ │
│ └────────────┘ └────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ VM Self-Monitoring │ │
│ │ - node-exporter (disk, memory, CPU alerts) │ │
│ │ - Healthchecks.io ping for uptime │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Preview Manager │ │
│ │ - flock-based locking (per-preview + global) │ │
│ │ - Counts running containers for quota │ │
│ │ - TTL cleanup + conditional docker prune │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
-
Traefik Reverse Proxy
- Single entry point (ports 80/443)
- Wildcard certificate via DNS-01 challenge (Cloudflare API)
- Auto-discovers containers via Docker labels
- Dashboard not exposed publicly (SSH tunnel access only)
-
Isolated Docker Networks
- Each preview runs in its own Docker network (
preview-{slug}) - No port conflicts - services communicate internally
- Only Grafana joins
traefik-publicnetwork for routing
- Each preview runs in its own Docker network (
-
Preview Manager
- flock-based locking (global for quota/prune, per-preview for deploy/destroy)
- Counts running containers (not directories) for quota enforcement
- Conditional docker prune (only when disk >80%)
- TTL cleanup via cron (3-day expiry)
-
VM Self-Monitoring (Required)
- node-exporter for system metrics
- Alerts on disk > 80%, memory > 85%
- External uptime monitoring (Healthchecks.io)
-
Security Notes
- Credentials stored only in
.envwithchmod 600(never in state.json or logs) - Docker socket mounted read-only in Traefik (acceptable for V1; consider socket proxy for V2)
- Credentials stored only in
Problem: Multiple monitoring stacks need fixed ports (Grafana:3000, VictoriaMetrics:9090, etc.).
Solution: Isolated Docker networks + Traefik routing (no host port exposure)
# docker-compose.preview.yml structure
networks:
default:
name: preview-${BRANCH_SLUG}
traefik-public:
external: true # CRITICAL: Created by Traefik stack
services:
grafana:
networks:
- default
- traefik-public
labels:
- "traefik.enable=true"
- "traefik.http.routers.preview-${BRANCH_SLUG}.rule=Host(`preview-${BRANCH_SLUG}.pgai.watch`)"
- "traefik.http.routers.preview-${BRANCH_SLUG}.entrypoints=websecure"
- "traefik.http.routers.preview-${BRANCH_SLUG}.tls.certresolver=letsencrypt"
- "traefik.http.services.preview-${BRANCH_SLUG}.loadbalancer.server.port=3000"
- "traefik.docker.network=traefik-public"
- "pgai.preview=true" # Label for quota counting
# NO ports: section - internal onlyCreate a single shared script used by CI, deploy, destroy, and webhook:
#!/bin/bash
# scripts/sanitize-branch.sh
# SINGLE SOURCE OF TRUTH for branch name sanitization
set -euo pipefail
BRANCH="${1:?Usage: sanitize-branch.sh <branch-name>}"
# 1. Lowercase
# 2. Replace / and _ with -
# 3. Remove non-alphanumeric except -
# 4. Collapse multiple dashes
# 5. Remove leading/trailing dashes
CLEAN=$(echo "$BRANCH" | tr '[:upper:]' '[:lower:]' | \
sed 's/[\/\_]/-/g' | \
sed 's/[^a-z0-9-]//g' | \
sed 's/--*/-/g' | \
sed 's/^-//;s/-$//')
# 6. If longer than 50 chars, truncate and append hash for uniqueness
if [ ${#CLEAN} -gt 50 ]; then
HASH=$(echo -n "$BRANCH" | sha1sum | cut -c1-8)
CLEAN="${CLEAN:0:50}-${HASH}"
fi
# 7. Final trim to 63 chars (DNS label limit)
echo "${CLEAN:0:63}"Examples:
feature/add-metrics → feature-add-metrics
claude/implement-caching → claude-implement-caching
feature_test_branch → feature-test-branch
feature/optimize-query-execution-plan-for-large → feature-optimize-query-execution-plan-for-lar-a1b2c3d4
/opt/postgres-ai-previews/
├── scripts/
│ ├── sanitize-branch.sh # SINGLE SOURCE OF TRUTH
│ └── cloudflare-dns.sh # DNS record management
├── traefik/
│ ├── docker-compose.yml # Traefik stack
│ ├── traefik.yml # Static config (DNS-01)
│ ├── .env # CF_DNS_API_TOKEN, CF_ZONE_ID (chmod 600)
│ └── acme.json # Wildcard cert storage (chmod 600)
├── previews/
│ ├── feature-add-metrics/
│ │ ├── docker-compose.yml # Generated per-preview
│ │ ├── config/ # Synced from repo via rsync
│ │ ├── .env # Credentials ONLY (chmod 600)
│ │ ├── state.json # Metadata (NO secrets)
│ │ └── .lock # flock file
│ └── fix-bug-123/
│ └── ...
├── shared/
│ ├── docker-compose.preview.template.yml
│ └── workload/
│ └── pgbench-variable.sh
└── manager/
├── deploy.sh # Main deploy script
├── destroy.sh # Cleanup script
├── cleanup-ttl.sh # Cron job for TTL
└── .global.lock # Global lock for quota/prune
Each preview runs a PostgreSQL 15 instance with:
pg_stat_statementsenabledtarget_databasedatabase- Initialized with pgbench schema (
pgbench -i -s 10≈ 160MB) - Healthcheck for proper startup sequencing
#!/bin/bash
# /opt/postgres-ai-previews/shared/workload/pgbench-variable.sh
set -euo pipefail
# Graceful shutdown handling
SHUTDOWN=0
trap 'SHUTDOWN=1; echo "[$(date)] Received shutdown signal, finishing current pattern..."' SIGTERM SIGINT
DB_HOST="${TARGET_DB_HOST:-target-db}"
DB_PORT="${TARGET_DB_PORT:-5432}"
DB_NAME="${TARGET_DB_NAME:-target_database}"
DB_USER="${TARGET_DB_USER:-postgres}"
export PGPASSWORD="${PGPASSWORD:-postgres}"
# Wait for database to be truly ready (not just socket open)
echo "[$(date)] Waiting for database..."
until psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -c "SELECT 1" >/dev/null 2>&1; do
echo "[$(date)] Database not ready, retrying..."
sleep 2
done
echo "[$(date)] Database ready"
# Initialize pgbench schema ONLY if not exists
if ! psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" \
-c "SELECT 1 FROM pgbench_accounts LIMIT 1" >/dev/null 2>&1; then
echo "[$(date)] Initializing pgbench schema..."
pgbench -i -s 10 -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" "$DB_NAME"
else
echo "[$(date)] pgbench schema already exists, skipping init"
fi
CONN="-h $DB_HOST -p $DB_PORT -U $DB_USER $DB_NAME"
while [ $SHUTDOWN -eq 0 ]; do
PATTERN=$((RANDOM % 6))
case $PATTERN in
0) # Spike: high intensity, short duration
echo "[$(date)] Pattern: SPIKE"
timeout 35 pgbench -c 4 -j 2 -T 30 $CONN 2>/dev/null || true
[ $SHUTDOWN -eq 0 ] && sleep 60
;;
1) # Valley: minimal activity
echo "[$(date)] Pattern: VALLEY"
timeout 125 pgbench -c 1 -j 1 -T 120 -R 5 $CONN 2>/dev/null || true
;;
2) # Ramp up
echo "[$(date)] Pattern: RAMP UP"
for clients in 1 2 3 4; do
[ $SHUTDOWN -eq 1 ] && break
timeout 50 pgbench -c $clients -j $clients -T 45 $CONN 2>/dev/null || true
done
;;
3) # Ramp down
echo "[$(date)] Pattern: RAMP DOWN"
for clients in 4 3 2 1; do
[ $SHUTDOWN -eq 1 ] && break
timeout 50 pgbench -c $clients -j $clients -T 45 $CONN 2>/dev/null || true
done
;;
4) # Steady medium
echo "[$(date)] Pattern: STEADY MEDIUM"
timeout 185 pgbench -c 2 -j 2 -T 180 $CONN 2>/dev/null || true
;;
5) # Burst pattern
echo "[$(date)] Pattern: BURST"
for i in 1 2 3; do
[ $SHUTDOWN -eq 1 ] && break
timeout 25 pgbench -c 4 -j 2 -T 20 $CONN 2>/dev/null || true
[ $SHUTDOWN -eq 0 ] && sleep 40
done
;;
esac
if [ $SHUTDOWN -eq 0 ]; then
PAUSE=$((30 + RANDOM % 90))
echo "[$(date)] Pausing for ${PAUSE}s"
sleep $PAUSE
fi
done
echo "[$(date)] Workload generator stopped gracefully"workload-generator:
image: postgres:15
depends_on:
target-db:
condition: service_healthy
volumes:
- ./workload/pgbench-variable.sh:/workload.sh:ro
entrypoint: ["/bin/bash", "/workload.sh"]
environment:
TARGET_DB_HOST: target-db
TARGET_DB_NAME: target_database
TARGET_DB_USER: postgres
PGPASSWORD: postgres
stop_grace_period: 60s
labels:
- "pgai.preview=true"
mem_limit: 256m
cpus: 0.5Key fixes from reviews:
- Calls shared
sanitize-branch.shscript (single source of truth) - Uses
CI_COMMIT_REF_SLUGfor environment name (available at parse time) - Uses dotenv artifact for computed values
resource_groupuses slug from dotenv- Credentials NOT printed to CI logs
# .gitlab-ci.yml additions
stages:
- build
- test
- preview
- cleanup
variables:
PREVIEW_VM_HOST: "preview.pgai.internal"
PREVIEW_VM_USER: "deploy"
PREVIEW_BASE_DIR: "/opt/postgres-ai-previews"
# =============================================================================
# PREVIEW DEPLOYMENT (Manual Trigger)
# =============================================================================
preview:deploy:
stage: preview
when: manual
image: alpine:3.21
before_script:
- apk add --no-cache openssh-client rsync bash coreutils
- eval $(ssh-agent -s)
- printf '%s' "$PREVIEW_SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add - # Pipe avoids ps exposure, POSIX compatible
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
# Use shared script for slug computation
- chmod +x ./scripts/sanitize-branch.sh
- export BRANCH_SLUG=$(./scripts/sanitize-branch.sh "$CI_COMMIT_REF_NAME")
- test -n "$BRANCH_SLUG" || (echo "ERROR: BRANCH_SLUG is empty" && exit 1)
# Write to dotenv for environment and subsequent jobs
- echo "BRANCH_SLUG=${BRANCH_SLUG}" >> deploy.env
- echo "PREVIEW_URL=https://preview-${BRANCH_SLUG}.pgai.watch" >> deploy.env
script:
- source deploy.env
# Sync scripts first (including sanitize-branch.sh)
- |
rsync -avz --delete -e "ssh -o StrictHostKeyChecking=no" \
./scripts/ \
${PREVIEW_VM_USER}@${PREVIEW_VM_HOST}:${PREVIEW_BASE_DIR}/scripts/
# Sync config files
- |
rsync -avz --delete -e "ssh -o StrictHostKeyChecking=no" \
./config/ \
${PREVIEW_VM_USER}@${PREVIEW_VM_HOST}:${PREVIEW_BASE_DIR}/previews/${BRANCH_SLUG}/config/
# Run deploy script on VM
- |
ssh -o StrictHostKeyChecking=no ${PREVIEW_VM_USER}@${PREVIEW_VM_HOST} \
"BRANCH_SLUG='${BRANCH_SLUG}' \
COMMIT_SHA='${CI_COMMIT_SHA}' \
BRANCH_NAME='${CI_COMMIT_REF_NAME}' \
${PREVIEW_BASE_DIR}/manager/deploy.sh"
# Output access info (NO password in logs)
- |
echo "============================================"
echo "Preview deployed successfully!"
echo "URL: ${PREVIEW_URL}"
echo "Username: admin"
echo "Password: SSH to VM and run:"
echo " cat ${PREVIEW_BASE_DIR}/previews/${BRANCH_SLUG}/.env"
echo "============================================"
environment:
name: preview/$CI_COMMIT_REF_SLUG
url: https://preview-$CI_COMMIT_REF_SLUG.pgai.watch
on_stop: preview:destroy
auto_stop_in: 3 days
artifacts:
reports:
dotenv: deploy.env
rules:
- if: $CI_COMMIT_BRANCH != "main"
# =============================================================================
# PREVIEW DESTRUCTION
# =============================================================================
preview:destroy:
stage: cleanup
when: manual
image: alpine:3.21
environment:
name: preview/$CI_COMMIT_REF_SLUG
action: stop
before_script:
- apk add --no-cache openssh-client bash coreutils
- eval $(ssh-agent -s)
- printf '%s' "$PREVIEW_SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add - # Pipe avoids ps exposure, POSIX compatible
- chmod +x ./scripts/sanitize-branch.sh
- export BRANCH_SLUG=$(./scripts/sanitize-branch.sh "$CI_COMMIT_REF_NAME")
script:
- |
ssh -o StrictHostKeyChecking=no ${PREVIEW_VM_USER}@${PREVIEW_VM_HOST} \
"BRANCH_SLUG='${BRANCH_SLUG}' \
${PREVIEW_BASE_DIR}/manager/destroy.sh"
rules:
- if: $CI_COMMIT_BRANCH != "main"
# =============================================================================
# RE-DEPLOY ON PUSH (if preview exists)
# =============================================================================
preview:update:
stage: preview
image: alpine:3.21
before_script:
- apk add --no-cache openssh-client rsync bash coreutils
- eval $(ssh-agent -s)
- printf '%s' "$PREVIEW_SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add - # Pipe avoids ps exposure, POSIX compatible
- chmod +x ./scripts/sanitize-branch.sh
- export BRANCH_SLUG=$(./scripts/sanitize-branch.sh "$CI_COMMIT_REF_NAME")
script:
- |
# Check if preview exists (by checking for running containers)
EXISTS=$(ssh -o StrictHostKeyChecking=no ${PREVIEW_VM_USER}@${PREVIEW_VM_HOST} \
"docker ps --filter 'label=com.docker.compose.project=preview-${BRANCH_SLUG}' --format '{{.ID}}' | head -1")
if [ -n "$EXISTS" ]; then
echo "Preview exists, updating..."
# Sync scripts
rsync -avz --delete -e "ssh -o StrictHostKeyChecking=no" \
./scripts/ \
${PREVIEW_VM_USER}@${PREVIEW_VM_HOST}:${PREVIEW_BASE_DIR}/scripts/
# Sync configs
rsync -avz --delete -e "ssh -o StrictHostKeyChecking=no" \
./config/ \
${PREVIEW_VM_USER}@${PREVIEW_VM_HOST}:${PREVIEW_BASE_DIR}/previews/${BRANCH_SLUG}/config/
# Run update
ssh -o StrictHostKeyChecking=no ${PREVIEW_VM_USER}@${PREVIEW_VM_HOST} \
"BRANCH_SLUG='${BRANCH_SLUG}' \
COMMIT_SHA='${CI_COMMIT_SHA}' \
${PREVIEW_BASE_DIR}/manager/deploy.sh --update"
else
echo "No running preview for ${BRANCH_SLUG}, skipping"
fi
rules:
- if: $CI_COMMIT_BRANCH != "main"
when: on_success#!/bin/bash
# /opt/postgres-ai-previews/scripts/cloudflare-dns.sh
# Manages DNS records via Cloudflare API
set -euo pipefail
ACTION="${1:?Usage: cloudflare-dns.sh <create|delete> <subdomain>}"
SUBDOMAIN="${2:?Usage: cloudflare-dns.sh <create|delete> <subdomain>}"
# Load credentials
source /opt/postgres-ai-previews/traefik/.env
: "${CF_DNS_API_TOKEN:?CF_DNS_API_TOKEN not set}"
: "${CF_ZONE_ID:?CF_ZONE_ID not set}"
: "${VM_PUBLIC_IP:?VM_PUBLIC_IP not set}"
FULL_NAME="${SUBDOMAIN}.pgai.watch"
API_BASE="https://api.cloudflare.com/client/v4/zones/${CF_ZONE_ID}/dns_records"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] cloudflare-dns: $*"; }
case "$ACTION" in
create)
# Check if record exists
EXISTING=$(curl -s -X GET "${API_BASE}?name=${FULL_NAME}" \
-H "Authorization: Bearer ${CF_DNS_API_TOKEN}" \
-H "Content-Type: application/json" | jq -r '.result[0].id // empty')
if [ -n "$EXISTING" ]; then
log "Record ${FULL_NAME} already exists (ID: ${EXISTING}), updating..."
curl -s -X PUT "${API_BASE}/${EXISTING}" \
-H "Authorization: Bearer ${CF_DNS_API_TOKEN}" \
-H "Content-Type: application/json" \
--data '{
"type": "A",
"name": "'"${FULL_NAME}"'",
"content": "'"${VM_PUBLIC_IP}"'",
"ttl": 120,
"proxied": true
}' | jq -e '.success' > /dev/null
else
log "Creating DNS record: ${FULL_NAME} -> ${VM_PUBLIC_IP}"
curl -s -X POST "${API_BASE}" \
-H "Authorization: Bearer ${CF_DNS_API_TOKEN}" \
-H "Content-Type: application/json" \
--data '{
"type": "A",
"name": "'"${FULL_NAME}"'",
"content": "'"${VM_PUBLIC_IP}"'",
"ttl": 120,
"proxied": true
}' | jq -e '.success' > /dev/null
fi
log "DNS record created/updated: ${FULL_NAME}"
;;
delete)
RECORD_ID=$(curl -s -X GET "${API_BASE}?name=${FULL_NAME}" \
-H "Authorization: Bearer ${CF_DNS_API_TOKEN}" \
-H "Content-Type: application/json" | jq -r '.result[0].id // empty')
if [ -n "$RECORD_ID" ]; then
log "Deleting DNS record: ${FULL_NAME} (ID: ${RECORD_ID})"
curl -s -X DELETE "${API_BASE}/${RECORD_ID}" \
-H "Authorization: Bearer ${CF_DNS_API_TOKEN}" | jq -e '.success' > /dev/null
log "DNS record deleted"
else
log "No DNS record found for ${FULL_NAME}, skipping"
fi
;;
*)
echo "Unknown action: $ACTION" >&2
exit 1
;;
esac#!/bin/bash
# /opt/postgres-ai-previews/manager/deploy.sh
set -euo pipefail
PREVIEW_BASE_DIR="/opt/postgres-ai-previews"
# Required variables (fail if missing)
: "${BRANCH_SLUG:?BRANCH_SLUG is required}"
: "${COMMIT_SHA:?COMMIT_SHA is required}"
BRANCH_NAME="${BRANCH_NAME:-$BRANCH_SLUG}"
UPDATE_MODE="${1:-}"
PREVIEW_DIR="${PREVIEW_BASE_DIR}/previews/${BRANCH_SLUG}"
MAX_PREVIEWS=2
GLOBAL_LOCK="${PREVIEW_BASE_DIR}/manager/.global.lock"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
# Validate BRANCH_SLUG format (security)
if ! [[ "$BRANCH_SLUG" =~ ^[a-z0-9-]{1,63}$ ]]; then
log "ERROR: Invalid BRANCH_SLUG format: $BRANCH_SLUG"
exit 1
fi
# Validate PREVIEW_DIR is within expected path (safety)
REAL_PREVIEW_DIR=$(realpath -m "$PREVIEW_DIR")
REAL_BASE="${PREVIEW_BASE_DIR}/previews"
if [[ "$REAL_PREVIEW_DIR" != "${REAL_BASE}/"* ]]; then
log "ERROR: PREVIEW_DIR outside expected path: $REAL_PREVIEW_DIR"
exit 1
fi
# =============================================================================
# GLOBAL LOCK for quota/resource checks
# =============================================================================
exec 200>"$GLOBAL_LOCK"
flock -w 30 200 || { log "ERROR: Could not acquire global lock"; exit 1; }
# =============================================================================
# QUOTA CHECK (skip for updates)
# =============================================================================
if [ "$UPDATE_MODE" != "--update" ]; then
# Count RUNNING preview containers (not directories)
CURRENT=$(docker ps --filter 'label=pgai.preview=true' --format '{{.ID}}' | wc -l)
# Divide by ~7 services per preview to get preview count
CURRENT_PREVIEWS=$((CURRENT / 7))
if [ "$CURRENT_PREVIEWS" -ge "$MAX_PREVIEWS" ]; then
log "ERROR: Maximum preview limit ($MAX_PREVIEWS) reached. Running: $CURRENT_PREVIEWS"
log "Active previews:"
docker ps --filter 'label=pgai.preview=true' --format '{{.Labels}}' | grep 'com.docker.compose.project=' | sort -u
exit 1
fi
# Check disk space (fail if < 5GB free)
FREE_GB=$(df -BG "${PREVIEW_BASE_DIR}" | awk 'NR==2 {print $4}' | tr -d 'G')
if [ "$FREE_GB" -lt 5 ]; then
log "ERROR: Insufficient disk space. Free: ${FREE_GB}GB, Required: 5GB"
exit 1
fi
# Check memory (fail if < 1GB free)
FREE_MEM_MB=$(free -m | awk '/^Mem:/ {print $7}')
if [ "$FREE_MEM_MB" -lt 1024 ]; then
log "ERROR: Insufficient memory. Available: ${FREE_MEM_MB}MB, Required: 1024MB"
exit 1
fi
fi
# Release global lock
flock -u 200
# =============================================================================
# PER-PREVIEW LOCK
# =============================================================================
mkdir -p "${PREVIEW_DIR}"
exec 201>"${PREVIEW_DIR}/.lock"
flock -w 60 201 || { log "ERROR: Could not acquire preview lock"; exit 1; }
log "Deploying preview: $BRANCH_SLUG (commit: ${COMMIT_SHA:0:8})"
cd "${PREVIEW_DIR}"
# =============================================================================
# CREDENTIALS (only generate if .env missing)
# =============================================================================
if [ ! -f ".env" ]; then
GRAFANA_PASSWORD=$(openssl rand -base64 16 | tr -dc 'a-zA-Z0-9' | head -c 16)
cat > ".env" << EOF
BRANCH_SLUG=${BRANCH_SLUG}
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
EOF
chmod 600 .env
log "Generated new credentials"
fi
# Always source .env (now guaranteed to exist)
source .env
# Update COMMIT_SHA in .env on updates
if [ "$UPDATE_MODE" = "--update" ]; then
if grep -q '^COMMIT_SHA=' .env; then
sed -i "s/^COMMIT_SHA=.*/COMMIT_SHA=${COMMIT_SHA}/" .env
else
echo "COMMIT_SHA=${COMMIT_SHA}" >> .env
fi
else
echo "COMMIT_SHA=${COMMIT_SHA}" >> .env
fi
# =============================================================================
# GENERATE COMPOSE FILE
# =============================================================================
export BRANCH_SLUG COMMIT_SHA GRAFANA_ADMIN_USER GRAFANA_ADMIN_PASSWORD
envsubst '$BRANCH_SLUG $COMMIT_SHA $GRAFANA_ADMIN_USER $GRAFANA_ADMIN_PASSWORD' \
< "${PREVIEW_BASE_DIR}/shared/docker-compose.preview.template.yml" \
> docker-compose.yml
# Validate generated compose file
if ! docker compose config > /dev/null 2>&1; then
log "ERROR: Generated docker-compose.yml is invalid"
docker compose config
exit 1
fi
# =============================================================================
# STATE FILE (NO secrets - only metadata)
# =============================================================================
cat > state.json << EOF
{
"branch": "${BRANCH_NAME}",
"branch_slug": "${BRANCH_SLUG}",
"commit_sha": "${COMMIT_SHA}",
"created_at": "$(jq -r '.created_at // empty' state.json 2>/dev/null || date -u +%Y-%m-%dT%H:%M:%SZ)",
"updated_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
}
EOF
# =============================================================================
# DNS RECORD (only on fresh deploy)
# =============================================================================
if [ "$UPDATE_MODE" != "--update" ]; then
log "Creating DNS record..."
"${PREVIEW_BASE_DIR}/scripts/cloudflare-dns.sh" create "preview-${BRANCH_SLUG}" || \
log "WARNING: DNS record creation failed (may already exist)"
fi
# =============================================================================
# DEPLOY
# =============================================================================
log "Pulling images..."
docker compose -p "preview-${BRANCH_SLUG}" pull --quiet
log "Starting services..."
docker compose -p "preview-${BRANCH_SLUG}" up -d --force-recreate --remove-orphans
# =============================================================================
# HEALTH CHECK (using docker inspect, not wget inside container)
# =============================================================================
log "Waiting for Grafana to be healthy..."
GRAFANA_CONTAINER="preview-${BRANCH_SLUG}-grafana-1"
for i in $(seq 1 30); do
HEALTH=$(docker inspect --format='{{.State.Health.Status}}' "$GRAFANA_CONTAINER" 2>/dev/null || echo "starting")
if [ "$HEALTH" = "healthy" ]; then
log "Grafana is healthy"
break
fi
if [ $i -eq 30 ]; then
log "ERROR: Grafana failed to become healthy (status: $HEALTH)"
log "Container status:"
docker compose -p "preview-${BRANCH_SLUG}" ps
log "Grafana logs:"
docker compose -p "preview-${BRANCH_SLUG}" logs grafana --tail=50
# Rollback
log "Rolling back..."
docker compose -p "preview-${BRANCH_SLUG}" down -v --remove-orphans
"${PREVIEW_BASE_DIR}/scripts/cloudflare-dns.sh" delete "preview-${BRANCH_SLUG}" || true
rm -rf "${PREVIEW_DIR}"
exit 1
fi
sleep 2
done
# =============================================================================
# SUCCESS (NO password in output)
# =============================================================================
log "SUCCESS: Preview deployed"
log "URL: https://preview-${BRANCH_SLUG}.pgai.watch"
log "Username: ${GRAFANA_ADMIN_USER}"
log "Password: cat ${PREVIEW_DIR}/.env"#!/bin/bash
# /opt/postgres-ai-previews/manager/destroy.sh
set -euo pipefail
PREVIEW_BASE_DIR="/opt/postgres-ai-previews"
: "${BRANCH_SLUG:?BRANCH_SLUG is required}"
PREVIEW_DIR="${PREVIEW_BASE_DIR}/previews/${BRANCH_SLUG}"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
# Validate BRANCH_SLUG format
if ! [[ "$BRANCH_SLUG" =~ ^[a-z0-9-]{1,63}$ ]]; then
log "ERROR: Invalid BRANCH_SLUG format: $BRANCH_SLUG"
exit 1
fi
# Safety check: ensure we're deleting within expected path
REAL_PREVIEW_DIR=$(realpath -m "$PREVIEW_DIR")
REAL_BASE="${PREVIEW_BASE_DIR}/previews"
if [[ "$REAL_PREVIEW_DIR" != "${REAL_BASE}/"* ]]; then
log "SAFETY ERROR: Invalid preview directory: $REAL_PREVIEW_DIR"
exit 1
fi
if [ ! -d "$PREVIEW_DIR" ]; then
log "Preview directory not found: $BRANCH_SLUG"
# Still try to clean up DNS
"${PREVIEW_BASE_DIR}/scripts/cloudflare-dns.sh" delete "preview-${BRANCH_SLUG}" 2>/dev/null || true
exit 0
fi
# Acquire per-preview lock (skip if held - don't block, just warn)
if ! flock -n "${PREVIEW_DIR}/.lock" true 2>/dev/null; then
log "WARNING: Preview is locked (deploy in progress?), proceeding anyway..."
fi
log "Destroying preview: $BRANCH_SLUG"
cd "$PREVIEW_DIR"
docker compose -p "preview-${BRANCH_SLUG}" down -v --remove-orphans 2>/dev/null || true
# Delete DNS record
log "Deleting DNS record..."
"${PREVIEW_BASE_DIR}/scripts/cloudflare-dns.sh" delete "preview-${BRANCH_SLUG}" || \
log "WARNING: DNS record deletion failed"
# Safe deletion
rm -rf "$PREVIEW_DIR"
log "Preview destroyed: $BRANCH_SLUG"#!/bin/bash
# /opt/postgres-ai-previews/manager/cleanup-ttl.sh
# Cron: */30 * * * * /opt/postgres-ai-previews/manager/cleanup-ttl.sh >> /var/log/preview-cleanup.log 2>&1
set -euo pipefail
PREVIEW_BASE_DIR="/opt/postgres-ai-previews"
PREVIEW_BASE="${PREVIEW_BASE_DIR}/previews"
TTL_SECONDS=$((3 * 24 * 60 * 60)) # 3 days
NOW=$(date +%s)
GLOBAL_LOCK="${PREVIEW_BASE_DIR}/manager/.global.lock"
DISK_PRUNE_THRESHOLD=80 # Prune only if disk usage > 80%
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
log "Starting cleanup check..."
# Use nullglob to handle empty directory
shopt -s nullglob
PREVIEW_DIRS=("${PREVIEW_BASE}"/*/)
shopt -u nullglob
if [ ${#PREVIEW_DIRS[@]} -eq 0 ]; then
log "No previews to check"
else
for preview_dir in "${PREVIEW_DIRS[@]}"; do
state_file="${preview_dir}state.json"
[ -f "$state_file" ] || continue
branch_slug=$(basename "$preview_dir")
# Skip if locked (deploy in progress)
if ! flock -n "${preview_dir}.lock" true 2>/dev/null; then
log "Skipping ${branch_slug}: locked"
continue
fi
updated_at=$(jq -r '.updated_at' "$state_file" 2>/dev/null || echo "")
if [ -z "$updated_at" ]; then
log "WARNING: No updated_at in $state_file, skipping"
continue
fi
updated_ts=$(date -d "$updated_at" +%s 2>/dev/null || echo 0)
age=$((NOW - updated_ts))
if [ $age -gt $TTL_SECONDS ]; then
log "Cleaning stale preview: $branch_slug (age: $((age/86400)) days)"
# Safety check
REAL_DIR=$(realpath -m "$preview_dir")
if [[ "$REAL_DIR" != "${PREVIEW_BASE}/"* ]]; then
log "SAFETY ERROR: Invalid path: $preview_dir"
continue
fi
cd "$preview_dir"
docker compose -p "preview-${branch_slug}" down -v --remove-orphans 2>/dev/null || true
# Delete DNS record
"${PREVIEW_BASE_DIR}/scripts/cloudflare-dns.sh" delete "preview-${branch_slug}" 2>/dev/null || true
rm -rf "$preview_dir"
log "Removed: $branch_slug"
fi
done
fi
# =============================================================================
# CONDITIONAL DOCKER CLEANUP (only if disk > threshold)
# =============================================================================
exec 200>"$GLOBAL_LOCK"
if flock -n 200; then
DISK_USAGE=$(df "${PREVIEW_BASE_DIR}" | awk 'NR==2 {print $5}' | tr -d '%')
if [ "$DISK_USAGE" -gt "$DISK_PRUNE_THRESHOLD" ]; then
log "Disk usage ${DISK_USAGE}% > ${DISK_PRUNE_THRESHOLD}%, running Docker cleanup..."
docker image prune -af --filter "until=72h" 2>/dev/null || true
# Don't prune volumes - they belong to running previews
log "Docker cleanup complete"
else
log "Disk usage ${DISK_USAGE}% <= ${DISK_PRUNE_THRESHOLD}%, skipping prune"
fi
flock -u 200
fi
log "Cleanup complete"For immediate cleanup when branches are merged/deleted. Run as a simple systemd service (internal only, no Traefik routing needed for V1):
#!/usr/bin/env python3
# /opt/postgres-ai-previews/manager/webhook-handler.py
import os
import re
import subprocess
import hashlib
import threading
from flask import Flask, request, jsonify
app = Flask(__name__)
WEBHOOK_SECRET = os.environ.get('GITLAB_WEBHOOK_SECRET', '')
PREVIEW_BASE_DIR = '/opt/postgres-ai-previews'
ALLOWED_SLUG_PATTERN = re.compile(r'^[a-z0-9-]{1,63}$')
def sanitize_branch(branch: str) -> str:
"""Sanitize branch name to DNS-safe slug."""
slug = branch.lower()
slug = re.sub(r'[/_]', '-', slug)
slug = re.sub(r'[^a-z0-9-]', '', slug)
slug = re.sub(r'-+', '-', slug)
slug = slug.strip('-')
if len(slug) > 50:
hash_suffix = hashlib.sha1(branch.encode()).hexdigest()[:8]
slug = f"{slug[:50]}-{hash_suffix}"
return slug[:63]
def cleanup_preview_async(branch_slug: str):
"""Run cleanup in background thread."""
def _cleanup():
if not ALLOWED_SLUG_PATTERN.match(branch_slug):
app.logger.error(f'Invalid branch slug: {branch_slug}')
return
try:
subprocess.run(
[f'{PREVIEW_BASE_DIR}/manager/destroy.sh'],
env={**os.environ, 'BRANCH_SLUG': branch_slug},
capture_output=True,
timeout=120
)
app.logger.info(f'Cleaned up preview: {branch_slug}')
except Exception as e:
app.logger.error(f'Cleanup failed for {branch_slug}: {e}')
thread = threading.Thread(target=_cleanup)
thread.daemon = True
thread.start()
@app.route('/webhook', methods=['POST'])
def handle_webhook():
token = request.headers.get('X-Gitlab-Token', '')
if not token or token != WEBHOOK_SECRET:
return jsonify({'error': 'unauthorized'}), 401
data = request.json or {}
event = request.headers.get('X-Gitlab-Event', '')
# Branch deletion (push with null SHA)
if event == 'Push Hook' and data.get('after') == '0' * 40:
branch = data.get('ref', '').replace('refs/heads/', '')
branch_slug = sanitize_branch(branch)
cleanup_preview_async(branch_slug)
return jsonify({'status': 'cleanup_started', 'branch': branch_slug}), 202
# MR merge
if event == 'Merge Request Hook':
attrs = data.get('object_attributes', {})
if attrs.get('action') == 'merge':
branch = attrs.get('source_branch', '')
branch_slug = sanitize_branch(branch)
cleanup_preview_async(branch_slug)
return jsonify({'status': 'cleanup_started', 'branch': branch_slug}), 202
return jsonify({'status': 'ignored'})
@app.route('/health', methods=['GET'])
def health():
return jsonify({'status': 'ok'})
if __name__ == '__main__':
app.run(host='127.0.0.1', port=9999)Strategy: Dynamic A Records via Cloudflare API
Since pgai.watch already has existing dynamic DNS infrastructure, we create/delete A records per preview:
- On deploy: Create
preview-{branch-slug}.pgai.watch→ VM IP (proxied) - On destroy: Delete the record
This approach:
- Works with Cloudflare Free/Pro plans
- Doesn't conflict with existing
*.pgai.watchrecords - Allows proxied (orange cloud) for DDoS protection
Required Cloudflare API credentials (store in /opt/postgres-ai-previews/traefik/.env):
CF_DNS_API_TOKEN=<your-cloudflare-api-token> # Needs Zone:DNS:Edit permission
CF_ZONE_ID=<your-zone-id> # From Cloudflare dashboard
VM_PUBLIC_IP=<hetzner-vm-ip># /opt/postgres-ai-previews/traefik/docker-compose.yml
version: "3.8"
services:
traefik:
image: traefik:v3.0
restart: unless-stopped
ports:
- "80:80"
- "443:443"
- "127.0.0.1:8080:8080" # Dashboard on localhost ONLY
env_file:
- .env # Contains CF_DNS_API_TOKEN
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik.yml:/etc/traefik/traefik.yml:ro
- ./acme.json:/acme.json
networks:
- traefik-public
networks:
traefik-public:
name: traefik-public# /opt/postgres-ai-previews/traefik/traefik.yml
api:
dashboard: true
insecure: true # Only accessible on localhost:8080
entryPoints:
web:
address: ":80"
http:
redirections:
entryPoint:
to: websecure
scheme: https
websecure:
address: ":443"
providers:
docker:
endpoint: "unix:///var/run/docker.sock"
exposedByDefault: false
network: traefik-public
certificatesResolvers:
letsencrypt:
acme:
email: admin@postgres.ai
storage: /acme.json
dnsChallenge:
provider: cloudflare
delayBeforeCheck: 10
resolvers:
- "1.1.1.1:53"
- "8.8.8.8:53"# In docker-compose.preview.template.yml
services:
grafana:
labels:
- "traefik.enable=true"
- "traefik.http.routers.preview-${BRANCH_SLUG}.rule=Host(`preview-${BRANCH_SLUG}.pgai.watch`)"
- "traefik.http.routers.preview-${BRANCH_SLUG}.entrypoints=websecure"
- "traefik.http.routers.preview-${BRANCH_SLUG}.tls.certresolver=letsencrypt"
- "traefik.http.services.preview-${BRANCH_SLUG}.loadbalancer.server.port=3000"
- "traefik.docker.network=traefik-public"
- "pgai.preview=true"Using Compose v2 native limits (mem_limit, cpus), NOT deploy.resources.
| Service | CPU | Memory | Notes |
|---|---|---|---|
| Grafana | 0.25 | 256m | |
| VictoriaMetrics | 0.4 | 768m | |
| PGWatch | 0.5 | 512m | |
| Sink PostgreSQL | 0.2 | 512m | |
| Target DB | 0.3 | 512m | |
| Workload Generator | 0.15 | 128m | |
| Flask Backend | 0.1 | 128m | |
| Total per preview | ~1.9 | ~2.8GB |
# docker-compose.preview.template.yml (excerpt)
services:
grafana:
image: grafana/grafana:12.0.2
mem_limit: 256m
cpus: 0.25
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:3000/api/health || exit 1"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
labels:
- "pgai.preview=true"
victoria-metrics:
image: victoriametrics/victoria-metrics:v1.105.0
mem_limit: 768m
cpus: 0.4
command:
- "-retentionPeriod=72h"
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:8428/health || exit 1"]
interval: 10s
timeout: 5s
retries: 3
labels:
- "pgai.preview=true"
target-db:
image: postgres:15
mem_limit: 512m
cpus: 0.3
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 10
start_period: 10s
labels:
- "pgai.preview=true"| VM Size | vCPU | RAM | Max Previews | Monthly Cost |
|---|---|---|---|---|
| CX31 | 4 | 8GB | 2 | ~€10 |
| CX41 | 8 | 16GB | 4-5 | ~€20 |
| CX51 | 16 | 32GB | 8-10 | ~€40 |
IMPORTANT: Only ONE Hetzner VM for preview environments is allowed at any time.
- Naming pattern:
ralph-preview-XX(e.g.,ralph-preview-01) - Before provisioning: Check Hetzner Cloud console for existing
ralph-preview-*servers - If VM exists: Use the existing one or delete it first (after confirming no active previews)
#!/bin/bash
# run as root on fresh Hetzner VM (named ralph-preview-XX)
set -euo pipefail
# Verify hostname follows naming convention
if [[ ! "$(hostname)" =~ ^ralph-preview- ]]; then
echo "WARNING: Hostname should start with 'ralph-preview-'"
echo "Current hostname: $(hostname)"
fi
# System updates
apt-get update && apt-get upgrade -y
# Install Docker
curl -fsSL https://get.docker.com | sh
systemctl enable --now docker
# Install dependencies (including python3-flask for webhook handler)
apt-get install -y jq rsync python3-flask curl
# Create deploy user
useradd -m -s /bin/bash -G docker deploy
mkdir -p /home/deploy/.ssh
# Add your SSH public key here
echo "ssh-ed25519 AAAA... deploy@ci" >> /home/deploy/.ssh/authorized_keys
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys
# Create swap (prevents OOM kills)
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
# Set swappiness to 1 (preferred for DB workloads)
sysctl vm.swappiness=1
echo 'vm.swappiness=1' >> /etc/sysctl.conf
# Docker logging limits (prevents disk exhaustion)
cat > /etc/docker/daemon.json << 'EOF'
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
EOF
systemctl restart docker
# Create directory structure
mkdir -p /opt/postgres-ai-previews/{traefik,previews,shared/workload,manager,scripts}
touch /opt/postgres-ai-previews/manager/.global.lock
chown -R deploy:deploy /opt/postgres-ai-previews
# Firewall
ufw allow 22/tcp
ufw allow 80/tcp
ufw allow 443/tcp
ufw --force enable
echo "VM provisioning complete"
echo ""
echo "Next steps:"
echo "1. Add Cloudflare credentials to /opt/postgres-ai-previews/traefik/.env"
echo "2. Deploy Traefik stack"
echo "3. Test DNS record creation"#!/bin/bash
# /opt/postgres-ai-previews/traefik/setup.sh
set -euo pipefail
cd /opt/postgres-ai-previews/traefik
# Create network if not exists
docker network inspect traefik-public >/dev/null 2>&1 || \
docker network create traefik-public
# Initialize acme.json with correct permissions
[ -f acme.json ] || (touch acme.json && chmod 600 acme.json)
# Verify .env exists
if [ ! -f .env ]; then
echo "ERROR: .env file missing. Create it with:"
echo " CF_DNS_API_TOKEN=<token>"
echo " CF_ZONE_ID=<zone-id>"
echo " VM_PUBLIC_IP=<ip>"
exit 1
fi
chmod 600 .env
# Start or recreate Traefik
docker compose up -d --force-recreate
echo "Traefik setup complete"# /opt/postgres-ai-previews/monitoring/docker-compose.yml
version: "3.8"
services:
node-exporter:
image: prom/node-exporter:v1.8.2
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "127.0.0.1:9100:9100"External monitoring (Healthchecks.io or similar):
# Add to crontab
* * * * * curl -fsS --retry 3 https://hc-ping.com/YOUR-UUID > /dev/nullAlert thresholds:
- Disk usage > 80%: Warning
- Disk usage > 90%: Critical
- Memory available < 512MB: Warning
- Load average > 4 (vCPU count): Warning
- Verify no existing preview VM — Check Hetzner console for any
ralph-preview-*servers - Provision Hetzner CX31 VM named
ralph-preview-01 - Run provisioning script (Docker, swap, swappiness, firewall, deploy user, python3-flask)
- Configure Docker logging limits
- Set up directory structure with lock files
- Create Cloudflare API token with Zone:DNS:Edit permission
- Add credentials to
/opt/postgres-ai-previews/traefik/.env - Run Traefik setup script (idempotent)
- Test DNS record creation via
cloudflare-dns.sh - Set up VM monitoring (node-exporter + external ping)
Scripts ready: preview-infra/provision-vm.sh, preview-infra/traefik/setup.sh, preview-infra/traefik/traefik.yml
- Create
docker-compose.preview.template.yml - Use native Compose resource limits (
mem_limit,cpus) - Add Traefik labels for routing (simplified, no explicit domain config)
- Add healthchecks to all services
- Add
pgai.preview=truelabel to all containers (for quota counting) - Deploy
scripts/sanitize-branch.sh(single source of truth) - Deploy
scripts/cloudflare-dns.sh(DNS management) - Create
manager/deploy.shwith flock locking and quota enforcement - Create
manager/destroy.shwith safety checks - Test single preview deployment manually (requires VM)
All code written in: preview-infra/shared/, preview-infra/scripts/, preview-infra/manager/
- Create
pgbench-variable.shwith graceful shutdown - Add idempotent schema initialization
- Test patterns produce interesting metrics (requires running preview)
- Verify resource usage stays within limits (requires running preview)
Script ready: preview-infra/shared/workload/pgbench-variable.sh
- Add CI jobs to
.gitlab-ci.yml - Ensure CI calls shared
sanitize-branch.shscript - Configure CI variables (
PREVIEW_SSH_PRIVATE_KEY,PREVIEW_VM_HOST) - Test
preview:deploymanual trigger - Test
preview:updateon push - Test
preview:destroy - Credentials printed to CI logs for testing (will remove for production)
CI jobs defined: preview:deploy, preview:update, preview:destroy
- Create
cleanup-ttl.shcron job script (with conditional prune) - Deploy cleanup cron job on VM
- (Optional) Deploy webhook handler as systemd service
- Configure GitLab webhook for branch deletion
- Test 3-day TTL expiry
- Test branch deletion cleanup
Script ready: preview-infra/manager/cleanup-ttl.sh
- Update CONTRIBUTING.md with preview usage
- Document how to retrieve credentials (SSH to VM)
- Create runbook for common issues
- Add preview URL to MR template
This feature is considered DONE when the following acceptance criteria are met:
- Sample MR exists — A test MR is created, branched from the
plan/monitoring-preview-environmentsbranch (the branch where this system is developed) - CI job visible in pipeline — The sample MR's pipeline contains the
preview:deployjob - Manual trigger works — Clicking the manual trigger on
preview:deploysuccessfully provisions a new preview environment - Grafana accessible — The preview URL (
https://preview-{branch-slug}.pgai.watch) opens in a browser and shows dashboards (anonymous viewer access enabled) - Dashboards are non-empty — At least one dashboard shows populated panels (not "No Data")
- Charts show activity — Graphs display metrics indicating the demo database is busy with the sample workload (e.g., transactions per second, query latency, active connections)
- Workload patterns visible — Over a 10-minute observation period, the charts show varying activity levels (spikes, valleys, or ramps) demonstrating the variable workload generator is functioning
Note: Anonymous viewer access is enabled (GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer) for easy verification. Admin credentials are still available in .env for full access.
-
Create a test branch from
plan/monitoring-preview-environments:git checkout plan/monitoring-preview-environments git checkout -b test/preview-demo git push -u origin test/preview-demo
-
Open the MR pipeline in GitLab and click the play button on
preview:deploy -
Wait for deployment to complete (check CI job logs for success message and URL)
-
Open the preview URL in browser, login with credentials from:
ssh deploy@<PREVIEW_VM_HOST> "cat /opt/postgres-ai-previews/previews/test-preview-demo/.env"
-
Navigate to dashboards and verify:
- Panels are loading data (no "No Data" messages)
- Time-series graphs show activity over the last 5-15 minutes
- Metrics like
pg_stat_activity, query rates, or TPS are visible
-
Cleanup: Click
preview:destroyor delete the test branch
| Risk | Impact | Mitigation |
|---|---|---|
| Multiple VMs provisioned | High | Only 1 VM allowed — name must be ralph-preview-*, verify in Hetzner console before provisioning |
| Resource exhaustion | High | Hard limit (2 previews), quota checks count running containers, swap |
| Disk exhaustion | High | Conditional prune (>80%), log limits, 72h retention |
| OOM kills | High | mem_limit on all containers, swap file, swappiness=1 |
| DNS API failure | Medium | Graceful handling, manual cleanup possible |
| Concurrent deploy races | Medium | flock locking (global + per-preview) |
| Stale previews | Low | 3-day TTL cron, branch deletion webhook |
| Credential leakage | Medium | Stored only in .env (chmod 600), never in logs or state.json |
| Docker socket exposure | Low | Read-only mount, acceptable for V1 (socket proxy for V2) |
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-01-23 | Initial draft |
| 1.1.0 | 2026-01-23 | Major revision after 3 expert reviews: fixed CI variable expansion, switched to DNS-01 wildcard cert, added hash suffix for collision prevention, replaced deploy.resources with native Compose limits, per-preview credentials, hard quota enforcement, disk/memory checks, rsync instead of registry, graceful shutdown, healthchecks, safety checks in cleanup, Traefik dashboard not exposed, proper pg_isready wait, deployment validation with rollback, VM monitoring as required |
| 1.2.0 | 2026-01-23 | Changed to 3rd level subdomain for Cloudflare wildcard cert compatibility |
| 1.3.0 | 2026-01-23 | Use preview-{branch}.pgai.watch - integrates with existing pgai.watch dynamic DNS infrastructure |
| 1.4.0 | 2026-01-23 | Addressing 3 additional expert reviews: DNS - switched to dynamic A records via Cloudflare API (create on deploy, delete on destroy); Secrets - removed from state.json and logs, stored only in .env with chmod 600; Locking - added flock (global + per-preview) for race prevention; Quota - counts running containers not directories; Prune - conditional, only when disk >80%; CI - uses shared sanitize-branch.sh script, fixed environment URL timing; Provisioning - added python3-flask, swappiness=1; Traefik - simplified TLS config, idempotent setup script |
| 1.5.0 | 2026-01-23 | Added Definition of Done section with acceptance criteria: sample MR with CI job, manual trigger provisions preview, Grafana accessible with non-empty dashboards showing workload activity |
| 1.6.0 | 2026-01-23 | Added VM constraint: only 1 Hetzner machine allowed, must be named ralph-preview-*; added naming convention section and pre-provisioning check |
| 1.7.0 | 2026-01-25 | Implementation complete: All scripts written and CI jobs defined. Added GF_SERVER_ROOT_URL for Traefik reverse proxy support. Enabled anonymous viewer access (GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer) for easier verification. Fixed Grafana password sync by deleting volume on deploy. Updated implementation checklist to reflect completed code work. Remaining: VM provisioning, credential setup, and end-to-end testing |
| 1.4.0 | 2026-01-23 | Addressing 3 additional expert reviews: DNS - switched to dynamic A records via Cloudflare API (create on deploy, delete on destroy); Secrets - removed from state.json and logs, stored only in .env with chmod 600; Locking - added flock (global + per-preview) for race prevention; Quota - counts running containers not directories; Prune - conditional, only when disk >80%; CI - uses shared sanitize-branch.sh script, fixed environment URL timing; Provisioning - added python3-flask, swappiness=1; Traefik - simplified TLS config, idempotent setup script |
| 1.5.0 | 2026-01-23 | Added Definition of Done section with acceptance criteria: sample MR with CI job, manual trigger provisions preview, Grafana accessible with non-empty dashboards showing workload activity |
| 1.6.0 | 2026-01-23 | Added VM constraint: only 1 Hetzner machine allowed, must be named ralph-preview-*; added naming convention section and pre-provisioning check |