Skip to content

Latest commit

 

History

History
1164 lines (803 loc) · 53.8 KB

File metadata and controls

1164 lines (803 loc) · 53.8 KB

Self-Hosting Guide

This comprehensive guide walks you through deploying Simple Agent Manager (SAM) to your own infrastructure. Follow each section carefully—skipping steps is the most common cause of deployment issues.


Quick Start (Automated Deployment)

For the fastest deployment experience, use the automated GitHub Actions workflow with Pulumi infrastructure management. Deployment is automatic on every push to main.

For detailed step-by-step instructions, see the Quickstart Guide.

Prerequisites (One-Time Setup)

  1. Fork this repository
  2. Have a domain on Cloudflare (nameservers already pointed to Cloudflare — see Cloudflare Setup if not yet done)
  3. Create a Cloudflare API Token — see the detailed permissions table below
  4. Note your Account ID and Zone ID from the Cloudflare dashboard (domain overview, right sidebar)
  5. Create an R2 API Token (separate from above - for Pulumi state storage):
    • Go to Cloudflare Dashboard → R2 → Manage R2 API Tokens
    • Create token with Object Read & Write permissions
    • Note: The state bucket is created automatically by the workflow
  6. Create GitHub App (see GitHub Setup below)
  7. Generate a Pulumi passphrase for encrypting state:
    openssl rand -base64 32

GitHub Environment Configuration

Automated deployment configuration lives in a GitHub Environment named production. This makes deployment inputs visible and editable in the GitHub UI. Runtime Worker vars that are not explicitly passed by the workflow still come from the checked-in top-level [vars] in apps/api/wrangler.toml.

Create the environment:

  1. Go to your fork's Settings → Environments
  2. Click New environment
  3. Name it production and click Configure environment

Add environment variables (visible in UI):

Variable Description Example
BASE_DOMAIN Your domain for the deployment example.com
RESOURCE_PREFIX Prefix for Cloudflare resources (optional) sam
PULUMI_STATE_BUCKET R2 bucket for Pulumi state (optional) sam-pulumi-state

Optional feature flags (GitHub Environment variables):

Variable Description Default
REQUIRE_APPROVAL Require admin approval for new users. First user becomes superadmin. (unset — all users active)
HETZNER_BASE_IMAGE Hetzner VM base image. Set to ubuntu-24.04 for emergency rollback from the faster docker-ce marketplace default. docker-ce

Optional runtime-config limit variables (Worker vars):

These are runtime Worker variables, not GitHub Environment variables in the current workflow. To change them for automated deployments, edit the top-level [vars] in apps/api/wrangler.toml before deploying, or extend .github/workflows/deploy-reusable.yml and scripts/deploy/sync-wrangler-config.ts to pass them through. Cloudflare Wrangler environment vars are non-inheritable, so the sync script copies top-level [vars] into the generated [env.production.vars] / [env.staging.vars] sections.

Variable Description Default
MAX_PROJECT_RUNTIME_ENV_VARS_PER_PROJECT Max runtime env vars saved per project 150
MAX_PROJECT_RUNTIME_FILES_PER_PROJECT Max runtime files saved per project 50
MAX_PROJECT_RUNTIME_ENV_VALUE_BYTES Max bytes per runtime env var value 8192
MAX_PROJECT_RUNTIME_FILE_CONTENT_BYTES Max bytes per runtime file content 131072
MAX_PROJECT_RUNTIME_FILE_PATH_LENGTH Max runtime file path length (chars) 256

Optional AI task title generation variables (Worker vars):

Variable Description Default
TASK_TITLE_MODEL Workers AI model for task title generation @cf/google/gemma-3-12b-it
TASK_TITLE_MAX_LENGTH Max characters in a generated title 100
TASK_TITLE_TIMEOUT_MS Timeout (ms) for AI title generation before falling back to truncation 5000
TASK_TITLE_GENERATION_ENABLED Set to false to disable AI generation entirely true
TASK_TITLE_SHORT_MESSAGE_THRESHOLD Messages at or below this length bypass AI 100
TASK_TITLE_MAX_RETRIES Max retry attempts on AI generation failure (rate limit, transient errors) 2
TASK_TITLE_RETRY_DELAY_MS Base delay (ms) between retries (exponential backoff: delay × 2^attempt) 1000
TASK_TITLE_RETRY_MAX_DELAY_MS Max delay (ms) cap for retry backoff 4000

Add environment secrets (hidden):

Secret Description
CF_API_TOKEN Cloudflare API token with D1, KV, R2, DNS, Workers Scripts, Workers Observability, AI Gateway, Workers Routes, Pages, and SSL/Certificates permissions
CF_ACCOUNT_ID Your Cloudflare account ID (32-char hex). Also used as a Worker secret for the admin observability log viewer.
CF_ZONE_ID Your domain's zone ID (32-char hex)
R2_ACCESS_KEY_ID R2 API token access key
R2_SECRET_ACCESS_KEY R2 API token secret key
PULUMI_CONFIG_PASSPHRASE Your generated passphrase
GH_CLIENT_ID GitHub App client ID
GH_CLIENT_SECRET GitHub App client secret
GH_APP_ID GitHub App ID
GH_APP_PRIVATE_KEY GitHub App private key (raw PEM or base64 encoded — both work)
GH_APP_SLUG GitHub App slug (URL name)
GH_WEBHOOK_SECRET GitHub webhook HMAC-SHA256 verification secret. Required when the GitHub App webhook is active; must match the GitHub App webhook secret exactly. The deploy workflow maps this to the Worker secret GITHUB_WEBHOOK_SECRET.

Optional secrets (TLS — usually not needed):

Secret Description
CF_ORIGIN_CA_KEY Deprecated fallback. Cloudflare Origin CA Key — only needed if your CF_API_TOKEN lacks the Zone > SSL and Certificates > Edit permission and you can't update it. The Origin CA Key is deprecated by Cloudflare (removal Sept 2026). Prefer adding the SSL permission to your API token instead.

Optional secrets (purpose-specific security overrides — recommended for production):

Secret Description
BETTER_AUTH_SECRET BetterAuth session signing/encryption (overrides ENCRYPTION_KEY for sessions)
CREDENTIAL_ENCRYPTION_KEY AES-GCM encryption of user cloud credentials (overrides ENCRYPTION_KEY for credential storage)

Optional secrets (for GCP OIDC integration — see GCP Setup Guide for full instructions):

Secret Description
GOOGLE_CLIENT_ID Google Cloud Console OAuth 2.0 client ID (enables "Connect Google Cloud" in Settings)
GOOGLE_CLIENT_SECRET Google Cloud Console OAuth 2.0 client secret

GCP OAuth Redirect URI: When creating a Google OAuth 2.0 client, add https://api.<YOUR_BASE_DOMAIN>/api/deployment/gcp/callback as an authorized redirect URI. This is a single static URI shared by all projects — no per-project URIs needed.

Optional GCP VM provisioning configuration (env vars, not secrets — sensible defaults provided):

Variable Default Description
GCP_STS_SCOPE https://www.googleapis.com/auth/cloud-platform OAuth scope for STS token exchange
GCP_SA_IMPERSONATION_SCOPES https://www.googleapis.com/auth/compute Comma-separated scopes for SA impersonation

For the full list of GCP configuration variables, see the GCP Setup Guide.

Optional GCP deployment configuration (for project-level Defang deployment — sensible defaults provided):

Variable Default Description
GCP_DEPLOY_WIF_POOL_ID sam-deploy-pool WIF pool ID for project-level deployment auth
GCP_DEPLOY_WIF_PROVIDER_ID sam-oidc OIDC provider within the deploy pool
GCP_DEPLOY_SERVICE_ACCOUNT_ID sam-deployer Service account for deployment operations
GCP_DEPLOY_IDENTITY_TOKEN_EXPIRY_SECONDS 600 Identity token lifetime in seconds

⚠️ Naming Convention — read this before troubleshooting "missing secret" errors: GitHub App secrets use GH_* prefix (not GITHUB_*) because GitHub Actions secret names cannot start with GITHUB_. The deployment workflow automatically maps GH_*GITHUB_* when setting Cloudflare Worker secrets. If you see GITHUB_CLIENT_ID or GITHUB_WEBHOOK_SECRET in code or .env files, those are Worker-side names — use GH_CLIENT_ID and GH_WEBHOOK_SECRET in GitHub Environment secrets. Google OAuth secrets use GOOGLE_* directly.

Note: Security keys (ENCRYPTION_KEY, JWT_PRIVATE_KEY, JWT_PUBLIC_KEY) and TLS certificates (ORIGIN_CA_CERT, ORIGIN_CA_KEY) are automatically generated and persisted via Pulumi state in R2. No manual intervention required—keys are created on first deployment and reused automatically on subsequent deployments.

Deploy

Automatic deployment: Every push to main triggers a deployment automatically.

First deployment:

  1. Configure the GitHub Environment (see above)
  2. Push any commit to main, OR
  3. Go to Actions"Deploy""Run workflow" for manual trigger

Subsequent deployments: Just merge PRs to main. The workflow:

  • Validates all required configuration exists
  • Provisions infrastructure via Pulumi (idempotent)
  • Deploys API Worker and Web UI via Wrangler
  • Runs database migrations
  • Builds and uploads VM Agent binaries
  • Runs health check

Teardown

To remove all resources:

  1. Go to Actions"Teardown"
  2. Click "Run workflow"
  3. Type DELETE to confirm
  4. Click "Run workflow"

For more control or troubleshooting, continue with the manual setup below.


Table of Contents

  1. Prerequisites & Preparation
  2. Cloudflare Setup
  3. GitHub Setup
  4. Project Setup
  5. Manual Building & Deployment (Optional)
  6. DNS Configuration
  7. Verification
  8. Maintenance
  9. Troubleshooting
  10. Cost Estimation

Prerequisites & Preparation

Before starting, ensure you have the following ready.

Required Accounts

Account Purpose Tier Needed Sign-up Link
Cloudflare API hosting, DNS, storage Free tier cloudflare.com
GitHub Authentication, repository access Free tier github.com
Domain Registrar Your workspace domain Any (you likely already have one)

Note on cloud providers: SAM uses a Bring-Your-Own-Cloud (BYOC) model. Each user provides their own Hetzner (or other provider) API token through the Settings UI to create workspaces. You do not need a shared cloud provider account for the platform itself — Cloudflare is the only infrastructure the platform operator manages.

Required Tools

Install these on your development machine:

# Node.js 20+ (check version)
node --version  # Should be v20.x or higher

# pnpm 9+ (install if missing)
npm install -g pnpm
pnpm --version  # Should be 9.x or higher

# Go 1.22+ (needed to compile the VM Agent — the binary that runs on each workspace VM)
go version  # Should be go1.22.x or higher

# Git
git --version

Installing Go (if not installed):

Preparation Checklist

  • All required accounts created
  • All tools installed and verified
  • A domain you control (e.g., example.com or workspaces.example.com)
  • 30-60 minutes of uninterrupted time

Cloudflare Setup

This section covers setting up Cloudflare as your infrastructure provider.

Step 1: Add Your Domain to Cloudflare

If your domain is not already on Cloudflare:

  1. Log in to Cloudflare Dashboard
  2. Click "Add a Site" (or "Add site" button)
  3. Enter your domain (e.g., example.com) and click Continue
  4. Select the Free plan and click Continue
  5. Cloudflare will scan your existing DNS records—review and click Continue
  6. Important: Note the two nameservers Cloudflare assigns (e.g., ivy.ns.cloudflare.com, rudy.ns.cloudflare.com)

Step 2: Update Nameservers at Your Registrar

You must point your domain to Cloudflare's nameservers. This varies by registrar:

GoDaddy:

  1. Go to my.godaddy.comMy ProductsDNS
  2. Click NameserversChangeEnter custom nameservers
  3. Enter Cloudflare's nameservers, click Save

Namecheap:

  1. Go to namecheap.comDomain ListManage
  2. Under Nameservers, select Custom DNS
  3. Enter Cloudflare's nameservers, click Save

Google Domains / Squarespace Domains:

  1. Go to domains.squarespace.com
  2. Select your domain → DNSNameserversUse custom nameservers
  3. Enter Cloudflare's nameservers

Other Registrars: Look for "Nameservers" or "DNS Settings" in your registrar's dashboard.

Important: Nameserver changes can take up to 24 hours to propagate. Cloudflare will email you when the domain is active.

Step 3: Find Your Account ID and Zone ID

You'll need these IDs for configuration:

  1. In Cloudflare Dashboard, select your domain
  2. Scroll down on the Overview page
  3. In the right sidebar under API, you'll see:
    • Zone ID: Copy this (32-character hex string)
    • Account ID: Copy this (32-character hex string)

Save these values—you'll need them later.

Step 4: Create API Token with Required Permissions

SAM needs a Cloudflare API token with specific permissions:

  1. Go to My Profile (top-right icon) → API Tokens
  2. Click "Create Token"
  3. Click "Create Custom Token" (not a template)
  4. Configure the token:

Token name: simple-agent-manager

Permissions — add all of these. Each row maps to a single permission in the Cloudflare UI: select the Scope (Account or Zone), then the Category group, then the specific Permission and Access Level.

Scope Category Permission Access Level
Account Developer Platform D1 Edit
Account Developer Platform Workers KV Storage Edit
Account Developer Platform Workers R2 Storage Edit
Account Developer Platform Workers Scripts Edit
Account Developer Platform Workers Observability Read
Account Developer Platform Pages Edit
Account AI AI Gateway Edit
Zone Developer Platform Workers Routes Edit
Zone SSL & Certificates SSL and Certificates Edit
Zone DNS & Zone DNS Edit
Zone DNS & Zone Zone Read

Zone Resources: Select IncludeSpecific zoneyour domain

Account Resources: Select IncludeYour account name

  1. Click Continue to summaryCreate Token
  2. Copy the token immediately—it won't be shown again

Step 5: Create Cloudflare Resources

Note: If using the Quick Start (Automated Deployment), skip this step. Pulumi automatically creates D1, KV, and R2 resources when you push to main.

Manual resource creation (optional)

Open your terminal and run these commands:

# Login to Cloudflare via Wrangler
npx wrangler login

# Create D1 Database
npx wrangler d1 create workspaces
# Note the database_id from the output!

# Create KV Namespace for sessions
npx wrangler kv namespace create sessions
# Note the namespace id from the output!

# Create R2 Bucket for VM Agent binaries and task attachments
npx wrangler r2 bucket create workspaces-assets

R2 CORS Configuration (Required for Task Attachments)

Quick Start (Automated Deployment): R2 CORS is configured automatically on every deploy by scripts/deploy/configure-r2-cors.sh. Skip this section if you are using the automated deployment pipeline.

If you are deploying manually and want to enable file attachments on task submissions, configure CORS on the R2 bucket to allow direct browser uploads via presigned PUT URLs:

# Create a cors-rules.json file:
cat > cors-rules.json << 'CORS'
[
  {
    "AllowedOrigins": ["https://app.YOUR_DOMAIN"],
    "AllowedMethods": ["PUT"],
    "AllowedHeaders": ["*"],
    "ExposeHeaders": ["ETag"],
    "MaxAgeSeconds": 3600
  }
]
CORS

# Apply CORS rules to the bucket (via S3-compatible API or Cloudflare Dashboard)
# Dashboard: R2 → workspaces-assets → Settings → CORS Policy

Replace YOUR_DOMAIN with your BASE_DOMAIN value (e.g., https://app.simple-agent-manager.org).

You also need R2 S3-compatible API credentials for presigned URL generation. Create these in the Cloudflare Dashboard under R2 → Manage R2 API Tokens, with Object Read & Write permissions scoped to the workspaces-assets bucket. Set R2_ACCESS_KEY_ID and R2_SECRET_ACCESS_KEY as Worker secrets.

Save these IDs from the command outputs:

  • D1 Database ID (e.g., abc123...)
  • KV Namespace ID (e.g., def456...)

GitHub Setup

SAM uses a single GitHub App for both user login (OAuth) and repository access.

Step 1: Create GitHub App

  1. Go to GitHub App Settings
  2. Click "New GitHub App"
  3. Fill in the form:

Basic Information:

Field Value
GitHub App name Simple Agent Manager
Homepage URL https://app.example.com

Identifying and authorizing users:

Field Value
Callback URL https://api.example.com/api/auth/callback/github
Expire user authorization tokens ✓ Checked
Request user authorization (OAuth) during installation Unchecked
Enable Device Flow ☐ Unchecked

Important: "Request user authorization (OAuth) during installation" MUST be unchecked. When checked, it disables the Setup URL and causes the post-installation redirect to hit the OAuth callback, which fails because BetterAuth didn't initiate the flow. Users log in separately via the app's login button.

Important: GitHub App user access tokens use app/user permissions (not OAuth scopes). SAM reads the user's primary email from GET /user/emails, so the Email addresses user permission must be granted.

Post installation:

Field Value
Setup URL (optional) https://api.example.com/api/github/callback
Redirect on update ✓ Checked

Note: The Setup URL points to the API, not the web UI. The API records the installation in the database and then redirects the user to https://app.example.com/settings.

Webhook:

Field Value
Active ✓ Checked
Webhook URL https://api.example.com/api/github/webhook
Webhook secret Generate a random string and save the same value as the GH_WEBHOOK_SECRET GitHub Environment secret

Permissions:

Repository permissions:

Permission Access
Contents Read and write
Metadata Read-only

Note: Contents requires Read and write access because workspaces need to commit and push code changes back to repositories.

Account permissions:

Permission Access
Email addresses Read-only

Note: SAM uses this permission to read the account's primary email from GET /user/emails. Without it, SAM falls back to the public profile email from GET /user, or a GitHub noreply fallback when no email is available.

Where can this GitHub App be installed?: Select based on your needs:

  • Only on this account: For personal use
  • Any account: For public/team use
  1. Click "Create GitHub App"
  2. Note the App ID (number shown at top)
  3. Copy the Client ID and generate a Client Secret — you'll need both for OAuth login

Step 2: Generate GitHub App Private Key

  1. On the GitHub App page, scroll to "Private keys"
  2. Click "Generate a private key"
  3. A .pem file will download automatically
  4. Save this file securely—you'll need it for configuration

Project Setup

Step 1: Clone and Install

# Clone the repository
git clone https://github.com/your-org/simple-agent-manager.git
cd simple-agent-manager

# Install dependencies
pnpm install

Step 2: Generate Security Keys (Local Development Only)

Note: For production deployments, security keys are automatically managed by Pulumi and persist in R2. This step is only needed for local development.

# Generate JWT and encryption keys for local development
pnpm tsx scripts/deploy/generate-keys.ts

This generates:

  • ENCRYPTION_KEY: Shared fallback key — used for credential encryption, session management, and webhook verification when purpose-specific overrides are not set
  • JWT_PRIVATE_KEY: RSA private key for signing terminal access tokens
  • JWT_PUBLIC_KEY: RSA public key for token verification

Step 3: Configure Environment Variables (Local Development)

Note: For production deployment via GitHub Actions, use GitHub Environment Configuration instead. This step is only needed for local development.

Naming Convention: Local .env files use GITHUB_* prefix (e.g., GITHUB_CLIENT_ID) because that's what the Worker code reads. This differs from GitHub Environment secrets which use GH_* prefix. The deployment workflow maps between them.

Create your .env file:

cp .env.example .env

Edit .env with your values:

# Cloudflare Configuration
CF_API_TOKEN=your-cloudflare-api-token-from-step-4
CF_ZONE_ID=your-zone-id-from-step-3
CF_ACCOUNT_ID=your-account-id-from-step-3

# Domain Configuration
# Use your workspace subdomain (workspaces will be ws-xxx.workspaces.example.com)
BASE_DOMAIN=workspaces.example.com

# GitHub App (from GitHub App setup)
GITHUB_CLIENT_ID=Iv1.xxxxxxxxxxxx
GITHUB_CLIENT_SECRET=your-github-app-client-secret
GITHUB_APP_ID=123456
# For the private key, base64 encode the entire .pem file:
# cat your-key.pem | base64 -w0
GITHUB_APP_PRIVATE_KEY=LS0tLS1CRUdJTi4uLi4=

# Security Keys (from generate-keys.ts script)
ENCRYPTION_KEY=your-encryption-key-from-generate-keys
JWT_PRIVATE_KEY=your-jwt-private-key
JWT_PUBLIC_KEY=your-jwt-public-key

Step 4: Update wrangler.toml

Note: If using the Quick Start (Automated Deployment), skip this step. The sync-wrangler-config.ts script automatically updates wrangler.toml with Pulumi-provisioned resource IDs.

Manual configuration (for local development or manual deployment)

Edit apps/api/wrangler.toml with your resource IDs:

name = "workspaces-api"
main = "src/index.ts"
compatibility_date = "2024-01-01"
compatibility_flags = ["nodejs_compat"]

[vars]
BASE_DOMAIN = "workspaces.example.com"  # Your domain
VERSION = "1.0.0"

# D1 Database (use your database_id from Step 5)
[[d1_databases]]
binding = "DATABASE"
database_name = "workspaces"
database_id = "your-d1-database-id-here"

# KV Namespace (use your namespace id from Step 5)
[[kv_namespaces]]
binding = "KV"
id = "your-kv-namespace-id-here"

# R2 Bucket
[[r2_buckets]]
binding = "R2"
bucket_name = "workspaces-assets"

# Cron for provisioning timeout checks
[triggers]
crons = ["*/5 * * * *"]

Manual Building & Deployment (Optional)

Most users should skip this section. The Quick Start (Automated Deployment) handles all build, deploy, and configuration steps automatically via GitHub Actions. The manual steps below are only needed for local development, custom deployments, or troubleshooting.

Manual Deployment Steps

Step 1: Build All Packages

# Build TypeScript packages
pnpm build

Step 2: Build VM Agent (Go)

The VM Agent runs on workspace VMs and requires compilation:

cd packages/vm-agent

# Install Go dependencies
go mod download

# Build for Linux (VMs use Linux)
make build-all

This creates binaries in packages/vm-agent/bin/:

  • vm-agent-linux-amd64
  • vm-agent-linux-arm64
  • vm-agent-darwin-amd64 (for local testing)
  • vm-agent-darwin-arm64 (for local testing)

Step 3: Set Cloudflare Worker Secrets

Secrets must be set separately (not in wrangler.toml):

cd apps/api

# Set each secret (you'll be prompted for the value)
wrangler secret put CF_API_TOKEN
wrangler secret put CF_ACCOUNT_ID
wrangler secret put CF_ZONE_ID
wrangler secret put GITHUB_CLIENT_ID
wrangler secret put GITHUB_CLIENT_SECRET
wrangler secret put GITHUB_APP_ID
wrangler secret put GITHUB_APP_PRIVATE_KEY
wrangler secret put GITHUB_APP_SLUG
wrangler secret put GITHUB_WEBHOOK_SECRET
wrangler secret put ENCRYPTION_KEY
wrangler secret put JWT_PRIVATE_KEY
wrangler secret put JWT_PUBLIC_KEY
wrangler secret put ORIGIN_CA_CERT
wrangler secret put ORIGIN_CA_KEY
wrangler secret put TRIAL_CLAIM_TOKEN_SECRET

# Optional purpose-specific overrides (recommended for production)
# wrangler secret put BETTER_AUTH_SECRET
# wrangler secret put CREDENTIAL_ENCRYPTION_KEY

# Optional task attachment upload support
# wrangler secret put R2_ACCESS_KEY_ID
# wrangler secret put R2_SECRET_ACCESS_KEY

Tip: For multiline values (like private keys), you can pipe them:

cat path/to/github-app-key.pem | wrangler secret put GITHUB_APP_PRIVATE_KEY

Step 4: Run Database Migrations

# Apply migrations to production D1
wrangler d1 migrations apply workspaces --remote

Step 5: Deploy API

cd apps/api
wrangler deploy

Note the deployed URL (e.g., workspaces-api.your-subdomain.workers.dev)

Step 6: Deploy Web UI

cd apps/web
pnpm build
wrangler pages deploy dist --project-name simple-agent-manager

If this is your first Pages deployment, Wrangler will create the project. Note the URL (e.g., simple-agent-manager.pages.dev).

Step 7: Upload VM Agent to R2

cd packages/vm-agent

# Upload each binary
wrangler r2 object put workspaces-assets/agents/vm-agent-linux-amd64 --file bin/vm-agent-linux-amd64 --remote
wrangler r2 object put workspaces-assets/agents/vm-agent-linux-arm64 --file bin/vm-agent-linux-arm64 --remote

# Upload version info
echo '{"version": "1.0.0", "buildDate": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' > bin/version.json
wrangler r2 object put workspaces-assets/agents/version.json --file bin/version.json --remote

Manual deployment note: The automated Pulumi workflow generates and persists ENCRYPTION_KEY, JWT_PRIVATE_KEY, JWT_PUBLIC_KEY, ORIGIN_CA_CERT, ORIGIN_CA_KEY, and TRIAL_CLAIM_TOKEN_SECRET. In the manual flow, you must generate and set those Worker secrets yourself. Use wrangler secret put <NAME> --env production if you deploy the Worker with a Wrangler environment.


DNS Configuration

Note: If using the Quick Start (Automated Deployment), DNS records are created automatically by Pulumi. This section is for manual deployment or reference.

Configure DNS records in Cloudflare to route traffic to your deployments.

Required DNS Records

In Cloudflare Dashboard → your domain → DNS:

Type Name Content Proxy Status
CNAME api workspaces-api.your-subdomain.workers.dev Proxied (orange)
CNAME app simple-agent-manager.pages.dev Proxied (orange)
CNAME * workspaces-api.your-subdomain.workers.dev Proxied (orange)

Notes:

  • The * (wildcard) record catches workspace subdomains (e.g., ws-abc123.workspaces.example.com)
  • The wildcard record should target the deployed API Worker hostname, matching the automated Pulumi deployment
  • All records should be proxied (orange cloud) for SSL and Workers routing
  • If you configure Worker routes manually, add routes for api.example.com/* and *.example.com/*, plus a more-specific *.vm.example.com/* route with no Worker script so VM-agent backend traffic bypasses the wildcard Worker route.

SSL/TLS Configuration

  1. In Cloudflare Dashboard → your domain → SSL/TLS
  2. Set encryption mode to Full (strict)
  3. Under Edge Certificates, ensure:
    • Always Use HTTPS: On
    • Automatic HTTPS Rewrites: On

Cloudflare automatically provisions SSL certificates including wildcard (*.workspaces.example.com).


Verification

Test each component to ensure everything works.

Test 1: API Health Check

curl https://api.example.com/health
# Should return: {"status":"healthy","timestamp":"..."}

Test 2: Web UI Access

Open https://app.example.com in your browser. You should see the login page.

Test 3: GitHub OAuth Login

  1. Click "Sign in with GitHub"
  2. Authorize the OAuth application
  3. You should be redirected back and see the dashboard

Test 4: Agent Binary Download

curl -I "https://api.example.com/api/agent/download?os=linux&arch=amd64"
# Should return: HTTP/2 200 with Content-Type: application/octet-stream

Test 5: Create a Workspace (Full E2E)

  1. Add your Hetzner API token in Settings
  2. Install the GitHub App on a test repository
  3. Create a workspace from the dashboard
  4. Wait for provisioning (2-5 minutes)
  5. Connect to the terminal

Maintenance

Viewing Logs

# Stream real-time logs
wrangler tail

# Filter to errors only
wrangler tail --format=pretty --filter error

Node Log Configuration

Nodes use systemd journald for centralized log aggregation. The cloud-init template automatically configures journald and Docker logging on new nodes.

Journald configuration (applied via /etc/systemd/journald.conf.d/sam.conf):

Setting Default Description
SystemMaxUse 500M Max disk space for journal
SystemKeepFree 1G Minimum free disk to maintain
MaxRetentionSec 7day Max log retention period
Storage persistent Persist logs across reboots
Compress yes Compress stored journal entries

These defaults can be overridden per-node by passing logJournalMaxUse, logJournalKeepFree, and logJournalMaxRetention to the cloud-init generator.

Docker logging: Docker is configured to use the journald log driver, so all container stdout/stderr flows into the same journal. This enables unified log viewing from the control plane UI.

VM Agent environment variables:

Variable Default Description
LOG_LEVEL info Agent log level (debug, info, warn, error)
LOG_FORMAT json Log output format (json or text)
LOG_RETRIEVAL_DEFAULT_LIMIT 200 Default entries per log page
LOG_RETRIEVAL_MAX_LIMIT 1000 Maximum entries per log page
LOG_STREAM_BUFFER_SIZE 100 Catch-up entries sent on stream connect
LOG_READER_TIMEOUT 30s Timeout for journalctl read commands
LOG_STREAM_PING_INTERVAL 30s WebSocket ping interval for log stream
LOG_STREAM_PONG_TIMEOUT 90s WebSocket pong deadline for log stream
SYSINFO_DOCKER_LIST_TIMEOUT 10s Timeout for docker ps command
SYSINFO_DOCKER_STATS_TIMEOUT 10s Timeout for docker stats command

Updating the VM Agent

When you make changes to the VM Agent:

cd packages/vm-agent
make build-all

# Re-upload to R2
wrangler r2 object put workspaces-assets/agents/vm-agent-linux-amd64 --file bin/vm-agent-linux-amd64 --remote
wrangler r2 object put workspaces-assets/agents/vm-agent-linux-arm64 --file bin/vm-agent-linux-arm64 --remote

# Update version
echo '{"version": "1.0.1", "buildDate": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' > bin/version.json
wrangler r2 object put workspaces-assets/agents/version.json --file bin/version.json --remote

Database Migrations

SAM uses two D1 databases:

  • DATABASE (workspaces): Core platform data (users, nodes, workspaces, projects, tasks)
  • OBSERVABILITY_DATABASE (observability): Error storage for the admin observability dashboard (spec 023). Isolated from the main database to prevent error volume from affecting core queries.

When schema changes are needed:

# Create a new migration for the main database
wrangler d1 migrations create workspaces your-migration-name

# Create a new migration for the observability database
wrangler d1 migrations create observability your-migration-name

# Apply all migrations to production (run-migrations.ts handles both databases)
pnpm tsx scripts/deploy/run-migrations.ts --env production

# Or apply individually
wrangler d1 migrations apply workspaces --remote
wrangler d1 migrations apply observability --remote

Note: Durable Object (DO) SQLite migrations are managed automatically. Each project's DO runs pending migrations in its constructor via blockConcurrencyWhile(). No manual migration step is needed for DO schemas.

Durable Object Configuration

SAM uses a per-project Durable Object (PROJECT_DATA) for chat sessions, messages, activity events, and real-time WebSocket streaming. This is configured automatically by Pulumi during deployment.

For manual deployments, ensure your wrangler.toml includes the DO binding:

[[durable_objects.bindings]]
name = "PROJECT_DATA"
class_name = "ProjectData"

[[migrations]]
tag = "v1"
new_sqlite_classes = ["ProjectData"]

Configurable DO limits (set as Worker vars or environment variables):

Variable Description Default
MAX_SESSIONS_PER_PROJECT Max chat sessions per project 10000
MAX_MESSAGES_PER_SESSION Max messages per chat session 10000
MESSAGE_SIZE_THRESHOLD Max message size in bytes 102400
ACTIVITY_RETENTION_DAYS Days to retain activity events 90
SESSION_IDLE_TIMEOUT_MINUTES Idle session timeout 60
DO_SUMMARY_SYNC_DEBOUNCE_MS Debounce for DO-to-D1 summary sync 5000
DEFAULT_TASK_AGENT_TYPE Agent used for autonomous task execution opencode
WORKSPACE_IDLE_TIMEOUT_MS Global default idle timeout before workspace is stopped (overridable per-project) 7200000 (2h)

See apps/api/.env.example for the full list of configurable variables.

Codex Token Refresh Proxy

SAM includes a centralized token refresh proxy for OpenAI Codex OAuth tokens. Codex uses rotating refresh tokens — when one instance refreshes, the old refresh token is permanently invalidated. If two workspaces refresh concurrently, one breaks permanently.

The proxy intercepts Codex refresh requests and serializes them per user via a Durable Object, preventing the race condition. This is enabled by default and requires no additional configuration.

Configurable variables:

Variable Description Default
CODEX_REFRESH_PROXY_ENABLED Kill switch — set to "false" to disable Enabled
CODEX_REFRESH_LOCK_TIMEOUT_MS Per-user lock timeout 30000 (30s)
CODEX_REFRESH_UPSTREAM_URL OpenAI token endpoint https://auth.openai.com/oauth/token
CODEX_REFRESH_UPSTREAM_TIMEOUT_MS Upstream request timeout 10000 (10s)
CODEX_CLIENT_ID OpenAI OAuth client ID app_EMoamEEZ73f0CkXaXp7hrann
RATE_LIMIT_CODEX_REFRESH_PER_HOUR Max refresh requests per hour per user (enforced atomically via CodexRefreshLock DO ctx.storage) 30
RATE_LIMIT_CODEX_REFRESH_WINDOW_SECONDS Rate limit window in seconds 3600 (1 hour)
CODEX_EXPECTED_SCOPES Comma-separated allowlist of OAuth scopes the upstream may return. Unexpected scopes block the refresh with 502 and the previous token remains valid. Empty string disables validation. openid,profile,email,offline_access

Trial Onboarding

If you want to expose the zero-friction /try URL-to-workspace flow on your deployment, see trial-configuration.md for the required TRIAL_CLAIM_TOKEN_SECRET secret, tunable env vars (monthly cap, workspace TTL, data retention), and the KV-backed kill switch.

Rotating Security Keys

Security keys are managed by Pulumi and normally don't need rotation. If you need to rotate keys:

Option 1: Force Pulumi to recreate keys

# Remove protection from key resources (temporarily)
cd infra
pulumi state unprotect "urn:pulumi:prod::infra::random:index/randomId:RandomId::encryption-key"
pulumi state unprotect "urn:pulumi:prod::infra::tls:index/privateKey:PrivateKey::jwt-signing-key"

# Delete the resources
pulumi state delete "urn:pulumi:prod::infra::random:index/randomId:RandomId::encryption-key"
pulumi state delete "urn:pulumi:prod::infra::tls:index/privateKey:PrivateKey::jwt-signing-key"

# Re-deploy to create new keys
pulumi up

Option 2: Manual rotation

# Generate new keys locally
pnpm tsx scripts/deploy/generate-keys.ts

# Update secrets directly
cd apps/api
wrangler secret put JWT_PRIVATE_KEY
wrangler secret put JWT_PUBLIC_KEY
wrangler secret put ENCRYPTION_KEY

Warning: Rotating keys will:

  • Invalidate all active terminal sessions (JWT keys)
  • Make existing encrypted credentials unreadable (CREDENTIAL_ENCRYPTION_KEY, or ENCRYPTION_KEY if the override is not set) - users will need to re-enter their Hetzner tokens

Troubleshooting

Pulumi & Automated Deployment Issues

"error: failed to decrypt state"

Cause: PULUMI_CONFIG_PASSPHRASE doesn't match the one used when state was created.

Fix:

  1. Use the same passphrase used during initial deployment
  2. If you lost the passphrase, delete the stack in R2 and start fresh:
    # In Cloudflare Dashboard → R2 → sam-pulumi-state bucket
    # Delete the .pulumi/ folder for your stack

"error: failed to load checkpoint"

Cause: R2 backend connection failed or bucket doesn't exist.

Fix:

  1. Verify the Pulumi state bucket exists in Cloudflare R2
  2. Check R2 credentials (R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY) in your GitHub Environment
  3. Verify the bucket name matches the PULUMI_STATE_BUCKET environment variable (default: sam-pulumi-state)

"error: stack 'prod' not found"

Cause: First deployment or stack was removed.

Fix: This is normal for first deployments. The workflow automatically creates the stack. If you see this after a previous deployment, the state may have been deleted.

"error: resource already exists"

Cause: Resource was created outside Pulumi or imported incorrectly.

Fix:

  1. If the resource should be managed by Pulumi, import it:
    pulumi import cloudflare:index/d1Database:D1Database sam-database <database-id>
  2. Or delete the resource in Cloudflare Dashboard and re-run deployment

"You need a workers.dev subdomain in order to proceed"

Cause: Cron triggers (used for provisioning timeout checks) require the account's workers.dev subdomain to be initialized. The deploy workflow handles this automatically via the Cloudflare API, but it may fail if the API token lacks the Workers Scripts permission.

Fix:

  1. Automatic: The deployment workflow includes an "Ensure workers.dev Subdomain" step that initializes it. Verify your API token has Account: Workers Scripts (Edit) permission.
  2. Manual: Go to Cloudflare DashboardWorkers & Pages → click on any worker → SettingsDomains & Routes → enable the workers.dev route.

"Unable to authenticate request" on Pages deploy

Cause: wrangler pages deploy needs the account ID but doesn't read it from wrangler.toml.

Fix: Ensure CF_ACCOUNT_ID is set as a secret in your GitHub Environment. The deploy workflow passes it as CLOUDFLARE_ACCOUNT_ID to the Pages deploy step.

"Deployment succeeded but health check failed"

Cause: Worker deployed but configuration issue preventing startup.

Fix:

  1. Check worker logs: wrangler tail
  2. Verify all secrets are set correctly
  3. Check D1 migrations were applied

Admin Logs show "Cloudflare Observability API returned 403"

Cause: The CF_API_TOKEN is missing the "Workers Observability (Read)" permission, which is required for the admin log viewer.

Fix:

  1. Go to Cloudflare DashboardMy ProfileAPI Tokens
  2. Edit the token used for SAM
  3. Add permission: Account → Workers Observability → Read
  4. Save the token
  5. If the token was regenerated, update the CF_API_TOKEN secret in your GitHub Environment and redeploy

"Configure AI Gateway" fails with 403

Cause: The CF_API_TOKEN is missing the account-level "AI Gateway (Edit)" permission. The deploy workflow configures the account AI Gateway before deploying the Worker.

Fix:

  1. Go to Cloudflare DashboardMy ProfileAPI Tokens
  2. Edit the token used for SAM
  3. Add permission: Account → AI Gateway → Edit
  4. Save the token
  5. If the token was regenerated, update the CF_API_TOKEN secret in your GitHub Environment and redeploy

"OAuth callback failed" or BetterAuth "unknown" error

Cause: Callback URL mismatch or incorrect GitHub App settings

Fix:

  1. Check your GitHub App's Callback URL matches exactly: https://api.example.com/api/auth/callback/github
  2. Check your GitHub App's Setup URL is set to: https://api.example.com/api/github/callback
  3. Ensure "Request user authorization (OAuth) during installation" is unchecked — when checked, it disables the Setup URL and causes post-installation redirects to hit BetterAuth, which fails
  4. Ensure HTTPS is used (not HTTP)
  5. Verify the domain in Cloudflare is active

"D1_ERROR: no such table"

Cause: Migrations haven't been applied

Fix:

wrangler d1 migrations apply workspaces --remote

"Failed to download agent binary"

Cause: R2 bucket not configured or binaries not uploaded

Fix:

  1. Verify R2 bucket exists: wrangler r2 bucket list
  2. Re-upload binaries (see Step 7 above)

"Workspace stuck in provisioning"

Cause: VM provisioning failed or agent didn't start

Fix:

  1. Check Hetzner console for VM status
  2. If VM is running, SSH in and check: systemctl status vm-agent
  3. View cloud-init logs: cat /var/log/cloud-init-output.log

"pkcs8 must be PKCS#8 formatted string" / GitHub App installation fails

Cause: The GitHub App private key is stored in an unsupported format. GitHub App keys are generated as PKCS#1 (-----BEGIN RSA PRIVATE KEY-----), and the API automatically converts them to PKCS#8 format at runtime.

Fix:

  1. Ensure the key is stored either as raw PEM or base64-encoded PEM (both work)
  2. For base64 encoding: cat your-key.pem | base64 -w0
  3. For raw PEM via wrangler: cat your-key.pem | wrangler secret put GITHUB_APP_PRIVATE_KEY
  4. Make sure the key isn't truncated — PKCS#1 RSA 2048 keys are ~1700 characters

"JWT verification failed"

Cause: Key mismatch between API and expectations

Fix:

  1. Ensure JWT_PUBLIC_KEY and JWT_PRIVATE_KEY are from the same key pair
  2. Check keys aren't truncated (base64 encoding)
  3. Regenerate keys if needed

"DNS_PROBE_FINISHED_NXDOMAIN"

Cause: DNS not propagated or misconfigured

Fix:

  1. Verify nameservers changed at registrar
  2. Check DNS records in Cloudflare dashboard
  3. Wait up to 24 hours for propagation
  4. Test with: dig +short api.example.com

Cost Estimation

Platform Costs (Your Infrastructure)

Component Free Tier Limit Paid Overage
Cloudflare Workers 100K requests/day $0.15/million
Cloudflare D1 5M rows read/day $0.001/million
Cloudflare KV 100K reads/day $0.50/million
Cloudflare R2 10GB storage $0.015/GB/month
Cloudflare Pages Unlimited Free

Typical SAM deployment: Stays within free tier for small to medium usage.

User VM Costs (Paid by Users)

Users provide their own Hetzner API token. Workspace VMs are billed to their account:

VM Size Specs Hourly Monthly
Small (CX22) 2 vCPU, 4GB RAM €0.006 (~$0.007) €3.79 (~$4.15)
Medium (CX32) 4 vCPU, 8GB RAM €0.011 (~$0.012) €6.80 (~$7.50)
Large (CX42) 8 vCPU, 16GB RAM €0.027 (~$0.030) €16.40 (~$18)

VMs are billed hourly until they are explicitly stopped or deleted.


Security Considerations

  1. Rotate Keys Regularly: Generate new JWT and encryption keys quarterly
  2. Minimal GitHub App Permissions: Only Contents: Read and write (required for committing) and Metadata: Read-only
  3. No Embedded Secrets: Bootstrap tokens ensure no secrets in cloud-init
  4. HTTPS Only: All traffic is encrypted via Cloudflare
  5. Session Security: BetterAuth handles secure session management

Getting Help


Last updated: 2026-04-14