Skip to content

Latest commit

 

History

History
930 lines (751 loc) · 32.7 KB

File metadata and controls

930 lines (751 loc) · 32.7 KB

Architecture Walkthrough

A comprehensive guide to the Simple Agent Manager (SAM) architecture — how every system fits together, from the user's browser to the VM terminal.


Table of Contents

  1. High-Level Architecture
  2. Request Routing
  3. Control Plane — API Worker
  4. Web Application
  5. Data Layer
  6. VM Agent
  7. Workspace Lifecycle
  8. Node Lifecycle
  9. Authentication & Security
  10. VM Provisioning
  11. Terminal & Agent Sessions
  12. Deployment Pipeline
  13. Infrastructure (Pulumi)

High-Level Architecture

SAM is a serverless platform for ephemeral AI coding environments. Users create cloud VMs running devcontainers with Claude Code pre-installed, then interact with them through a web terminal and agent chat interface.

graph TB
    subgraph "User"
        Browser["Browser"]
    end

    subgraph "Cloudflare Edge"
        Pages["Cloudflare Pages<br/><i>app.domain</i><br/>React + Vite"]
        Worker["Cloudflare Worker<br/><i>api.domain + *.domain</i><br/>Hono API"]
        D1["D1 (SQLite)<br/>Users, Nodes, Workspaces,<br/>Credentials, Sessions"]
        KV["KV Namespace<br/>Auth Sessions,<br/>Bootstrap Tokens,<br/>Boot Logs"]
        R2["R2 Bucket<br/>VM Agent Binaries,<br/>Pulumi State"]
    end

    subgraph "Hetzner Cloud"
        Node1["Node VM"]
        subgraph "Node VM Internals"
            VMAgent["VM Agent<br/>(Go Binary, :8443)"]
            Docker["Docker Engine"]
            WS1["Workspace Container 1<br/>Devcontainer + Claude Code"]
            WS2["Workspace Container N<br/>Devcontainer + Claude Code"]
        end
    end

    subgraph "External Services"
        GitHub["GitHub<br/>OAuth + App API"]
        Hetzner["Hetzner Cloud API<br/>VM Provisioning"]
        CFDNS["Cloudflare DNS API<br/>Dynamic Records"]
    end

    Browser -->|"HTTPS"| Pages
    Browser -->|"HTTPS/WSS"| Worker
    Worker --> D1
    Worker --> KV
    Worker --> R2
    Worker -->|"Proxy ws-*.domain"| VMAgent
    Worker -->|"Proxy app.domain"| Pages
    Worker -->|"OAuth + App"| GitHub
    Worker -->|"Create/Delete VMs"| Hetzner
    Worker -->|"DNS Records"| CFDNS
    VMAgent --> Docker
    Docker --> WS1
    Docker --> WS2
    VMAgent -->|"Callbacks"| Worker
Loading

Key Design Decisions

Decision Rationale
Cloudflare Worker as API + reverse proxy Single Worker handles API requests AND proxies workspace subdomain traffic to VMs
D1 for persistent state SQLite at the edge — zero latency, zero management
User-provided Hetzner tokens (BYOC) Users own their infrastructure; platform never stores cloud provider auth
Callback-driven provisioning VMs POST /ready when bootstrapped — no polling required
Dynamic DNS per workspace ws-{id}.domain resolves instantly; deleted when workspace stops

Request Routing

Every HTTP request to *.simple-agent-manager.org passes through the same Cloudflare Worker. The Worker uses the Host header to decide what to do.

flowchart TD
    Request["Incoming Request<br/><code>*.simple-agent-manager.org</code>"]

    Request --> HostCheck{"Host header?"}

    HostCheck -->|"app.domain"| ProxyPages["Proxy to Cloudflare Pages<br/><i>sam-web-prod.pages.dev</i>"]
    HostCheck -->|"ws-{id}.domain"| WSProxy["Workspace Proxy"]
    HostCheck -->|"api.domain"| APIRoutes["API Route Handler"]

    WSProxy --> LookupDB["Lookup workspace in D1<br/>Get nodeId, status"]
    LookupDB --> StatusCheck{"status in {running,recovery}?"}
    StatusCheck -->|No| Return503["503 Not Ready"]
    StatusCheck -->|Yes| ResolveNode["Resolve backend:<br/><code>{nodeId}.vm.domain:8443</code>"]
    ResolveNode --> ProxyVM["Proxy request to VM Agent<br/>Inject X-SAM-Node-Id,<br/>X-SAM-Workspace-Id headers"]

    APIRoutes --> CORS["CORS Middleware"]
    CORS --> Logger["Logger Middleware"]
    Logger --> Routes["Route Matching<br/>/api/auth, /api/nodes,<br/>/api/workspaces, ..."]
Loading

Subdomain Routing Rules

Pattern Destination How
app.{domain} Cloudflare Pages Worker proxies to {project}.pages.dev
api.{domain} Worker API routes Direct handling by Hono router
ws-{id}.{domain} VM Agent on port 8443 Worker proxies via proxied {nodeId}.vm.{domain}
*.{domain} (other) 404 No matching route

Why two-level backend subdomains? Cloudflare Workers cannot fetch IP addresses directly (Error 1003), and the wildcard route *.{domain}/* causes same-zone routing for single-level VM subdomains. We use {nodeId}.vm.{domain} (two levels, bypasses the wildcard) with orange-clouded (proxied) A records. CF edge terminates TLS and re-encrypts to the VM's Origin CA cert.


Control Plane — API Worker

The API Worker (apps/api/) is a Hono application running on Cloudflare Workers. It handles authentication, resource management, and proxying.

graph TB
    subgraph "API Worker (apps/api/)"
        Entry["index.ts<br/>Entry Point"]

        subgraph "Middleware Layer"
            ErrHandler["app.onError()<br/>Global Error Handler"]
            AppProxy["app.* Proxy → Pages"]
            WSProxy["ws-*.* Proxy → VM"]
            CORSMw["CORS Middleware"]
            LogMw["Logger Middleware"]
        end

        subgraph "Route Layer"
            Auth["/api/auth/*<br/>BetterAuth<br/>(GitHub OAuth)"]
            Nodes["/api/nodes/*<br/>Node CRUD,<br/>Lifecycle, Events"]
            Workspaces["/api/workspaces/*<br/>Workspace CRUD,<br/>Lifecycle, Events,<br/>Agent Sessions"]
            Creds["/api/credentials/*<br/>Cloud + Agent<br/>Credentials"]
            GH["/api/github/*<br/>Installations,<br/>Repositories"]
            Terminal["/api/terminal/*<br/>WebSocket Token<br/>Generation"]
            Agent["/api/agent/*<br/>Binary Download,<br/>Install Script"]
            Bootstrap["/api/bootstrap/*<br/>One-Time Token<br/>Redemption"]
            JWKS["/.well-known/jwks.json<br/>Public Key Set"]
            Health["/health<br/>Health Check"]
        end

        subgraph "Service Layer"
            NodeSvc["Node Service<br/>Provision, Stop,<br/>Delete, Events"]
            NodeAgentSvc["Node Agent Service<br/>HTTP calls to VM Agent"]
            JWTSvc["JWT Service<br/>Sign, Verify,<br/>JWKS Export"]
            DNSSvc["DNS Service<br/>Create/Delete Records"]
            CredSvc["Credential Service<br/>Encrypt/Decrypt<br/>(AES-256-GCM)"]
            GHAppSvc["GitHub App Service<br/>Installation Tokens"]
            LimitsSvc["Limits Service<br/>Per-user/node caps"]
            BootLogSvc["Boot Log Service<br/>KV-backed progress"]
            TimeoutSvc["Timeout Service<br/>Cron-triggered"]
        end

        subgraph "Bindings"
            D1B["D1 Database"]
            KVB["KV Namespace"]
            R2B["R2 Bucket"]
        end

        Entry --> ErrHandler
        Entry --> AppProxy
        Entry --> WSProxy
        Entry --> CORSMw
        CORSMw --> LogMw
        LogMw --> Auth & Nodes & Workspaces & Creds & GH & Terminal & Agent & Bootstrap & Health & JWKS

        Nodes --> NodeSvc
        Nodes --> NodeAgentSvc
        Workspaces --> NodeAgentSvc
        Workspaces --> JWTSvc
        Workspaces --> BootLogSvc
        Terminal --> JWTSvc
        Creds --> CredSvc
        GH --> GHAppSvc
        Nodes --> DNSSvc
        Nodes --> LimitsSvc

        NodeSvc --> D1B
        CredSvc --> D1B
        JWTSvc --> KVB
        BootLogSvc --> KVB
        Agent --> R2B
    end

    subgraph "Cron Trigger (every 5 min)"
        Cron["scheduled()"] --> TimeoutSvc
        TimeoutSvc --> D1B
    end
Loading

Route Summary

Route Auth Purpose
/api/auth/* Public GitHub OAuth sign-in/out, session management
/api/nodes/* Required Node CRUD, stop, delete, events, ready/heartbeat callbacks
/api/workspaces/* Required Workspace CRUD, stop, restart, delete, events, boot logs
/api/workspaces/:id/agent-sessions/* Required Create/list/stop agent sessions
/api/credentials/* Required Cloud provider + agent API key management
/api/github/* Required GitHub App installations, repository listing
/api/terminal/token Required Generate workspace JWT for WebSocket auth
/api/agent/* Public VM Agent binary download, version, install script
/api/bootstrap/:token Token One-time token redemption (VM → API)
/.well-known/jwks.json Public JWT public key set for VM Agent verification
/health Public Health check with version and limits

Web Application

The web UI (apps/web/) is a React SPA deployed to Cloudflare Pages, served through the Worker's app.* proxy.

graph TB
    subgraph "Web App (apps/web/)"
        subgraph "Pages"
            Landing["/ Landing Page"]
            Dashboard["/dashboard<br/>Project Cards"]
            WSList["/workspaces<br/>All Workspaces (filterable)"]
            CreateWS["/workspaces/new<br/>Create Workspace"]
            WSView["/workspaces/:id<br/>Terminal + Agent Chat"]
            NodeList["/nodes<br/>All Nodes"]
            NodeView["/nodes/:id<br/>Node Details + Events"]
            Settings["/settings<br/>Credentials + Config"]
        end

        subgraph "Components"
            AuthProvider["AuthProvider<br/>BetterAuth React Client"]
            ProtectedRoute["ProtectedRoute<br/>Auth Guard"]
            MultiTerminal["MultiTerminal<br/>Tab bar + xterm.js"]
            TabBar["TabBar<br/>Shell + Chat Tabs"]
        end

        subgraph "Libraries"
            APIClient["api.ts<br/>Typed API Client"]
            AuthLib["auth.ts<br/>BetterAuth Wrapper"]
        end
    end

    subgraph "External"
        API["API Worker"]
        VMAgent["VM Agent<br/>(via ws-*.domain)"]
    end

    AuthProvider --> AuthLib
    AuthLib -->|"Session/OAuth"| API
    ProtectedRoute --> AuthProvider
    Dashboard --> APIClient
    WSView --> MultiTerminal
    MultiTerminal -->|"WebSocket"| VMAgent
    MultiTerminal --> TabBar
    APIClient -->|"REST"| API
    Settings --> APIClient
Loading

Key UI Features

  • Mobile-first design — Single-column layouts, 56px+ touch targets, responsive text
  • Real-time terminal — xterm.js with WebSocket reconnection and exponential backoff
  • Tab-based workspace view — Shell terminals and agent chat sessions in tabs
  • Session persistence — Tabs restored from VM Agent SQLite on page refresh

Data Layer

All persistent state lives in Cloudflare's edge storage services.

Entity Relationships

erDiagram
    users ||--o{ nodes : "owns"
    users ||--o{ workspaces : "owns"
    users ||--o{ credentials : "has"
    users ||--o{ github_installations : "has"
    users ||--o{ sessions : "has"
    users ||--o{ accounts : "has (OAuth)"
    nodes ||--o{ workspaces : "hosts"
    workspaces ||--o{ agent_sessions : "has"

    users {
        text id PK
        text email
        text github_id UK
        text name
        text avatar_url
        int created_at
        int updated_at
    }

    nodes {
        text id PK
        text user_id FK
        text name
        text status "pending|creating|running|stopping|stopped|error"
        text health_status "healthy|stale|unhealthy"
        text vm_size "small|medium|large"
        text vm_location "nbg1|fsn1|hel1"
        text provider_instance_id
        text ip_address
        text backend_dns_record_id
        text last_heartbeat_at
        int heartbeat_stale_after_seconds
    }

    workspaces {
        text id PK
        text node_id FK
        text user_id FK
        text installation_id FK
        text display_name
        text name
        text repository
        text branch
        text status "pending|creating|running|recovery|stopping|stopped|error"
        text vm_ip
        text dns_record_id

    }

    credentials {
        text id PK
        text user_id FK
        text provider
        text credential_type "cloud-provider|agent-api-key"
        text agent_type "claude-code|openai-codex|..."
        text credential_kind "api-key|oauth-token"
        int is_active
        text encrypted_token
        text iv
    }

    agent_sessions {
        text id PK
        text workspace_id FK
        text user_id FK
        text status "running|stopped|error"
        text label
    }

    github_installations {
        text id PK
        text user_id FK
        text installation_id UK
        text account_type "personal|organization"
        text account_name
    }
Loading

Storage Services

Service Binding Purpose Key Patterns
D1 (SQLite) DATABASE All persistent state Users, nodes, workspaces, credentials, sessions
KV KV Transient/session data session:{token} → session data, boot-log:{workspaceId} → JSON progress, bootstrap:{token} → credential payload
R2 R2 Binary artifacts agents/vm-agent-linux-amd64, agents/version.json

Credential Encryption

User credentials (Hetzner tokens, agent API keys) are encrypted at rest using AES-256-GCM with a per-credential random IV. The ENCRYPTION_KEY is a platform secret stored as a Cloudflare Worker secret.

Encrypt: plaintext + ENCRYPTION_KEY → { ciphertext, iv }  (stored in D1)
Decrypt: { ciphertext, iv } + ENCRYPTION_KEY → plaintext   (on-demand)

VM Agent

The VM Agent (packages/vm-agent/) is a Go binary that runs on each Hetzner node. It manages Docker containers (workspaces), terminal PTY sessions, and Claude Code agent sessions.

graph TB
    subgraph "VM Agent (Go Binary, :8443)"
        Main["main.go<br/>Bootstrap → Server → Signal Handler"]

        subgraph "HTTP Server"
            Router["HTTP Router"]
            AuthMw["JWT Validator<br/>+ Session Manager"]
            CORSMw["CORS Middleware<br/>(Wildcard Subdomain)"]
        end

        subgraph "Core Subsystems"
            PTYMgr["PTY Manager<br/>Terminal Multiplexing<br/>Ring Buffer Replay"]
            ContainerMgr["Container Manager<br/>Docker create/exec<br/>Devcontainer CLI"]
            ACPGateway["ACP Gateway<br/>Claude Code Protocol<br/>Initialize → Session → Prompt"]
            Persistence["SQLite Store<br/>Tab Persistence<br/>(modernc.org/sqlite)"]
        end

        subgraph "Bootstrap"
            BootLog["Boot Logger<br/>POST progress to KV"]
            NodeReg["Node Registration<br/>POST /ready callback"]
            DockerSetup["Docker + Devcontainer<br/>Installation"]
        end

        subgraph "HTTP Routes"
            HealthR["GET /health"]
            AuthR["POST /auth/token"]
            ShellR["WS /workspaces/:id/shell"]
            AgentR["WS /workspaces/:id/agent"]
            TabsR["GET /workspaces/:id/tabs"]
            WSCreateR["POST /workspaces<br/>(from API Worker)"]
            WSDeleteR["DELETE /workspaces/:id<br/>(from API Worker)"]
        end
    end

    subgraph "Docker Engine"
        DC1["Workspace Container 1<br/>Devcontainer"]
        DC2["Workspace Container N<br/>Devcontainer"]
    end

    subgraph "Control Plane"
        API["API Worker"]
    end

    Browser["Browser"] -->|"WSS"| Router
    API -->|"HTTP"| Router
    Router --> AuthMw
    Router --> CORSMw
    AuthMw --> HealthR & AuthR & ShellR & AgentR & TabsR & WSCreateR & WSDeleteR

    ShellR -->|"WebSocket ↔ PTY"| PTYMgr
    AgentR -->|"WebSocket ↔ ACP"| ACPGateway
    TabsR --> Persistence
    WSCreateR --> ContainerMgr
    WSDeleteR --> ContainerMgr

    PTYMgr --> DC1 & DC2
    ACPGateway -->|"stdin/stdout"| DC1 & DC2
    ContainerMgr --> DC1 & DC2

    Main --> BootLog
    Main --> NodeReg
    NodeReg -->|"POST /api/nodes/:id/ready"| API
    BootLog -->|"POST /api/workspaces/:id/boot-log"| API
Loading

VM Agent Subsystems

Subsystem Package Responsibility
PTY Manager internal/pty/ Terminal session multiplexing, ring buffer for replay on reconnect, session lifecycle
Container Manager internal/container/ Docker exec, devcontainer CLI, named volume management, git credential injection
JWT Validator internal/auth/ Validates workspace JWTs via JWKS endpoint, extracts claims
Session Manager internal/auth/ HTTP cookie-based sessions, TTL cleanup
ACP Gateway internal/acp/ ACP SDK protocol — Initialize → NewSession → Prompt — streams to WebSocket
Persistence internal/persistence/ SQLite storage for workspace tabs (survives browser refresh)
Boot Logger internal/bootlog/ Reports provisioning progress to control plane KV

Key Go Dependencies

Dependency Version Purpose
github.com/coder/acp-go-sdk v0.6.3 Agent Control Protocol for Claude Code
github.com/creack/pty v1.1.21 PTY allocation and management
github.com/gorilla/websocket v1.5.3 WebSocket server
github.com/golang-jwt/jwt v5.2.1 JWT validation
modernc.org/sqlite v1.45.0 Pure Go SQLite (no CGO)

Workspace Lifecycle

Workspaces transition through a defined state machine. Transitions are triggered by API calls and VM Agent callbacks.

stateDiagram-v2
    [*] --> pending : User creates workspace

    pending --> creating : API dispatches to Node Agent
    creating --> running : VM Agent POST /ready (status=running)
    creating --> recovery : VM Agent POST /ready (status=recovery)
    creating --> error : VM Agent POST /provisioning-failed<br/>or provisioning timeout (cron)

    running --> stopping : User clicks Stop
    recovery --> stopping : User clicks Stop
    stopping --> stopped : Resources cleaned up

    running --> error : Unexpected failure
    recovery --> error : Unexpected failure

    stopped --> creating : User clicks Restart
    error --> creating : User clicks Restart

    stopped --> [*] : User deletes workspace
    error --> [*] : User deletes workspace
Loading

Workspace Creation Flow

sequenceDiagram
    actor User
    participant Browser
    participant API as API Worker
    participant D1
    participant NodeAgent as VM Agent

    User->>Browser: Click "Create Workspace"
    Browser->>API: POST /api/workspaces
    API->>API: Validate limits, ownership
    API->>D1: INSERT workspace (status=pending)

    alt Node exists and healthy
        API->>API: Select existing node
    else No suitable node
        API->>D1: INSERT node (status=pending)
        API->>API: Provision Hetzner VM (async)
        Note over API: VM boots, runs cloud-init,<br/>VM Agent starts, POSTs /ready
        NodeAgent->>API: POST /api/nodes/:id/ready
        API->>D1: UPDATE node (status=running)
    end

    API->>D1: UPDATE workspace (status=creating)
    API->>NodeAgent: POST /workspaces (create container)
    NodeAgent->>NodeAgent: devcontainer up (async)
    NodeAgent->>API: POST /api/workspaces/:id/boot-log
    NodeAgent->>API: POST /api/workspaces/:id/ready
    API->>D1: UPDATE workspace (status=running|recovery)
    API->>Browser: 201 Created (workspace)
    Browser->>User: Redirect to workspace view
Loading

Node Lifecycle

Nodes are Hetzner VMs that host one or more workspace containers.

stateDiagram-v2
    [*] --> pending : Node created (with or without workspace request)

    pending --> creating : Hetzner API called
    creating --> running : VM Agent POST /api/nodes/:id/ready
    creating --> error : Hetzner API failure<br/>or bootstrap timeout

    running --> stopping : User clicks Stop
    stopping --> stopped : Hetzner VM powered off

    running --> error : Heartbeat timeout

    stopped --> [*] : User deletes node
    error --> [*] : User deletes node
Loading

Node Health Status

The VM Agent sends periodic heartbeats. Health is derived from heartbeat freshness:

Health Status Condition
healthy Last heartbeat within heartbeatStaleAfterSeconds (default: 180s)
stale Heartbeat older than threshold but node still running
unhealthy No heartbeat received or node not running

Authentication & Security

Authentication Flow

sequenceDiagram
    actor User
    participant Browser
    participant API as API Worker
    participant GitHub
    participant D1
    participant KV

    User->>Browser: Click "Sign in with GitHub"
    Browser->>API: POST /api/auth/sign-in/social (provider=github)
    API->>GitHub: OAuth redirect
    GitHub->>User: Authorize SAM?
    User->>GitHub: Approve
    GitHub->>API: Callback with code
    API->>GitHub: Exchange code for tokens
    API->>GitHub: GET /user (profile)
    API->>GitHub: GET /user/emails (primary email)
    API->>D1: Upsert user + account
    API->>KV: Create session token
    API->>Browser: Set session cookie
    Browser->>User: Redirect to /dashboard
Loading

Security Model

graph LR
    subgraph "Platform Secrets (Worker Secrets)"
        EK["ENCRYPTION_KEY<br/>AES-256-GCM"]
        JWT_PRIV["JWT_PRIVATE_KEY<br/>RSA-2048"]
        JWT_PUB["JWT_PUBLIC_KEY<br/>RSA-2048"]
        CF["CF_API_TOKEN<br/>Cloudflare DNS"]
        GH_OAUTH["GITHUB_CLIENT_*<br/>OAuth App"]
        GH_APP["GITHUB_APP_*<br/>GitHub App"]
    end

    subgraph "User Credentials (Encrypted in D1)"
        HETZNER["Hetzner API Token"]
        AGENT_KEY["Agent API Key<br/>(Claude/OpenAI/Gemini)"]
        OAUTH_TOK["Agent OAuth Token<br/>(Claude Pro/Max)"]
    end

    subgraph "Short-Lived Tokens"
        SESSION["Session Token<br/>(KV, cookie)"]
        WS_JWT["Workspace JWT<br/>(terminal auth)"]
        BOOTSTRAP["Bootstrap Token<br/>(one-time, 5min)"]
        CALLBACK["Callback Token<br/>(VM → API auth)"]
    end

    EK -->|"Encrypts"| HETZNER & AGENT_KEY & OAUTH_TOK
    JWT_PRIV -->|"Signs"| WS_JWT & CALLBACK & BOOTSTRAP
    JWT_PUB -->|"Verifies (via JWKS)"| WS_JWT & CALLBACK
Loading

Token Types

Token Lifetime Purpose Where Validated
Session cookie Hours Browser auth (BetterAuth) API Worker
Workspace JWT Minutes Terminal WebSocket auth VM Agent (via JWKS)
Bootstrap token 5 minutes One-time VM credential injection API Worker
Callback token Minutes VM Agent → API callbacks API Worker

OS-Level Firewall (iptables)

VMs are provisioned with an iptables firewall via cloud-init (packages/cloud-init/src/template.ts) that restricts inbound traffic to the VM agent port (VM_AGENT_PORT, default 8443) from Cloudflare IP ranges only. This provides defense-in-depth: even if someone discovers the VM's public IP, they cannot reach the VM agent directly — traffic must flow through Cloudflare's edge.

Firewall rules (INPUT chain):

Rule Purpose
Allow loopback (lo) Local process communication
Allow ESTABLISHED,RELATED Responses to outbound connections (apt, API callbacks, heartbeats)
Allow docker0 and br-+ interfaces → VM agent port Container-to-host communication (scoped to agent port only)
Allow Cloudflare IPs → VM agent port Legitimate proxied traffic from Cloudflare edge
Default policy: DROP Block all other inbound traffic (including SSH port 22)

Cloudflare IP updates: The firewall setup script (/etc/sam/firewall/setup-firewall.sh) fetches current Cloudflare IP ranges from https://www.cloudflare.com/ips-v4 and https://www.cloudflare.com/ips-v6 at boot time, with hardcoded fallback defaults if the fetch fails. A daily cron job (/etc/cron.daily/update-cloudflare-firewall) refreshes the rules automatically.

Docker compatibility: Only the INPUT chain is modified. Docker's FORWARD and NAT chains (used for container networking, port publishing, and masquerading) are left untouched.


VM Provisioning

When a new node is created, the VM bootstraps itself through cloud-init and the VM Agent.

sequenceDiagram
    participant API as API Worker
    participant Hetzner as Hetzner API
    participant VM as Hetzner VM
    participant Agent as VM Agent
    participant Docker as Docker Engine

    API->>Hetzner: Create server (cloud-init script)
    Hetzner->>VM: Boot VM

    Note over VM: Cloud-init executes:
    VM->>VM: Install Docker, git, curl
    VM->>VM: Configure iptables firewall (Cloudflare IPs only)
    VM->>VM: Download VM Agent from R2
    VM->>VM: Create systemd service
    VM->>VM: Install Node.js + devcontainer CLI
    VM->>VM: Start VM Agent service

    Agent->>Agent: Load config from environment
    Agent->>Agent: Run bootstrap sequence
    Agent->>API: POST /api/nodes/:id/ready
    API->>API: Create DNS record ({id}.vm.domain → IP)
    API->>API: Update node status → running

    Note over API: Node is ready for workspaces

    API->>Agent: POST /workspaces (create container)
    Agent->>Docker: devcontainer up (async)
    Docker->>Docker: Pull image, build container
    Agent->>API: POST /api/workspaces/:id/boot-log (progress)
    Agent->>Docker: Inject git credentials
    Agent->>API: POST /api/workspaces/:id/ready
Loading

Cloud-Init Configuration

The cloud-init template (packages/cloud-init/src/template.ts) creates a fully provisioned VM with:

  1. System packages — Docker, git, curl, jq
  2. VM Agent binary — Downloaded from R2 via /api/agent/download
  3. Systemd service — Auto-restart, environment injection (NODE_ID, CONTROL_PLANE_URL, CALLBACK_TOKEN)
  4. Node.js + devcontainer CLI — For building devcontainer images
  5. Config file — Written to /etc/workspace/config.json
  6. OS-level firewall — iptables rules restricting VM agent port to Cloudflare IPs, persisted via iptables-persistent and refreshed daily

No credentials are embedded in cloud-init. The VM Agent uses a one-time callback token to fetch credentials from the control plane during bootstrap.


Terminal & Agent Sessions

Terminal WebSocket Flow

sequenceDiagram
    actor User
    participant Browser
    participant API as API Worker
    participant Worker as Worker Proxy
    participant Agent as VM Agent
    participant PTY as PTY Manager

    User->>Browser: Open workspace
    Browser->>API: POST /api/terminal/token
    API->>API: Sign workspace JWT
    API->>Browser: { token, workspaceUrl }

    Browser->>Worker: WSS ws-{id}.domain/workspaces/{id}/shell?token=...
    Worker->>Worker: Lookup workspace in D1
    Worker->>Agent: Proxy WebSocket to {nodeId}.vm.domain:8443

    Agent->>Agent: Validate JWT (via JWKS)
    Agent->>PTY: Create or reattach PTY session
    PTY->>Agent: Terminal output stream

    loop Terminal I/O
        User->>Browser: Type command
        Browser->>Agent: WebSocket text frame
        Agent->>PTY: Write to PTY stdin
        PTY->>Agent: Read from PTY stdout
        Agent->>Browser: WebSocket text frame
        Browser->>User: Render in xterm.js
    end
Loading

Agent Session (ACP) Flow

sequenceDiagram
    actor User
    participant Browser
    participant API as API Worker
    participant Agent as VM Agent
    participant ACP as ACP Gateway
    participant Claude as Claude Code Binary

    User->>Browser: Click "+ New Chat"
    Browser->>API: POST /api/workspaces/:id/agent-sessions
    API->>Agent: POST /workspaces/:id/agent-sessions
    Agent->>ACP: Create ACP Gateway instance
    ACP->>Claude: Initialize (ProtocolVersion, Capabilities)
    Claude->>ACP: Initialized
    ACP->>Claude: NewSession (Cwd, McpServers)
    Claude->>ACP: Session created
    Agent->>API: Session ID
    API->>Browser: { id, status: "running" }

    User->>Browser: Type prompt
    Browser->>Agent: WebSocket: session/prompt
    Agent->>ACP: Parse prompt → ContentBlock[]
    ACP->>Claude: Prompt (blocking)

    loop Streaming Response
        Claude->>ACP: session/update notification
        ACP->>Agent: SessionUpdate callback
        Agent->>Browser: WebSocket: session/update
        Browser->>User: Render agent output
    end

    Claude->>ACP: Prompt returns (final result)
    ACP->>Agent: Result
    Agent->>Browser: WebSocket: prompt complete
Loading

Deployment Pipeline

CI/CD Workflows

flowchart LR
    subgraph "CI (Every Push/PR)"
        Lint["Lint<br/>(ESLint)"]
        TypeCheck["Type Check<br/>(tsc)"]
        Test["Unit Tests<br/>(Vitest)"]
        Build["Build<br/>(Turbo)"]
        GoTest["Go Tests<br/>(vm-agent)"]
        GoInteg["Go Integration<br/>(Docker tests)"]
        Preflight["Preflight Evidence<br/>(PR only)"]
        InfraTest["Infra Tests<br/>(Pulumi)"]
    end

    subgraph "Deploy (Push to main)"
        direction TB
        P1["Phase 1: Infrastructure<br/>Pulumi up (D1, KV, R2, DNS)"]
        P2["Phase 2: Configuration<br/>Sync wrangler.toml,<br/>Read security keys"]
        P3["Phase 3: Application<br/>Build → Deploy Worker<br/>→ Deploy Pages<br/>→ Run Migrations<br/>→ Configure Secrets"]
        P4["Phase 4: VM Agent<br/>Build Go (multi-arch)<br/>→ Upload to R2"]
        P5["Phase 5: Validation<br/>Health check polling"]

        P1 --> P2 --> P3 --> P4 --> P5
    end

    Build --> P1
Loading

Deployment Architecture

graph TB
    subgraph "GitHub"
        Repo["Repository<br/>(main branch)"]
        Actions["GitHub Actions"]
        Secrets["GitHub Environment<br/>(production)"]
    end

    subgraph "Pulumi"
        State["Pulumi State<br/>(R2 encrypted)"]
        Stack["Stack: prod"]
    end

    subgraph "Cloudflare"
        WorkerDeploy["Worker Deploy<br/>(wrangler deploy)"]
        PagesDeploy["Pages Deploy<br/>(wrangler pages deploy)"]
        D1Migrate["D1 Migrations<br/>(wrangler d1 migrations apply)"]
        SecretConfig["Secret Configuration<br/>(configure-secrets.sh)"]
    end

    Repo -->|"Push to main"| Actions
    Actions --> Secrets
    Secrets --> Stack
    Stack --> State
    Stack -->|"Outputs"| WorkerDeploy
    Stack -->|"Outputs"| PagesDeploy
    WorkerDeploy --> D1Migrate
    D1Migrate --> SecretConfig
Loading

Infrastructure (Pulumi)

Infrastructure is defined as code in infra/ using Pulumi with TypeScript.

graph TB
    subgraph "Pulumi Stack (infra/)"
        subgraph "Compute"
            Worker["Cloudflare Worker<br/><code>sam-api-prod</code>"]
            Pages["Cloudflare Pages<br/><code>sam-web-prod</code>"]
        end

        subgraph "Storage"
            D1["D1 Database<br/><code>sam-prod</code><br/>SQLite"]
            KV["KV Namespace<br/><code>sam-prod-sessions</code>"]
            R2["R2 Bucket<br/><code>sam-prod-assets</code><br/>Region: WNAM"]
        end

        subgraph "DNS"
            APIDns["CNAME api.domain<br/>→ sam-api-prod.workers.dev"]
            AppDns["CNAME app.domain<br/>→ sam-web-prod.pages.dev"]
            WildDns["CNAME *.domain<br/>→ sam-api-prod.workers.dev"]
        end

        subgraph "Security (Protected)"
            EncKey["Encryption Key<br/>256-bit random"]
            JWTKeys["JWT RSA-2048<br/>Key Pair"]
        end
    end

    Worker --> D1 & KV & R2
    APIDns --> Worker
    WildDns --> Worker
    AppDns --> Pages
Loading

Resource Details

Resource Pulumi Type Name Pattern Notes
D1 Database cloudflare.D1Database {prefix}-{stack} SQLite at edge
KV Namespace cloudflare.WorkersKvNamespace {prefix}-{stack}-sessions Transient data
R2 Bucket cloudflare.R2Bucket {prefix}-{stack}-assets WNAM region
DNS (API) cloudflare.Record api.{domain} CNAME, proxied
DNS (App) cloudflare.Record app.{domain} CNAME, proxied
DNS (Wildcard) cloudflare.Record *.{domain} CNAME, proxied
Encryption Key random.RandomId 32 bytes, base64, protected
JWT Keys tls.PrivateKey RSA-2048, PKCS#8, protected

Appendix: Key File Locations

Component Path Language
API entry apps/api/src/index.ts TypeScript
DB schema apps/api/src/db/schema.ts TypeScript
API routes apps/api/src/routes/*.ts TypeScript
API services apps/api/src/services/*.ts TypeScript
Web entry apps/web/src/main.tsx TypeScript
Web pages apps/web/src/pages/*.tsx TypeScript
API client apps/web/src/lib/api.ts TypeScript
Shared types packages/shared/src/types.ts TypeScript
Provider packages/providers/src/hetzner.ts TypeScript
Cloud-init packages/cloud-init/src/template.ts TypeScript
Terminal UI packages/terminal/src/*.tsx TypeScript
VM Agent packages/vm-agent/main.go Go
Agent server packages/vm-agent/internal/server/ Go
ACP gateway packages/vm-agent/internal/acp/ Go
PTY manager packages/vm-agent/internal/pty/ Go
Infra infra/resources/*.ts TypeScript
Deploy CI .github/workflows/deploy.yml YAML
Wrangler config apps/api/wrangler.toml TOML