NVIDIA · TaylorMutch · May 7, 2026 · May 7, 2026 · May 8, 2026 · May 8, 2026
@@ -189,6 +189,17 @@ openshell status
 openshell logs <sandbox-name>
 ```
 
+## Telemetry Signals
+
+Before drilling into logs, check whether the gateway is exporting telemetry — the pull-based metrics surface and the push-based trace export are the fastest signals that the control plane is alive and that requests are reaching it.
+
+| Signal | Where it shows up | When to use it |
+|---|---|---|
+| Prometheus metrics on `/metrics` | A scrape target via the chart's `ServiceMonitor` (`monitoring.serviceMonitor.enabled`). Local: `kubectl -n openshell port-forward statefulset/openshell <metrics-port>:<metrics-port>`. | Confirm the gateway listener is up and gRPC requests are landing. `up{job="openshell"} == 1` in Prometheus is a quick liveness ping. |
+| OTLP traces | Jaeger / Tempo / OTel backend (`monitoring.tracing.enabled`). Look for service `openshell-gateway`. | Confirm an inbound request reached the multiplex layer; spans carry `method`, `path`, `request_id`. Missing traces under load means OTLP export is misconfigured or the endpoint is unreachable. |
+
+If the chart's `monitoring.serviceMonitor.enabled` or `monitoring.tracing.enabled` were not set, those signals are unavailable — fall back to gateway logs. See [Monitoring the Gateway](../../../docs/kubernetes/monitoring.mdx) for setup.
+
 ## Common Failure Patterns
 
 | Symptom | Likely cause | Check |

@@ -169,6 +169,39 @@ To remove Keycloak:
 mise run keycloak:k8s:teardown
 ```
 
+### Monitoring (Prometheus + Grafana + Jaeger)
+
+One-time setup — installs `kube-prometheus-stack` (slimmed: no Alertmanager,
+node-exporter, or kube-state-metrics) and a Jaeger all-in-one Pod:
+
+```bash
+mise run observability:k8s:setup
+```
+
+Then activate monitoring on the gateway:
+
+1. Uncomment `#- ci/values-monitoring.yaml` in `skaffold.yaml`
+2. Redeploy: `mise run helm:skaffold:run`
+
+Forward UIs to localhost:
+
+```bash
+mise run observability:port-forward
+# Grafana       http://localhost:3000  (admin / admin)
+# Prometheus    http://localhost:9090
+# Jaeger UI     http://localhost:16686
+```
+
+Teardown:
+
+```bash
+mise run observability:k8s:teardown
+```
+
+The chart's `monitoring.serviceMonitor.enabled` creates a `ServiceMonitor`
+that Prometheus scrapes, and `monitoring.tracing.enabled` projects `OTEL_*`
+env vars onto the gateway so it exports OTLP/gRPC traces to Jaeger.
+
 ---
 
 ## Cluster Lifecycle (suspend/resume)

@@ -0,0 +1,126 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+name: Branch Helm E2E
+
+on:
+  push:
+    branches:
+      - "pull-request/[0-9]+"
+  workflow_dispatch: {}
+
+permissions: {}
+
+jobs:
+  pr_metadata:
+    name: Resolve PR metadata
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      pull-requests: read
+    outputs:
+      should_run: ${{ steps.gate.outputs.should_run }}
+    steps:
+      - uses: actions/checkout@v6
+
+      - id: gate
+        uses: ./.github/actions/pr-gate
+        with:
+          required_label: test:e2e-helm
+
+  build-gateway:
+    needs: [pr_metadata]
+    if: needs.pr_metadata.outputs.should_run == 'true'
+    permissions:
+      contents: read
+      packages: write
+    uses: ./.github/workflows/docker-build.yml
+    with:
+      component: gateway
+      platform: linux/amd64
+
+  build-supervisor:
+    needs: [pr_metadata]
+    if: needs.pr_metadata.outputs.should_run == 'true'
+    permissions:
+      contents: read
+      packages: write
+    uses: ./.github/workflows/docker-build.yml
+    with:
+      component: supervisor
+      platform: linux/amd64
+
+  helm-e2e:
+    name: Helm E2E (Rust smoke)
+    needs: [pr_metadata, build-gateway, build-supervisor]
+    if: needs.pr_metadata.outputs.should_run == 'true'
+    # Bare runner: running kind-in-container hits nested-Docker / kubeconfig
+    # complications. The runner has Docker; mise installs helm, kubectl, and
+    # the Rust toolchain.
+    runs-on: linux-amd64-cpu8
+    timeout-minutes: 60
+    permissions:
+      contents: read
+      packages: read
+    env:
+      MISE_GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      KIND_CLUSTER_NAME: helm-e2e-${{ github.run_id }}
+    steps:
+      - uses: actions/checkout@v6
+
+      - name: Install mise
+        run: |
+          curl https://mise.run | sh
+          echo "$HOME/.local/bin" >> "$GITHUB_PATH"
+          echo "$HOME/.local/share/mise/shims" >> "$GITHUB_PATH"
+
+      - name: Install tools
+        run: mise install --locked
+
+      # The openshell-policy crate transitively pulls in z3-sys, whose
+      # build script needs the z3 C/C++ headers and clang/bindgen to
+      # compile. The bare runner doesn't ship them; the CI container
+      # image used by other Rust e2e jobs does, but we can't run helm-e2e
+      # there (the runner's container handler injects its own --network
+      # bridge, which conflicts with the --network host we need so kind's
+      # API server is reachable from the test process).
+      - name: Install z3 build deps
+        run: sudo apt-get update && sudo apt-get install -y --no-install-recommends libz3-dev clang
+
+      - name: Log in to GHCR
+        run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin
+
+      - name: Create kind cluster
+        uses: helm/kind-action@v1
+        with:
+          cluster_name: ${{ env.KIND_CLUSTER_NAME }}
+          wait: 120s
+
+      # mise.toml sets KUBECONFIG="{{config_root}}/kubeconfig"; helm/kind-action
+      # writes to ~/.kube/config. Materialize the kind context at the mise path
+      # so `mise run e2e:helm` (and the wrapper's `kubectl --context=…`) finds
+      # the kind cluster.
+      - name: Export kind kubeconfig to mise path
+        run: |
+          set -euo pipefail
+          kind get kubeconfig --name "$KIND_CLUSTER_NAME" > "$GITHUB_WORKSPACE/kubeconfig"
+          chmod 600 "$GITHUB_WORKSPACE/kubeconfig"
+
+      # Pre-pull and side-load: kind nodes don't have ghcr credentials, and
+      # tagging IMAGE_TAG to a SHA means the chart's IfNotPresent pull policy
+      # is satisfied once the image is loaded into the node's containerd.
+      - name: Load gateway and supervisor images into kind
+        run: |
+          set -euo pipefail
+          for component in gateway supervisor; do
+            image="ghcr.io/nvidia/openshell/${component}:${{ github.sha }}"
+            docker pull "$image"
+            kind load docker-image "$image" --name "$KIND_CLUSTER_NAME"
+          done
+
+      - name: Run Helm E2E (Rust smoke)
+        env:
+          OPENSHELL_E2E_KUBE_CONTEXT: kind-${{ env.KIND_CLUSTER_NAME }}
+          IMAGE_TAG: ${{ github.sha }}
+          OPENSHELL_REGISTRY: ghcr.io/nvidia/openshell
+        run: mise run --no-deps --skip-deps e2e:helm
@@ -19,7 +19,7 @@ permissions: {}
 jobs:
   hint:
     name: Post next-step hint for E2E label
-    if: github.event.label.name == 'test:e2e' || github.event.label.name == 'test:e2e-gpu'
+    if: github.event.label.name == 'test:e2e' || github.event.label.name == 'test:e2e-gpu' || github.event.label.name == 'test:e2e-helm'
     runs-on: ubuntu-latest
     permissions:
       pull-requests: write
@@ -40,6 +40,7 @@ jobs:
           case "$LABEL_NAME" in
             test:e2e) workflow_file=branch-e2e.yml; workflow_name="Branch E2E Checks" ;;
             test:e2e-gpu) workflow_file=test-gpu.yml; workflow_name="GPU Test" ;;
+            test:e2e-helm) workflow_file=branch-helm-e2e.yml; workflow_name="Branch Helm E2E" ;;
             *) echo "Unrecognized label $LABEL_NAME"; exit 1 ;;
           esac
 

@@ -58,6 +58,12 @@ tracing = "0.1"
 tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
 tracing-appender = "0.2"
 
+# OpenTelemetry — pinned to a tonic-0.12 / prost-0.13 compatible release set.
+opentelemetry = "0.29"
+opentelemetry_sdk = { version = "0.29", features = ["rt-tokio"] }
+opentelemetry-otlp = { version = "0.29", default-features = false, features = ["grpc-tonic", "trace"] }
+tracing-opentelemetry = "0.30"
+
 # Metrics
 metrics = "0.24"
 metrics-exporter-prometheus = { version = "0.18", default-features = false, features = ["http-listener"] }

@@ -54,6 +54,23 @@ Domain objects use shared metadata: stable server-generated IDs, human-readable
 names, creation timestamps, and labels. Crate-level details live in
 `crates/openshell-core/README.md`.
 
+### Observability surface
+
+The gateway exposes three independent telemetry surfaces, each with its own
+configuration knob and consumer:
+
+| Surface | Direction | Configured by | Consumers |
+|---|---|---|---|
+| Prometheus metrics on `/metrics` | Pull | `--metrics-port` (CLI), `monitoring.serviceMonitor.*` (Helm) | Prometheus / kube-prometheus-stack via `ServiceMonitor`. |
+| OpenTelemetry traces over OTLP/gRPC | Push | `--otlp-endpoint` / `OTEL_EXPORTER_OTLP_*` env, `monitoring.tracing.*` (Helm) | Any OTLP backend (Jaeger, Tempo, OTel Collector). The per-request span set up by `TraceLayer` becomes the OTLP root. |
+| Sandbox log fan-out | Push (gRPC stream) | Always on per sandbox subscription | CLI / TUI / SDK consumers via `WatchSandbox` and `GetSandboxLogs`; OCSF JSONL when enabled inside the sandbox. |
+
+Trace export is opt-in: the gateway only installs the OpenTelemetry layer
+when an OTLP endpoint is supplied. Spans flush on `SIGTERM` via an explicit
+`shutdown()` in the gateway shutdown path. See
+[Monitoring the Gateway](../docs/kubernetes/monitoring.mdx) for the operator
+guide.
+
 ## Persistence
 
 The gateway persistence layer is a protobuf object store. Domain services store

@@ -64,6 +64,12 @@ anyhow = { workspace = true }
 tracing = { workspace = true }
 tracing-subscriber = { workspace = true }
 
+# OpenTelemetry tracing export (opt-in, configured via env)
+opentelemetry = { workspace = true }
+opentelemetry_sdk = { workspace = true }
+opentelemetry-otlp = { workspace = true }
+tracing-opentelemetry = { workspace = true }
+
 # Metrics
 metrics = { workspace = true }
 metrics-exporter-prometheus = { workspace = true }