Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,17 @@ openshell status
openshell logs <sandbox-name>
```

## Telemetry Signals

Before drilling into logs, check whether the gateway is exporting telemetry — the pull-based metrics surface and the push-based trace export are the fastest signals that the control plane is alive and that requests are reaching it.

| Signal | Where it shows up | When to use it |
|---|---|---|
| Prometheus metrics on `/metrics` | A scrape target via the chart's `ServiceMonitor` (`monitoring.serviceMonitor.enabled`). Local: `kubectl -n openshell port-forward statefulset/openshell <metrics-port>:<metrics-port>`. | Confirm the gateway listener is up and gRPC requests are landing. `up{job="openshell"} == 1` in Prometheus is a quick liveness ping. |
| OTLP traces | Jaeger / Tempo / OTel backend (`monitoring.tracing.enabled`). Look for service `openshell-gateway`. | Confirm an inbound request reached the multiplex layer; spans carry `method`, `path`, `request_id`. Missing traces under load means OTLP export is misconfigured or the endpoint is unreachable. |

If the chart's `monitoring.serviceMonitor.enabled` or `monitoring.tracing.enabled` were not set, those signals are unavailable — fall back to gateway logs. See [Monitoring the Gateway](../../../docs/kubernetes/monitoring.mdx) for setup.

## Common Failure Patterns

| Symptom | Likely cause | Check |
Expand Down
33 changes: 33 additions & 0 deletions .agents/skills/helm-dev-environment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,39 @@ To remove Keycloak:
mise run keycloak:k8s:teardown
```

### Monitoring (Prometheus + Grafana + Jaeger)

One-time setup — installs `kube-prometheus-stack` (slimmed: no Alertmanager,
node-exporter, or kube-state-metrics) and a Jaeger all-in-one Pod:

```bash
mise run observability:k8s:setup
```

Then activate monitoring on the gateway:

1. Uncomment `#- ci/values-monitoring.yaml` in `skaffold.yaml`
2. Redeploy: `mise run helm:skaffold:run`

Forward UIs to localhost:

```bash
mise run observability:port-forward
# Grafana http://localhost:3000 (admin / admin)
# Prometheus http://localhost:9090
# Jaeger UI http://localhost:16686
```

Teardown:

```bash
mise run observability:k8s:teardown
```

The chart's `monitoring.serviceMonitor.enabled` creates a `ServiceMonitor`
that Prometheus scrapes, and `monitoring.tracing.enabled` projects `OTEL_*`
env vars onto the gateway so it exports OTLP/gRPC traces to Jaeger.

---

## Cluster Lifecycle (suspend/resume)
Expand Down
126 changes: 126 additions & 0 deletions .github/workflows/branch-helm-e2e.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

name: Branch Helm E2E

on:
push:
branches:
- "pull-request/[0-9]+"
workflow_dispatch: {}

permissions: {}

jobs:
pr_metadata:
name: Resolve PR metadata
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: read
outputs:
should_run: ${{ steps.gate.outputs.should_run }}
steps:
- uses: actions/checkout@v6

- id: gate
uses: ./.github/actions/pr-gate
with:
required_label: test:e2e-helm

build-gateway:
needs: [pr_metadata]
if: needs.pr_metadata.outputs.should_run == 'true'
permissions:
contents: read
packages: write
uses: ./.github/workflows/docker-build.yml
with:
component: gateway
platform: linux/amd64

build-supervisor:
needs: [pr_metadata]
if: needs.pr_metadata.outputs.should_run == 'true'
permissions:
contents: read
packages: write
uses: ./.github/workflows/docker-build.yml
with:
component: supervisor
platform: linux/amd64

helm-e2e:
name: Helm E2E (Rust smoke)
needs: [pr_metadata, build-gateway, build-supervisor]
if: needs.pr_metadata.outputs.should_run == 'true'
# Bare runner: running kind-in-container hits nested-Docker / kubeconfig
# complications. The runner has Docker; mise installs helm, kubectl, and
# the Rust toolchain.
runs-on: linux-amd64-cpu8
timeout-minutes: 60
permissions:
contents: read
packages: read
env:
MISE_GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
KIND_CLUSTER_NAME: helm-e2e-${{ github.run_id }}
steps:
- uses: actions/checkout@v6

- name: Install mise
run: |
curl https://mise.run | sh
echo "$HOME/.local/bin" >> "$GITHUB_PATH"
echo "$HOME/.local/share/mise/shims" >> "$GITHUB_PATH"

- name: Install tools
run: mise install --locked

# The openshell-policy crate transitively pulls in z3-sys, whose
# build script needs the z3 C/C++ headers and clang/bindgen to
# compile. The bare runner doesn't ship them; the CI container
# image used by other Rust e2e jobs does, but we can't run helm-e2e
# there (the runner's container handler injects its own --network
# bridge, which conflicts with the --network host we need so kind's
# API server is reachable from the test process).
- name: Install z3 build deps
run: sudo apt-get update && sudo apt-get install -y --no-install-recommends libz3-dev clang

- name: Log in to GHCR
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin

- name: Create kind cluster
uses: helm/kind-action@v1
with:
cluster_name: ${{ env.KIND_CLUSTER_NAME }}
wait: 120s

# mise.toml sets KUBECONFIG="{{config_root}}/kubeconfig"; helm/kind-action
# writes to ~/.kube/config. Materialize the kind context at the mise path
# so `mise run e2e:helm` (and the wrapper's `kubectl --context=…`) finds
# the kind cluster.
- name: Export kind kubeconfig to mise path
run: |
set -euo pipefail
kind get kubeconfig --name "$KIND_CLUSTER_NAME" > "$GITHUB_WORKSPACE/kubeconfig"
chmod 600 "$GITHUB_WORKSPACE/kubeconfig"

# Pre-pull and side-load: kind nodes don't have ghcr credentials, and
# tagging IMAGE_TAG to a SHA means the chart's IfNotPresent pull policy
# is satisfied once the image is loaded into the node's containerd.
- name: Load gateway and supervisor images into kind
run: |
set -euo pipefail
for component in gateway supervisor; do
image="ghcr.io/nvidia/openshell/${component}:${{ github.sha }}"
docker pull "$image"
kind load docker-image "$image" --name "$KIND_CLUSTER_NAME"
done

- name: Run Helm E2E (Rust smoke)
env:
OPENSHELL_E2E_KUBE_CONTEXT: kind-${{ env.KIND_CLUSTER_NAME }}
IMAGE_TAG: ${{ github.sha }}
OPENSHELL_REGISTRY: ghcr.io/nvidia/openshell
run: mise run --no-deps --skip-deps e2e:helm
3 changes: 2 additions & 1 deletion .github/workflows/e2e-label-help.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ permissions: {}
jobs:
hint:
name: Post next-step hint for E2E label
if: github.event.label.name == 'test:e2e' || github.event.label.name == 'test:e2e-gpu'
if: github.event.label.name == 'test:e2e' || github.event.label.name == 'test:e2e-gpu' || github.event.label.name == 'test:e2e-helm'
runs-on: ubuntu-latest
permissions:
pull-requests: write
Expand All @@ -40,6 +40,7 @@ jobs:
case "$LABEL_NAME" in
test:e2e) workflow_file=branch-e2e.yml; workflow_name="Branch E2E Checks" ;;
test:e2e-gpu) workflow_file=test-gpu.yml; workflow_name="GPU Test" ;;
test:e2e-helm) workflow_file=branch-helm-e2e.yml; workflow_name="Branch Helm E2E" ;;
*) echo "Unrecognized label $LABEL_NAME"; exit 1 ;;
esac

Expand Down
85 changes: 85 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,12 @@ tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
tracing-appender = "0.2"

# OpenTelemetry — pinned to a tonic-0.12 / prost-0.13 compatible release set.
opentelemetry = "0.29"
opentelemetry_sdk = { version = "0.29", features = ["rt-tokio"] }
opentelemetry-otlp = { version = "0.29", default-features = false, features = ["grpc-tonic", "trace"] }
tracing-opentelemetry = "0.30"

# Metrics
metrics = "0.24"
metrics-exporter-prometheus = { version = "0.18", default-features = false, features = ["http-listener"] }
Expand Down
17 changes: 17 additions & 0 deletions architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,23 @@ Domain objects use shared metadata: stable server-generated IDs, human-readable
names, creation timestamps, and labels. Crate-level details live in
`crates/openshell-core/README.md`.

### Observability surface

The gateway exposes three independent telemetry surfaces, each with its own
configuration knob and consumer:

| Surface | Direction | Configured by | Consumers |
|---|---|---|---|
| Prometheus metrics on `/metrics` | Pull | `--metrics-port` (CLI), `monitoring.serviceMonitor.*` (Helm) | Prometheus / kube-prometheus-stack via `ServiceMonitor`. |
| OpenTelemetry traces over OTLP/gRPC | Push | `--otlp-endpoint` / `OTEL_EXPORTER_OTLP_*` env, `monitoring.tracing.*` (Helm) | Any OTLP backend (Jaeger, Tempo, OTel Collector). The per-request span set up by `TraceLayer` becomes the OTLP root. |
| Sandbox log fan-out | Push (gRPC stream) | Always on per sandbox subscription | CLI / TUI / SDK consumers via `WatchSandbox` and `GetSandboxLogs`; OCSF JSONL when enabled inside the sandbox. |

Trace export is opt-in: the gateway only installs the OpenTelemetry layer
when an OTLP endpoint is supplied. Spans flush on `SIGTERM` via an explicit
`shutdown()` in the gateway shutdown path. See
[Monitoring the Gateway](../docs/kubernetes/monitoring.mdx) for the operator
guide.

## Persistence

The gateway persistence layer is a protobuf object store. Domain services store
Expand Down
6 changes: 6 additions & 0 deletions crates/openshell-server/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,12 @@ anyhow = { workspace = true }
tracing = { workspace = true }
tracing-subscriber = { workspace = true }

# OpenTelemetry tracing export (opt-in, configured via env)
opentelemetry = { workspace = true }
opentelemetry_sdk = { workspace = true }
opentelemetry-otlp = { workspace = true }
tracing-opentelemetry = { workspace = true }

# Metrics
metrics = { workspace = true }
metrics-exporter-prometheus = { workspace = true }
Expand Down
Loading
Loading