Skip to content

feat(observability): add gateway OTLP traces and Helm monitoring surface#1270

Draft
TaylorMutch wants to merge 4 commits intomainfrom
tmutch/otel-metrics-traces
Draft

feat(observability): add gateway OTLP traces and Helm monitoring surface#1270
TaylorMutch wants to merge 4 commits intomainfrom
tmutch/otel-metrics-traces

Conversation

@TaylorMutch
Copy link
Copy Markdown
Collaborator

Summary

Adds opt-in OpenTelemetry trace export to the gateway and a Prometheus ServiceMonitor to the Helm chart. Both surfaces are independent from the existing /metrics endpoint and the OCSF sandbox log fan-out, default off, and configured via standard OTEL_* env vars or chart values.

Changes

Gateway (crates/openshell-server)

  • Pin OTel 0.29 / tracing-opentelemetry 0.30 (the latest set compatible with the workspace's tonic 0.12 + prost 0.13).
  • TracingLogBus::install_subscriber now optionally appends a tracing-opentelemetry layer when an OTLP endpoint is configured. The existing tower_http::trace::TraceLayer per-request span automatically becomes the OTLP root — no #[instrument] rewrites required.
  • New OtlpTracingConfig::resolve honors OTEL_EXPORTER_OTLP_TRACES_ENDPOINTOTEL_EXPORTER_OTLP_ENDPOINT--otlp-endpoint precedence.
  • Sampler reads OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG; default parent_based_traceidratio(1.0).
  • New shutdown() flushes the BatchSpanProcessor from the gateway shutdown path on SIGTERM.

Helm chart

  • New monitoring.serviceMonitor.* and monitoring.tracing.* blocks in values.yaml (off by default).
  • New templates/servicemonitor.yaml (gated, scrapes the existing named metrics port).
  • StatefulSet projects OTEL_* env vars when tracing is enabled, including merged OTEL_RESOURCE_ATTRIBUTES.
  • New ci/values-monitoring.yaml overlay and commented-in kube-prometheus-stack + jaeger Helm releases in skaffold.yaml.
  • New Monitoring section in deploy/helm/openshell/README.md.

Tooling

  • New tasks/observability.toml exposing observability:k8s:setup, observability:k8s:teardown, and observability:port-forward.
  • New scripts under tasks/scripts/ mirroring the existing keycloak-k8s-setup.sh shape: install slim kube-prometheus-stack + Jaeger all-in-one, idempotent re-runs.

Docs / agent skills

  • New docs/kubernetes/monitoring.mdx (operator + local-dev guide).
  • Cross-links from docs/observability/overview.mdx and a new "Observability surface" subsection in architecture/gateway.md.
  • helm-dev-environment and debug-openshell-cluster skills updated.

Testing

  • mise run pre-commit passes (lint, format, license headers, clippy, helm-lint matrix, full workspace tests).
  • Unit tests added for OtlpTracingConfig::resolve and sampler_from_env.
  • End-to-end on local k3d: created cluster, ran observability:k8s:setup, deployed gateway with ci/values-monitoring.yaml, drove 5 ListSandboxes + 3 Health gRPC calls. Verified:
    • Prometheus target up{job=\"openshell\"} == 1; openshell_server_grpc_requests_total totals match driven traffic (8).
    • Jaeger registers openshell-gateway service; 8 request spans with correct method, path, request_id attributes; resource attributes include service.namespace=openshell, service.version=0.0.0, deployment.environment=dev, telemetry.sdk.version=0.29.0.
  • No new e2e runtime test in CI for OTLP — unit tests + manual validation are sufficient for v1; standing up Jaeger in CI is disproportionate.

Out of scope (follow-ups)

  • OTLP log export (Loki / Collector logs receiver). OCSF JSONL remains the canonical log story.
  • In-process OTLP metrics push exporter — Prometheus pull is sufficient.
  • HTTP/protobuf OTLP transport — gateway currently only supports gRPC; chart accepts protocol: grpc.
  • Pre-built Grafana dashboards as ConfigMaps.
  • Per-handler #[tracing::instrument] annotations on gRPC handlers.

Checklist

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Adds a label-gated GitHub Actions workflow that exercises the Helm
chart end-to-end against the Rust e2e suite via `mise run e2e:helm`.

Pipeline:
- pr_metadata gates on the `test:e2e-helm` label via the pr-gate action.
- build-gateway / build-supervisor build and push Docker images using
  the reusable docker-build.yml workflow.
- helm-e2e (bare runner): apt-installs z3 build deps so cargo can
  compile the openshell-policy crate's z3-sys backend, creates a kind
  cluster via helm/kind-action, materializes the kind kubeconfig at the
  path mise's [env] block expects, side-loads the freshly built
  gateway/supervisor images, applies
  deploy/kube/manifests/agent-sandbox.yaml so the
  sandboxes.agents.x-k8s.io CRD and reconciling StatefulSet are in
  place, and finally runs `mise run e2e:helm`.

Also expands the `e2e:helm` task to run the full Rust e2e suite
(matching `e2e:podman`) instead of only the smoke test, with
OPENSHELL_E2E_KUBE_TEST as an opt-in single-test override for local
debugging.

Extends the e2e-label-help workflow so applying `test:e2e-helm` posts
the next-step hint pointing at this workflow.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Adds opt-in OpenTelemetry trace export and a Prometheus ServiceMonitor to
the gateway Helm chart. The exporter and chart toggles are independent
from the existing /metrics surface and the OCSF sandbox log fan-out.

- Gateway: append a tracing-opentelemetry layer to TracingLogBus when an
  OTLP/gRPC endpoint is configured; flush spans on shutdown. CLI gains
  --otlp-endpoint; standard OTEL_* env vars drive sampling and resource
  attributes.
- Helm: monitoring.serviceMonitor.* renders a Prometheus-Operator
  ServiceMonitor; monitoring.tracing.* projects OTEL_* env vars onto the
  gateway container. Both default off.
- Tooling: observability:k8s:{setup,teardown,port-forward} mise tasks
  install kube-prometheus-stack + Jaeger all-in-one for local dev.
- Docs: new docs/kubernetes/monitoring.mdx; cross-links from observability
  overview and architecture/gateway.md; helm-dev-environment and
  debug-openshell-cluster skills updated.
…files

The kube-prometheus-stack and Jaeger releases were configured via long
chains of `--set` flags, which obscure the configuration and make the
script hard to extend. Extract them into two checked-in values files
the setup script consumes via `--values`.

- tasks/scripts/observability-prometheus-values.yaml — slim chart config
  plus Grafana auto-provisioning of a Jaeger datasource (stable uid so
  dashboards can reference it).
- tasks/scripts/observability-jaeger-values.yaml — all-in-one Jaeger.
- PROMSTACK_VALUES and JAEGER_VALUES env vars allow pointing at custom
  files for local experimentation.
@TaylorMutch TaylorMutch force-pushed the tmutch/otel-metrics-traces branch from a551804 to c6463bf Compare May 8, 2026 23:14
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant