Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ Common findings:
- Sandbox image missing or pull denied: verify image reference and registry credentials.
- Docker driver cannot initialize because it cannot find `openshell-sandbox`: verify `OPENSHELL_DOCKER_SUPERVISOR_BIN`, the sibling binary next to `openshell-gateway`, or the configured supervisor image contains `/openshell-sandbox`.
- Sandbox never registers: check gateway logs and supervisor callback endpoint.
- Supervisor image exits before printing `openshell-sandbox --version`: the image should be the scratch supervisor image from `deploy/docker/Dockerfile.supervisor` and must contain a static executable at `/openshell-sandbox`.

For source checkout development, restart the local gateway with:

Expand Down Expand Up @@ -126,7 +127,7 @@ kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
```

The gateway image and `server.supervisorImage` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.

For local/external pull mode (the default local path via `mise run cluster`), local images are tagged to the configured local registry base, pushed to that registry, and pulled by k3s via the `registries.yaml` mirror endpoint. The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).

Expand Down
9 changes: 6 additions & 3 deletions .github/workflows/e2e-gpu-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,11 @@ jobs:
OPENSHELL_REGISTRY_USERNAME: ${{ github.actor }}
OPENSHELL_REGISTRY_PASSWORD: ${{ secrets.GITHUB_TOKEN }}
OPENSHELL_E2E_DOCKER_GPU: "1"
# NVIDIA-managed Ubuntu base used as the GPU probe target: it has the
# filesystem layout CDI injection expects (ldconfig, populated /usr/bin)
# which the distroless gateway runtime lacks. Consumed by the prereq
# probe below and by the e2e tests in e2e/rust/tests/gpu_device_selection.rs.
OPENSHELL_E2E_GPU_PROBE_IMAGE: "nvcr.io/nvidia/base/ubuntu:noble-20251013"
steps:
- uses: actions/checkout@v6

Expand All @@ -58,9 +63,7 @@ jobs:
- name: Check Docker GPU prerequisites
run: |
docker info --format '{{json .CDISpecDirs}}'
GPU_PROBE_IMAGE="$(awk '$1 == "FROM" && $3 == "AS" && $4 == "gateway" { print $2; exit }' deploy/docker/Dockerfile.images)"
test -n "${GPU_PROBE_IMAGE}"
docker run --rm --device nvidia.com/gpu=all "${GPU_PROBE_IMAGE}" nvidia-smi -L
docker run --rm --device nvidia.com/gpu=all "${OPENSHELL_E2E_GPU_PROBE_IMAGE}" nvidia-smi -L

- name: Run tests
run: mise run --no-deps --skip-deps e2e:docker:gpu
53 changes: 46 additions & 7 deletions .github/workflows/rust-native-build.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

name: Rust Native Build (openshell-gateway / openshell-sandbox)
name: Rust Image Binary Build (openshell-gateway / openshell-sandbox)

# Build Rust binaries natively per Linux architecture before the Docker image
# build consumes them as prebuilt artifacts.
# Build Rust binaries per Linux architecture before the Docker image build
# consumes them as prebuilt artifacts. Gateway images use GNU-linked binaries
# for the NVIDIA distroless C/C++ runtime; supervisor images use musl/static
# binaries so the final image can remain scratch.

on:
workflow_call:
Expand Down Expand Up @@ -105,10 +107,12 @@ jobs:
gateway)
crate=openshell-server
binary=openshell-gateway
zig_target=
;;
sandbox)
crate=openshell-sandbox
binary=openshell-sandbox
zig_target=
;;
*)
echo "unsupported component: $COMPONENT" >&2
Expand All @@ -118,10 +122,20 @@ jobs:

case "$ARCH" in
amd64)
target=x86_64-unknown-linux-gnu
if [[ "$COMPONENT" == "sandbox" ]]; then
target=x86_64-unknown-linux-musl
zig_target=x86_64-linux-musl
else
target=x86_64-unknown-linux-gnu
fi
;;
arm64)
target=aarch64-unknown-linux-gnu
if [[ "$COMPONENT" == "sandbox" ]]; then
target=aarch64-unknown-linux-musl
zig_target=aarch64-linux-musl
else
target=aarch64-unknown-linux-gnu
fi
;;
*)
echo "unsupported arch: $ARCH" >&2
Expand All @@ -133,6 +147,7 @@ jobs:
echo "crate=$crate"
echo "binary=$binary"
echo "target=$target"
echo "zig_target=$zig_target"
} >> "$GITHUB_OUTPUT"

- name: Configure GHA sccache backend
Expand Down Expand Up @@ -163,6 +178,30 @@ jobs:
set -euo pipefail
sed -i -E '/^\[workspace\.package\]/,/^\[/{s/^version[[:space:]]*=[[:space:]]*".*"/version = "'"${{ steps.version.outputs.cargo_version }}"'"/}' Cargo.toml

- name: Set up zig musl wrappers
if: contains(steps.target.outputs.target, 'musl')
run: |
set -euo pipefail
ZIG="$(mise which zig)"
ZIG_TARGET="${{ steps.target.outputs.zig_target }}"
mkdir -p /tmp/zig-musl

# cc-rs injects --target=<rust-triple>, which zig does not parse.
# Strip caller-provided --target and use the wrapper's zig target.
for tool in cc c++; do
printf '#!/bin/bash\nargs=()\nfor arg in "$@"; do\n case "$arg" in\n --target=*) ;;\n *) args+=("$arg") ;;\n esac\ndone\nexec "%s" %s --target=%s "${args[@]}"\n' \
"$ZIG" "$tool" "$ZIG_TARGET" > "/tmp/zig-musl/${tool}"
chmod +x "/tmp/zig-musl/${tool}"
done

TARGET_ENV=$(echo "${{ steps.target.outputs.target }}" | tr '-' '_')
TARGET_ENV_UPPER=${TARGET_ENV^^}

echo "CC_${TARGET_ENV}=/tmp/zig-musl/cc" >> "$GITHUB_ENV"
echo "CXX_${TARGET_ENV}=/tmp/zig-musl/c++" >> "$GITHUB_ENV"
echo "CARGO_TARGET_${TARGET_ENV_UPPER}_LINKER=/tmp/zig-musl/cc" >> "$GITHUB_ENV"
echo "CARGO_TARGET_${TARGET_ENV_UPPER}_RUSTFLAGS=-Clink-self-contained=no" >> "$GITHUB_ENV"

- name: Build ${{ steps.target.outputs.binary }} (${{ steps.target.outputs.target }})
env:
# Preserve the release-codegen setting used by the old Dockerfile
Expand All @@ -171,6 +210,7 @@ jobs:
OPENSHELL_IMAGE_TAG: ${{ inputs['image-tag'] }}
run: |
set -euo pipefail
mise x -- rustup target add "${{ steps.target.outputs.target }}"
args=(
--release
--target "${{ steps.target.outputs.target }}"
Expand All @@ -192,8 +232,7 @@ jobs:
OUTPUT="$("$BIN" --version)"
echo "$OUTPUT"
grep -q "^${{ steps.target.outputs.binary }} " <<<"$OUTPUT"
# Record glibc linkage so drift from the Ubuntu noble runtime base
# image is visible in logs.
# Record linkage so image runtime drift is visible in logs.
ldd --version
ldd "$BIN" || true

Expand Down
30 changes: 25 additions & 5 deletions architecture/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ OpenShell builds these main artifacts:
|---|---|
| Gateway binary | `crates/openshell-server` |
| CLI package and Python SDK | `python/openshell` plus Rust binaries where packaged |
| Gateway and supervisor container images | `deploy/docker/Dockerfile.images` |
| Gateway container image | `deploy/docker/Dockerfile.gateway` |
| Supervisor container image | `deploy/docker/Dockerfile.supervisor` |
| Helm chart | `deploy/helm/openshell` |
| VM driver/runtime assets | `crates/openshell-driver-vm` |
| Published docs site | `docs/` rendered by Fern config in `fern/` |
Expand All @@ -21,10 +22,29 @@ Sandbox community images are built outside this repository.

## Container Builds

The Docker image pipeline stages prebuilt Rust binaries, then builds container
images from `deploy/docker/Dockerfile.images`. CI builds native artifacts on the
target architecture, stages them under `deploy/docker/.build/`, and then uses
Buildx to publish per-architecture images and multi-architecture tags.
The Docker image pipeline is a two-step flow: build the Rust binary natively
for the target architecture, then assemble the container image from the
prebuilt binary. The gateway image is built from `deploy/docker/Dockerfile.gateway`
and the supervisor image from `deploy/docker/Dockerfile.supervisor`. Neither
Dockerfile compiles Rust — both copy a staged binary out of
`deploy/docker/.build/prebuilt-binaries/<arch>/` into the final image.

Binary staging is driven by `tasks/scripts/stage-prebuilt-binaries.sh`, which
runs `cargo build` natively on a matching host or `cargo zigbuild` when
cross-compiling. CI invokes the same staging step via the
`rust-native-build.yml` workflow (per-architecture, per-component) and uploads
the result as an artifact that the image build job downloads back into the
staging directory before running Buildx.

Runtime layout:

- **Gateway**: `nvcr.io/nvidia/distroless/cc` base, GNU-linked binary at
`/usr/local/bin/openshell-gateway`, runs as UID/GID `65532:65532`.
- **Supervisor**: `scratch` base, static musl binary at `/openshell-sandbox`.
Static linkage is required because the image is mounted/extracted into
sandbox environments (Docker extraction, Podman image volumes, Kubernetes
init-container copy-self) and cannot rely on a dynamic loader.

Gateway image builds bake the corresponding supervisor image tag into the
gateway binary so Docker sandboxes do not depend on `:latest` by default.
Package formulas also pin Docker supervisor extraction to the matching release
Expand Down
6 changes: 3 additions & 3 deletions crates/openshell-driver-podman/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ sequenceDiagram
C->>C: entrypoint: /opt/openshell/bin/openshell-sandbox
```

The `supervisor` target in `deploy/docker/Dockerfile.images` copies the
`openshell-sandbox` binary to `/openshell-sandbox` in the supervisor image.
The supervisor image from `deploy/docker/Dockerfile.supervisor` copies the static
`openshell-sandbox` binary to `/openshell-sandbox`.
Mounting that image at `/opt/openshell/bin` makes the binary available as
`/opt/openshell/bin/openshell-sandbox`.

Expand Down Expand Up @@ -352,4 +352,4 @@ matter compared to cluster or rootful runtimes:
netns, proxy, and relay behavior shared by all drivers.
- Container engine abstraction: `tasks/scripts/container-engine.sh` for
build/deploy support across Docker and Podman.
- Supervisor image build: `deploy/docker/Dockerfile.images`.
- Supervisor image build: `deploy/docker/Dockerfile.supervisor`.
14 changes: 12 additions & 2 deletions crates/openshell-sandbox/src/sandbox/linux/seccomp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,16 @@ use tracing::debug;
/// Value of `SECCOMP_SET_MODE_FILTER` (linux/seccomp.h).
const SECCOMP_SET_MODE_FILTER: u64 = 1;

// libc 0.2.185 omits `SYS_kexec_file_load` from the musl/aarch64 bindings even
// though the kernel exposes syscall 294. Fall back to the literal so the
// supervisor's seccomp filter still blocks fileless kernel-image loads when
// built statically against musl on aarch64.
#[cfg(all(target_arch = "aarch64", target_env = "musl"))]
#[allow(non_upper_case_globals)]
const SYS_kexec_file_load: libc::c_long = 294;
#[cfg(not(all(target_arch = "aarch64", target_env = "musl")))]
use libc::SYS_kexec_file_load;
Comment thread
TaylorMutch marked this conversation as resolved.

/// Apply the supervisor seccomp filter across the running process.
///
/// This runs after privileged startup helpers complete and synchronizes the
Expand Down Expand Up @@ -81,7 +91,7 @@ fn build_supervisor_prelude_rules() -> BTreeMap<i64, Vec<SeccompRule>> {
libc::SYS_finit_module,
libc::SYS_delete_module,
libc::SYS_kexec_load,
libc::SYS_kexec_file_load,
Comment thread
TaylorMutch marked this conversation as resolved.
SYS_kexec_file_load,
] {
rules.entry(syscall).or_default();
}
Expand Down Expand Up @@ -423,7 +433,7 @@ mod tests {
libc::SYS_finit_module,
libc::SYS_delete_module,
libc::SYS_kexec_load,
libc::SYS_kexec_file_load,
SYS_kexec_file_load,
] {
assert!(
filter_rules.contains_key(&syscall),
Expand Down
1 change: 1 addition & 0 deletions deploy/docker/Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
libz3-dev \
pkg-config \
libssl-dev \
musl-tools \
openssh-client \
python3 \
python3-venv \
Expand Down
41 changes: 41 additions & 0 deletions deploy/docker/Dockerfile.gateway
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# syntax=docker/dockerfile:1.4

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# Gateway image build.
#
# The Rust binary is built natively before this image build runs and staged at:
# deploy/docker/.build/prebuilt-binaries/<arch>/openshell-gateway
#
# Use tasks/scripts/docker-build-image.sh gateway (or `mise run build:docker:gateway`)
# to stage the binary and build the image in one step. CI builds the binary
# per-architecture via the `rust-native-build.yml` workflow and uploads it as
# an artifact, which is downloaded into the same staging directory before the
# image build job runs.
#
# The runtime is `nvcr.io/nvidia/distroless/cc:4.0.0`, which provides glibc and
# the dynamic loader needed by the GNU-linked gateway binary while keeping the
# attack surface small.

ARG GATEWAY_BASE_IMAGE=nvcr.io/nvidia/distroless/cc:v4.0.4
Comment thread
TaylorMutch marked this conversation as resolved.

FROM ${GATEWAY_BASE_IMAGE} AS gateway

ARG TARGETARCH

WORKDIR /app

# --chmod=0550 preserves the executable bit through actions/upload-artifact +
# download-artifact (which strip exec perms during the roundtrip) without
# granting world-execute. --chown=nvs:nvs matches the image's only defined
# non-root user (`nvs:1000`, the NVIDIA distroless convention) and aligns
# with the Helm chart's `securityContext.runAsUser: 1000`, which overrides
# the Dockerfile's USER at runtime.
COPY --chown=nvs:nvs --chmod=0550 deploy/docker/.build/prebuilt-binaries/${TARGETARCH}/openshell-gateway /usr/local/bin/openshell-gateway

USER nvs:nvs
EXPOSE 8080

ENTRYPOINT ["/usr/local/bin/openshell-gateway"]
CMD ["--bind-address", "0.0.0.0", "--port", "8080"]
Loading
Loading