Skip to content

feat(gateway): add TOML configuration file (RFC 0003)#1317

Merged
TaylorMutch merged 16 commits into
mainfrom
tmutch/gateway-config-impl
May 15, 2026
Merged

feat(gateway): add TOML configuration file (RFC 0003)#1317
TaylorMutch merged 16 commits into
mainfrom
tmutch/gateway-config-impl

Conversation

@TaylorMutch
Copy link
Copy Markdown
Collaborator

Summary

Implements RFC 0003: an opt-in --config / OPENSHELL_GATEWAY_CONFIG flag that loads a TOML file with gateway-wide settings and per-driver tables. Source precedence stays CLI > env > file > built-in default, so existing flags and OPENSHELL_* env vars are not affected when no file is supplied. Includes the matching Helm ConfigMap cutover.

Related Issue

N/A — implements RFC 0003.

Changes

  • Loader (crates/openshell-server/src/config_file.rs): TOML parser with deny_unknown_fields everywhere, version-1 enforcement, env-only rejection of database_url / ssh_handshake_secret, and a driver_table() that overlays inheritable [openshell.gateway] defaults onto each [openshell.drivers.<name>] table.
  • CLI (crates/openshell-server/src/cli.rs): new --config flag; merge_file_into_args() applies file values only when clap's ValueSource::DefaultValue indicates an arg was not supplied; new build_vm_config() / build_docker_config() consume merged driver tables.
  • Driver crates (kubernetes, docker, podman, vm): config structs derive Deserialize/Serialize with struct-level #[serde(default, deny_unknown_fields)]. SupervisorSideloadMethod gains Deserialize with kebab-case rename. Refreshed Default impls where needed.
  • run_server / build_compute_runtime thread Option<&ConfigFile> through; the ad-hoc env reads for Kubernetes/Podman supervisor settings now fold into the file-driven pipeline (env still wins).
  • Helm chart: new templates/gateway-config.yaml ConfigMap rendering gateway.toml from existing .Values.server.* / .Values.service.* / .Values.supervisor.* keys; StatefulSet mounts it at /etc/openshell/gateway.toml, passes --config, and drops the 18 migrated OPENSHELL_* env entries. Only OPENSHELL_SSH_HANDSHAKE_SECRET (Secret-backed) remains.
  • Docs: architecture/gateway.md documents source precedence and the inheritance allowlist; examples/gateway/gateway.example.toml ships a worked example.
  • Inheritance scope (Q4 high-overlap set, per-driver allowlist): default_image, supervisor_image, image_pull_policy, client_tls_secret_name, host_gateway_ip, enable_user_namespaces, guest_tls_ca/cert/key.
  • Follow-up commit removes ssh_handshake_skew_secs / ssh_handshake_secret references from examples and architecture ahead of their planned removal.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated — 10 new tests in config_file.rs (parse, secret rejection, version, unknown field, inheritance merge, override) and 6 new tests in cli.rs (file value applies when defaulted, CLI overrides file, env overrides file, OIDC block, driver inheritance, driver override). Full mise run test passes.
  • E2E tests — test:e2e label applied; the sandbox-infra path should be exercised against the new --config startup mode.

Manual smoke:

  • helm lint . clean, helm template . renders gateway.toml with TLS, with TLS disabled, and with OIDC enabled.
  • TOML loader rejects database_url, ssh_handshake_secret, version = 2, and unknown fields with friendly errors.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Deferred (explicit follow-ups)

  • Helm unit tests for the new ConfigMap (separate effort).
  • "Raw ConfigMap contents" escape hatch in values.yaml.
  • --config-dir directory loader (RFC open question Prototype data access layer that can connect to outlook data #2).
  • RFC schema-gap fixes (Kubernetes ssh fields, Podman 6 missing fields, VM 9+ missing fields).

@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label May 12, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for e324f7e. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

@TaylorMutch TaylorMutch force-pushed the tmutch/gateway-config-impl branch from bf3670f to 02df5f7 Compare May 12, 2026 22:38
Copy link
Copy Markdown
Collaborator

@drew drew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this through codex, these seem reasonable to consider. "Restart pods when the rendered ConfigMap changes" feels possibly out of scope.

Review Comments

[P1] Move Helm grpc_endpoint into a supported table

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/gateway-config.yaml:36

The generated Helm config always writes grpc_endpoint under [openshell.gateway], but GatewayFileSection uses deny_unknown_fields and has no grpc_endpoint field. Default chart installs will fail during config_file::load before the gateway starts; this needs to move under
[openshell.drivers.kubernetes] or be added to the gateway schema and merge path.

[P1] Keep Kubernetes imagePullPolicy empty by default

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-driver-kubernetes/src/config.rs:76

When no Kubernetes image pull policy is configured, this now defaults to missing, which is the Podman vocabulary and is not a valid Kubernetes imagePullPolicy. Because the driver emits the field whenever it is non-empty, default Kubernetes deployments will create pods rejected by the
API; keep this default empty so Kubernetes applies its own default.

[P1] Preserve driver-table gRPC endpoints

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/lib.rs:508

When a TOML file sets [openshell.drivers.kubernetes].grpc_endpoint, kubernetes_config_from_file reads it and this assignment immediately overwrites it with config.grpc_endpoint, which can only come from CLI/env in this patch. File-only Kubernetes configs therefore pass an empty
OPENSHELL_ENDPOINT into sandboxes, so the callback path breaks even though the driver table contains the endpoint.

[P2] Keep disableTls wired after moving config to TOML

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/statefulset.yaml:49

When .Values.server.disableTls is true, the ConfigMap omits the TLS section, but this args/env block no longer passes --disable-tls or OPENSHELL_DISABLE_TLS. RunArgs.disable_tls remains false, so startup still requires TLS cert/key/CA paths and the plaintext Helm option fails.

[P2] Restart pods when the rendered ConfigMap changes

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/statefulset.yaml:65

After moving settings such as log level, sandbox image, TLS paths, and driver options into this ConfigMap, Helm upgrades that change those values no longer change the StatefulSet pod template. The gateway only reads /etc/openshell/gateway.toml at startup, so without a checksum
annotation or similar rollout trigger, pods keep running with stale config.

[P2] Honor health and metrics listener addresses

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/cli.rs:511

For config files that set health or metrics on a different interface than the main listener, these lines copy only the port and later build the listener with args.bind_address. For example, a health listener intended for loopback can be exposed on the public bind address; preserve the
full SocketAddr from the TOML file or validate that separate IPs are unsupported.

[P2] Apply ssh_session_ttl_secs from the TOML file

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/config_file.rs:101

The loader accepts ssh_session_ttl_secs, but merge_file_into_args and run_from_args never apply it to openshell_core::Config with with_ssh_session_ttl_secs. Operators can set the documented key and get a successful parse while session tokens still use the built-in default TTL.

@TaylorMutch
Copy link
Copy Markdown
Collaborator Author

Sounds good, agree these look reasonable to me.

Comment thread examples/gateway/docker.example.toml Outdated
Comment thread deploy/helm/openshell/templates/statefulset.yaml
@TaylorMutch TaylorMutch force-pushed the tmutch/gateway-config-impl branch from 02df5f7 to 9f19b02 Compare May 13, 2026 23:57
TaylorMutch added a commit that referenced this pull request May 14, 2026
Resolves the P1 and P2 issues raised in PR #1317:

- Helm gateway ConfigMap moves `grpc_endpoint` under
  `[openshell.drivers.kubernetes]` so the default install no longer fails
  the gateway's `deny_unknown_fields` schema check.
- `kubernetes_config_from_file` and `podman_config_from_file` only let
  the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table
  value when it was actually supplied, preserving file-only configs.
- Kubernetes driver default `image_pull_policy` is now empty (was Podman
  vocabulary "missing"), so default deployments let the Kubernetes API
  apply its own policy instead of being rejected.
- New `disable_tls` gateway field plumbs `.Values.server.disableTls`
  through the TOML ConfigMap instead of relying on env vars dropped from
  the StatefulSet.
- StatefulSet pod template now carries a `checksum/gateway-config`
  annotation so `helm upgrade` rolls pods when the ConfigMap changes.
- Auxiliary listener resolution preserves the full `SocketAddr` from
  `health_bind_address` / `metrics_bind_address`, so a loopback-pinned
  health port is not silently relocated onto the public bind address.
- `ssh_session_ttl_secs` from the file is now applied to `Config` (it
  was previously accepted by the loader but never read).

New regression coverage: cli-level merge tests for the new fields plus
helm-unittest assertions for the ConfigMap shape, checksum annotation,
and `disable_tls` rendering.
TaylorMutch added a commit that referenced this pull request May 14, 2026
Replaces the per-driver example files under examples/gateway/ with a
single published reference page at docs/reference/gateway-config.mdx
covering source precedence, layout, the full example, and the four
per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged
during PR #1317 review that the examples belong with the user-facing
docs rather than in a sibling examples/ directory.

The cross-references in architecture/gateway.md and RFC 0003 are updated
to point at the new docs page; the round-trip test in config_file.rs is
removed (schema coverage stays on the inline parses_full_example test
and per-field merge tests — doc-snippet drift belongs in a separate
docs-lint, not in a cross-tree Rust unit test).
@github-actions
Copy link
Copy Markdown

@TaylorMutch
Copy link
Copy Markdown
Collaborator Author

Ran an audit with Claude:


⏺ Verification complete. No breaking changes to either CLI or env-var surface.

  Gateway binary (crates/openshell-server/src/cli.rs):
  - 0 env = "OPENSHELL_*" declarations removed or renamed
  - 0 --long-flag declarations removed or renamed
  - 1 additive: --config / OPENSHELL_GATEWAY_CONFIG (new TOML loader entry)
  - All 47 prior RunArgs fields preserved; 1 new field (config) added

  Direct env reads (crates/openshell-server/src/lib.rs):
  - All 8 prior std::env::var("OPENSHELL_*") reads preserved
  - 1 additive: OPENSHELL_SUPERVISOR_SIDELOAD_METHOD

  Helm chart StatefulSet env block:
  - 26 envs dropped from the StatefulSet (now in the TOML ConfigMap)
  - 1 retained: OPENSHELL_SSH_HANDSHAKE_SECRET (must stay as a Kubernetes Secret ref)
  - Every dropped env has a verified-rendered TOML equivalent. Spot-checked with helm template (including conditional toggles like disableGatewayAuth, all OIDC fields, server SANs, image-pull policies)

  For operators who set OPENSHELL_* via custom extraEnv or pod-level injection:
  - Every clap env = "OPENSHELL_*" is unchanged, so the gateway binary still reads each one
  - Env still wins over the TOML file — locked in by the merge tests in cli.rs (env_var_overrides_file_value, env_var_overrides_file_loopback_service_http, etc.)

  Net effect

  ┌────────────────────────────┬─────────┬─────────┬────────────────────────────────┐
  │          Surface           │ Removed │ Renamed │             Added              │
  ├────────────────────────────┼─────────┼─────────┼────────────────────────────────┤
  │ Gateway CLI flags          │ 0       │ 0       │ --config                       │
  ├────────────────────────────┼─────────┼─────────┼────────────────────────────────┤
  │ Gateway env vars           │ 0       │ 0       │ OPENSHELL_GATEWAY_CONFIG       │
  ├────────────────────────────┼─────────┼─────────┼────────────────────────────────┤
  │ Helm extraEnv API contract │ 0       │ 0       │ 0 (still respected by gateway) │
  └────────────────────────────┴─────────┴─────────┴────────────────────────────────┘

TaylorMutch added a commit that referenced this pull request May 14, 2026
Resolves the P1 and P2 issues raised in PR #1317:

- Helm gateway ConfigMap moves `grpc_endpoint` under
  `[openshell.drivers.kubernetes]` so the default install no longer fails
  the gateway's `deny_unknown_fields` schema check.
- `kubernetes_config_from_file` and `podman_config_from_file` only let
  the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table
  value when it was actually supplied, preserving file-only configs.
- Kubernetes driver default `image_pull_policy` is now empty (was Podman
  vocabulary "missing"), so default deployments let the Kubernetes API
  apply its own policy instead of being rejected.
- New `disable_tls` gateway field plumbs `.Values.server.disableTls`
  through the TOML ConfigMap instead of relying on env vars dropped from
  the StatefulSet.
- StatefulSet pod template now carries a `checksum/gateway-config`
  annotation so `helm upgrade` rolls pods when the ConfigMap changes.
- Auxiliary listener resolution preserves the full `SocketAddr` from
  `health_bind_address` / `metrics_bind_address`, so a loopback-pinned
  health port is not silently relocated onto the public bind address.
- `ssh_session_ttl_secs` from the file is now applied to `Config` (it
  was previously accepted by the loader but never read).

New regression coverage: cli-level merge tests for the new fields plus
helm-unittest assertions for the ConfigMap shape, checksum annotation,
and `disable_tls` rendering.
TaylorMutch added a commit that referenced this pull request May 14, 2026
Replaces the per-driver example files under examples/gateway/ with a
single published reference page at docs/reference/gateway-config.mdx
covering source precedence, layout, the full example, and the four
per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged
during PR #1317 review that the examples belong with the user-facing
docs rather than in a sibling examples/ directory.

The cross-references in architecture/gateway.md and RFC 0003 are updated
to point at the new docs page; the round-trip test in config_file.rs is
removed (schema coverage stays on the inline parses_full_example test
and per-field merge tests — doc-snippet drift belongs in a separate
docs-lint, not in a cross-tree Rust unit test).
@TaylorMutch TaylorMutch force-pushed the tmutch/gateway-config-impl branch from abe8b74 to b034436 Compare May 14, 2026 16:43
TaylorMutch added a commit that referenced this pull request May 14, 2026
Resolves the P1 and P2 issues raised in PR #1317:

- Helm gateway ConfigMap moves `grpc_endpoint` under
  `[openshell.drivers.kubernetes]` so the default install no longer fails
  the gateway's `deny_unknown_fields` schema check.
- `kubernetes_config_from_file` and `podman_config_from_file` only let
  the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table
  value when it was actually supplied, preserving file-only configs.
- Kubernetes driver default `image_pull_policy` is now empty (was Podman
  vocabulary "missing"), so default deployments let the Kubernetes API
  apply its own policy instead of being rejected.
- New `disable_tls` gateway field plumbs `.Values.server.disableTls`
  through the TOML ConfigMap instead of relying on env vars dropped from
  the StatefulSet.
- StatefulSet pod template now carries a `checksum/gateway-config`
  annotation so `helm upgrade` rolls pods when the ConfigMap changes.
- Auxiliary listener resolution preserves the full `SocketAddr` from
  `health_bind_address` / `metrics_bind_address`, so a loopback-pinned
  health port is not silently relocated onto the public bind address.
- `ssh_session_ttl_secs` from the file is now applied to `Config` (it
  was previously accepted by the loader but never read).

New regression coverage: cli-level merge tests for the new fields plus
helm-unittest assertions for the ConfigMap shape, checksum annotation,
and `disable_tls` rendering.
TaylorMutch added a commit that referenced this pull request May 14, 2026
Replaces the per-driver example files under examples/gateway/ with a
single published reference page at docs/reference/gateway-config.mdx
covering source precedence, layout, the full example, and the four
per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged
during PR #1317 review that the examples belong with the user-facing
docs rather than in a sibling examples/ directory.

The cross-references in architecture/gateway.md and RFC 0003 are updated
to point at the new docs page; the round-trip test in config_file.rs is
removed (schema coverage stays on the inline parses_full_example test
and per-field merge tests — doc-snippet drift belongs in a separate
docs-lint, not in a cross-tree Rust unit test).
@TaylorMutch TaylorMutch force-pushed the tmutch/gateway-config-impl branch from b034436 to 919cc7c Compare May 14, 2026 20:52
@TaylorMutch TaylorMutch requested a review from drew May 14, 2026 21:39
TaylorMutch added a commit that referenced this pull request May 15, 2026
Resolves the P1 and P2 issues raised in PR #1317:

- Helm gateway ConfigMap moves `grpc_endpoint` under
  `[openshell.drivers.kubernetes]` so the default install no longer fails
  the gateway's `deny_unknown_fields` schema check.
- `kubernetes_config_from_file` and `podman_config_from_file` only let
  the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table
  value when it was actually supplied, preserving file-only configs.
- Kubernetes driver default `image_pull_policy` is now empty (was Podman
  vocabulary "missing"), so default deployments let the Kubernetes API
  apply its own policy instead of being rejected.
- New `disable_tls` gateway field plumbs `.Values.server.disableTls`
  through the TOML ConfigMap instead of relying on env vars dropped from
  the StatefulSet.
- StatefulSet pod template now carries a `checksum/gateway-config`
  annotation so `helm upgrade` rolls pods when the ConfigMap changes.
- Auxiliary listener resolution preserves the full `SocketAddr` from
  `health_bind_address` / `metrics_bind_address`, so a loopback-pinned
  health port is not silently relocated onto the public bind address.
- `ssh_session_ttl_secs` from the file is now applied to `Config` (it
  was previously accepted by the loader but never read).

New regression coverage: cli-level merge tests for the new fields plus
helm-unittest assertions for the ConfigMap shape, checksum annotation,
and `disable_tls` rendering.
TaylorMutch added a commit that referenced this pull request May 15, 2026
Replaces the per-driver example files under examples/gateway/ with a
single published reference page at docs/reference/gateway-config.mdx
covering source precedence, layout, the full example, and the four
per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged
during PR #1317 review that the examples belong with the user-facing
docs rather than in a sibling examples/ directory.

The cross-references in architecture/gateway.md and RFC 0003 are updated
to point at the new docs page; the round-trip test in config_file.rs is
removed (schema coverage stays on the inline parses_full_example test
and per-field merge tests — doc-snippet drift belongs in a separate
docs-lint, not in a cross-tree Rust unit test).
@TaylorMutch TaylorMutch force-pushed the tmutch/gateway-config-impl branch from a8ac397 to 381784e Compare May 15, 2026 18:25
Introduces an opt-in --config / OPENSHELL_GATEWAY_CONFIG flag that loads a
TOML file with gateway-wide settings and per-driver tables. Source
precedence is CLI > env > file > built-in default, implemented via clap's
ValueSource so existing flags and env vars keep their priority.

Driver crates (kubernetes, docker, podman, vm) now derive Deserialize on
their config structs. SupervisorSideloadMethod gains Deserialize with
kebab-case rename. A per-driver inheritance allowlist on the loader side
overlays [openshell.gateway] shared defaults (default_image,
supervisor_image, image_pull_policy, guest_tls_*, ssh_handshake_skew_secs,
client_tls_secret_name, host_gateway_ip, enable_user_namespaces) onto
each [openshell.drivers.<name>] table before deserialization.

The Helm chart renders a new gateway-config ConfigMap and mounts it at
/etc/openshell/gateway.toml. The migrated OPENSHELL_* env entries are
dropped from the StatefulSet — only the Secret-backed
OPENSHELL_SSH_HANDSHAKE_SECRET remains. database_url stays on --db-url.

Adds examples/gateway/gateway.example.toml and updates architecture/gateway.md
with the source precedence and inheritance rules.
…from examples

Both fields are scheduled for removal. Remove the example values and the
env-only note so the gateway.toml example and the architecture doc stop
recommending settings that will not exist much longer.
TaylorMutch and others added 13 commits May 15, 2026 12:16
Adds focused single-driver examples next to the comprehensive
gateway.example.toml: kubernetes, docker, podman, and microvm. Each one
demonstrates the realistic settings for that driver plus how shared
[openshell.gateway] defaults inherit into the driver table.

A new unit test (`checked_in_examples_parse`) loads every example through
the config_file loader so schema drift fails CI rather than silently
shipping a broken example.
Kubernetes and Podman use mutually-incompatible vocabularies for the same
TOML key:

  - Kubernetes: `Always | IfNotPresent | Never` (free-form string passed
    verbatim to the K8s API).
  - Podman: `always | missing | never | newer` (strict lowercase enum
    deserialised into `ImagePullPolicy`).

No value means the same thing in both drivers. Sharing the key at
`[openshell.gateway]` scope and inheriting it into every active driver's
table meant any value safe for one driver was either wrong or silently
dropped for the other (`IfNotPresent` → `ImagePullPolicy::Missing` after
`.unwrap_or_default()`). Operators run one driver per gateway, so the
"shared default" never pays for itself.

Make `image_pull_policy` driver-local:

  - Remove the field from `GatewayFileSection` and from
    `inheritable_keys()` for both Kubernetes and Podman.
  - Drop the file→`RunArgs` merge for the gateway-scope key.
  - Stop unconditionally clobbering the driver value with
    `config.sandbox_image_pull_policy` in the runtime wiring — only apply
    the CLI/env override when it was set (and, for Podman, only when it
    parses into the lowercase enum).
  - Move the key under `[openshell.drivers.kubernetes]` and
    `[openshell.drivers.podman]` in every example, the RFC, the
    architecture doc, and the Helm-rendered gateway ConfigMap.

The supervisor pull policy follows the same shape: it is K8s-only and
moves into `[openshell.drivers.kubernetes]` alongside `image_pull_policy`
in the Helm template.
Resolves the P1 and P2 issues raised in PR #1317:

- Helm gateway ConfigMap moves `grpc_endpoint` under
  `[openshell.drivers.kubernetes]` so the default install no longer fails
  the gateway's `deny_unknown_fields` schema check.
- `kubernetes_config_from_file` and `podman_config_from_file` only let
  the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table
  value when it was actually supplied, preserving file-only configs.
- Kubernetes driver default `image_pull_policy` is now empty (was Podman
  vocabulary "missing"), so default deployments let the Kubernetes API
  apply its own policy instead of being rejected.
- New `disable_tls` gateway field plumbs `.Values.server.disableTls`
  through the TOML ConfigMap instead of relying on env vars dropped from
  the StatefulSet.
- StatefulSet pod template now carries a `checksum/gateway-config`
  annotation so `helm upgrade` rolls pods when the ConfigMap changes.
- Auxiliary listener resolution preserves the full `SocketAddr` from
  `health_bind_address` / `metrics_bind_address`, so a loopback-pinned
  health port is not silently relocated onto the public bind address.
- `ssh_session_ttl_secs` from the file is now applied to `Config` (it
  was previously accepted by the loader but never read).

New regression coverage: cli-level merge tests for the new fields plus
helm-unittest assertions for the ConfigMap shape, checksum annotation,
and `disable_tls` rendering.
Replaces the per-driver example files under examples/gateway/ with a
single published reference page at docs/reference/gateway-config.mdx
covering source precedence, layout, the full example, and the four
per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged
during PR #1317 review that the examples belong with the user-facing
docs rather than in a sibling examples/ directory.

The cross-references in architecture/gateway.md and RFC 0003 are updated
to point at the new docs page; the round-trip test in config_file.rs is
removed (schema coverage stays on the inline parses_full_example test
and per-field merge tests — doc-snippet drift belongs in a separate
docs-lint, not in a cross-tree Rust unit test).
The constant is Kubernetes-specific (used only by KubernetesComputeConfig's
Default impl) and does not belong in openshell-core. Relocate it to the
driver crate that owns the K8s vocabulary; openshell-core retains only
truly cross-cutting defaults.
DEFAULT_NETWORK_NAME is Podman vocabulary, consumed only by the Podman
driver. Also drops the unused DEFAULT_IMAGE_PULL_POLICY constant.
Sweeps the trailing mentions left after the rebase: the gateway
config-file module doc, the Helm gateway-config ConfigMap header,
the gateway-config.mdx env-only note, and the RPM systemd unit
comment for init-gateway-env.sh.
The previous comment implied the callback endpoint was Kubernetes-only,
but the value is propagated to every compute driver (Kubernetes, Docker,
Podman, VM) and must be reachable from wherever the sandbox runs.
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
…esses

The gateway CLI flags moved into TOML config tables in 560550d (#1394),
which made every existing e2e/with-{docker,podman}-gateway.sh invocation
fail with "unexpected argument '--sandbox-namespace'" (and a long tail
of similar driver-specific options) before the gateway could even bind.

Replace the obsolete CLI flags with a synthesized
`[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml`
and passed via the new `--config` flag. Only the gateway-wide flags
that survived 560550d (bind-address, port, drivers, db-url, tls-*,
disable-tls, log-level, health-port) stay on the command line.

Both scripts get a small `toml_string` helper to properly TOML-quote
the values (the previous `%q` printf format produced bash-escape, not
TOML-escape). The Docker harness also corrects two field names that
diverged from the driver schema: `docker_network_name` →
`network_name`, and the supervisor binary/image plumbing now reads
through to `supervisor_bin` / `supervisor_image` in the same table.

The Podman harness drops `--ssh-gateway-port` (deleted in 560550d —
gRPC + SSH are multiplexed on the same port now) and substitutes
`network_name` + `gateway_port` for the obsolete `--sandbox-namespace`
(which the Podman driver never had as a typed field).
CLI's resolve_ssh_gateway treated 0.0.0.0 as a loopback "keep as-is"
when the cluster URL was also loopback, so the SSH proxy connected to
0.0.0.0:port. The unspecified address is never a valid connect target
and is not present in any TLS cert SAN, which produced BadCertificate
TLS handshake failures during `openshell sandbox create -- ...` in
docker/podman e2e (e.g. bypass_detection).

Resolution: when the server returns 0.0.0.0 or :: as the gateway host
and both endpoints are loopback, fall back to the cluster URL's host
(which the CLI is already using to reach the gateway, so it must
resolve and match the cert).
Podman 5.x with the applehv/libkrun provider no longer creates the legacy
~/.local/share/containers/podman/machine/podman.sock symlink, and
`podman system service` is a Linux-only subcommand — the macOS client
delegates the API service to the VM. Both assumptions in the harness
were stale, so the script tried to start a temporary service that podman
rejected with "unknown flag: --time".

- Discover the macOS socket via `podman machine inspect` instead of the
  hardcoded path.
- On Darwin, fail fast with a "start podman machine" message rather than
  attempting the Linux-only `podman system service` fallback.
- Write socket_path into [openshell.drivers.podman] so the in-process
  driver picks up the discovered socket; the driver reads TOML only
  after the config refactor (560550d), so OPENSHELL_PODMAN_SOCKET alone
  was no longer enough.
clippy 1.95.0 rejects assigning the result of `Clone::clone()` to an
existing variable under `-D warnings` (`assigning_clones`). Switch to
`clone_from(&...)` to satisfy the lint and avoid the redundant
allocation.
@TaylorMutch TaylorMutch force-pushed the tmutch/gateway-config-impl branch from 381784e to 9bc2e11 Compare May 15, 2026 19:17
@TaylorMutch TaylorMutch merged commit b61a98d into main May 15, 2026
38 of 40 checks passed
@TaylorMutch TaylorMutch deleted the tmutch/gateway-config-impl branch May 15, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants