Skip to content

feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router #1586

@mrunalp

Description

@mrunalp

Problem Statement

Issue draft: Per-Handler gRPC Auth Annotations


Problem Statement

The gateway's gRPC auth metadata is currently spread across four hand-maintained
constants in three files:

Constant File Purpose
SCOPED_METHODS auth/authz.rs Bearer scope per method
ADMIN_METHODS auth/authz.rs Admin-vs-user role mapping
UNAUTHENTICATED_METHODS auth/oidc.rs Methods that bypass auth entirely
ALLOWED_SANDBOX_METHODS auth/sandbox_methods.rs Methods callable by sandbox principals

This has three concrete consequences:

  1. Asymmetric router enforcement. The router rejects Principal::Sandbox
    when the method isn't in ALLOWED_SANDBOX_METHODS, but it does not do
    the inverse for Principal::User. A Bearer user with openshell:all (or
    even just the right scope) is therefore not stopped at the router from
    reaching a handler that is intended for sandbox supervisors. Several
    handlers (GetSandboxProviderEnvironment, ReportPolicyStatus,
    SubmitPolicyAnalysis, PushSandboxLogs) rely on ensure_sandbox_scope,
    which intentionally lets users through. The streaming RPCs
    (ConnectSupervisor, RelayStream) don't gate at all before opening
    their bidi channel. End-to-end against an in-cluster Keycloak with an
    openshell-admin + openshell:all token (full A/B captured below):

    Method main today Expected
    GetSandboxProviderEnvironment NotFound: sandbox not found (handler reached, store queried) PermissionDenied at router
    ReportPolicyStatus InvalidArgument: sandbox_id is required (handler validated body) PermissionDenied
    SubmitPolicyAnalysis InvalidArgument: name is required PermissionDenied
    PushSandboxLogs {}stream accepted, log push succeeded PermissionDenied
    ConnectSupervisor InvalidArgument: expected SupervisorHello (bidi stream opened) PermissionDenied
    RelayStream InvalidArgument: first RelayFrame must be init… (bidi stream opened) PermissionDenied

    IssueSandboxToken, RefreshSandboxToken, and GetInferenceBundle are
    already safe today because their handlers call
    ensure_sandbox_principal_scope, but that safety is per-handler discipline,
    not a structural guarantee.

  2. Silent drift on every new RPC. A new RPC that lands in the proto but
    doesn't get added to SCOPED_METHODS falls back to requiring
    openshell:all at runtime — no compile-time error, no test failure unless
    someone wrote specific coverage for it. A method missing from
    ALLOWED_SANDBOX_METHODS silently denies sandbox supervisors. A typo in
    one of the hand-written gRPC path strings (e.g.
    "/openshell.v1.OpenShell/CreateProvider") is undetectable until
    production: the proto is the source of truth, but nothing connects the
    two.

  3. Auth posture is not visible at the handler. A reviewer reading
    handle_create_provider has no local signal that the method requires
    admin role and provider:write scope — they have to cross-reference
    three files.

Checklist

  • I've reviewed existing issues and the architecture docs.
  • This is a design proposal, not a "please build this" request.

Proposed Design

Proposed Design

Declare auth metadata at the handler definition site and enforce it at the
router. Replace the four constants with generated tables backed by per-method
annotations; enforce the missing user-side AuthMode check; gate everything
with a compile-time and a descriptor-set-driven test.

Auth model

Auth mode Principal::Sandbox? Bearer? Scope applies? Role applies? Examples
unauthenticated n/a n/a no no Health, gRPC reflection (handled by prefix, not annotation)
sandbox yes no no no ReportPolicyStatus, PushSandboxLogs, GetInferenceBundle
bearer no yes yes yes ListSandboxes, CreateProvider, SetClusterInference
dual yes yes yes (Bearer path only) yes (Bearer path only) GetSandboxConfig, UpdateConfig, GetDraftPolicy

sandbox auth uses the per-sandbox gateway-minted JWT introduced in #1404
the old shared sandbox secret no longer exists. A handler annotated sandbox
authenticates as a specific Principal::Sandbox; handlers still perform a
same-sandbox check on the request body where applicable.

Roles are coarse (admin or user). Scopes are fine-grained (sandbox:read,
provider:write, etc.).

Per-handler annotation

#[rpc_authz(service = "openshell.v1.OpenShell")]
#[tonic::async_trait]
impl OpenShell for OpenShellService {
    #[rpc_auth(auth = "unauthenticated")]
    async fn health(...) -> Result<_, Status> { ... }

    #[rpc_auth(auth = "bearer", scope = "sandbox:read", role = "user")]
    async fn list_sandboxes(...) -> Result<_, Status> { ... }

    #[rpc_auth(auth = "bearer", scope = "provider:write", role = "admin")]
    async fn create_provider(...) -> Result<_, Status> { ... }

    #[rpc_auth(auth = "dual", scope = "config:read", role = "user")]
    async fn get_sandbox_config(...) -> Result<_, Status> { ... }

    #[rpc_auth(auth = "sandbox")]
    async fn report_policy_status(...) -> Result<_, Status> { ... }
}

What the macros generate

#[rpc_authz] is a new impl-level attribute macro (first proc macro in the
workspace, in a small openshell-server-macros crate). It inspects each
method's #[rpc_auth] attribute and emits, adjacent to the impl block:

pub const OPEN_SHELL_AUTH_METADATA: &[MethodAuth] = &[
    MethodAuth {
        path: "/openshell.v1.OpenShell/Health",
        mode: AuthMode::Unauthenticated,
        scope: None,
        role: None,
    },
    MethodAuth {
        path: "/openshell.v1.OpenShell/ListSandboxes",
        mode: AuthMode::Bearer,
        scope: Some("sandbox:read"),
        role: Some(Role::User),
    },
    // ...
];

The const name is derived from the trait identifier in the impl
(impl OpenShell for ...OPEN_SHELL_AUTH_METADATA). Paths are derived
from the service = "..." argument and the snake_case method name converted
to PascalCase, so they cannot drift from the proto.

The macro strips #[rpc_auth(...)] attributes from the methods before
re-emitting the impl block, so #[tonic::async_trait] sees a normal impl.

MethodAuth, AuthMode, and Role live in openshell-server
(auth/method_authz.rs). The macro emits crate::auth::method_authz::*
paths; that only needs to work from inside openshell-server.

Compile-time enforcement

#[rpc_authz] fails compilation when:

  • An RPC method is missing #[rpc_auth].
  • An auth = "unauthenticated" or auth = "sandbox" method is annotated
    with scope or role.
  • An auth = "bearer" or auth = "dual" method is missing scope or role.
  • Two methods on the same service produce the same path.
  • The same key (auth, scope, or role) appears twice in one
    #[rpc_auth(...)].
  • An invalid auth mode or role string is supplied.

Aggregation

The macro emits one pub const per service. Aggregation is a manual one-liner
in a new module auth/method_authz.rs:

const SERVICES: &[&[MethodAuth]] = &[
    crate::grpc::OPEN_SHELL_AUTH_METADATA,
    crate::inference::INFERENCE_AUTH_METADATA,
];

pub fn lookup(method: &str) -> Option<&'static MethodAuth> {
    SERVICES.iter().flat_map(|s| s.iter()).find(|m| m.path == method)
}

This is the single source of truth queried by authz.rs, oidc.rs, and
sandbox_methods.rs. No inventory crate, no linker tricks, no runtime
initialization.

Router enforcement (closes problem #1)

AuthGrpcRouter already checks is_sandbox_callable for Principal::Sandbox.
Add the mirror for Principal::User via a new
method_authz::is_user_callable(path):

Principal::User(ref user) => {
    if !method_authz::is_user_callable(&path) {
        return Ok(status_response(
            tonic::Status::permission_denied(
                "this method requires a sandbox principal")));
    }
    if let Some(policy) = authz_policy {
        if let Err(s) = policy.check(&user.identity, &path) { return ...; }
    }
}

is_user_callable returns true for Bearer / Dual (the only modes a
user principal should reach), false for Sandbox / Unauthenticated,
and true for unknown methods so AuthzPolicy::check still gets to apply
the openshell:all fallback (defense-in-depth, see below).

What this replaces

Today After
SCOPED_METHODS in auth/authz.rs method_authz::required_scope() reading from generated tables
ADMIN_METHODS in auth/authz.rs method_authz::required_role() reading from generated tables
UNAUTHENTICATED_METHODS in auth/oidc.rs method_authz::is_unauthenticated() reading from generated tables
ALLOWED_SANDBOX_METHODS in auth/sandbox_methods.rs method_authz::is_sandbox_callable() reading from generated tables
UNAUTHENTICATED_PREFIXES in auth/oidc.rs Stays — prefix matching for /grpc.reflection.* and /grpc.health.* is structural, not per-method

Exhaustiveness test (closes problem #2)

openshell-core/build.rs is extended to emit a binary FileDescriptorSet
via tonic_build::configure().file_descriptor_set_path(...). The descriptor
is exposed as openshell_core::FILE_DESCRIPTOR_SET (a &'static [u8]).

A test in openshell-server parses the descriptor, enumerates every
(service, method) pair, and verifies each one is covered exactly once by
the aggregated MethodAuth tables (or matches one of the prefix-bypassed
paths). Failure modes:

  • A new RPC is added to a proto but no annotation lands → test fails loudly.
  • A method appears with two different annotations across services → test fails.
  • An annotated path doesn't match any real proto RPC → test fails (catches
    stale annotations after a rename).

The exhaustiveness test is the primary safety net. The runtime keeps the
openshell:all fallback for unknown methods (preserved
unknown_method_requires_openshell_all test) as defense in depth: if a
future refactor introduces a code path the test can't see (e.g. a method
routed through the server without appearing in the gateway-facing
descriptor set, or the aggregation list drifts), an unknown method still
requires the all-scope rather than falling open. The two layers are
deliberate and complementary.

Implementation outline

  1. Macro crate, types, annotations. Add crates/openshell-server-macros/
    with #[rpc_authz] + #[rpc_auth]. Add MethodAuth, AuthMode, Role,
    and the aggregator in auth/method_authz.rs. Annotate every RPC method
    on OpenShellService and InferenceService.
  2. Wire lookups. Replace SCOPED_METHODS, ADMIN_METHODS,
    UNAUTHENTICATED_METHODS, ALLOWED_SANDBOX_METHODS with calls through
    the aggregator. Existing unit tests in authz.rs, oidc.rs,
    sandbox_methods.rs keep exercising the public predicates and continue
    to pass.
  3. Router enforcement + exhaustiveness. Add is_user_callable and the
    Principal::User check in AuthGrpcRouter. Emit the descriptor set
    from build.rs. Add the exhaustiveness test plus a router test that
    proves openshell-admin + openshell:all is rejected on every
    sandbox-annotated method.

Backwards compatibility

The four old constants are removed in the same commit that introduces the
aggregator, so external state stays consistent. Behavior changes visible
to deployed gateways:

  • Six sandbox-only methods (listed in the table above) start rejecting
    Bearer users at the router. Before this change, those methods either
    succeeded on incomplete requests or surfaced NotFound / InvalidArgument
    from the handler. Nothing in the CLI or any user-facing flow calls them;
    only sandbox supervisors do. No CLI or e2e regression.
  • A handful of provider-profile methods and ExecSandboxInteractive that
    previously fell back to openshell:all now have explicit scope/role.
    Pragmatically: openshell:all tokens still work; provider:read-only
    tokens gain access to ListProviderProfiles / GetProviderProfile.

Risks and constraints

  • First proc macro in the workspace. Adds ~1–2 s of build time for the
    macro crate. Mitigated by keeping the macro small and focused on auth
    metadata only.
  • Compiler diagnostics. Proc-macro errors are noisier than const-table
    errors; the macro emits compile_error! spans pointing at the offending
    method.
  • Method-name convention. Relies on tonic's snake_case → PascalCase
    convention. If a proto introduces a non-conventional method name later,
    the macro will need an explicit path override; the current proto surface
    doesn't require it.
  • #[tonic::async_trait] composition. The macro must apply before
    #[tonic::async_trait], parse the impl body, strip #[rpc_auth]
    attributes, and re-emit a clean impl so async_trait's expansion is
    unaffected. This is exercised by the full server test suite.

Alternatives Considered

Alternatives Considered

  1. Const tables per module, no proc macro. Improves review locality but
    keeps the drift problems intact: paths stay hand-written, missing
    registrations still fall back silently, auth mode stays in a separate
    file. The proc macro is worth the build-time cost specifically to fix
    those two issues.

  2. inventory / linker-tricks distributed registration. Avoids the
    manual aggregator one-liner but adds runtime startup work and platform
    fragility for marginal benefit. Rejected.

  3. Macro-free runtime registry built from the proto descriptor.
    Defer the policy table to a build-time scan of the descriptor with
    external YAML attached. Loses the "auth metadata at the call site"
    property that motivated this work in the first place. Rejected.

  4. Declarative macro_rules! instead of a proc macro. Can do most of
    the work but can't easily generate a canonical service-derived const
    name and has weaker diagnostics. Rejected as a worse trade-off.

  5. Just add the router check, skip the refactor. Closes problem ci: add GitHub Actions CI workflow with lint, test, and image build #1 in
    ~20 lines. Doesn't address problems Prototype data access layer that can connect to outlook data #2 and Sandbox logging #3. Considered as a
    point-fix; the team chose the structural fix because the asymmetry is
    a symptom of the source-of-truth split, not of one missing line.

Agent Investigation

  • Searched current and recent issues/PRs for OIDC, RBAC, scope, role,
    SCOPED_METHODS, sandbox principal, and per-handler auth terminology.
  • Found PR feat(auth): add OIDC/Keycloak authentication with RBAC and scope-based permissions #935 (closed) introduced SCOPED_METHODS, ADMIN_METHODS,
    and UNAUTHENTICATED_METHODS in auth/authz.rs and auth/oidc.rs as
    flat constants — the hand-maintained shape this proposal replaces.
  • Found PR feat(auth): per-sandbox authentication to gateway #1404 (merged) added the per-sandbox gateway-minted JWT,
    Principal::Sandbox, ALLOWED_SANDBOX_METHODS in auth/sandbox_methods.rs,
    and the handler-level ensure_sandbox_scope /
    ensure_sandbox_principal_scope guards. The router was given an
    is_sandbox_callable check on Principal::Sandbox, but the inverse
    is_user_callable was not added — that asymmetry is the gap problem ci: add GitHub Actions CI workflow with lint, test, and image build #1
    describes.
  • Found feat(auth): add HA-compatible sandbox JWT refresh replay protection #1506 (open) tracks HA-compatible sandbox JWT refresh as the
    other follow-up from the feat(auth): per-sandbox authentication to gateway #1404 review; it's orthogonal to this work
    (refresh-state replication, not method-level policy).
  • Found Supervisor ConnectSupervisor and RelayStream RPCs rejected when OIDC is enabled without mTLS #1470 (closed) covers a related streaming-method case
    (ConnectSupervisor / RelayStream being rejected when OIDC is
    enabled without mTLS) — same RPCs that today accept a bearer user's
    bidi stream open. The fix proposed here closes that opening for any
    caller without a sandbox principal, regardless of mTLS.
  • Read crates/openshell-server/src/multiplex.rs to confirm the router
    evaluates Principal::Sandbox through is_sandbox_callable but routes
    Principal::User straight into AuthzPolicy::check, which only knows
    about role/scope — not auth mode.
  • Read crates/openshell-server/src/auth/guard.rs to confirm
    ensure_sandbox_scope lets Principal::User through unconditionally,
    which is what lets GetSandboxProviderEnvironment,
    ReportPolicyStatus, SubmitPolicyAnalysis, and PushSandboxLogs
    reach their handler bodies as a user.
  • Did not find an open umbrella issue covering per-handler auth metadata
    or the router-side Principal::User AuthMode check.
  • Verified the gap end-to-end against a live local-up-cluster + in-cluster
    Keycloak using scripts/test-keycloak-e2e.sh: with an
    openshell-admin + openshell:all token, 8 of 9 sandbox-only methods
    return non-PermissionDenied on main (handler reached; PushSandboxLogs
    fully succeeds with {}). The same probes return
    PermissionDenied: this method requires a sandbox principal on the
    proposed branch.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions