✨ feat: Add mTLS runtime identity verification for AgentCards#284
✨ feat: Add mTLS runtime identity verification for AgentCards#284kevincogan wants to merge 21 commits into
Conversation
…d signing Introduce a reusable SignCard function that produces JWS x5c signatures, supporting ECDSA (P-256/P-384/P-521) and RSA key types with RFC 7518 fixed-width R||S encoding. Add SpiffeSigner that wraps a SPIFFE Workload API X509Source to sign AgentCards with the workload's SVID. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com> Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Add Writer that creates or updates signed AgentCard ConfigMaps using an uncached API reader to avoid cache-scoping issues across namespaces. Refactor the init-container signer to use the shared signature.SignCard function and the canonical agentcard.ConfigMapName constant, eliminating duplicate logic. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com> Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
…entity Wire the SpiffeSigner and CardWriter into the AgentCard reconciler behind the --enable-operator-signing flag (off by default). When enabled, the operator signs cards with its own SVID, writes signed ConfigMaps, and heals tampered signatures on reconcile. Add Helm values, SPIRE CSI volume mount, ClusterSPIFFEID template, and expanded ConfigMap RBAC. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com> Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
|
Thanks for the PR @kevincogan. The signing infrastructure looks solid (shared What this PR does well
What this PR does not address (and needs to for Option 4 from #292)1. The fetcher is unchanged. 2. The HTTP fetch is plaintext and unauthenticated. When the fetcher does fall back to HTTP, it uses 3. The agent's SPIFFE ID is not in the signed payload. The JWS x5c chain contains the operator's certificate. The agent's workload identity is lost from the signature. As discussed in #292, the operator would need to verify the agent's identity during fetch (via mTLS) and embed the verified agent SPIFFE ID as a claim in the signed payload (notarial model). SummaryThis PR provides the signing machinery (which is good and needed), but it's not yet Option 4 from #292. To get there, it needs:
Without (1), the operator signs the empty skeleton. Without (2) and (3), you get operator-attested cards without agent identity binding or tamper protection on the fetch. |
Skip NetworkPolicy creation when identity binding fails and strict mode is disabled, allowing agents with unverified bindings to retain network access. Add test coverage for strict vs non-strict behavior. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Run make manifests to reflect the new Strict field in the AgentCard CRD, updated controller RBAC markers, and webhook timeout removal. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Grant update permission on deployments/finalizers and statefulsets/finalizers so the operator can set blockOwnerDeletion on owned workloads. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Point the enforcement demo prerequisite back to agentcard-spire-signing (init-container path). Add a new agentcard-operator-signing demo with deployment manifests, ClusterSPIFFEID, AgentCard CR, and teardown script for the opt-in operator signing path. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
…ditional fetcher Add IdentityBinding.Strict field (default false) for per-card binding enforcement granularity. When strict is true, a trust-domain mismatch removes the signature-verified label; when false, the mismatch is recorded in status but the label is preserved. Gate the AgentCard fetcher on --enable-operator-signing: when disabled (default), the reconciler uses ConfigMapFetcher to read init-container- signed cards; when enabled, it uses DefaultFetcher to fetch cards via HTTP for operator-side signing. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
|
Thanks for the update @kevincogan. The signing infrastructure is solid. Clean package structure, proper A few observations on how the current PR relates to the Option 4 flow proposed in #292: The fetcher is unchanged (assessed by Claude)
The operator signs whatever the fetcher gives it. Since the fetcher prefers the signed ConfigMap (which contains the empty skeleton from the init-container), the operator re-signs empty content. The real card from the live endpoint is still only used as a fallback. For Option 4 as proposed in #292 ("fetch from the live agent endpoint via HTTP and remove the init-container signer and all ConfigMaps from the signing pipeline"), I think the follow-up needs to:
Fetch security and agent identityAs discussed in my comment on #292, the HTTP fetch is plaintext and unauthenticated, and the agent's SPIFFE ID is not embedded in the signed payload. These are follow-up items, but worth tracking explicitly since they affect whether we can market this as zero-trust agent verification. SummaryI see this PR as the correct signing infrastructure. It provides the building blocks. The full Option 4 flow (HTTP fetch from live endpoint, mTLS, agent identity embedding, init-container removal) would be a next step, but is it clear that we can implement those ? But before continuing on this PR we should be crystal clear how we solve:
This should be tackled first before we can think about adding this to the code base. |
Operator-signing wrote signed ConfigMaps on behalf of agents, adding complexity without meaningful trust improvement. Replace with Phase 1 mTLS verified fetch which authenticates agents directly via SPIFFE. Removes CardSigner, CardWriter, SpiffeSigner, and the operator-signing demo. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Adds Phase 1 identity verification: the operator authenticates agents via mutual TLS using SPIFFE workload identity before trusting their AgentCard data. - SpiffeFetcher performs mTLS-authenticated fetch with cached HTTP client - evaluateTrust unifies mTLS and JWS signature verification paths - Verified condition is the single trust signal for NetworkPolicy gating - Identity binding evaluates attested SPIFFE IDs against trust domain - AttestedAgent printer column shows verified SPIFFE ID (wide output) - Proper gRPC connection lifecycle for X509Source Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
- Add verifiedFetch Helm values with SPIFFE CSI volume and operator flags - Create ClusterSPIFFEID template for operator workload identity - Inject POD_NAMESPACE via Downward API for NetworkPolicy peer selection - Use kubernetes.io/metadata.name label instead of custom labeling - Replace hardcoded namespace with ClusterDefaultsNamespace constant Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Lightweight test server that serves AgentCard data over mTLS using SPIFFE workload identity. Used for E2E testing of verified fetch on clusters with SPIRE deployed. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Ensures the injected Envoy sidecar exposes the agent-tls named port so the operator can discover mTLS endpoints by port name. Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
…card-signing Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # charts/kagenti-operator/templates/manager/manager.yaml
- Check write error returns in HTTP handlers - Suppress errcheck on deferred Close and Shutdown calls - Remove redundant fmt.Sprintf for string argument Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
…card-signing Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # charts/kagenti-operator/templates/manager/manager.yaml
…card-signing Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # kagenti-operator/internal/controller/agentcard_controller.go
|
@mrsabath @maia-iyer @rhuss I'd appreciate your review on this PR when you get a chance. Quick context: this PR started as the operator-side signing work we collaborated on. After @rhuss review and the Slack discussion on the AgentCard trust model, the direction shifted toward mTLS verified fetch rather than operator signing. The key argument was that within a single cluster, mTLS via SPIFFE already proves agent identity at fetch time without the circular trust concern. What changed:
The underlying principle is the same: SPIRE-attested identity as the foundation for trust, just using mTLS rather than a signature to verify it. References:
The only change to the Sigstore plan is that Phase 2A (operator signing) is replaced by mTLS verified fetch. The supply-chain provenance phases (image verification, blob verification, attestor composition) remain the same. Would like to confirm this still aligns from your perspective. Thanks! |
|
@kevincogan fyi: you have tagged the wrong person. |
c5fa688 to
1502f39
Compare
|
I am nervous about approving security PRs without understanding the pieces. This PR has no link to an issue or discussion, so I will ask here. I know of three ways to get AgentCards without this PR:
Does mTLS identity verification extend any of these? Will I be doing |
pdettori
left a comment
There was a problem hiding this comment.
Solid feature-gated implementation of mTLS runtime identity verification. The dual-mode coexistence (mTLS + JWS) is well-designed with clear condition semantics — evaluateTrust cleanly unifies both trust paths under a single Verified condition, and the NetworkPolicy controller correctly keys on it.
Two inline suggestions (non-blocking) and one nit below.
Areas reviewed: Go, Helm/K8s, YAML, Dockerfile, Shell, Docs
Commits: 19 commits (15 feature + 4 merge), all signed-off ✓
CI status: 15/15 checks passing ✓
Assisted-By: Claude Code
mrsabath
left a comment
There was a problem hiding this comment.
Solid implementation of mTLS runtime identity verification. The dual-mode coexistence design is sound, feature gating is correct, and the reconciler refactoring into fetchCardData → evaluateTrust improves testability. CI is fully green and E2E coverage on ROSA is thorough.
Three bugs worth addressing (non-blocking given the feature gate is off by default):
Critical (low exploitability due to feature gate):
-
Silent error in trust domain parsing (
fetcher.go:260):td, _ := spiffeid.TrustDomainFromString(trustDomain)— iftrustDomainis empty (the Helm default!),tdis a zero-valueTrustDomainandAuthorizeMemberOf(td)behavior is undefined in go-spiffe. Should fail loudly. -
Wrong error returned (
fetcher.go:297):return nil, errreturns the first endpoint's error instead oflegacyErr. Masks the actual failure when both endpoints fail. -
SPIFFE ID from
PeerCertificates[0]vsVerifiedChains(fetcher.go:341):PeerCertificatesis populated before chain validation. Defense-in-depth says preferVerifiedChains[0][0]. Mitigated by go-spiffe'sVerifyPeerCertificatecallback.
Major (suggestions):
-
Silent HTTP fallback: When
--enable-verified-fetchis set but agent lacksagent-tlsport, the operator logs at Info level and falls back to unverified HTTP. Consider emitting a K8s Event on the AgentCard or aDegradedFetchcondition. -
Binding precedence undocumented: When both mTLS and JWS succeed (possibly with different SPIFFE IDs), mTLS takes unconditional precedence in
evaluateTrust. Worth a comment.
Areas reviewed: Go (fetcher, reconciler, binding, NetworkPolicy controller, test-tls-agent), Helm chart, CRD, RBAC, Dockerfile, demos
Commits: 19 (15 feature + 4 merge), all signed-off
CI: 15/15 passing
- Return error from NewSpiffeFetcher on invalid trust domain - Fix wrong error variable returned in legacy endpoint fallback - Prefer VerifiedChains over PeerCertificates for SPIFFE ID extraction - Emit Warning Event when falling back to unverified HTTP fetch - Document mTLS precedence over JWS in evaluateTrust - Fix unused parameter lint warning in test-tls-agent Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
@mrsabath addressed both suggestions:
|
Good questions @esnible. This PR changes how the operator fetches and trusts AgentCard data internally. The mTLS happens between the operator and the agent during reconciliation, not between end users and agents. To answer each:
The new attestation field is This resolves #292 and implements Phase 1 of the AgentCard Trust Model RFC. |
r3v5
left a comment
There was a problem hiding this comment.
Great work, @kevincogan ! Left small comments, not a blocker
- Add nil check on r.Recorder.Event() in FallbackToHTTP path - Change cleanupVerifiedFetchFields to return error to caller - Caller logs and continues without failing the reconcile loop Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Summary
Adds runtime identity verification for AgentCards using mTLS via SPIFFE. The operator can now cryptographically verify the identity of agents at fetch time by performing mutual TLS authentication against the agent's SPIFFE SVID. Also removes the deprecated operator-side signing code in favor of the existing init-container signer combined with runtime verification.
What's included
Runtime Identity Verification (mTLS Verified Fetch)
Authenticated fetcher (
internal/agentcard/fetcher.go): mTLS-capable HTTP client using go-spiffe X509Source to fetch agent cards with mutual TLS authentication. Extracts the agent's SPIFFE ID from the TLS connection state. IncludesdoHTTPFetchhelper to eliminate HTTP boilerplate duplication.Unified
Verifiedcondition: Single trust decision condition that is True if either mTLS verification or JWS signature verification succeeds.Identity binding for mTLS:
computeBindingevaluates trust-domain membership for SPIFFE IDs from both mTLS attestation and JWS x5c certificates.Readycondition: Composite signal indicating agent usability. True when Synced and Verified are both satisfied. Correctly set to False on error paths such as fetch failure or verification failure.NetworkPolicy enforcement: Simplified to rely solely on the unified
Verifiedcondition for permit/deny decisions.Reconciler refactoring:
Reconcilesplit intofetchCardDataandevaluateTrusthelpers for testability. Status updates consolidated into a single atomicStatus().Update()call. Binding logic deduplicated viaapplyBindingToStatushelper.Label propagation:
propagateVerifiedLabelsyncs theagent.kagenti.dev/signature-verifiedlabel to workloads based on verification status.Helm chart changes: New
verifiedFetchvalues section (disabled by default), SPIRE CSI volume mount, ClusterSPIFFEID template, TLS port configuration.Test TLS agent (
cmd/test-tls-agent/): Lightweight Go server that serves an agent card over mTLS using SPIFFE SVIDs, used for E2E testing of the verified fetch path.Cleanup
Removed
internal/signature/spiffe_signer.goandinternal/agentcard/writer.go(operator-side signing, superseded by init-container signer combined with runtime verification).Removed
demos/agentcard-operator-signing/directory (deprecated demo).Corrected misleading naming: NetworkPolicy labels now use "identity-verification", startup logs reference "identity verification" instead of "signature verification".
Design decisions
Feature-gated:
--enable-verified-fetch=false(off by default). No behavioral change unless explicitly enabled.Coexistence: Both trust mechanisms (init-container JWS signer and mTLS verified fetch) operate independently in the same namespace. Each produces Verified=True via its own path.
Graceful fallback: If the
agent-tlsport is absent on a service, the operator falls back to plain HTTP fetch. There is no hard dependency on SPIRE.Condition semantics documented: Comment block above
updateAgentCardStatusexplains the meaning of each condition (Verified, SignatureVerified, Synced, Ready, Bound).What was removed
Related issue(s)
Resolves: Signed AgentCard always has empty skills/capabilities in automated deploy path #292
Implements runtime identity verification from the AgentCard Trust Model RFC.
Testing
Unit tests
E2E validation
Tested on ROSA 4.19 with 18/18 tests passing across three groups:
Init-container signer: Deploy, signature verification, binding, NetworkPolicy, label propagation, invalid signature rejection
mTLS verified fetch: Deploy, mTLS verification, SPIFFE ID capture, binding, NetworkPolicy, TLS port removal/restoration, wrong trust domain rejection
Interaction: Both mechanisms coexisting in same namespace, NetworkPolicy differentiation, agent unavailability detection and recovery