Skip to content

✨ feat: Add mTLS runtime identity verification for AgentCards#284

Open
kevincogan wants to merge 21 commits into
kagenti:mainfrom
kevincogan:feat/operator-agentcard-signing
Open

✨ feat: Add mTLS runtime identity verification for AgentCards#284
kevincogan wants to merge 21 commits into
kagenti:mainfrom
kevincogan:feat/operator-agentcard-signing

Conversation

@kevincogan
Copy link
Copy Markdown
Contributor

@kevincogan kevincogan commented Apr 14, 2026

Summary

Adds runtime identity verification for AgentCards using mTLS via SPIFFE. The operator can now cryptographically verify the identity of agents at fetch time by performing mutual TLS authentication against the agent's SPIFFE SVID. Also removes the deprecated operator-side signing code in favor of the existing init-container signer combined with runtime verification.

What's included

Runtime Identity Verification (mTLS Verified Fetch)

  • Authenticated fetcher (internal/agentcard/fetcher.go): mTLS-capable HTTP client using go-spiffe X509Source to fetch agent cards with mutual TLS authentication. Extracts the agent's SPIFFE ID from the TLS connection state. Includes doHTTPFetch helper to eliminate HTTP boilerplate duplication.

  • Unified Verified condition: Single trust decision condition that is True if either mTLS verification or JWS signature verification succeeds.

  • Identity binding for mTLS: computeBinding evaluates trust-domain membership for SPIFFE IDs from both mTLS attestation and JWS x5c certificates.

  • Ready condition: Composite signal indicating agent usability. True when Synced and Verified are both satisfied. Correctly set to False on error paths such as fetch failure or verification failure.

  • NetworkPolicy enforcement: Simplified to rely solely on the unified Verified condition for permit/deny decisions.

  • Reconciler refactoring: Reconcile split into fetchCardData and evaluateTrust helpers for testability. Status updates consolidated into a single atomic Status().Update() call. Binding logic deduplicated via applyBindingToStatus helper.

  • Label propagation: propagateVerifiedLabel syncs the agent.kagenti.dev/signature-verified label to workloads based on verification status.

  • Helm chart changes: New verifiedFetch values section (disabled by default), SPIRE CSI volume mount, ClusterSPIFFEID template, TLS port configuration.

  • Test TLS agent (cmd/test-tls-agent/): Lightweight Go server that serves an agent card over mTLS using SPIFFE SVIDs, used for E2E testing of the verified fetch path.

Cleanup

  • Removed internal/signature/spiffe_signer.go and internal/agentcard/writer.go (operator-side signing, superseded by init-container signer combined with runtime verification).

  • Removed demos/agentcard-operator-signing/ directory (deprecated demo).

  • Corrected misleading naming: NetworkPolicy labels now use "identity-verification", startup logs reference "identity verification" instead of "signature verification".

Design decisions

  • Feature-gated: --enable-verified-fetch=false (off by default). No behavioral change unless explicitly enabled.

  • Coexistence: Both trust mechanisms (init-container JWS signer and mTLS verified fetch) operate independently in the same namespace. Each produces Verified=True via its own path.

  • Graceful fallback: If the agent-tls port is absent on a service, the operator falls back to plain HTTP fetch. There is no hard dependency on SPIRE.

  • Condition semantics documented: Comment block above updateAgentCardStatus explains the meaning of each condition (Verified, SignatureVerified, Synced, Ready, Bound).

What was removed

Component Reason
SpiffeSigner Operator signing superseded by init-container signer combined with runtime verification
CardWriter No longer needed without operator-side signing
operatorSigning Helm values Replaced by verifiedFetch
demos/agentcard-operator-signing/ Deprecated demo

Related issue(s)

Resolves: Signed AgentCard always has empty skills/capabilities in automated deploy path #292

Implements runtime identity verification from the AgentCard Trust Model RFC.

Testing

Unit tests

cd kagenti-operator
go test ./internal/controller/... ./internal/agentcard/... ./internal/signature/... ./api/... -count=1 -short

E2E validation

Tested on ROSA 4.19 with 18/18 tests passing across three groups:

  • Init-container signer: Deploy, signature verification, binding, NetworkPolicy, label propagation, invalid signature rejection

  • mTLS verified fetch: Deploy, mTLS verification, SPIFFE ID capture, binding, NetworkPolicy, TLS port removal/restoration, wrong trust domain rejection

  • Interaction: Both mechanisms coexisting in same namespace, NetworkPolicy differentiation, agent unavailability detection and recovery

…d signing

Introduce a reusable SignCard function that produces JWS x5c signatures,
supporting ECDSA (P-256/P-384/P-521) and RSA key types with RFC 7518
fixed-width R||S encoding. Add SpiffeSigner that wraps a SPIFFE Workload
API X509Source to sign AgentCards with the workload's SVID.
Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Add Writer that creates or updates signed AgentCard ConfigMaps using an
uncached API reader to avoid cache-scoping issues across namespaces.
Refactor the init-container signer to use the shared signature.SignCard
function and the canonical agentcard.ConfigMapName constant, eliminating
duplicate logic.
Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
…entity

Wire the SpiffeSigner and CardWriter into the AgentCard reconciler behind
the --enable-operator-signing flag (off by default). When enabled, the
operator signs cards with its own SVID, writes signed ConfigMaps, and
heals tampered signatures on reconcile. Add Helm values, SPIRE CSI
volume mount, ClusterSPIFFEID template, and expanded ConfigMap RBAC.
Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
@rhuss
Copy link
Copy Markdown

rhuss commented Apr 17, 2026

Thanks for the PR @kevincogan. The signing infrastructure looks solid (shared SignCard, SpiffeSigner with auto-rotation, feature-gated). A few observations:

What this PR does well

  • Clean extraction of signing logic into internal/signature/ (reusable by both init-container and operator)
  • Proper SVID lifecycle management via X509Source (auto-rotation, no manual key zeroing)
  • Feature-gated behind --enable-operator-signing (safe rollout)
  • Eliminates pod restarts for re-signing (operator re-signs each reconcile)

What this PR does not address (and needs to for Option 4 from #292)

1. The fetcher is unchanged. ConfigMapFetcher still reads <agent>-card-signed ConfigMap first and falls back to HTTP. For Option 4 to work as described in #292 (operator fetches real card from live endpoint, signs it), the default fetcher needs to be changed to DefaultFetcher (HTTP-first) when operator signing is enabled. Otherwise the operator signs whatever is in the ConfigMap, which is still the empty skeleton.

2. The HTTP fetch is plaintext and unauthenticated. When the fetcher does fall back to HTTP, it uses http://<agent>.<ns>.svc.cluster.local without mTLS. I've detailed the attack vectors this opens (MITM, service hijacking, no agent authentication) in my comment on #292. For a platform marketing zero-trust agent security, the operator should validate the agent's SVID via mTLS during the fetch.

3. The agent's SPIFFE ID is not in the signed payload. The JWS x5c chain contains the operator's certificate. The agent's workload identity is lost from the signature. As discussed in #292, the operator would need to verify the agent's identity during fetch (via mTLS) and embed the verified agent SPIFFE ID as a claim in the signed payload (notarial model).

Summary

This PR provides the signing machinery (which is good and needed), but it's not yet Option 4 from #292. To get there, it needs:

  1. Switch to DefaultFetcher (HTTP fetch from live endpoint) when operator signing is enabled
  2. Use mTLS on the fetch to authenticate the agent and encrypt the channel
  3. Embed the verified agent SPIFFE ID in the signed payload

Without (1), the operator signs the empty skeleton. Without (2) and (3), you get operator-attested cards without agent identity binding or tamper protection on the fetch.

Skip NetworkPolicy creation when identity binding fails and strict mode
is disabled, allowing agents with unverified bindings to retain network
access. Add test coverage for strict vs non-strict behavior.

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Run make manifests to reflect the new Strict field in the AgentCard CRD,
updated controller RBAC markers, and webhook timeout removal.

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Grant update permission on deployments/finalizers and
statefulsets/finalizers so the operator can set blockOwnerDeletion
on owned workloads.

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Point the enforcement demo prerequisite back to agentcard-spire-signing
(init-container path). Add a new agentcard-operator-signing demo with
deployment manifests, ClusterSPIFFEID, AgentCard CR, and teardown script
for the opt-in operator signing path.

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
…ditional fetcher

Add IdentityBinding.Strict field (default false) for per-card binding
enforcement granularity. When strict is true, a trust-domain mismatch
removes the signature-verified label; when false, the mismatch is
recorded in status but the label is preserved.
Gate the AgentCard fetcher on --enable-operator-signing: when disabled
(default), the reconciler uses ConfigMapFetcher to read init-container-
signed cards; when enabled, it uses DefaultFetcher to fetch cards via
HTTP for operator-side signing.

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
@rhuss
Copy link
Copy Markdown

rhuss commented Apr 17, 2026

Thanks for the update @kevincogan. The signing infrastructure is solid. Clean package structure, proper Signer interface, X509Source auto-rotation, good test coverage, feature-gated. The shared SignCard refactor eliminating duplication between init-container and operator is a nice touch.

A few observations on how the current PR relates to the Option 4 flow proposed in #292:

The fetcher is unchanged (assessed by Claude)

ConfigMapFetcher is still the default. When both init-container and operator signing are active (the coexistence mode described in the PR), the flow is:

  1. Init container signs empty skeleton → writes <agent>-card-signed ConfigMap
  2. Reconciler runs → ConfigMapFetcher reads <agent>-card-signed → gets the signed empty skeleton
  3. Operator re-signs the empty skeleton with its own SVID → overwrites the ConfigMap

The operator signs whatever the fetcher gives it. Since the fetcher prefers the signed ConfigMap (which contains the empty skeleton from the init-container), the operator re-signs empty content. The real card from the live endpoint is still only used as a fallback.

For Option 4 as proposed in #292 ("fetch from the live agent endpoint via HTTP and remove the init-container signer and all ConfigMaps from the signing pipeline"), I think the follow-up needs to:

  • Use DefaultFetcher (HTTP) instead of ConfigMapFetcher when operator signing is enabled
  • Drop the card-unsigned and init-container from the pipeline entirely in operator-signing mode

Fetch security and agent identity

As discussed in my comment on #292, the HTTP fetch is plaintext and unauthenticated, and the agent's SPIFFE ID is not embedded in the signed payload. These are follow-up items, but worth tracking explicitly since they affect whether we can market this as zero-trust agent verification.

Summary

I see this PR as the correct signing infrastructure. It provides the building blocks. The full Option 4 flow (HTTP fetch from live endpoint, mTLS, agent identity embedding, init-container removal) would be a next step, but is it clear that we can implement those ?

But before continuing on this PR we should be crystal clear how we solve:

  • the encryption challenge (https)
  • the identity challenge (agent or operator SVID)

This should be tackled first before we can think about adding this to the code base.

@kevincogan kevincogan changed the title ✨ feat: Add operator-side AgentCard signing with SPIFFE identity ✨ feat: Add mTLS runtime identity verification for AgentCards May 6, 2026
kevincogan added 7 commits May 7, 2026 22:37
Operator-signing wrote signed ConfigMaps on behalf of agents, adding
complexity without meaningful trust improvement. Replace with Phase 1
mTLS verified fetch which authenticates agents directly via SPIFFE.
Removes CardSigner, CardWriter, SpiffeSigner, and the operator-signing demo.

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Adds Phase 1 identity verification: the operator authenticates agents
via mutual TLS using SPIFFE workload identity before trusting their
AgentCard data.
- SpiffeFetcher performs mTLS-authenticated fetch with cached HTTP client
- evaluateTrust unifies mTLS and JWS signature verification paths
- Verified condition is the single trust signal for NetworkPolicy gating
- Identity binding evaluates attested SPIFFE IDs against trust domain
- AttestedAgent printer column shows verified SPIFFE ID (wide output)
- Proper gRPC connection lifecycle for X509Source

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
- Add verifiedFetch Helm values with SPIFFE CSI volume and operator flags
- Create ClusterSPIFFEID template for operator workload identity
- Inject POD_NAMESPACE via Downward API for NetworkPolicy peer selection
- Use kubernetes.io/metadata.name label instead of custom labeling
- Replace hardcoded namespace with ClusterDefaultsNamespace constant

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Lightweight test server that serves AgentCard data over mTLS using
SPIFFE workload identity. Used for E2E testing of verified fetch on
clusters with SPIRE deployed.

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Ensures the injected Envoy sidecar exposes the agent-tls named port
so the operator can discover mTLS endpoints by port name.

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
…card-signing

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	charts/kagenti-operator/templates/manager/manager.yaml
- Check write error returns in HTTP handlers
- Suppress errcheck on deferred Close and Shutdown calls
- Remove redundant fmt.Sprintf for string argument

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
@kevincogan kevincogan marked this pull request as ready for review May 7, 2026 22:29
@kevincogan kevincogan requested a review from a team as a code owner May 7, 2026 22:29
…card-signing

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	charts/kagenti-operator/templates/manager/manager.yaml
…card-signing

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	kagenti-operator/internal/controller/agentcard_controller.go
@kevincogan
Copy link
Copy Markdown
Contributor Author

kevincogan commented May 11, 2026

@mrsabath @maia-iyer @rhuss I'd appreciate your review on this PR when you get a chance.

Quick context: this PR started as the operator-side signing work we collaborated on. After @rhuss review and the Slack discussion on the AgentCard trust model, the direction shifted toward mTLS verified fetch rather than operator signing. The key argument was that within a single cluster, mTLS via SPIFFE already proves agent identity at fetch time without the circular trust concern.

What changed:

  • Removed: Operator-side signing (SpiffeSigner, CardWriter, ConfigMap pipeline)
  • Added: mTLS authenticated fetch using go-spiffe. The operator verifies the agent's SPIFFE SVID during the TLS handshake and captures the agent's SPIFFE ID in the CRD status
  • Preserved: The init-container JWS signer still works independently. Both trust paths coexist

The underlying principle is the same: SPIRE-attested identity as the foundation for trust, just using mTLS rather than a signature to verify it.

References:

The only change to the Sigstore plan is that Phase 2A (operator signing) is replaced by mTLS verified fetch. The supply-chain provenance phases (image verification, blob verification, attestor composition) remain the same. Would like to confirm this still aligns from your perspective. Thanks!

@maia
Copy link
Copy Markdown

maia commented May 11, 2026

@kevincogan fyi: you have tagged the wrong person.

@kevincogan kevincogan force-pushed the feat/operator-agentcard-signing branch from c5fa688 to 1502f39 Compare May 11, 2026 20:00
@esnible
Copy link
Copy Markdown
Contributor

esnible commented May 12, 2026

I am nervous about approving security PRs without understanding the pieces.

This PR has no link to an issue or discussion, so I will ask here. I know of three ways to get AgentCards without this PR:

Does mTLS identity verification extend any of these? Will I be doing curl --cert cert.crt, or seeing a lock icon in the Kagenti UI, or seeing a new attestation field in the our Kubernetes AgentCard CRD?

Copy link
Copy Markdown
Contributor

@pdettori pdettori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid feature-gated implementation of mTLS runtime identity verification. The dual-mode coexistence (mTLS + JWS) is well-designed with clear condition semantics — evaluateTrust cleanly unifies both trust paths under a single Verified condition, and the NetworkPolicy controller correctly keys on it.

Two inline suggestions (non-blocking) and one nit below.

Areas reviewed: Go, Helm/K8s, YAML, Dockerfile, Shell, Docs
Commits: 19 commits (15 feature + 4 merge), all signed-off ✓
CI status: 15/15 checks passing ✓

Assisted-By: Claude Code

Comment thread kagenti-operator/internal/agentcard/fetcher.go Outdated
Comment thread kagenti-operator/internal/agentcard/fetcher.go Outdated
Comment thread kagenti-operator/cmd/test-tls-agent/Dockerfile
Copy link
Copy Markdown
Contributor

@mrsabath mrsabath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid implementation of mTLS runtime identity verification. The dual-mode coexistence design is sound, feature gating is correct, and the reconciler refactoring into fetchCardDataevaluateTrust improves testability. CI is fully green and E2E coverage on ROSA is thorough.

Three bugs worth addressing (non-blocking given the feature gate is off by default):

Critical (low exploitability due to feature gate):

  1. Silent error in trust domain parsing (fetcher.go:260): td, _ := spiffeid.TrustDomainFromString(trustDomain) — if trustDomain is empty (the Helm default!), td is a zero-value TrustDomain and AuthorizeMemberOf(td) behavior is undefined in go-spiffe. Should fail loudly.

  2. Wrong error returned (fetcher.go:297): return nil, err returns the first endpoint's error instead of legacyErr. Masks the actual failure when both endpoints fail.

  3. SPIFFE ID from PeerCertificates[0] vs VerifiedChains (fetcher.go:341): PeerCertificates is populated before chain validation. Defense-in-depth says prefer VerifiedChains[0][0]. Mitigated by go-spiffe's VerifyPeerCertificate callback.

Major (suggestions):

  1. Silent HTTP fallback: When --enable-verified-fetch is set but agent lacks agent-tls port, the operator logs at Info level and falls back to unverified HTTP. Consider emitting a K8s Event on the AgentCard or a DegradedFetch condition.

  2. Binding precedence undocumented: When both mTLS and JWS succeed (possibly with different SPIFFE IDs), mTLS takes unconditional precedence in evaluateTrust. Worth a comment.

Areas reviewed: Go (fetcher, reconciler, binding, NetworkPolicy controller, test-tls-agent), Helm chart, CRD, RBAC, Dockerfile, demos
Commits: 19 (15 feature + 4 merge), all signed-off
CI: 15/15 passing

Comment thread kagenti-operator/internal/agentcard/fetcher.go Outdated
Comment thread kagenti-operator/internal/agentcard/fetcher.go Outdated
Comment thread kagenti-operator/internal/agentcard/fetcher.go Outdated
- Return error from NewSpiffeFetcher on invalid trust domain
- Fix wrong error variable returned in legacy endpoint fallback
- Prefer VerifiedChains over PeerCertificates for SPIFFE ID extraction
- Emit Warning Event when falling back to unverified HTTP fetch
- Document mTLS precedence over JWS in evaluateTrust
- Fix unused parameter lint warning in test-tls-agent

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
@kevincogan
Copy link
Copy Markdown
Contributor Author

Major (suggestions):

  1. Silent HTTP fallback: When --enable-verified-fetch is set but agent lacks agent-tls port, the operator logs at Info level and falls back to unverified HTTP. Consider emitting a K8s Event on the AgentCard or a DegradedFetch condition.
  2. Binding precedence undocumented: When both mTLS and JWS succeed (possibly with different SPIFFE IDs), mTLS takes unconditional precedence in evaluateTrust. Worth a comment.

Areas reviewed: Go (fetcher, reconciler, binding, NetworkPolicy controller, test-tls-agent), Helm chart, CRD, RBAC, Dockerfile, demos Commits: 19 (15 feature + 4 merge), all signed-off CI: 15/15 passing

@mrsabath addressed both suggestions:

  • Silent HTTP fallback: Added a Warning Event (FallbackToHTTP) on the AgentCard when the operator falls back to unverified HTTP due to missing agent-tls port.
  • Binding precedence: Added comment above the Verified status computation documenting that mTLS takes unconditional precedence over JWS.

@kevincogan
Copy link
Copy Markdown
Contributor Author

I am nervous about approving security PRs without understanding the pieces.

This PR has no link to an issue or discussion, so I will ask here. I know of three ways to get AgentCards without this PR:

Does mTLS identity verification extend any of these? Will I be doing curl --cert cert.crt, or seeing a lock icon in the Kagenti UI, or seeing a new attestation field in the our Kubernetes AgentCard CRD?

Good questions @esnible. This PR changes how the operator fetches and trusts AgentCard data internally. The mTLS happens between the operator and the agent during reconciliation, not between end users and agents.

To answer each:

  • oc get agentcards: The existing Verified column now reflects the unified trust decision (mTLS or JWS). With -o wide, a new AttestedAgent column shows the verified SPIFFE ID.
  • Kagenti UI: No change in this PR. The UI reads CRD status, so it could surface the new fields in a follow-up.
  • curl /.well-known/agent-card.json: No change. You will not be doing curl --cert.

The new attestation field is status.attestedAgentSpiffeId on the AgentCard CRD. NetworkPolicy enforcement keys on the Verified condition to gate network access.

This resolves #292 and implements Phase 1 of the AgentCard Trust Model RFC.

Copy link
Copy Markdown
Contributor

@r3v5 r3v5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @kevincogan ! Left small comments, not a blocker

Comment thread kagenti-operator/internal/controller/agentcard_controller.go Outdated
Comment thread kagenti-operator/internal/controller/agentcard_controller.go Outdated
- Add nil check on r.Recorder.Event() in FallbackToHTTP path
- Change cleanupVerifiedFetchFields to return error to caller
- Caller logs and continues without failing the reconcile loop

Signed-off-by: Kevin Cogan <kevin.s.cogan@gmail.com>
@kevincogan kevincogan requested a review from r3v5 May 19, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants