Skip to content

WIP: Add OCI skill image mounting to AgentRuntime#332

Draft
cooktheryan wants to merge 1 commit into
kagenti:mainfrom
cooktheryan:feat/skill-image-volumes
Draft

WIP: Add OCI skill image mounting to AgentRuntime#332
cooktheryan wants to merge 1 commit into
kagenti:mainfrom
cooktheryan:feat/skill-image-volumes

Conversation

@cooktheryan
Copy link
Copy Markdown
Contributor

@cooktheryan cooktheryan commented May 6, 2026

Summary

  • Adds a skills field to AgentRuntimeSpec for declaring OCI skill images to mount into agent pods as Kubernetes ImageVolumes
  • Gated behind a skillImageVolumes feature gate (default off), requires Kubernetes 1.31+
  • Uses the skillimage OCI format: FROM scratch images with skill.yaml + SKILL.md
  • Each skill specifies a mountPath, making the feature framework-agnostic (Claude, Cursor, custom agents, etc.)

Example

apiVersion: agent.kagenti.dev/v1alpha1
kind: AgentRuntime
metadata:
  name: resume-agent-runtime
spec:
  type: agent
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: resume-agent
  skills:
    - name: resume-reviewer
      image: ghcr.io/redhat-et/skillimage/resume-reviewer:v1.0.0
      mountPath: /agent/skills/resume-reviewer
    - name: blog-writer
      image: ghcr.io/redhat-et/skillimage/blog-writer:latest
      mountPath: /app/.claude/skills/blog-writer
      pullPolicy: Always

Changes

Area Files What
CRD types api/v1alpha1/agentruntime_types.go SkillImageRef, SkillPullPolicy, Skills field
Feature gate internal/webhook/config/feature_gates.go SkillImageVolumes (default false)
Controller internal/controller/agentruntime_controller.go, agentruntime_skills.go Reconcile ImageVolumes on Deployment/StatefulSet, cleanup on deletion
Config hash internal/controller/agentruntime_config.go Skills in hash → rolling updates on change
Webhook internal/webhook/v1alpha1/agentruntime_webhook.go Validate duplicate names, reserved volume collisions
Wiring cmd/main.go Pass feature gate loader to reconciler
Docs docs/api-reference.md, docs/architecture.md SkillImageRef reference, conditions, examples
Samples config/samples/agent_v1alpha1_agentruntime_skills.yaml, updated _full.yaml New and updated sample manifests
Helm charts/kagenti-operator/values.yaml, CRD YAML Feature gate + CRD schema
Tests agentruntime_skills_test.go, agentruntime_webhook_test.go Volume reconciliation, config hash, validation

Relationship to ConfigMap-based skill linking (kagenti/kagenti#1440)

This feature complements the ConfigMap-based skill mounting in kagenti/kagenti#1440. Both deliver skill files into agent pods, but target different maturity stages from kagenti/kagenti#1342:

#1440 (ConfigMap) This PR (OCI ImageVolume)
Storage Kubernetes ConfigMap (~1MB limit) OCI registry (no size limit)
Versioning None (mutable ConfigMap) OCI tags + digests (immutable)
Lifecycle Create/delete ConfigMap draft → testing → published → deprecated → archived
Declaration Backend API at deploy time AgentRuntime CR (declarative, GitOps-friendly)
Mount path Hardcoded /app/skills/<name> User-specified per skill
K8s version Any 1.31+ (ImageVolume feature gate)

Integration opportunities for discussion

  1. SKILL_FOLDERS env var — #1440 sets SKILL_FOLDERS so agents discover mounted skills. This operator feature could inject the same env var so agents work transparently with both delivery mechanisms.

  2. kagenti.io/skills annotation — #1440 stores linked skills in this annotation. The operator could write this annotation when skills are declared on the AgentRuntime CR, enabling the UI/backend to display OCI-mounted skills alongside ConfigMap-mounted ones.

  3. Coexistence — Both mechanisms can coexist on the same pod. ConfigMap volumes use names like skill-0, skill-1; OCI ImageVolumes use skill-<name>. Different volume types, different names, no conflicts.

  4. Migration path — ConfigMap skills work today on any K8s version. OCI ImageVolume skills are the upgrade path when clusters reach K8s 1.31+. Teams can adopt incrementally.

Test plan

  • Unit tests: volume reconciliation (add/remove/update/multi-container), config hash, webhook validation
  • make manifests generate — CRD and deepcopy regenerated
  • go build ./... — compiles cleanly
  • go test ./internal/controller/ ./internal/webhook/... — all tests pass
  • Kind cluster (K8s 1.31): CRD installs, schema validation works, fields round-trip correctly
  • E2E: Full operator deployment with skill ImageVolumes on K8s 1.33+ cluster (requires kind v0.29.0+ with containerd 2.1.1 for runtime-level ImageVolume support)

Assisted-By: Claude Code

@cooktheryan
Copy link
Copy Markdown
Contributor Author

DO NOT MERGE at the current time. I would like feedback based on kagenti/kagenti#1342

@cooktheryan cooktheryan force-pushed the feat/skill-image-volumes branch from d3257c1 to 274bd62 Compare May 6, 2026 15:42
Copy link
Copy Markdown
Collaborator

@cwiklik cwiklik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid implementation with proper feature gating, comprehensive tests (unit + E2E), clean separation of concerns (controller, webhook, config hash), and thorough docs. The ImageVolume K8s requirement (1.31+) is well-documented and the graceful degradation (condition + event when gate is disabled) is good UX.

Areas reviewed: Go (types, controller, webhook), Helm, CRD, Docs, Tests
Commits: 3 commits, all signed-off: yes
CI status: all passing (E2E pending manual trigger)


Suggestions (non-blocking)

1. PR body attribution (nit)
PR body ends with "Generated with Claude Code" — per repo conventions this should be "Assisted-By: Claude Code".

2. Commit hygiene (suggestion)
Commits 45e06bc ("include e2e tests for oci") and 4bd2f5a ("fixes due to code review") don't follow the imperative commit convention and are vague. Consider squashing into the main commit before merge.

3. Skill mounts applied to all containers (suggestion)
Skills are currently mounted into ALL containers including sidecars (envoy-proxy, spiffe-helper). For pods with AuthBridge injection, sidecars don't need skill files. Consider targeting only the agent container in a follow-up. Not a blocker for alpha — the extra read-only mounts are harmless — but worth tracking to avoid clutter in complex pod specs.

@pavelanni
Copy link
Copy Markdown
Contributor

pavelanni commented May 6, 2026

It's important to make sure that the mounted skills are listed in the AgentCard exposed by the agent running in Agent Runtime. There is a section in the AgentCard spec for that.

https://agent2agent.info/docs/concepts/agentcard/

In my agent harness (https://github.com/redhat-et/docsclaw) it is implemented by the agent itself, but it would be good to have it implemented at the runtime level to make it agent-agnostic.

Another important thing is ensure that images are mounted in containers read-only to avoid any risk of mutating them my malicious agents. If the Operator mounts them, it should be in its logic.

@cooktheryan cooktheryan force-pushed the feat/skill-image-volumes branch 2 times, most recently from bd605a8 to 2961db3 Compare May 6, 2026 18:45
@pavelanni
Copy link
Copy Markdown
Contributor

Please take a look at the SkillCard schema that I use in Skill Image: https://github.com/redhat-et/skillimage/blob/main/schemas/skillcard-v1.json
It might be used as a prototype for Kagenti skills.

@cooktheryan cooktheryan force-pushed the feat/skill-image-volumes branch from 2961db3 to 2f3bf03 Compare May 7, 2026 19:47
Add kagenti.io/skills annotation on target workload metadata with a JSON
array of mounted skill names for downstream discovery (agent card
controllers, UI). The annotation is set when the skillImageVolumes
feature gate is enabled and removed on skill clearing or AgentRuntime
deletion.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ryan Cook <rcook@redhat.com>
@cooktheryan cooktheryan force-pushed the feat/skill-image-volumes branch from 2f3bf03 to 58f3815 Compare May 7, 2026 19:57
@cooktheryan
Copy link
Copy Markdown
Contributor Author

@kevincogan one thing my brain is stuck on right now when we build the container image for an agent we build it with the agentcard and that agentcard is r/o. The stuck point I have is the OCI mounting for skills may be dynamic but the SKILL section of an agentcard is pretty much locked with our mechanism. Any advice or thoughts here?

@cooktheryan
Copy link
Copy Markdown
Contributor Author

additionally @Ladas do you have any opinions here based on your work launching claude code and etc using the OCI mounting mechanisms

@kevincogan
Copy link
Copy Markdown
Contributor

@kevincogan one thing my brain is stuck on right now when we build the container image for an agent we build it with the agentcard and that agentcard is r/o. The stuck point I have is the OCI mounting for skills may be dynamic but the SKILL section of an agentcard is pretty much locked with our mechanism. Any advice or thoughts here?

@cooktheryan I don't think we should touch the signed card the agent serves. That stays locked down. But the AgentCardReconciler already fetches and caches the card into status, so we can just append the runtime skills to that cached copy after verification completes. One flow, one CR, just an enriched status at the end.

Security-wise nothing changes. Verification (JWS or mTLS) still runs against the original card before any merging happens. The Verified condition, NetworkPolicy, and identity binding are all driven by the original signed card. The appended skills are purely informational for discovery and the UI.

Your kagenti.io/skills annotation is basically all I'd need on my side. The AgentCard controller reads that and appends anything not already in the card.

Let me know if I am missing anything. If not I can pick this up as a follow-up once yours lands.

@pdettori
Copy link
Copy Markdown
Contributor

pdettori commented May 7, 2026

@cooktheryan should we set this PR as draft until ready to merge ?

@cooktheryan
Copy link
Copy Markdown
Contributor Author

@pdettori yes for sure...i was feeling confident in the PR early then I realized how many pieces we have to tie in

@cooktheryan cooktheryan changed the title Feat: Add OCI skill image mounting to AgentRuntime WIP: Add OCI skill image mounting to AgentRuntime May 8, 2026
@pdettori pdettori marked this pull request as draft May 8, 2026 02:19
@eranra
Copy link
Copy Markdown

eranra commented May 10, 2026

@cooktheryan @pavelanni @pdettori are you guys in sync with the initial community effort around OCI and skills here: https://github.com/agentskills/agentskills/discussions/292?ref=thomasvitale.com --- if will be best if we can make Kagnti as "generic" as possible and if we can join forces with the community effort and align the code it will be best.

@pavelanni
Copy link
Copy Markdown
Contributor

@eranra Yes, I reached out to Thomas Vitale on Slack and we are working on organizing a meeting. There is also a CNCF initiative around that: cncf/toc#1740 which I am participating in as well.
I'm also in contact with the Lola project: https://github.com/LobsterTrap/lola where we are adding OCI extension to their toolset.

@eranra
Copy link
Copy Markdown

eranra commented May 11, 2026

@eranra Yes, I reached out to Thomas Vitale on Slack and we are working on organizing a meeting. There is also a CNCF initiative around that: cncf/toc#1740 which I am participating in as well. I'm also in contact with the Lola project: https://github.com/LobsterTrap/lola where we are adding OCI extension to their toolset.

@pavelanni Thanks for sharing ;-)

I looked at the link/initiative, and it is indeed very interesting. I think we should also consider a more “shift-right” approach that automates processes and moves more of the intelligence and optimization into the runtime space.

Focusing on the AI developer persona makes a lot of sense today, but as the skills and AI ecosystem evolves toward greater automation and iterative optimization, the outer loop will become just as important. In particular, the ability to automatically improve, adapt, and incorporate new skills over time will be critical for long-term scalability and operational efficiency. I think that dynamic interaction with skills is a characteristic we need to consider in the interface between agents and skills.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

7 participants