Skip to content

fix(keycloak): delete wrong-type mapper before recreating audience mapper#359

Open
cwiklik wants to merge 1 commit into
mainfrom
fix/358-audience-mapper-wrong-type
Open

fix(keycloak): delete wrong-type mapper before recreating audience mapper#359
cwiklik wants to merge 1 commit into
mainfrom
fix/358-audience-mapper-wrong-type

Conversation

@cwiklik
Copy link
Copy Markdown
Collaborator

@cwiklik cwiklik commented May 13, 2026

Summary

  • Fixes the infinite error loop where the operator can't create an audience mapper due to a 409 conflict from Keycloak, causing 401s on all agent-to-agent calls.
  • Handles two failure modes discovered during local testing:
    1. Wrong-type mapper: A mapper with the correct name but wrong protocolMapper type blocks creation — now deleted and recreated.
    2. Phantom/concurrent 409: Multiple reconciles race or Keycloak reports a conflict for a non-existent mapper — now returns nil (breaking the error loop) and lets verifyAudienceMapper handle it on the next reconcile.

Observed Symptoms

Operator logs on Kind cluster with v0.2.0-rc.4, deploying a2a-currency-converter as a Deployment:

{"level":"error","msg":"Keycloak audience scope management failed (credentials will still be written)",
 "controller":"clientregistration","Deployment":{"name":"a2a-currency-converter","namespace":"team1"},
 "clientId":"spiffe://localtest.me/ns/team1/sa/a2a-currency-converter",
 "error":"ensure audience mapper for existing scope \"agent-team1-a2a-currency-converter-aud\": no matching audience mapper found for scope \"agent-team1-a2a-currency-converter-aud\" (scopeID 2ca02fe1-...)"}

This repeats every few seconds indefinitely. The credential Secret IS written (client registration succeeds), but without the audience mapper the issued tokens lack the correct aud claim → AuthBridge/Envoy rejects them → 401 Unauthorized on all A2A calls.

Prior Work and Remaining Gap

PR What it fixed What it doesn't handle
#331 updateAudienceMapperIfNeeded — updates audience value when mapper exists with correct type but stale included.custom.audience Mapper with wrong type, or empty mapper list
#350 verifyAudienceMapper — defense-in-depth GET+verify on every reconcile; error propagation (fix for #348) Unreachable when getOrCreateAudienceClientScope returns error first

The gap: updateAudienceMapperIfNeeded has no recovery path when:

  • A mapper exists with the right name but wrong protocolMapper type (causes name collision 409 but loop skips it)
  • The mapper list is empty despite the 409 (concurrent reconcile or Keycloak realm-level uniqueness check)

Root Cause Analysis

Call chain (before this fix)

EnsureAudienceScope
  → getOrCreateAudienceClientScope
    → findClientScopeIDByName → scope EXISTS
    → ensureAudienceMapper
      → POST mapper → 409 Conflict
      → updateAudienceMapperIfNeeded
        → GET mappers for scope
        → Loop: find Name==scopeName AND ProtocolMapper=="oidc-audience-mapper"
        → NO MATCH → return error  ← BUG: unrecoverable, blocks everything below
    → error propagates up
  → verifyAudienceMapper NEVER REACHED (would have self-healed)

Scenario 1: Wrong-type mapper

A mapper exists in the scope with Name == "agent-team1-...-aud" but ProtocolMapper == "oidc-hardcoded-claim-mapper" (or any non-audience type). This can happen from a prior partial failure or Keycloak migration.

  • POST returns 409 (name collision)
  • updateAudienceMapperIfNeeded skips it (line 266: type filter)
  • Falls through to error

Scenario 2: Concurrent reconcile / phantom 409

Multiple reconcile goroutines fire simultaneously for the same Deployment (observed: 3-4 reconciles within 1 second on pod restart). One succeeds, others get 409. But by the time the losers list mappers, Keycloak may not have committed the winner's transaction yet → empty list.

Alternatively, Keycloak enforces mapper name uniqueness at the realm level across all client scopes. If the name was ever used in a scope that was later deleted, Keycloak's internal state may still reserve it (observed on Kind: DELETE scope → recreate scope → POST mapper → 409 "Duplicate resource error" despite empty scope).

The Fix

For Scenario 1 (wrong-type mapper):

// Second pass: find mapper by name regardless of type, delete it, re-POST
for i := range mappers {
    if mappers[i].Name != scopeName {
        continue
    }
    if err := a.deleteMapper(ctx, token, realm, scopeID, mappers[i].ID); err != nil {
        return fmt.Errorf("delete stale mapper %q (id %s): %w", scopeName, mappers[i].ID, err)
    }
    return a.ensureAudienceMapper(ctx, token, realm, scopeID, scopeName, audience)
}

For Scenario 2 (empty list after 409):

// No mapper with matching name found at all. Return nil — break the loop.
return nil

Why returning nil is safe

The EnsureAudienceScope call chain has two layers:

EnsureAudienceScope
  1. getOrCreateAudienceClientScope → ensureAudienceMapper (CREATE path)
  2. verifyAudienceMapper                                   (VERIFY path)

When updateAudienceMapperIfNeeded returns nil (scenario 2), control flows:

  • ensureAudienceMapper returns nil
  • getOrCreateAudienceClientScope returns scopeID (success)
  • verifyAudienceMapper runs — it independently GETs the mapper list
  • If mapper is present (concurrent reconcile committed): returns nil (done)
  • If mapper is absent: calls ensureAudienceMapper again → if 409 repeats, returns nil again → no error logged, mapper will be created on the next full reconcile cycle when the Keycloak state clears

This is safe because:

  • The credential Secret is already written (client registration is independent of audience scope)
  • verifyAudienceMapper runs every reconcile cycle as defense-in-depth
  • The error was already non-fatal (logged but didn't block credential delivery)
  • Breaking the loop prevents log spam that obscures real issues

New helper: deleteMapper

Standard DELETE to protocol-mappers/models/{mapperID}. Accepts 204 (success) and 404 (already gone) as success.

Local Testing Results

Tested on Kind cluster with operator v0.2.0-rc.4 → fix-358:

  1. Before fix: a2a-currency-converter audience mapper error repeating every ~2s indefinitely
  2. After fix: Error loop broken. Initial burst of 409s on simultaneous reconciles (3 errors in 1s, expected for concurrent startup), then self-heals. Subsequent reconciles: zero errors.
  3. Verified in Keycloak: Scope agent-team1-a2a-currency-converter-aud created with correct oidc-audience-mapper, audience = SPIFFE URI, registered as realm default scope.
  4. AgentCard: Successfully fetched ("Currency Agent" version 1.0.0)

Test plan

  • go test ./internal/keycloak/ — 9 tests pass (3 new)
  • Local Kind cluster: error loop broken, mapper created correctly
  • CI passes (envtest + e2e)
  • Deploy to Kind cluster fresh, verify no audience mapper errors on first agent deploy
  • Verify A2A conversation succeeds without 401

Fixes #358

Assisted-By: Claude Code

@cwiklik cwiklik requested a review from a team as a code owner May 13, 2026 22:54
@cwiklik cwiklik force-pushed the fix/358-audience-mapper-wrong-type branch from 6abaa5e to 220b8ce Compare May 13, 2026 23:07
…pper (#358)

When a protocol mapper exists with the correct name but the wrong type
(not oidc-audience-mapper), the POST returns 409 and
updateAudienceMapperIfNeeded fails to find a matching mapper — entering
an infinite error loop that blocks audience scope propagation.

## Root Cause

The 409 Conflict from Keycloak means "a mapper with that name already
exists." But updateAudienceMapperIfNeeded only looks for mappers matching
BOTH Name == scopeName AND ProtocolMapper == "oidc-audience-mapper".
When the existing mapper has the right name but wrong type, the loop
skips it and falls through to "no matching audience mapper found."

This also prevents verifyAudienceMapper (defense-in-depth from PR #350)
from running, since getOrCreateAudienceClientScope returns early on the
error — no self-healing is possible.

## Fix

In updateAudienceMapperIfNeeded, after failing to find an
oidc-audience-mapper, perform a second pass looking for any mapper with
a matching name (regardless of type). If found, DELETE it via the
Keycloak Admin API, then re-POST the correct oidc-audience-mapper.

This is the minimal targeted fix — it handles the exact broken state
(wrong-type name collision) without restructuring the flow.

## Observed Symptoms

- Operator logs: "ensure audience mapper for existing scope ... no
  matching audience mapper found" repeating every few seconds
- Agent tokens lack the correct audience claim
- AuthBridge/Envoy rejects requests with 401 Unauthorized
- Affects fresh installs with operator v0.2.0-rc.4

Fixes #358

Signed-off-by: cwiklik <cwiklik@users.noreply.github.com>
Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: cwiklik <cwiklikj@gmail.com>
@cwiklik cwiklik force-pushed the fix/358-audience-mapper-wrong-type branch from 220b8ce to 95b7f28 Compare May 13, 2026 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

fix(keycloak): verifyAudienceMapper defense-in-depth never runs due to early return in getOrCreateAudienceClientScope

2 participants