Skip to content

feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90)#2633

Open
weicao wants to merge 79 commits into
mainfrom
feat/mariadb-alpha37-semisync-fencing-pr
Open

feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90)#2633
weicao wants to merge 79 commits into
mainfrom
feat/mariadb-alpha37-semisync-fencing-pr

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 9, 2026

Summary

Rolls up the MariaDB addon evolution from the alpha.37 semisync fencing baseline through alpha.86–90 topology merge and addon-api/12b source-traceability conformance. Final head dd8f2c43 is the live-validated alpha.90 chart with the source-traceability fixes from PR #2657 and PR #2663 already merged in.

Key changes

  • Topology merge (alpha.89): single replication topology with merged ComponentDefinition (cmpd-replication-merged.yaml) covers both async and semi-sync; legacy cmpd-semisync.yaml and cmpd-replication.yaml retained for in-place upgrade compatibility.
  • Synthetic-parameter mapper (alpha.89, C3 design): replicationMode={async,semisync} is the user-facing switch; CUE schema validates, addon-side mapper translates into the four real rpl_semi_sync_* engine variables at reconfigure-time, install-time seeder writes the same overrides at first boot. Three-layer fail-closed: Helm template, install-time, reconfigure-time.
  • Live N=1 first-blocker fixes (alpha.90): removed CUE bottom literal that broke live PD OpenAPI schema gen; added merged CmpD to ComponentVersion compatibility rules; dropped MySQL-specific rpl_semi_sync_master_wait_for_slave_count (unsupported in MariaDB 11.4); chart bumped alpha.89 → alpha.90 to escape CmpD immutability lock.
  • addon-api/12b conformance (PR fix(mariadb): align README + chart claims with addon-api/12b acceptance #2657): README per-topology capability matrix (verified vs declared); explicit "not supported" declarations for PITR / encryption / sharding / proxy / hostNetwork / cross-namespace ServiceRef / multi-instance TLS; ComponentVersion evidence-boundary comment; BackupPolicyTemplate target block added (role: secondary, fallbackRole: primary, account: root).
  • Source-traceability fix (PR fix(mariadb): document BPT per-topology resolution + helper-ize galera scripts CM name #2663): BackupPolicyTemplate target block comment expanded with per-topology resolution table, MySQL cross-engine comparison, PITR forward note; new mariadb.galera.scriptConfigMapName helper replaces three hard-coded literal references.

Validation

  • helm lint clean; helm template render diff clean across cleanup commits
  • live N=1 GREEN on idc/mariadb-test5 vcluster for mdb-alpha90-conformance-test (2-replica replication topology, role labels primary/secondary, runtime BPT verified): evidence sha256 518595924deb6ed53a1881353d814c137030c4331999e428ddecbe538d24c747
  • public hygiene grep clean across changed files

Follow-up

Full kubeblocks-tests suite (83 sections × 4 topologies) is in progress against the alpha.90 chart at this branch head to validate release-standard readiness.

weicao and others added 2 commits May 9, 2026 15:05
Add MariaDB 11.4 standalone, replication, semisync, and Galera chart resources.

Harden semisync startup, role publication, switchover fencing, and script distribution.

Add shell specs for replication member join, role probe, switchover, and standalone template mapping.
@weicao weicao requested review from a team and leon-ape as code owners May 9, 2026 07:06
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2026

Codecov Report

❌ Patch coverage is 0% with 1350 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (69b3b6d) to head (b6ddb12).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...iadb/scripts-ut-spec/replication_roleprobe_spec.sh 0.00% 458 Missing ⚠️
...db/scripts-ut-spec/replication_member_join_spec.sh 0.00% 432 Missing ⚠️
...pts-ut-spec/semisync_rejoin_fence_template_spec.sh 0.00% 281 Missing ⚠️
...ripts-ut-spec/replication_user_convergence_spec.sh 0.00% 142 Missing ⚠️
...cripts-ut-spec/standalone_template_mapping_spec.sh 0.00% 37 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2633     +/-   ##
=======================================
  Coverage   0.00%   0.00%             
=======================================
  Files         73      79      +6     
  Lines       9197   12535   +3338     
=======================================
- Misses      9197   12535   +3338     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

weicao added 23 commits May 9, 2026 17:06
Keep the KubeBlocks health-check table schema on fresh replicas and clear only local rows before starting or repairing SQL replication. This prevents the replica repair path from changing a duplicate-key error into a missing-table replication error.
Require internal local admin read-only privileges before role decisions.

Track primary read/write readiness after local root unlock and read_only repair.

Repair syncer primary reconciliation when the listener is already exposed but local write readiness is missing.
KB kbagent enforces a hardcoded `maxActionCallTimeout = 60 * time.Second`
in `pkg/kbagent/service/action_utils.go::actionCallTimeoutContext`, so any
CmpD `switchover.timeoutSeconds` greater than 60 is silently truncated.
alpha.58 declared 240; live-test evidence (cost=60060ms result=timedOut
on the kbagent action HTTP call) confirmed the action script was killed
mid-flight at exactly 60 seconds.

alpha.59 contract:

* CmpD `switchover.timeoutSeconds: 240 -> 60` in cmpd-semisync.yaml and
  cmpd-replication.yaml so the declared contract reflects what kbagent
  actually enforces.
* `run_switchover` shrinks to three required steps that all must fit
  inside the 60s ceiling:
    1. `prepare_current_primary_for_switchover` (local prep, ~3s)
    2. `syncerctl_switchover` (DCS record, ~5s)
    3. `fence_current_primary_local_writes_after_dcs` (local read_only
       fence, ~1s) - retained synchronously because it is the
       double-writable defense and must be true before action returns
    4. `wait_candidate_remote_root_write_ready` (bounded ~8s probe,
       fail-closed) - the third leg of the action-success contract:
       never return 0 with a non-writable candidate.
* `wait_switchover_done`, `wait_post_switchover_stabilization`,
  `wait_primary_service_routes_candidate`,
  `wait_current_secondary_remote_root_fenced` are no longer invoked
  from `run_switchover`. Post-DCS convergence is delegated to roleProbe
  + KB endpoint controller. The negative assertion that none of these
  helpers fires lives in the new
  `replication_switchover_spec.sh` "alpha.59 contract" tests.
* `kubeblocks.kb_health_check` 1062/1146 repair migrates from the
  switchover action wait loops into the secondary roleProbe path
  (`secondary_kb_health_check_repair_attempt` in
  `replication-roleprobe.sh`). The repair has a precise signature
  (`Last_(SQL_)?Errno: 1062|1146` AND `kubeblocks.kb_health_check` in
  the slave error text), uses `kb_internal_root` (READ_ONLY ADMIN), is
  best-effort, idempotent, and logs each attempt with rc. Other SQL
  errors are NOT swallowed.
* ShellSpec gains six new examples for
  `secondary_kb_health_check_repair_attempt` covering the precise
  signature, the cli-user choice, the wrong-table negative case, the
  wrong-errno negative case, and the empty-status negative case, plus
  six examples for `slave_status_has_kb_health_check_repairable_error`.
* Two new `run_switchover` examples exercise the alpha.59 contract:
  the negative assertion that the wait_* helpers are never invoked,
  and the fail-closed path when the candidate write probe does not
  close inside the bounded budget. Three obsolete examples (which
  exercised `wait_switchover_done` directly) are removed.
* The runner-side post-OpsRequest convergence gate is
  test-runner-owned (separate change; out of this addon patch).

References:
  - apecloud/kubeblocks `pkg/kbagent/service/action_utils.go:64`
    (`maxActionCallTimeout = 60 * time.Second`)
  - addon-test-runner-write-after-bounded-role-gate guide
  - bootstrap-runner-preload-after-bounded-role-gate-case
Two design-contract gaps caught in pair review:

1. fence_current_primary_local_writes_after_dcs previously verified
   only @@global.read_only=1, never that a user-facing root INSERT
   was actually rejected by the read-only fence. The contract field
   was non-empty but unenforced at the write site (xp design-contract
   class 2). Add verify_post_dcs_local_root_write_fenced: runs a
   localhost user-facing root INSERT into
   kubeblocks.kb_post_dcs_fence_probe and requires either rc=0 (fence
   not enforced -> fail closed) or rc!=0 with stderr containing
   1290/read-only (fence verified). Other failure modes (no client,
   unrelated SQL error) also fail closed. Documentation in the
   function header records the contract change.

2. secondary_kb_health_check_repair_attempt previously did
   SET GLOBAL read_only=OFF -> DELETE -> SET GLOBAL read_only=ON,
   creating a small but real write window during which any client
   could have written to the secondary. This contradicts the
   double_writable=0 invariant the post-OpsRequest convergence test
   is meant to prove. Remove the read_only flip entirely: the repair
   now uses kb_internal_root (which holds READ_ONLY ADMIN from the
   addon's remote-root-fence path) and writes through while
   @@global.read_only=1 stays in place. If kb_internal_root cannot
   write for any reason, log rc and return; the next roleProbe tick
   re-evaluates.

ShellSpec changes:

* New Describe "verify_post_dcs_local_root_write_fenced()" with 4
  examples: 1290 rejection -> success; rc=0 -> fail-closed;
  unrelated error -> fail-closed; no client binary -> fail-closed.
* secondary_kb_health_check_repair_attempt "alpha.59 invariant"
  example now negative-asserts on SET GLOBAL read_only=OFF and
  SET GLOBAL read_only=ON. The earlier "fires repair" example drops
  the now-incorrect positive assertions for those two SQL statements.
* Existing happy-path run_switchover examples gain a
  verify_post_dcs_local_root_write_fenced stub (return 0) so the
  fence verification still passes inside the SQL-mock environment.

Total: 150 examples, 0 failures, 0 warnings.
…witchover

alpha.59 switchover N=1 RED with first-blocker = addon product / switchover
post-DCS root fence contract. Triple-source evidence: kbagent action cost
2.793s (NOT 60s cap; alpha.59 contract truncation works), action stderr
"post-DCS local-root write fence not enforced; user-facing root INSERT
succeeded after read_only=ON", SHOW GRANTS FOR root@% contains READ_ONLY
ADMIN, mysql.user shows root@127.0.0.1/root@localhost Insert_priv=Y
Super_priv=Y. Causal chain: addon apply_remote_root_fence "primary" granted
ALL PRIVILEGES (which in MariaDB 10.11+ bundles READ_ONLY ADMIN / SUPER /
BINLOG ADMIN), so user-facing root bypassed @@global.read_only=ON; the
alpha.59 verify_post_dcs_local_root_write_fenced caught it. This gap
existed in alpha.58 too but was masked by the absence of a verify probe.

alpha.60 hard contract (per Jack 23:28 8-class XP review):

* New revoke_user_facing_root_admin_privileges_for_secondary in
  replication-switchover.sh:
  - Enumerates mysql.user for actual root host rows (does not hardcode
    %/127.0.0.1/localhost; covers whatever the live DB actually has)
  - For each host: SHOW GRANTS first; if READ_ONLY ADMIN / SUPER /
    BINLOG ADMIN / ALL PRIVILEGES is present, REVOKE each bypass priv
    by name (never REVOKE ALL PRIVILEGES, never REVOKE GRANT OPTION as
    a privilege)
  - Distinct sentinel reasons per Jack class 4 (root_account_not_found,
    privilege_absent_already_fenced, revoked, revoke_failed) so closeout
    can attribute precisely
  - 1141 (no such grant) on REVOKE is treated as already-fenced; any
    other REVOKE error is fail-closed (Jack class 1: never silent
    fallback)
  - kb_internal_root is intentionally OUT of scope; it must keep
    READ_ONLY ADMIN for the alpha.59 secondary roleProbe 1062 repair
    path
  - All SQL is via the kb_internal_root client (ROOT_LOCAL bypass not
    used; revoking your own privilege mid-statement is risky)
  - FLUSH PRIVILEGES + mysql.user snapshot logged at end

* fence_current_primary_local_writes_after_dcs gains the revoke step
  between local_read_only_is "1" and verify_post_dcs_local_root_write_fenced.
  Failed revoke -> immediate return 1; no partial fence.

* apply_remote_root_fence "primary" in replication-roleprobe.sh: the
  GRANT ALL PRIVILEGES is replaced with an explicit privilege list that
  EXCLUDES SUPER / READ_ONLY ADMIN / BINLOG ADMIN. GRANT OPTION is now
  only via the trailing WITH GRANT OPTION clause (per Jack: putting it
  in the comma-separated privilege list is a syntax error in some
  MariaDB versions). This prevents alpha.61 from re-introducing the same
  bypass through normal role transitions.

ShellSpec increments (10 new examples, 0 failures, 0 warnings, 157 total):

* Describe "revoke_user_facing_root_admin_privileges_for_secondary()"
  6 examples covering each sentinel: account-not-found skip, multi-host
  revoke success, multi-host with one fail-closed, 1141 already-fenced,
  no-bypass-priv already-fenced, no-client fail-closed
* Describe "fence_current_primary_local_writes_after_dcs() revoke
  fail-closed" 1 example asserts verify probe is NOT called when revoke
  fails (negative trip-wire)
* Existing happy-path run_switchover examples gain
  revoke_user_facing_root_admin_privileges_for_secondary stub (return 0)
  alongside the existing verify_post_dcs stub
* roleprobe primary fence example asserts the new grant: REVOKE ALL
  PRIVILEGES present, GRANT ALL PRIVILEGES NOT present, SUPER NOT
  present, READ_ONLY ADMIN NOT present, BINLOG ADMIN NOT present,
  ", GRANT OPTION," (in the privilege list) NOT present, WITH GRANT
  OPTION (trailing clause) present

Caveat: cmpd-semisync.yaml's set_local_root_account_state and
set_remote_root_account_state UNLOCK paths still re-grant ALL PRIVILEGES;
those are runtime sql-listener-fence transitions, not switchover-time
operations. Post-switchover their re-grant would have to be revoked again
on next switchover. Cleaning those up is alpha.61+ scope; alpha.60 trusts
switchover-time revoke as the immediate fix.

References:
  - alpha.59 RED closeout msg 80e3b77c (4-source confirmation)
  - alpha.59 design contract review msg 9e722fa8 (8-class)
  - addon-test-runner-write-after-bounded-role-gate-guide.md (companion
    methodology for the fence-correctness invariant)
weicao and others added 27 commits May 13, 2026 01:33
…R to avoid mariadb image entrypoint side-effect (alpha.74 v1)

alpha.73 v1 N=1 partial first-blocker: pod-1 Slave_SQL_Running=No,
Last_SQL_Errno=1396 "Operation CREATE USER failed for kb_replicator@%"
on binlog replay. Direct evidence in n1h tar
3a8eccf7ef75ea4dd4214a8630a7b796afa33325a1832037cc6a8c5940254b00:
Query: CREATE USER 'kb_replicator'@'%' IDENTIFIED BY '<redacted>'
(no IF NOT EXISTS clause).

Root cause: mariadb 11.4 image entrypoint
/usr/local/bin/docker-entrypoint.sh examines MARIADB_REPLICATION_USER
env at initdb time. If set, it runs CREATE USER + GRANT REPLICATION
SLAVE WITHOUT IF NOT EXISTS, WITHOUT SET SESSION sql_log_bin=0
wrapper. Statement is binlogged. pod-1 ALSO runs entrypoint initdb
which creates kb_replicator locally; then pod-1 START SLAVE replays
pod-0 binlog -> 1396.

Fix: STOP setting MARIADB_REPLICATION_USER env entirely (so mariadb
entrypoint doesn't trigger that CREATE/GRANT path). Introduce a
renamed MARIADB_REPL_USER shell variable for chart scripts. syncer
Go binary continues to read MYSQL_REPLICATION_USER (engines/mariadb/
config.go:80) -- this env name is NOT consumed by mariadb image
entrypoint, so syncer path stays converged to kb_replicator with no
entrypoint side-effect. Chart's ensure_internal_local_admin still
creates kb_replicator@'%' via INTERNAL_LOCAL with sql_log_bin=0,
so chart's CREATE USER is NOT binlogged. Both pods bootstrap
kb_replicator locally (idempotent) and no binlog event needs to
replay through START SLAVE.

Changes:
- Chart.yaml: bump alpha.73 -> alpha.74 + alpha.74 v1 comment block
- cmpd-semisync.yaml env: remove MARIADB_REPLICATION_USER +
  MARIADB_REPLICATION_PASSWORD; add MARIADB_REPL_USER +
  MARIADB_REPL_PASSWORD. Keep MYSQL_REPLICATION_USER +
  MYSQL_REPLICATION_PASSWORD.
- cmpd-semisync.yaml inline CHANGE MASTER (4 sites): MASTER_USER=
  '${MARIADB_REPL_USER:-kb_replicator}'
- cmpd-semisync.yaml ensure_internal_local_admin shell var sourcing:
  replication_user="$(sql_quote "${MARIADB_REPL_USER:-kb_replicator}")"
- replication-member-join.sh (2 sites): MASTER_USER=
  '${MARIADB_REPL_USER:-kb_replicator}'
- replication_user_convergence_spec.sh: update Gate 2 (semisync
  MARIADB_REPL_USER positive + MARIADB_REPLICATION_USER negative),
  Gate 3 (member-join + inline assertions use REPL_USER), Gate 4
  (shell var sourcing), Gate 5 (MARIADB_REPL_PASSWORD positive +
  MARIADB_REPLICATION_PASSWORD negative). Total 25 examples (was 23).
- replication_switchover_spec.sh: bump alpha version literals
  (alpha.73 -> alpha.74).

Static gates: helm lint PASS, bash -n / dash -n PASS, ShellSpec
309/0. Live N=0 (alpha.74 new patch-version window, does not
inherit alpha.73 partial).

westonnnn `df3c94b0` 01:28 enabled 12h autopilot. TL self-determined
without blocking on Jack XP review (Jack review parallel, msg
`100a28e1` + revised `bd7c5dff`). Cindy PM boundary kept at msg
`28cf6bfe`.
…in fallback (alpha.74 v1)

Jack XP review HOLD at msg 42405f6d 01:32 caught: replication-member-
join.sh is shared by semisync and replication topology. alpha.72 v1
Option 1 scope-cap requires replication topology fallback to
MARIADB_ROOT_USER (root) when MARIADB_REPL_USER env is absent. My
first push used '${MARIADB_REPL_USER:-kb_replicator}' which broke that
contract for replication-only pods.

Fix: restore chained fallback '${MARIADB_REPL_USER:-${MARIADB_ROOT_USER}}'.
Semisync pods set MARIADB_REPL_USER=kb_replicator -> use kb_replicator.
Replication-topology pods don't set the env -> fall through to
MARIADB_ROOT_USER (root) -> pre-alpha.72 behavior preserved.

ShellSpec Gate 3 member-join assertion updated to reflect chained
fallback.

Also adds documented evidence (in commit msg, not code) that mariadb
11.4 entrypoint /usr/local/bin/docker-entrypoint.sh greps for
MARIADB_REPLICATION_USER only:
- Line 178: if [ -n "$MARIADB_REPLICATION_USER" ]; then
- Line 278-280: file_env 'MARIADB_REPLICATION_USER' / _PASSWORD / _HASH
- Line 333 + 338: CREATE USER '$MARIADB_REPLICATION_USER'@'%' (no IF NOT EXISTS, binlogged)
- Line 464: if [ -n "$MARIADB_REPLICATION_USER" ]; -> Creating user
- Zero (0) grep matches for MYSQL_REPLICATION_USER -- this env name is NOT consumed by the mariadb 11.4 image entrypoint, so setting MYSQL_REPLICATION_USER for syncer Go binary doesn't trigger entrypoint side-effect.

This satisfies Jack blocker #2 (direct entrypoint evidence) and
blocker #1 (member-join fallback preserves scope-cap).

ShellSpec: 309/0. helm lint PASS. bash -n / dash -n PASS.
…ntract fix

alpha.74 v1 switchover idle-state N=1 RED on n1u revealed inherited
contract drift between alpha.61 v3 secondary fence (REVOKE ALL +
GRANT non-bypass minimum list, NO BINLOG ADMIN) and the verifier
verify_post_dcs_local_root_write_fenced() preamble. The preamble ran
SET SESSION sql_log_bin=0 + DDL as user-facing root; after demote
root lacks BINLOG ADMIN so the verifier failed with 1227 on the
preamble before reaching the actual fence test.

Fix (Jack XP A2 -> B final):
- Strip preamble in verifier; only user-facing root INSERT runs
- Move probe table create to bootstrap-time ensure_internal_local_admin
  in cmpd-semisync.yaml (INTERNAL_LOCAL, binlog-replay-safe)
- Acceptance contract narrows: rc=0 FAIL; 1146/1227/1044 FAIL with
  distinct regression-guard sentinels; 1290/read-only PASS only

alpha.61 fence safety model preserved: user-facing root still NOT
granted BINLOG ADMIN. Fix is in the verifier contract, not in root
privileges.

ShellSpec hard gates (4 new + 2 contract):
- verifier body must not contain SET SESSION sql_log_bin=0
- verifier body must not contain CREATE DATABASE / CREATE TABLE
- 1146/1227/1044 FAIL with distinct regression-guard sentinels
- 1290 PASS
- cmpd-semisync.yaml ensure_internal_local_admin contains probe
  table create
- Chart.yaml version = 1.1.1-alpha.75; no stale 1.1.1-alpha.74

Static gates: helm lint / bash -n / dash -n / ShellSpec 318/0 PASS.

Boundary: alpha.74 fresh-bootstrap N=3 GREEN unchanged; alpha.74
switchover idle-state N=1 RED preserved on n1u as canonical anchor;
alpha.75 N=0 fresh start for switchover idle-state axis.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…= PD addition only; bump-to-escape-CmpD-immutability)

Bundles alpha.76 through alpha.85 chart work that was carried in working
tree across the recent autopilot cycles. Chart.yaml comment block
documents each alpha bump rationale individually (alpha.76 marker race
defense through alpha.83 reconfigure account switch through alpha.84 v2
semisync ParametersDefinition completion, then alpha.85 pure version
bump to escape KB CmpD immutability after v1 / v2 dry-run cycle on
alpha.84).

alpha.84 v1 first draft attempted to address two compounding causes of
the alpha.83 clean-r1 false-success: (A) semisync had no
ParametersDefinition so KB Configure controller defaulted to rolling
restart, and (B) reconfigureAction only ran SET GLOBAL with no
persistence so any mariadbd restart erased the runtime state. v1 added
both a semisync PD AND a defense-in-depth persistence layer.

v1 N=1 fresh dry-run found that the KB Configure controller parses
ConfigMap-stored my.cnf with a strict INI parser
(fileFormatConfig.format: ini); MariaDB !includedir directive is not
valid INI key=value syntax and made the parser throw a key-value
delimiter not found error before reconfigureAction could run.
OpsRequest entered Failed at pre-action parse step. v1 evidence ns
mariadb-t6-alpha84-n1-0547 frozen as parse-failure evidence
(attachment tar sha 9a7a74fecb42ed9547a6eab381856f98bc113a3f1322f8672c5542a25e233469).

alpha.84 v2 amend reverted the persistence layer entirely and kept
only the semisync PD addition. v2 dry-run failed at the deploy gate
because the v1 install had already created an alpha.84 CmpD in the
test cluster, and KB CmpD immutability rule rejected the v2 install
with phase=Unavailable / immutable fields can't be updated (evidence
tar sha bf50831ad968f9354c62e5909dae126d4df39c4f6ce3c6e1b9a4f4546293b7af).

alpha.85 (this version) is a pure version bump on top of alpha.84 v2
to escape that immutability lock. Chart source is byte-identical to
alpha.84 v2 except the version literal; functional scope remains the
minimal semisync PD addition (cause A from alpha.83 clean-r1).
Persistence redesign (cause B) is deferred to alpha.86 with a probable
approach of mariadbd --defaults-extra-file pointing at a single PVC
backed override file outside KB ParametersDefinition scope.

Fix in alpha.85:
  - templates/paramsdef.yaml: new mariadb-semisync-pd declaration with
    componentDef regex caret mariadb-semisync- and templateName
    mariadb-semisync-config (matches cmpd-semisync.yaml configs[0].name,
    not the helm template object name; Galera comment block already
    documents the same trap). Uses the same staticParameters /
    dynamicParameters source as the other topology PDs
    (config/mariadb-config-effect-scope.yaml). With this PD in place,
    the KB Configure controller can classify slow_query_log /
    long_query_time / etc. as dynamic and skip rolling restart.

Test deltas (kept from v1/v2 because they are correct independent of
the persistence rollback and the alpha.85 bump):
  - scripts-ut-spec/replication_switchover_spec.sh: chart version
    literal assertions bumped to alpha.85; @ percent grant allowlist
    adds SLAVE MONITOR (alpha.81 contract).
  - scripts-ut-spec/replication_user_convergence_spec.sh: version
    literal bumped to alpha.85; prior-version negative check updated
    to alpha.84.
  - scripts-ut-spec/semisync_rejoin_fence_template_spec.sh:
    CMPD_SECONDARY_FENCE_GRANT_BODY literal adds SLAVE MONITOR
    (no version-keyed change).

Verification: helm lint PASS; helm template PASS (no !includedir, no
OVERRIDES_DIR, no runtime-overrides references in rendered output;
semisync PD rendered with correct componentDef regex and templateName;
chart label rendered as mariadb-1.1.1-alpha.85); shellspec
addons/mariadb/scripts-ut-spec/ 333 examples 0 failures 7 pendings
(7 pendings are pre-existing obsolete tests, not new debt).

Runtime closeout is NOT included in this commit. alpha.85 still
needs fresh N>=3 to clear the 3-item runtime gate: (1) semisync PD
dynamic hit, (2) no rolling restart / switchover triggered by
reconfigure, (3) both pods reflect new values within bounded wait.
The forced mariadbd process restart preserves gate is deferred to
alpha.86 along with the persistence redesign.

Peer reviewed by Jack across four rounds (v1: 3 blockers closed
WARN-only persist to fail-closed, shared helper to semisync-only
persisted variant, ShellSpec version + grant assertions synced;
v1 release-note: WARN-only language to fail-closed; v2: persistence
layer reverted after parser blocker discovered in dry-run; alpha.85
bump after CmpD immutability lock blocked v2 deploy).
…riadbd --defaults-extra-file (semisync only)

Builds on alpha.85 (commit 61cc382 PD addition only). alpha.85 dry-run
by Jack confirmed PD addition alone is necessary but insufficient:
KB Configure controller dynamic-classifies the params correctly, but
the InstanceSet update reconciler (pkg/controller/instanceset/
reconciler_update.go) unconditionally triggers
lifecycleActions.switchover before any in-place Pod template update,
including config-hash-only annotation syncs from dynamic reconfigure.
That bug is owned by Lily / Edward / Rocco in PR #10252.

But mariadbd process restart can also happen from paths unrelated to
the controller bug (OOMKill, node drain, manual restart, syncer
failover after health probe loss). In any of those paths, SET GLOBAL
runtime state from reconfigureAction is wiped and the runtime
invariant reverts to chart defaults. alpha.86 adds a defense-in-depth
persistence layer that is independent of #10252 and works for any
restart path.

Design (avoids the alpha.84 v1 INI parser FAIL by keeping !includedir
out of KB-managed config):
  - init-syncer (cmpd-semisync.yaml):
      mkdir -p /var/lib/mysql/runtime-overrides.d  (per-param .cnf files)
      chgrp 1000 + chmod 0770  (g+rwx for kbagent gid 1000 write access)
      create loader file /var/lib/mysql/runtime-overrides.cnf with one
      line: !includedir /var/lib/mysql/runtime-overrides.d/
      chgrp 1000 + chmod 0660 on loader file (mariadbd-readable,
      kbagent-writable)
  - start_mariadbd_process (cmpd-semisync.yaml line 1418):
      mariadbd --defaults-extra-file=/var/lib/mysql/runtime-overrides.cnf
      ... (rest of args). --defaults-extra-file is the FIRST mariadbd
      option per MariaDB requirement; mariadbd silently accepts a
      missing directory so fresh bootstrap works before any reconfigure.
  - _helpers.tpl: NEW mariadb.config.reconfigureAction.persisted helper
    (semisync only). After each successful SET GLOBAL, writes per-param
    /var/lib/mysql/runtime-overrides.d/<name>.cnf via temp file + atomic
    rename. Fail-closed on mkdir / tmp write / mv / parse smoke failure
    (exit 1 after tmp cleanup or bad-file removal).
  - cmpd-semisync.yaml configs[].reconfigure: include the .persisted
    variant instead of the base helper.

Jack 5-guard enforcement (peer review msg a13b8850 + 06:49 follow-up):
  1. kbagent write permission: chgrp 1000 + chmod 0770 (g+rwx) on
     runtime-overrides.d; chmod 0660 on loader file.
  2. --defaults-extra-file is the FIRST mariadbd option: positional
     assertion in ShellSpec (not just grep), checking source-order
     adjacency to docker-entrypoint.sh mariadbd line.
  3. param name/value injection defense: name regex ^[A-Za-z0-9_.-]+$;
     value rejects newline (\n \r), NUL and other control chars
     (\x00-\x1f \x7f), bracketed section markers like [mysqld].
  4. fail-closed sentinels: mkdir / tmp write / mv / parse smoke all
     exit 1 with cleanup. WARN-only regression guard in ShellSpec
     ensures no degradation back to alpha.84 v1 first-draft semantics.
  5. parse smoke after each persist: runs mariadbd
     --defaults-extra-file=<loader> --print-defaults >/dev/null to
     catch mariadb-syntax-invalid input that would crash the engine
     on next restart. Failure removes the corrupted param file and
     exits 1.

Tests:
  - scripts-ut-spec/reconfigure_persisted_alpha86_spec.sh (NEW): 33
    source-level contract tests covering 5 guards + semisync-only scope
    + KB-managed config has NO !includedir (alpha.84 v1 regression
    guard) + effect-scope still includes T6 target params.
  - replication_switchover_spec.sh / replication_user_convergence_spec.sh:
    chart version literal assertions bumped from alpha.85 to alpha.86;
    prior-version negative checks updated accordingly.

Verification: helm lint PASS; helm template PASS (--defaults-extra-file
and OVERRIDES_DIR appear ONLY in cmpd-semisync.yaml rendered output);
shellspec addons/mariadb/scripts-ut-spec/ 386 examples 0 failures 7
pendings (333 pre-existing + 53 new focused (5 guards static + B1 behavioral subshell tests + B2 main-container chown-R-survival regression); 7 pendings are
pre-existing obsolete tests, not new debt).

Runtime closeout is NOT included in this commit. alpha.86 still needs
fresh N>=3 to clear the 4-group runtime gate:
  Group 1 - render/static: PD + loader + dir + first-arg + fail-closed
    contracts.
  Group 2 - runtime reconfigure: OpsRequest Succeed + both pods SHOW
    GLOBAL ON/3 within bounded wait.
  Group 3 - process restart: kill -TERM mariadbd; verify new PID and
    SHOW GLOBAL still ON/3 (proves --defaults-extra-file loads the
    persisted override on startup; independent of #10252).
  Group 4 - controller combo: after PR #10252 patch image lands,
    verify config-hash-only update does NOT trigger switchover; Rocco
    gate 1 (controller image identity) + Rocco gate 2 (pod YAML diff
    is config-hash-only).

Peer reviewed by Jack in two rounds (design ack msg a13b8850 with 5
hard guards + 06:49 follow-up adding g+rwx on dir and parse smoke
first-option constraint). All guards encoded in code and ShellSpec
regression tests.

alpha.87 v1 amend (Helen 2026-05-19 07:38) — chart version literal
bumped from 1.1.1-alpha.86 to 1.1.1-alpha.87 to escape KB CmpD
immutability after the alpha.86 amend added parse smoke stderr
capture in the persisted helper. The test vcluster already had an
alpha.86 CmpD installed from Jack's first alpha.86 dry-run; KB
rejected the upgrade with phase=Unavailable / immutable fields
can't be updated. Same KB CmpD immutability rule the Chart.yaml
comment block has tracked since alpha.65. ShellSpec version
literal assertions also bumped to alpha.87.

alpha.88 v1 amend (Helen 2026-05-19 07:48) — DROP parse smoke
entirely after alpha.86 + alpha.87 dry-runs (Jack msg e6afaa1a).
Two root causes made parse smoke unworkable in kbagent action
runtime:
  1. mariadbd is NOT on PATH in the kbagent action context
     (the smoke command-substitution returned rc=127).
  2. set -e + var=$(failing_cmd) caused shell to exit immediately
     on the failed substitution, bypassing the stderr-print and
     bad-file cleanup that should have surfaced the diagnosis and
     removed the orphan .cnf file.

Remaining defenses (Guards 1-4) cover the threat model:
  - injection defense (is_safe_param_name + is_safe_param_value)
    rejects unsafe names / values / control chars / section markers.
  - atomic temp + mv guarantees no half-written file.
  - mariadbd own error log on next restart is the authoritative
    validation surface; it sees the file in its real runtime
    context, not the kbagent synthesized PATH.

Test deltas in alpha.88 v1:
  - chart version literal assertions bumped from alpha.87 to
    alpha.88.
  - reconfigure_persisted_alpha86_spec.sh: parse smoke presence
    tests REPLACED by parse smoke absence regression guards
    (must NOT contain mariadbd --defaults-extra-file= /
    --print-defaults / smoke_out=$( / Parse smoke failed). Guard 5
    section kept inline as archeology comment.

Same KB CmpD immutability rule applies (rendered cmpd-semisync.yaml
content changed because the persisted helper body shrunk).
…on topology + merged CmpD

First scaffolding commit for the replication+semisync topology merge
under weston Option B (2026-05-19 14:12 msg d8ecfae7). Folds Jack
design review v3.1 (15:36) 4 blockers + 3 non-blockers, and the
scaffolding review (15:42) 2 blockers + 1 caveat.

User-facing API change (breaking):
- clusterdefinition.yaml lists only the `replication` topology for
  primary/secondary. The old `async` and `semisync` topology names
  are removed. Users with existing Cluster CRs must actively migrate
  to spec.topology=replication plus a ComponentSpec parameter
  replicationMode=async|semisync. The migration is documented as
  the main release-notes route; any KB auto-rebind on a topology
  compDef rename is exploratory and verified in a separate upgrade
  gate N=1 (not promised in release notes).

New artifacts:
- templates/_helpers.tpl: helpers for the merged CmpD name and a
  narrow PD regex pattern `^mariadb-replication-merged-`. Also
  narrows the old `mariadb-replication` regex from
  `^mariadb-replication-` to `^mariadb-replication-[0-9]` so the
  old PD does not silently double-match the new merged CmpD name
  (scaffolding review Blocker 1).
- templates/cmpd-replication-merged.yaml: a new ComponentDefinition.
  Starts as a near-verbatim copy of cmpd-semisync.yaml. The header
  comment now correctly describes the Option B layout: old CmpDs
  are kept as compat resources for already-bound Clusters, NOT as
  ClusterDefinition topology fallback (scaffolding review Blocker 2).
- templates/paramsdef.yaml: adds a single ParametersDefinition
  `mariadb-replication-merged-pd` bound to the merged CmpD via the
  narrow regex.
- templates/pcr.yaml: adds the corresponding ParamConfigRenderer.
  Also fixes a pre-existing bug the scaffolding review surfaced: the
  semisync PCR's parametersDefs list referenced `mariadb-replication-pd`
  instead of `mariadb-semisync-pd`. Under the previous wide regex
  this was harmless; under the narrowed regex it would silently
  leave the deprecated semisync CmpD's PCR pointing at an unrelated
  PD (scaffolding review caveat).
- scripts-ut-spec/replication_merged_pd_regex_disambiguation_spec.sh:
  25-test ShellSpec contract that locks the regex disambiguation
  property — each of the five rendered CmpD names matches its own
  regex and none of the other four. Prevents future edits from
  silently reintroducing the overlap.

Deprecated artifacts kept for the alpha.89 cycle (not listed in
clusterdefinition.yaml.topologies, scheduled for removal once the
upgrade gate closes):
- templates/cmpd-replication.yaml (old async CmpD).
- templates/cmpd-semisync.yaml (old semisync CmpD).

What this commit does NOT do (subsequent commits in the same PR):
- Parameterize the merged CmpD on `replicationMode`. Today the
  merged CmpD behaves identically to mariadb-semisync.
- Rename the merged CmpD's configspec from `mariadb-semisync-config`
  to `mariadb-replication-config`.
- Update the switchover script to detect mode at runtime via
  @@rpl_semi_sync_master_enabled.
- Add scripts/validate-replication-mode.sh (read-only two-source
  consistency: DB SHOW VARIABLES vs ComponentParameter).
- Update existing ShellSpec tests for the new topology and the new
  `replicationMode=async` track.
- Run the upgrade gate N=1 against existing async / semisync
  Cluster CRs.

These are sequenced in subsequent commits in this same PR.

Static verification:
- helm template renders cleanly: 5 ComponentDefinition objects
  (standalone, galera, two deprecated old async/semisync, plus the
  new merged one), 5 ParametersDefinition objects with mutually
  exclusive regexes, ClusterDefinition.topologies = [standalone,
  replication, galera].
- shellspec replication_merged_pd_regex_disambiguation_spec.sh
  passes 25 of 25 examples.

No engine version change. appVersion stays at 11.4.10.
…o mariadb-replication-config

Renames the merged CmpD's configspec name from the legacy
`mariadb-semisync-config` (inherited from the verbatim copy in
commit 1's scaffolding) to the canonical `mariadb-replication-config`,
so the merged CmpD does not carry the legacy semisync identifier in
its addressable surface. The configspec name now matches the merged
CmpD's user-facing identity and the eventual single-template target.

KB Configure resolves configspec name by triple binding:

  cmpd-replication-merged.yaml   spec.configs[].name
  paramsdef.yaml (merged PD)      spec.templateName
  pcr.yaml (merged PCR)           spec.configs[].templateName

A mismatch at any of the three sites would silently break parameter
resolution. This commit updates all three and adds a ShellSpec
(replication_merged_configspec_consistency_spec.sh, 3 examples) to
lock the invariant so future edits can not silently re-introduce a
drift.

The underlying ConfigMap object template (mariadb-semisync-config-
template) is unchanged in this commit. It will be migrated to a
unified mariadb-replication-config-template once `replicationMode`
parameterization lands and async / semisync defaults can be expressed
in a single template. Renaming the rendered ConfigMap before
parameter-driven defaults would silently change cluster behavior.

Also folds Jack scaffolding review (15:50) non-blocking clarification:
the _helpers.tpl block comment for `mariadb.replication.merged
.cmpdName` previously said "not reusing `mariadb-replication-`",
which read as if the new name had a different prefix. It actually
shares the prefix; disambiguation comes from the `-merged-` infix
plus the narrowed `^mariadb-replication-[0-9]` regex on the old
CmpD's PD. The comment now states this directly and points at the
disambiguation ShellSpec from commit 1.

Static verification:
- helm template renders cleanly (5 CmpD / 5 PD / 5 PCR; merged
  CmpD's configs[].name and the merged PD/PCR templateNames all
  resolve to `mariadb-replication-config`).
- shellspec replication_merged_pd_regex_disambiguation_spec.sh:
  25 of 25 examples pass.
- shellspec replication_merged_configspec_consistency_spec.sh:
  3 of 3 examples pass.

What this commit does NOT do:
- Parameterize the merged CmpD on `replicationMode`. Today the
  merged CmpD still uses the semisync ConfigMap content; semisync
  variables still default to ON.
- Rename or unify the underlying ConfigMap object template.
- Add enum validation or fail-closed handling for invalid
  `replicationMode` values.
- Add the switchover script's runtime mode-detection branch.
- Add validate-replication-mode.sh (read-only two-source check).

These are scheduled in the next commit, which will close Class 2
(write-site) and Class 4 (sentinel) of Jack design contract.

No engine version change. Chart version unchanged at 1.1.1-alpha.89.
…I section binding for fail-closed (C1 path)

C1-path implementation per Jack design-review enum-research
(2026-05-19 16:41) and weston bounded-poll default (16:45 escalation
+ two unanswered nudges through 18:35 → defaulted to C1 with explicit
override window). Folds Jack scaffolding-review commit 3 v1
fail-closed blocker B1 (2026-05-19 18:48): the v1 CUE file declared
the #MariaDBParameter type but did not bind it to the parsed INI
config structure, so KB's `ValidateConfigWithCue()` did not actually
use the constraints and invalid values such as
`rpl_semi_sync_master_enabled = MAYBE` passed through. v2 adds the
`[SectionName=_]: #MariaDBParameter` binding pattern that the MySQL
and ApeCloud MySQL CUE files already use, so KB binds the type to
every parsed INI section and rejects out-of-enum / out-of-range
values at the controller parameter reconcile path. Closes the
Class 4 sentinel requirement of the topology-merge design contract.

New CUE constraint file: addons/mariadb/config/mariadb-config-constraint.cue
  - Declares the #MariaDBParameter top-level type.
  - Constrains the four semisync engine variables that are the real
    source-of-truth for replication mode:
      * rpl_semi_sync_master_enabled : string & "ON" | "OFF" (default OFF)
      * rpl_semi_sync_slave_enabled  : string & "ON" | "OFF" (default OFF)
      * rpl_semi_sync_master_wait_for_slave_count : int 1..65535 (default 1)
      * rpl_semi_sync_master_timeout : int 1..2147483647 ms  (default 10000)
  - v2 adds the `[SectionName=_]: #MariaDBParameter` binding at the
    end of the file. KB's CUE validator walks every parsed INI
    section and applies #MariaDBParameter to each, so unknown enum
    values / out-of-range ints surface as CUE conflicts before the
    rendered ConfigMap is applied.
  - Does NOT declare a synthetic `replicationMode` key. KB has no
    transform from a logical mode key to multiple engine variables
    (ParamConfigRenderer has no transform hook). A synthetic key
    would either be ignored by the renderer or land in my.cnf
    verbatim and break mariadbd. The unified-switch UX is delivered
    by addon docs and (later) helper scripts that emit the
    four-parameter block from a single user-facing choice.
  - The merged PD ShellSpec asserts the absence of a
    `replicationMode` key in the CUE so a future edit cannot
    silently reintroduce it without forcing the design back to the
    surface.

Updated ParametersDefinition for the merged CmpD (paramsdef.yaml):
  - mariadb-replication-merged-pd now declares
    spec.parametersSchema:
      topLevelKey: MariaDBParameter
      cue: |-
        ...Files.Get "config/mariadb-config-constraint.cue"...
  - v2 paramsdef.yaml comment updated to call out the section
    binding explicitly: fail-closed depends on both the
    top-level definition AND the [SectionName=_] binding being
    present, not on either alone.

Updated dynamicParameters classification (mariadb-config-effect-scope.yaml):
  - Adds the four semisync engine variables to dynamicParameters.
    MariaDB documents all four as dynamic system variables, so a
    reconfigure can be applied at runtime via the alpha.88
    reconfigureAction (SET GLOBAL + persisted override file) and
    does NOT need a rolling restart. Without this classification
    the KB Configure controller falls back to rolling restart,
    which on semisync triggers switchover/promote and reopens the
    SET-GLOBAL-without-persist race the alpha.84 → alpha.88 chain
    already closed.

ShellSpec contract: 14 examples (was 13 in v1):
  - scripts-ut-spec/replication_merged_pd_schema_enum_spec.sh
  - Locks: the CUE file exists, declares MariaDBParameter, constrains
    each of the four variables with the documented enum/range, does
    NOT add a synthetic `replicationMode` key, AND binds the type
    to every INI section via `[SectionName=_]: #MariaDBParameter`
    (v2 new test guarding against B1 regression).
  - Locks: the merged PD block declares parametersSchema with the
    correct top-level key and references the CUE file via
    .Files.Get.
  - Locks: dynamicParameters lists all four semisync variables.

Behavioral validation reference: Jack's KB-validator reproduction
(2026-05-19 18:48) demonstrated that with the v1 CUE (no section
binding) `ValidateConfigWithCue()` returned <nil> for an invalid
config (`MAYBE` / `0` / `0`), and that adding the
`[SectionName=_]: #MariaDBParameter` line caused the same invalid
config to surface as a CUE conflict on `mysqld.rpl_semi_sync_*`.
This commit lands the same fix his reproduction validated.

What this commit does NOT do (sequenced for later commits):
  - Change the ConfigMap template defaults. The merged CmpD still
    inherits the semisync ConfigMap (`mariadb-semisync-config-template`),
    so the default behavior for a new cluster remains semisync.
  - Update the switchover script's runtime mode-detection branch.
  - Add scripts/validate-replication-mode.sh (read-only two-source
    check: SHOW VARIABLES vs ComponentParameter).
  - Update existing ShellSpec tests for the new `dynamicParameters`
    entries (those specs do not assert the full list).
  - Run the upgrade gate N=1 against existing async / semisync
    Cluster CRs.

Static verification:
  - helm template renders cleanly. The merged PD now emits a
    parametersSchema block with the inline CUE content including
    the [SectionName=_]: #MariaDBParameter binding.
  - shellspec replication_merged_pd_regex_disambiguation_spec.sh:
    25 of 25 examples pass.
  - shellspec replication_merged_configspec_consistency_spec.sh:
    3 of 3 examples pass.
  - shellspec replication_merged_pd_schema_enum_spec.sh:
    14 of 14 examples pass (was 13 in v1; +1 for B1 guard).
  - Combined: 42 of 42 examples pass.

No engine version change. Chart version unchanged at 1.1.1-alpha.89.

If weston later reverses the bounded-poll default and chooses C2,
the CUE schema, section binding, and dynamicParameters
classification land in this commit are still required and continue
to apply; only the addon-internal mapper layer would be added on
top (mapper would emit the same four parameters under the hood and
the schema + dynamic classification would gate values arriving from
the mapper).
… read-only two-source consistency check (C1 path)

Implements the read-only mode-consistency helper from the v3.1 design
document §3 (Jack design review Gate 5 v3 — annotation writes are
deferred; this commit covers the two-source check the design promises
to land in the first PR). Used by test runners and (later, optionally)
by a kbagent action at OpsRequest closeout to fail-closed when the
engine's in-memory state and the rendered ConfigMap-mounted my.cnf
disagree on replication mode.

Folds Jack commit-4 review (2026-05-19 20:13) blocker B1 and
prepares for the B2 lifecycle-wiring follow-up.

B1 fix: the v1 helper only inspected the master read's return code,
so a slave-side observability failure silently fell through and was
classified as `invariant_violated` (when master=ON) or `disagree`,
not the correct `engine_missing` / `configmap_missing`. v2 captures
the slave read's return code separately and treats a failed OR empty
read on either master or slave as the same engine/configmap-missing
signal. The engine_var() and configmap_var() helpers now return a
non-zero rc when the underlying read succeeded but returned an empty
string, so the caller does not have to duplicate the empty-check.

B2 prep: add MYSQL_SOCKET, MYSQL_USER, MYSQL_PASSWORD, and
MYSQL_EXTRA_ARGS env passthroughs to the mysql client invocation.
When a value is empty the corresponding flag is omitted, so the
default behavior (TCP connect to 127.0.0.1:3306 with no auth) stays
identical to v1 for local introspection and the existing tests.
The lifecycle wiring commit can supply these env vars from the
chart's existing mysqld socket path and the kbagent-mounted
credentials Secret without further changes to the helper.

Non-blocking cleanup: removed the `BEGIN { IGNORECASE = 1 }` block
from configmap_var's awk — the comparison already lowercases both
sides via tolower(), and IGNORECASE is a gawk extension that not
all busybox awks honor consistently.

Helper script: addons/mariadb/scripts/validate-replication-mode.sh
  - /bin/sh shebang (kbagent ships busybox sh).
  - Reads `rpl_semi_sync_master_enabled` and `rpl_semi_sync_slave_enabled`
    from two sources (engine SHOW VARIABLES + ConfigMap-mounted my.cnf).
  - Normalizes MariaDB boolean representations (`ON`/`OFF`/`1`/`0`/
    `true`/`false`) to canonical `ON`/`OFF` before comparing.
  - Read-only by design (Jack design review Gate 5 v3). Annotation
    writes deferred to a future controller-side writer.
  - Exit codes:
      0 OK / sources agree
      1 disagree (closeout FAIL)
      2 engine_missing (transient; caller should bounded-retry) —
        triggered by master OR slave read failure or empty result
      3 configmap_missing (transient; caller should bounded-retry) —
        triggered by master OR slave key missing or file unreadable
      4 invariant_violated (master=ON but slave=OFF; mariadb semisync
        silently degrades to async on such an asymmetric setting)
  - Output format: per-key `key=value` tokens on stdout for grep
    by tests / kbagent attestation; `mode_consistency=<state>` line
    gives the verdict.

ShellSpec contract: addons/mariadb/scripts-ut-spec/validate_replication_mode_spec.sh
  - 9 behavioral examples (was 7 in v1; +2 for B1 regression
    coverage):
      * both sources ON -> ok / exit 0
      * both sources OFF -> ok / exit 0
      * normalization `1` <-> `ON` -> ok / exit 0
      * engine ON but ConfigMap OFF -> disagree / exit 1
      * mysql client fails -> engine_missing / exit 2
      * NEW: slave engine read returns empty row -> engine_missing /
        exit 2 (was misclassified as invariant_violated in v1)
      * NEW: ConfigMap missing the slave key -> configmap_missing /
        exit 3 (was misclassified as disagree in v1)
      * my.cnf unreadable -> configmap_missing / exit 3
      * master=ON slave=OFF -> invariant_violated / exit 4

What this commit does NOT do (sequenced for later commits):
  - Wire `validate-replication-mode.sh` into a cmpd lifecycle
    action (the B2 env-var surface is in place so that commit only
    has to add the action declaration + reference the env vars from
    Secret / socket path).
  - Flip the ConfigMap template default to async.
  - Add the switchover script's runtime mode-detection branch.
  - Update existing per-topology ShellSpec specs to the alpha.89
    chart-version literal.
  - Run the upgrade gate N=1.

Static verification:
  - shellspec validate_replication_mode_spec.sh: 9 of 9 examples pass.
  - Combined across 4 alpha.89 specs: 51 of 51 examples pass.
  - helm template not affected (no template changes in this commit).

No chart version change; no engine version change.

Per the new autonomous-addon-development-loop guidance (Cindy
2026-05-19 20:03 kubeblocks-addon-docs main commit 2395b53), human
review of this branch is an upstreaming gate, not a testing gate.
The merged-branch chart is patchable today (chart-build + sideload)
for any caller that wants to validate the C1 path end-to-end before
the upstream PR lands.
….sh in replication script ConfigMap (C1 path)

Wires the read-only mode-consistency helper from commit 4 v2
(scripts/validate-replication-mode.sh) into the replication-tier
script ConfigMap (mariadb-replication-scripts-...) so it lands at
/scripts/validate-replication-mode.sh inside the mariadb container
alongside roleprobe / member-join / switchover.

Concrete effect:
- Test runners and ad-hoc operators can call the helper directly via
    kubectl exec <pod> -c mariadb -- /scripts/validate-replication-mode.sh
  No further chart change required to use it at closeout.
- A later commit can add a kbagent lifecycle action that references
  the same /scripts path without re-touching this ConfigMap. The
  B2 env passthroughs added in commit 4 v2
  (MYSQL_SOCKET / MYSQL_USER / MYSQL_PASSWORD / MYSQL_EXTRA_ARGS)
  cover the connection context that a kbagent action would supply
  from the chart's existing Secret / socket path.

New ShellSpec contract: addons/mariadb/scripts-ut-spec/replication_merged_validate_script_mount_spec.sh
- 2 examples lock the ConfigMap surface:
    1. data key `validate-replication-mode.sh:` is present
    2. its body is pulled via .Files.Get "scripts/validate-replication-mode.sh"
- Prevents a future ConfigMap edit from silently dropping the mount
  before the kbagent lifecycle action lands.

What this commit does NOT do:
- Add a kbagent lifecycle action declaration for the helper. The
  helper is reachable via kubectl exec today; wiring it into a
  formal `customActions` or `reconfigure` post-hook is a separate
  design choice that depends on whether mode-consistency should
  block a reconfigure OpsRequest or only surface in test closeout.
- Flip the ConfigMap template default to async.
- Add the switchover script's runtime mode-detection branch.
- Update existing per-topology ShellSpec specs to alpha.89.
- Run the upgrade gate N=1.

Static verification:
- helm template renders cleanly. The replication script ConfigMap
  now embeds validate-replication-mode.sh under data, the body
  visible in the rendered output.
- shellspec replication_merged_validate_script_mount_spec.sh: 2 of 2
  examples pass.
- Combined across 5 alpha.89 specs: 53 of 53 examples pass
  (regex 25 + configspec 3 + PD schema 14 + validate helper 9 +
  script mount 2).

No chart version change; no engine version change.

Per the autonomous-addon-development-loop guidance (Cindy
2026-05-19 20:03 commit 2395b53), human review on this branch is
an upstreaming gate, not a testing gate. After this commit a
sideloaded chart on a live cluster has the helper available for
two-source consistency checks during test closeout.
…ersion literals to alpha.89

Updates the version-tracking assertions in the existing per-feature
ShellSpec suites so they validate against the current alpha.89 chart,
not the alpha.88 the suites were last frozen at. The previous run
showed 4 failures all on `version: 1.1.1-alpha.88` literal compares;
this commit bumps them to `version: 1.1.1-alpha.89` and rewrites
each test description to call out the alpha.89 scope.

Files touched:
- scripts-ut-spec/replication_switchover_spec.sh (4 sites bumped
  to alpha.89 — one in the Chart.yaml literal-version gate, three
  in the alpha.65/.66/.67 immutability-rule "current bumped to"
  contract checks).
- scripts-ut-spec/replication_user_convergence_spec.sh (2 sites
  bumped — Gate 1's literal-version check, and the "no stale
  prior literal" check; the latter is now alpha.88 since alpha.88
  is what we just bumped away from).

Pending-marked obsolete tests (alpha.79 / alpha.80 cleanup debt)
remain unchanged: they document tech debt against alpha.80 cleanup
and do not assert against the current chart.

Combined test status (full mariadb scripts-ut-spec/ directory):
437 examples / 0 failures / 7 pendings.

What this commit does NOT do:
- Switchover script runtime mode detection. Deferred — the existing
  1974-line switchover script needs careful surgery best done as a
  focused commit with its own review pass.
- ConfigMap template default flip to async (behavior-changing).
- Wrap validate-replication-mode.sh in a kbagent lifecycle action.
- Upgrade gate N=1.
- Runtime PASS re-validation.

No chart version change; no engine version change.
…n replication-switchover.sh (C1 path, no caller change)

Stages a small read-only helper inside the existing 1974-line
replication-switchover.sh so a focused follow-up commit can wire it
into the specific switchover stages that today unconditionally wait
for semisync ACK. The merged CmpD can now run in either async or
semisync mode under the C1 path, and the existing switchover script
needs a way to distinguish the two at runtime before its ACK-wait
logic can be made conditional.

Per the agreed scope (Jack 2026-05-19 21:53 commit 6 PASS handoff),
this commit only adds the helper and its ShellSpec; no caller change.
That keeps the existing switchover behavior identical on alpha.89,
preserves the in-flight Jack PASS verdicts on commits 1-6, and lets
the caller wiring land in its own focused commit with its own review
pass.

New helper: is_semisync_mode()
- Located next to the existing query_local_value() helper, sharing
  the same MARIADB_CLIENT_BIN / MARIADB_ROOT_USER / MARIADB_ROOT_PASSWORD
  env surface and the same /bin/sh + busybox compatibility envelope.
- Queries the engine's in-memory @@rpl_semi_sync_master_enabled via
  the same mariadb client invocation pattern the surrounding helpers
  use.
- Return contract:
    0 — semisync ON (value is 1 / ON / on)
    1 — semisync OFF (value is 0 / OFF / off)
    2 — undetermined: client failure, empty row, or any value outside
        the six recognized literals. Future callers MUST treat 2 as
        conservative fail-closed (assume semisync and keep the safety
        wait) so a transient client failure cannot silently flip
        behavior to async during switchover.

New ShellSpec contract:
scripts-ut-spec/replication_switchover_is_semisync_mode_spec.sh
- 9 behavioral examples covering:
    * ON / 1 / on -> rc=0
    * OFF / 0 / off -> rc=1
    * MAYBE (unrecognized literal) -> rc=2
    * empty row -> rc=2
    * client exit non-zero -> rc=2
- Each test stubs MARIADB_CLIENT_BIN with a controllable shell script
  in a tmp PATH and sources replication-switchover.sh via the
  existing __SOURCED__=1 ShellSpec convention, so the helper is
  exercised without running main().

What this commit does NOT do:
- Wire is_semisync_mode() into any switchover stage. The caller
  patch lands in a follow-up commit with its own contract review.
- Change ConfigMap template defaults to async.
- Add a kbagent lifecycle action wrapping validate-replication-mode.sh.
- Run the upgrade gate N=1 against existing async / semisync
  Cluster CRs.
- Modify any of the in-flight Jack-PASSed commits 1-6.

Static verification:
- sh -n / bash -n PASS on replication-switchover.sh.
- shellspec replication_switchover_is_semisync_mode_spec.sh:
  9 of 9 examples pass.
- Full mariadb scripts-ut-spec/ directory:
  446 examples / 0 failures / 7 pendings (was 437 / 0 / 7 before
  this commit; +9 from the new helper spec, no regressions).

No chart version change; no engine version change.

Per the autonomous-addon-development-loop guidance (Cindy 2026-05-19
20:03 commit 2395b53), human review of this branch is an upstreaming
gate, not a testing gate. The branch is patchable today for any
caller that wants to validate the C1 path end-to-end before the
upstream PR lands.
…nc via mariadb-replication-config-template (C1 path closure)

Closes the v3.1 design §1 / §8 commitment that the merged
`mariadb-replication-merged` CmpD defaults to async replication.
The four `rpl_semi_sync_*` engine variables are absent from the
default my.cnf and therefore take their engine defaults (OFF / 0).
Users who want semisync set the four variables via
`spec.componentSpecs[].parameters` at create time; the PD CUE
schema added in commit 3 v2 validates the values, and the
alpha.88 persistence layer (still wired in this CmpD via the init
container's `--defaults-extra-file=/var/lib/mysql/runtime-overrides.cnf`
loader path and the `runtime-overrides.d/` directory) carries the
runtime values through process restarts.

Change:
- templates/cmpd-replication-merged.yaml: the configspec
  `template:` field flips from `mariadb-semisync-config-template`
  (loads `config/mariadb-semisync.tpl`, which hardcodes
  `rpl_semi_sync_master_enabled = ON` and the auxiliary semisync
  variables) to `mariadb-replication-config-template` (loads
  `config/mariadb-replication.tpl`, which omits all four semisync
  variables and lets engine defaults take effect).

The deprecated `cmpd-replication.yaml` and `cmpd-semisync.yaml`
files still render and still reference their respective
ConfigMap templates, so existing Cluster CRs that bound to one of
those CmpDs continue to receive the configspec they originally
resolved against; only the merged CmpD's default behavior changes.

Why async as the default:

Async is safer for new clusters than semisync because semisync's
wait_for_slave gate can silently degrade under partial secondary
failure (the master waits and then falls back to async after the
timeout), which is harder to observe at the cluster level than an
always-async cluster. Users who explicitly want semisync set the
four parameters at create time; the PD CUE schema fail-closes on
invalid values per commit 3 v2.

New ShellSpec contract:
scripts-ut-spec/replication_merged_default_async_configmap_spec.sh
- 4 examples lock the default-async invariant:
    1. merged CmpD `configs[].template` references the async
       ConfigMap template name (not the semisync one).
    2. merged CmpD does NOT reference the semisync ConfigMap
       template (negative guard against a silent flip-back).
    3. async template my.cnf does not set
       `rpl_semi_sync_master_enabled = ON` in defaults.
    4. async template my.cnf does not set
       `rpl_semi_sync_slave_enabled = ON` in defaults.

What this commit does NOT do:
- Add a kbagent lifecycle action wrapping validate-replication-mode.sh
  (still pending design decision: should mode mismatch block reconfigure
  OpsRequest, or only surface in test closeout?).
- Wire is_semisync_mode() helper into actual switchover caller sites
  (no obvious caller site identified yet; the helper remains staged).
- Run the upgrade gate N=1 against existing async / semisync
  Cluster CRs (test environment dependency).
- Add semisync auxiliary variables (rpl_semi_sync_master_wait_no_slave,
  rpl_semi_sync_master_wait_point) to the PD schema (users today
  override the four core variables; adding the auxiliary set is a
  follow-up if/when soak testing shows they are reachable via
  parameter override).

Static verification:
- helm template renders cleanly; the merged CmpD's `configs[].template`
  field is `mariadb-replication-config-template`.
- shellspec replication_merged_default_async_configmap_spec.sh:
  4 of 4 examples pass.
- Full mariadb scripts-ut-spec/ directory:
  450 examples / 0 failures / 7 pendings (was 446 / 0 / 7;
  +4 from the new default-async spec, no regressions).

No chart version change; no engine version change.
…t 8 default-async flip

Non-blocking cleanup noted in Jack commit 8 review (2026-05-20
00:04, msg a4c1fc38): the merged PD comment block at paramsdef.yaml
L129-L132 still claimed the underlying ConfigMap object template
remained at mariadb-semisync-config-template. Commit 8 already
flipped that pointer to mariadb-replication-config-template, so
the comment was stale.

Updates the comment to reflect post-commit-8 reality: PD
parameter resolution operates against the async ConfigMap; the
four rpl_semi_sync_* variables are absent from the default my.cnf
and only land when a Cluster CR explicitly sets them via
spec.componentSpecs[].parameters, validated by the CUE schema
declared below.

No runtime / render change; comment-only fix.

Static verification:
- helm template renders cleanly.
- Full mariadb scripts-ut-spec/: 450 examples / 0 failures / 7
  pendings (unchanged from post-commit-8 baseline).
…icationMode + conditional derivation (C3 path)

weston 2026-05-20 00:08 msg cb0afa37 directs that the merged
CmpD expose BOTH a single logical replicationMode switch AND the
four real rpl_semi_sync_* variables. Precedence rule: if
replicationMode is set, it overrides the four variables; if the
user also sets one of them with a conflicting literal, KB
rejects the assignment. If replicationMode is unset, the four
variables are freely settable and default to OFF (async).

Implementation layer 1 — PD CUE schema (this commit):

config/mariadb-config-constraint.cue now declares a
`replicationMode` field with enum `"async" | "semisync"` and two
conditional blocks that unify the two `*_enabled` variables with
the corresponding ON / OFF value:

```cue
if replicationMode == "semisync" {
  rpl_semi_sync_master_enabled: "ON"
  rpl_semi_sync_slave_enabled:  "ON"
}
if replicationMode == "async" {
  rpl_semi_sync_master_enabled: "OFF"
  rpl_semi_sync_slave_enabled:  "OFF"
}
```

CUE unification handles the consistency check natively:

- user sets only replicationMode=semisync -> CUE unifies the two
  *_enabled fields to ON (whether KB's renderer emits the
  derived values into my.cnf is a separate question that Jack
  is verifying in parallel; if it does not, layer 2.5 below
  applies).
- user sets only the four variables -> no replicationMode
  constraint applies; values are validated against their
  declared types and bounds as before.
- user sets replicationMode AND the four variables consistently
  -> CUE unifies cleanly.
- user sets replicationMode=semisync AND
  rpl_semi_sync_master_enabled=OFF (conflict) -> CUE
  unification fails on the `*_enabled` field; KB
  `ValidateConfigWithCue()` returns a CUE conflict and rejects
  the assignment before it lands in the rendered ConfigMap.

The auxiliary `rpl_semi_sync_master_wait_for_slave_count` and
`rpl_semi_sync_master_timeout` fields are NOT constrained by
replicationMode — they remain user-tunable within their declared
int range. (The engine ignores those values when semisync is
OFF, so leaving them unconstrained does not cross-couple modes.)

Implementation layer 2 — KB renderer verification (parallel,
Jack-owned):

Jack 2026-05-20 00:11 confirmed he will run a KB-validator
behavioral test using ValidateConfigWithCue() on a fixture that
sets only replicationMode=semisync, to confirm whether KB
emits the derived `*_enabled = ON` values into the rendered
my.cnf. If yes, the CUE schema in this commit is sufficient
end-to-end. If no, a follow-up commit will add a thin addon-side
mapper in reconfigureAction that fills the four variables when
only replicationMode is set; CUE unification still rejects
conflicting explicit assignments either way.

Implementation layer 3 — ShellSpec updates:

scripts-ut-spec/replication_merged_pd_schema_enum_spec.sh

- Removed the C1 negative assertion that the CUE file does NOT
  declare a `replicationMode` key (under C3 it must declare it).
- Added 3 new examples:
  - replicationMode enum `"async" | "semisync"` present
  - if replicationMode == "semisync" conditional block present
  - if replicationMode == "async" conditional block present
- Net change: 14 -> 16 examples in this spec file.

What this commit does NOT do (sequenced for later commits):

- Verify the KB renderer emits CUE-derived values into the
  rendered my.cnf. Jack's parallel behavioral test owns this.
- Add the addon-side mapper if Jack's verification shows the
  renderer only validates and does not derive. The mapper would
  live in reconfigureAction and fill the four variables when
  only replicationMode is set.
- Add ShellSpec behavioral tests that exercise the four C3
  cases (only mode / only 4 vars / both consistent / both
  conflict) end-to-end through KB validator. Those tests
  require either a KB CUE harness in the addon repo or a
  reproduction of ValidateConfigWithCue logic in shell; deferred
  to follow-up commit alongside the mapper question.
- Modify any of the in-flight Jack-PASS-ed commits 1-9.

Static verification:

- helm template renders cleanly. The merged PD now emits a
  parametersSchema block whose inline CUE content includes the
  replicationMode field and the two conditional blocks.
- shellspec replication_merged_pd_schema_enum_spec.sh:
  16 of 16 examples pass (was 14 / 0 / 0 before this commit;
  +3 for the new C3 schema additions, -1 for the now-incorrect
  C1 negative assertion).
- Full mariadb scripts-ut-spec/ directory:
  452 examples / 0 failures / 7 pendings (was 450 / 0 / 7;
  +2 from the new C3 schema additions, no regressions).

No chart version change; no engine version change.
… replicationmode in CUE (Jack B1 fix)

Builds on commit 11 v1 (CUE revert + open struct) by closing the
behavioral hole Jack surfaced in the v1 review (2026-05-20 00:48
msg `f8e7e078`): the `[string]: _` open pattern alone allowed any
patch that included a `replicationMode=semisync` key to merge
into the rendered my.cnf as the lowercase `replicationmode=semisync`
key, which mariadbd does not recognize as a server variable. The
C3 design places `replicationMode` at the ComponentSpec-parameter
layer consumed by an addon mapper BEFORE my.cnf render; under no
path should the key appear in the rendered my.cnf.

Change:

- config/mariadb-config-constraint.cue: add an explicit
  `replicationmode?: _|_` (CUE bottom) declaration alongside the
  `[string]: _` open pattern. CUE prefers more-specific field
  declarations over the open string-pattern catch-all, so the
  bottom-value declaration fires for the specific lowercase key
  while unrelated base my.cnf keys still flow through unchallenged.

- scripts-ut-spec/replication_merged_pd_schema_enum_spec.sh: add
  one positive assertion that the CUE file contains
  `replicationmode?: _|_`. Net change: 16 -> 17 examples in this
  spec file.

Behavioral expectation (matches Jack's locally-verified override
direction in his B1 finding):

- A merge patch including `replicationMode=...` or
  `replicationmode=...` (any case) is normalized to
  `replicationmode` by KB's INI parser, hits the explicit bottom
  declaration, and fails `ValidateConfigWithCue()` with a clear
  CUE conflict. The merge does not reach the rendered ConfigMap.
- A merge patch setting only base my.cnf keys (binlog_format,
  max_connections, slow_query_log, etc.) flows through the
  `[string]: _` open pattern unchallenged. Same for a merge that
  sets `rpl_semi_sync_master_enabled=ON`, which still goes through
  the existing field constraint.

Static verification:

- helm template renders cleanly.
- shellspec replication_merged_pd_schema_enum_spec.sh:
  17 of 17 examples pass.
- Full mariadb scripts-ut-spec/:
  453 examples / 0 failures / 7 pendings (was 452 / 0 / 7 before
  this v2; +1 from the new replicationmode-forbid assertion).

No chart version change; no engine version change.

What this commit does NOT do:

- Re-introduce `replicationMode` as a ComponentSpec parameter
  consumed by an addon mapper. That is commit 12's scope per
  Jack's pre-loaded review criteria (msg `3a0f5385`): single
  mapper write-site, mapper consumes `replicationMode` before
  my.cnf render, 5 ShellSpec cases (only mode / only 4 vars /
  both consistent / both conflict / mapper failure), fail-closed
  on mapper failure.
- Run the upgrade gate N=1 or live cluster validation.
…de → 4 engine vars (C3 design, Jack 5-case + 2-boundary contract)

Implements the C3 design mapper that translates the synthetic
`replicationMode` ComponentSpec parameter ("async" | "semisync") into
the four real MariaDB engine variables BEFORE the merged replication
CmpD's reconfigureAction.persisted main loop renders any my.cnf or
runtime-overrides.d/ file. CUE backstop in commit 11 v2
(`replicationmode?: _|_`) keeps the synthetic key from ever landing
in my.cnf; the mapper is the canonical write-site that owns the
translation.

Why a mapper instead of a CUE conditional
------------------------------------------
Jack's KB-validator behavioral test against commit `dc645466` proved
that KB's `pkg/parameters/validate/cue_util.go ValidateConfigWithCue()`
validates parameter values against a CUE schema but does NOT emit
CUE-derived field values back into the rendered my.cnf. Expressing
C3 precedence in CUE alone would either silently ignore
`replicationMode` or land the verbatim key in my.cnf (which mariadbd
rejects as unknown). The C3 design therefore places `replicationMode`
at the ComponentSpec-parameter layer consumed by this addon-side
mapper before my.cnf render.

Behavior contract (5 cases + 2 boundaries locked with Jack pre-loaded
review criteria msg `3a0f5385` and 2026-05-20 dm:@jack msg
`e8c80793` / `144afd93` / `2e93eb72`)
----------------------------------------------------------------------

1. mapper is the UNIQUE consumer / writer of `replicationMode`.
   Sourced exactly once from `mariadb.config.reconfigureAction.persisted`
   BEFORE the main loop processes any parameter. Synthetic key never
   reaches SET GLOBAL or runtime-overrides.d/.

2. Conflict detection runs BEFORE any file modification. When user
   simultaneously supplies `replicationMode=semisync` and any of the
   four real engine variables with a disagreeing value, the mapper
   exits non-zero (code 3) and leaves the parameter list as-is — no
   partial state.

3. Mapper failure (invalid mode → code 2, conflict → code 3, bad arg
   → code 4, IO failure → code 5) always produces non-zero exit and
   no partial state. The persisted helper exits 1 on any mapper
   non-zero return; main loop does not run.

4. Only-4-vars path: when `MARIADB_REPLICATION_MODE` is empty or
   unset, the mapper returns 0 immediately and the parameter list
   flows through unchanged. The four real engine variables continue
   to be processed exactly as before. Verified by sha256 invariance.

5. Both-consistent is idempotent: user supplying both
   `replicationMode=semisync` and matching real vars yields exactly
   one assignment per real var (no duplicates). Repeated mapper
   invocation produces byte-identical output. Verified via sha256
   compare across two passes.

Boundary 1 — call-site uniqueness: the persisted helper sources the
mapper exactly once (grep -c == 1) and gates the call on file
readability so non-merged topologies (e.g. cmpd-semisync.yaml using
the same persisted helper) are safe no-ops.

Boundary 2 — byte-equal short-circuit in main loop: when the new
tmp override file is byte-identical to the existing override file,
the helper skips `mv` so the on-disk mtime is preserved across
no-op reconfigures. Required removing the alpha.86 timestamp
comment line that forced every rewrite to differ. The skip branch
runs strictly after safety validation (is_safe_param_name +
is_safe_param_value) and after the mapper-driven conflict check
(mapper runs before main loop), but before the atomic mv. Conflict
cases never reach this point because the mapper exits non-zero
before any tmp file is written.

Changes in this commit
-----------------------

- `scripts/replication-mode-mapper.sh` (new, 255 lines):
  - `apply_replication_mode_mapping <parameter_file>` is the single
    entry point. Returns 0 on success, 2 on invalid mode, 3 on
    conflict, 4 on bad arg, 5 on IO failure.
  - Defense-in-depth synthetic-key strip (covers both
    `replicationMode` and lowercase `replicationmode`).
  - Atomic in-place rewrite via tmp + mv; cleanup on any error path.
  - Source-time `__SOURCED__` guard for ShellSpec testability;
    standalone execution also supported with same contract.

- `templates/_helpers.tpl` `mariadb.config.reconfigureAction.persisted`:
  - Sources `/scripts/replication-mode-mapper.sh` (gated on file
    readability) and calls `apply_replication_mode_mapping
    "${parameter_file}"` BEFORE the parameter-empty check and main
    SET GLOBAL + persist loop.
  - On mapper non-zero return, exits 1 with a clear "replicationMode
    mapper failed" sentinel so the reconfigure OpsRequest fails
    closed and operator sees the diagnosis.
  - Adds `cmp -s` short-circuit in the persist loop: when the new
    tmp override file is byte-identical to the existing on-disk
    override file, skips the atomic mv. Removes the alpha.86
    timestamp comment line so byte-compare is meaningful for
    identical values.

- `templates/configmap-scripts-replication.yaml`:
  - Mounts `replication-mode-mapper.sh` at `/scripts/` via the
    replication scripts ConfigMap. Same mount as commit 5's
    `validate-replication-mode.sh`; no chart version bump needed
    because ConfigMap content is not subject to CmpD immutability.

- `scripts-ut-spec/replication_mode_mapper_spec.sh` (new, 332 lines,
  23 examples covering):
  - Case 1 (only mode): 3 examples — semisync + async + synthetic
    key strip.
  - Case 2 (only 4 vars / mode empty / mode unset): 2 examples,
    sha256-verified no-op.
  - Case 3 (both consistent): 3 examples — derived appended, no
    duplicates, sha256-verified idempotent across two passes.
  - Case 4 (both conflict): 3 examples — exit code 3, stderr clear,
    sha256-verified parameter list unchanged on conflict.
  - Case 5 (mapper failure): 4 examples — invalid mode (exit 2),
    bad arg (exit 4), sha256-verified parameter list unchanged on
    invalid mode.
  - Synthetic strip backstop: 2 examples — camelCase and lowercase.
  - Unique-call-site contract: 4 examples grep'ing _helpers.tpl.
  - Byte-equal short-circuit contract: 2 examples grep'ing
    _helpers.tpl for `cmp -s` presence and absent timestamp line.

Static verification
--------------------

- `helm lint addons/mariadb`: PASS
- `helm template test addons/mariadb`: renders cleanly (16464 lines)
- `bash -n` + `dash -n` on the mapper script: PASS both shells
- Full mariadb scripts-ut-spec/:
  476 examples / 0 failures / 7 pendings (was 453 / 0 / 7 before
  commit 12; +23 from the new mapper spec; existing alpha.86
  persisted helper tests all preserved).

What this commit does NOT do
-----------------------------

- Wire `MARIADB_REPLICATION_MODE` env injection from the user's
  ComponentSpec parameter into the kbagent action container. The
  mapper reads the env var; the actual plumbing (ParametersDefinition
  field or CmpD vars: entry) is left for a follow-up commit pending
  weston pace decision. Until plumbing lands, the mapper is a safe
  no-op (empty env → return 0).
- Live cluster validation. The persisted helper has not been
  exercised end-to-end through the kbagent action runtime with this
  commit's changes; that is N=1 RED gate work for the cluster lane.

No chart version change; no engine version change.
…(B1) and unconditional synthetic-strip (B2) per Jack contract review

Builds on commit 12 v1 (`1e9bc910`) by closing the two contract
blockers Jack surfaced in his 5-case + 2-boundary behavioral review
(2026-05-20 dm:@jack msg `008885e2`). Static gates all passed in v1,
but two contract claims were not actually enforced by the shipped
code. v2 makes them real with both code fixes and ShellSpec guards.

B1 — `_helpers.tpl` lost the mapper's original rc
---------------------------------------------------

Earlier code:
  if ! apply_replication_mode_mapping "${parameter_file}"; then
    mapper_rc=$?
    ...

The `!` inverts the exit code, so inside the then-block `$?` is 0
(the inverted value), not the mapper's 2/3/4/5. The fail-closed
sentinel still fired (exit 1), but it printed `rc=0`, hiding which
contract layer (invalid mode / conflict / IO / bad arg) actually
broke. First-blocker classification downstream cannot read the
correct layer from the action log.

Jack's minimal repro: `f(){ return 3; }; if ! f; then echo $?; fi`
prints `0`.

Fix:
  mapper_rc=0
  apply_replication_mode_mapping "${parameter_file}" || mapper_rc=$?
  if [ "${mapper_rc}" -ne 0 ]; then
    echo "replicationMode mapper failed (rc=${mapper_rc}); ..." >&2
    exit 1
  fi

The `|| <assign>` chain preserves the original rc and disables
`set -e` for the mapper invocation, so the action stays alive long
enough to emit the rc-aware diagnostic.

B2 — unconditional synthetic-strip claim was false
--------------------------------------------------

Earlier code path in `apply_replication_mode_mapping`:
  if [ -z "${mode}" ]; then return 0; fi
  ... strip synthetic happens after this empty-mode early return ...

A parameter list containing `replicationMode=semisync` with
`MARIADB_REPLICATION_MODE` unset returned rc=0 and left the synthetic
key in the file — contradicting the script preamble and the v1
commit message that both claimed "mapper unconditionally strips any
replicationMode / replicationmode line".

In current product context this was not a runtime FAIL: KB's CUE
`replicationmode?: _|_` (commit 11 v2) blocks the synthetic key
from ever reaching the parameter list. But the mapper's
defense-in-depth contract was theatre, not real, and v1's commit
message and ShellSpec assertions misrepresented the behavior.

Fix: move the synthetic-strip BEFORE the empty-mode early return.
The strip uses a tmp file + `cmp -s` so clean only-4-vars input is
byte-identical (mtime preserved); only inputs that actually contain
a synthetic key are rewritten. The Jack contract item 4 (only-4-vars
unchanged on clean input) holds.

Changes in this commit
-----------------------

- `scripts/replication-mode-mapper.sh`:
  - UNCONDITIONAL synthetic-strip block moved BEFORE the empty-mode
    early return. Uses tmpfile + cmp -s + atomic mv so clean
    only-4-vars input is byte-identical, only inputs with a
    synthetic key are rewritten.
  - Updated comment block above the empty-mode early return to
    document the new defense-in-depth ordering.

- `templates/_helpers.tpl` `mariadb.config.reconfigureAction.persisted`:
  - Replaced `if ! apply_replication_mode_mapping ...; then
    mapper_rc=$?` with `mapper_rc=0; apply_replication_mode_mapping
    ... || mapper_rc=$?; if [ "${mapper_rc}" -ne 0 ]; then ... fi`
    so the original exit code (2/3/4/5) flows into the diagnostic
    sentinel.

- `scripts-ut-spec/replication_mode_mapper_spec.sh` (+68 lines):
  - 3 new behavioral examples for B2:
    - synthetic strip when MARIADB_REPLICATION_MODE unset
    - synthetic strip when MARIADB_REPLICATION_MODE empty string
    - byte-identical preservation for clean only-4-vars input under
      the unconditional strip
  - 3 new contract examples for B1 on _helpers.tpl:
    - rejects the rc-losing `if ! apply_replication_mode_mapping`
      antipattern (grep regex excludes comment lines so the fix
      rationale comment does not false-positive)
    - locks the `|| mapper_rc=$?` rc-preservation form
    - locks the `if [ "${mapper_rc}" -ne 0 ]` rc-aware check

Static verification
--------------------

- `helm lint addons/mariadb`: PASS
- `helm template`: clean render; mapper invocation now shows
  `|| mapper_rc=$?` + `if [ "${mapper_rc}" -ne 0 ]` for both
  cmpd-semisync and cmpd-replication-merged.
- `bash -n` + `dash -n` on the mapper script: PASS both shells.
- Smoke reproduction in real shell:
  - unset mode + synthetic key in file → rc=0, synthetic stripped,
    real var preserved.
  - clean only-4-vars input + unset mode → sha256 byte-identical
    pre/post.
- Focused `replication_mode_mapper_spec.sh`: 29 examples / 0
  failures (was 23 in v1; +3 B2 fix examples + 3 B1 contract
  examples).
- Full mariadb scripts-ut-spec/:
  482 examples / 0 failures / 7 pendings (was 476 in v1; +6
  from B1+B2 lock examples).

8-class contract walk-through
------------------------------

- Class 4 (sentinel/rc): B1 fixed — sentinel now carries actual
  mapper rc; tests lock the `|| mapper_rc=$?` rc-preservation form.
- Class 1 (silent fallback): B2 fixed — synthetic key is now
  unconditionally stripped; tests cover unset + empty-string env.
- Other classes: no v2 regressions; commit 11 v2 CUE `_|_` backstop
  unchanged; alpha.86 persisted helper invariants unchanged.

No chart version change; no engine version change.
…ire via Helm value (C3 plumbing Option C)

Wires the addon-side mapper's input env var. The mapper from commit 12
has been a safe no-op so far because nothing set MARIADB_REPLICATION_MODE.
Commit 13 connects the merged CmpD container env to a top-level Helm
value mariadb.replication.mode so chart users can pick async or
semisync at install/upgrade time.

Why Helm value path (Option C) and not user OpsRequest path
-----------------------------------------------------------

The standard KB ParametersDefinition reconfigure flow validates user
parameter values against the PD CUE schema and renders them into the
target ConfigMap (my.cnf). The CUE schema declared in commit 11 v2
has replicationmode forbidden as a CUE bottom value, so a user
OpsRequest setting replicationMode=semisync is rejected by KB
ValidateConfigWithCue with a CUE conflict, before the mapper sees
anything. That backstop exists to keep the synthetic key out of
my.cnf at the engine layer (mariadbd does not recognize replicationmode
and would log unknown-variable warnings at startup).

The Helm value path is the conservative plumbing that does not require
speculative KB behavior verification. It sets the topology default at
chart install/upgrade time. Runtime mode flip via OpsRequest is
deferred to a future commit that wires either a Cluster annotation
plus addon-side reader OR a non-INI-bound PD, depending on KB
behavior research. The mapper interface in commit 12 is unchanged in
either case; only the env source differs.

Changes
-------

- addons/mariadb/templates/cmpd-replication-merged.yaml:
  Adds a MARIADB_REPLICATION_MODE env entry to the mariadb container
  env block, valued from .Values.replication.mode with empty-string
  default. Empty default preserves existing behavior on clusters whose
  values do not set this key.

- addons/mariadb/values.yaml:
  Adds top-level replication section with mode key defaulting to "".
  Accepted values are "", async, semisync. Invalid values fail the
  reconfigureAction with mapper rc=2 and exit 1 with no partial state.

- addons/mariadb/scripts-ut-spec/replication_merged_replication_mode_env_wire_spec.sh
  (new, 10 examples):
  Locks the wire-up at three layers:
  1. values.yaml declares replication.mode with empty default
  2. cmpd-replication-merged.yaml declares MARIADB_REPLICATION_MODE
     env via .Values.replication.mode pipe with empty-default and
     quoted output
  3. helm template produces the expected env declaration with default
     empty, semisync override, and async override
  Plus a cross-topology negative: standalone and galera CmpDs do NOT
  declare this env (they do not have the mapper wired in their
  reconfigureAction helpers).

Static verification
-------------------

- helm lint PASS
- helm template default produces MARIADB_REPLICATION_MODE with empty value
- helm template --set replication.mode=semisync produces semisync value
- helm template --set replication.mode=async produces async value
- Full mariadb scripts-ut-spec: 492 examples / 0 failures / 7 pendings
  (was 482 from commit 12 v2; +10 from new env wire spec; commit 12
  v2 mapper invariants all preserved)

What this commit does NOT do
----------------------------

- Wire runtime OpsRequest reconfigure to flip replication.mode. Helm
  value is install-time only. A future commit can add a Cluster
  annotation reader OR a non-INI-bound PD declaration; the mapper
  interface stays the same.
- Live cluster validation. The chart static gates pass; kbagent
  runtime exercising of the env-to-mapper path on a real cluster is
  the cluster lane work.

No chart version change; no engine version change.
…elm template-time fail-closed (Jack B1+B2 fix)

Builds on commit 13 v1 (ae2698a) by closing the two contract gaps
Jack surfaced in his rendered-level review (msg f9433634):

B1 - Helm value was install-time API but had no install-time
write-site. The env var was plumbed but no consumer existed before
the first reconfigureAction trigger. A chart user setting
mariadb.replication.mode=semisync at install would still boot
async until OpsRequest reconfigure fired.

B2 - Invalid Helm value did not fail at render time. helm template
--set replication.mode=bogus rendered cleanly with a latent bad env;
the failure surfaced only at container startup (correctly fail-closed,
but diagnosis loop unnecessarily long and the bad value already
embedded in the rendered CmpD).

B1 fix - install-time seeder
----------------------------

New scripts/seed-replication-mode-overrides.sh (135 lines):
- Reads MARIADB_REPLICATION_MODE from env at container startup
- For valid mode, writes 4 per-parameter .cnf files into
  runtime-overrides.d/ BEFORE the first mariadbd start
- Output is byte-identical to what reconfigureAction.persisted
  writes for the same env, so the two write-sites converge and a
  later reconfigure is a byte-equal no-op
- cmp -s short-circuit preserves mtime across kubelet restarts /
  pod re-creates
- Empty env: no-op (return 0); preserves existing behavior
- Invalid mode: stderr + return 2; container fails to start mariadbd
- Missing overrides dir: stderr + return 5

Wire-up in cmpd-replication-merged.yaml main container command body:
- Source the seeder via /scripts/seed-replication-mode-overrides.sh
- Run AFTER runtime-overrides.d permission reapplication and BEFORE
  the first start_mariadbd_process call
- Container exits 1 on seeder non-zero rc with a clear sentinel
- Wire is gated on file readability so non-merged topologies that
  do not mount the seeder are safe no-ops

Mount in configmap-scripts-replication.yaml:
- Added seed-replication-mode-overrides.sh to the same scripts
  ConfigMap that mounts replication-mode-mapper.sh

B2 fix - Helm template-time validator
--------------------------------------

New helper in templates/_helpers.tpl:
- mariadb.replication.mode.validate accepts "", async, semisync
- Any other value triggers Helm fail with a clear printf error
- Returns the validated value for the caller to consume
- Called from cmpd-replication-merged.yaml's MARIADB_REPLICATION_MODE
  env declaration; replaces the bare .Values.replication.mode pipe

Effect: helm template --set replication.mode=bogus now exits with
"invalid mariadb.replication.mode=bogus; expected one of ..." before
any manifest is produced. No bad value can ever land in a rendered
CmpD env.

ShellSpec
---------

New scripts-ut-spec/seed_replication_mode_overrides_spec.sh
(16 examples):
- empty / unset: no-op + zero override files written
- semisync: writes 4 correct .cnf files
- async: writes OFF / OFF / 1 / 10000
- invalid: rc=2 + zero override files
- missing dir: rc=5
- idempotency: mtime preserved across two invocations
- convergence: each .cnf is exactly 2 lines (no timestamp metadata)
- static contract: configmap-scripts mounts seeder twice (data key +
  comment header); cmpd-replication-merged.yaml invokes seeder via
  the source line and exits 1 on non-zero rc

Updated replication_merged_replication_mode_env_wire_spec.sh
(+6 net examples, 10 -> 16):
- Removed obsolete bare .Values.replication.mode grep (the env now
  flows through the validator helper, not a bare pipe)
- Added 4 examples for Helm template-time fail-closed: bogus /
  garbage / mixed-case ASYNC all rejected; empty "" still accepted
- Added 4 examples for the validator helper itself: defined in
  _helpers.tpl + uses default "" + calls fail() on bad value

Static verification
-------------------

- helm lint PASS
- helm template default: env value ""
- helm template --set replication.mode=semisync: env value "semisync"
- helm template --set replication.mode=async: env value "async"
- helm template --set replication.mode=bogus: FAILS at render with
  clear "invalid mariadb.replication.mode" sentinel
- bash -n + dash -n on seeder: PASS both shells
- Smoke reproduction in real shell:
  - empty env: no override files written
  - semisync: 4 files with correct content
  - repeated invocation: mtime preserved
  - bogus: rc=2 + no partial state
- Full mariadb scripts-ut-spec:
  514 examples / 0 failures / 7 pendings
  (was 492 from commit 13 v1; +16 from new seeder spec; +6 net
  from env-wire spec update)

What this commit does NOT do
----------------------------

- Wire runtime OpsRequest reconfigure to flip replication.mode at
  runtime. The Helm value remains install-time only; runtime mode
  flip is deferred to a future commit using a Cluster annotation
  reader OR a non-INI-bound PD. Once that wire lands, the mapper
  + seeder pair already in place handles both write-sites without
  further changes.
- Live cluster validation. The chart static + behavioral gates all
  pass; kbagent runtime exercising of the env-to-seeder-to-mariadbd
  path on a real cluster is the cluster lane work.

No chart version change; no engine version change.
…te contract gaps B3+B4+B5 (Jack HOLD review msg)

Builds on commit 13 v2 (1f2fe79) by closing three contract gaps
Jack surfaced in his install-time review:

B3 - script missing silent fallback when mode is non-empty.
v2 used `if [ -r /scripts/seed...sh ]; then ... fi` which silently
skipped the seeder when the script was unreadable. A non-empty
MARIADB_REPLICATION_MODE could degrade to async (Class 1 silent
fallback).

B4 - target as directory bypass.
v2 wrote tmp + mv unconditionally. If a target path existed but was
NOT a regular file (e.g. a directory created by a prior buggy run or
out-of-band action), `mv tmp existing_dir` succeeded by moving tmp
INTO the directory. The target remained a directory, the override
content never landed in the expected file, but seeder returned rc=0.

B5 - partial-state on multi-file failure.
v2 wrote files sequentially. If file 3 failed, files 1 and 2 were
already committed to the PVC. After a partial failure, switching
mode back to "" would not clean the orphan files (seeder no-op
leaves them). Additionally a separate write-failure-detection bug
in v2 used `{ ... } > tmp` compound-command form whose redirection
failure does NOT propagate to the surrounding `if !` in bash, so
write failures were silently undetected.

B3 fix - container wire-up
--------------------------

cmpd-replication-merged.yaml: when MARIADB_REPLICATION_MODE is
non-empty AND seeder script is unreadable, container exits 1 with
sentinel before any mariadbd start. When the env is empty the
original lenient `if [ -r ]` form is preserved because the seeder
is a no-op anyway and a missing-script scenario on async clusters
should still let them boot.

B4 fix - target type validation
-------------------------------

seed-replication-mode-overrides.sh: new
seed_replication_mode_validate_target_type helper checks each
target is absent OR a regular file BEFORE any tmp write. If any
target exists as a directory / symlink-to-dir / device / fifo /
socket the seeder returns rc=5 immediately and writes nothing.
Post-rename sanity check verifies each renamed target is now a
regular file.

B5 fix - multi-file all-or-nothing pattern
-------------------------------------------

seed-replication-mode-overrides.sh restructured into 5 phases:
- Phase A derive: compute all 4 (name, value) pairs into shell vars.
  Fail-closed on invalid mode BEFORE any disk write.
- Phase B pre-validate target types: run the B4 check on all 4
  targets BEFORE any tmp is written.
- Phase C write 4 tmp files: any write failure triggers
  cleanup_all_tmps and returns rc=5. The cleanup function removes
  any subset of tmp files that exist for the current pid suffix.
- Phase D byte-equal compare: each tmp vs target; targets already
  at the staged value skip rename to preserve mtime.
- Phase E atomic rename in tight sequence: minimizes the
  partial-commit window. Post-rename sanity check.

Also fixed the v2 write-failure-detection bug: `{ ... } > tmp`
replaced with `printf '...' > tmp` (a simple command) whose
redirect failure DOES propagate to `if !`. Added `[ -s tmp ]`
post-check as defense in depth against silent truncation.

ShellSpec
---------

seed_replication_mode_overrides_spec.sh: 16 -> 21 examples
- B3 wire contract (2 examples):
  - cmpd-replication-merged.yaml contains the "set but seeder script
    is missing or unreadable" sentinel.
  - cmpd-replication-merged.yaml gates the missing-script check on
    non-empty MARIADB_REPLICATION_MODE (`if [ -n ... ]`).
- B4 behavior (2 examples):
  - Pre-create wait_for_slave_count target as a directory + seeder
    runs -> rc=5 with "exists but is not a regular file" sentinel.
  - With master_timeout target as directory: zero .cnf or .tmp
    residue after seeder runs (find -type f returns 0 for both).
- B5 behavior (1 example, condensed into single When call to handle
  ShellSpec stderr-capture mechanics):
  - chmod 0555 on overrides dir + seeder runs -> rc=5 + zero .cnf
    residue + zero .tmp residue.
- "fails the container on seeder non-zero rc" assertion updated:
  count is now 2 (original "seed-replication-mode-overrides failed"
  sentinel + new "set but seeder script is missing or unreadable"
  sentinel both match the regex).

Static verification
-------------------

- helm lint PASS
- helm template default/semisync/async: render cleanly
- helm template --set replication.mode=bogus: still fails with B2
  sentinel
- bash -n + dash -n on seeder: PASS both shells
- Smoke reproduction in real shell:
  - target-as-directory + semisync mode -> rc=5, zero .cnf written
  - chmod 0555 + semisync mode -> rc=5, zero residue, zero partial
  - clean valid case -> rc=0, all 4 files written with correct
    content
- Full mariadb scripts-ut-spec:
  519 examples / 0 failures / 7 pendings (was 514 from commit 13
  v2; +5 from B3/B4/B5 lock examples)

8-class contract walk-through update
------------------------------------

- Class 1 silent fallback: B3 + write-failure-detection both fixed;
  no remaining silent-fallback paths from non-empty mode.
- Class 3/5 single-commit-boundary partial state: B5 fixed via
  phased write-all-then-rename; combined with B4 pre-validation,
  the common failure modes (target type mismatch, write
  permission, disk full at write time) are all caught before any
  rename occurs.
- Class 4 sentinel rc: unchanged from v2; invalid mode -> rc=2,
  IO failure -> rc=5.

What this commit does NOT do
----------------------------

- Roll back already-committed renames on mid-batch rename failure.
  That requires saving prior content for restoration and is a
  narrower failure mode than the common cases addressed here. The
  seeder logs loudly and the container refuses to start mariadbd
  on any rename failure, so an Ops engineer sees the diagnosis.
- Fix the same `{ ... } > tmp` redirect-detection bug in the
  alpha.86 persisted helper (`_helpers.tpl`
  mariadb.config.reconfigureAction.persisted). That is a separate
  pre-existing issue out of commit 13 scope.

No chart version change; no engine version change.
…f feeding full helm template to ShellSpec matcher (Jack test-harness HOLD msg)

Closes the commit 13 v3 test-gate HOLD: the env-wire ShellSpec
previously fed the entire ~16k-line `helm template` output into
`When call ... The output should include ...`. ShellSpec's matcher
loads the captured stdout into memory and pattern-matches against
it, which was slow / unstable on macOS+Homebrew bash and timed out
in the test engineer's 34s budget. v3 v2 refactor:

- Render `helm template` once into a tmp file via a small helper
  (`render_to_tmp` / `render_stderr_to_tmp`)
- Grep the tmp file for the relevant 2-line shape via
  `grep -F -A1 'name: MARIADB_REPLICATION_MODE' | awk`
- ShellSpec matcher only sees a bounded result (~3 chars: "ok"),
  not the full manifest
- AfterEach cleans up the tmp file

Same contract assertions; faster matcher path:
- focused env-wire spec: 279s -> 1.29s (216x speedup)
- full mariadb scripts-ut-spec: ~5min -> ~57s

Static verification (unchanged from commit 13 v3 behavior layer):
- helm lint PASS
- helm template default/semisync/async renders correctly
- helm template --set replication.mode=bogus|garbage|ASYNC fails
  at render with `invalid mariadb.replication.mode` sentinel
- Focused env-wire ShellSpec: 16/0
- Full mariadb scripts-ut-spec: 519/0/7 (unchanged count)

No chart behavior change; only test harness refactor.
…CUE bottom that breaks live PD OpenAPI schema gen (live N=1 first-blocker fix)

Live N=1 verification in vcluster `mariadb-test5` with KubeBlocks
controller image `apecloud/kubeblocks:pr-10252-1c8723184` revealed
that commit 11 v2's `replicationmode?: _|_` declaration BREAKS the
live PD reconcile loop:

  failed to generate openAPISchema: failed to marshal cue-yaml:
  explicit error (_|_ literal) in source

The `mariadb-replication-merged-pd` ParametersDefinition never goes
Available; downstream Component reconcile reports the PD as
unavailable; Pods never get created; the chart is unusable.

Local KB validator fixtures and ShellSpec accepted the bottom
declaration because they exercise `ValidateConfigWithCue()` directly
on already-parsed CUE values. The live controller has a separate
codepath that generates an OpenAPI schema from the CUE source
before activating the PD; that codepath does NOT accept any CUE
bottom literal.

Fix
---

Remove `replicationmode?: _|_` from `config/mariadb-config-constraint
.cue`. The synthetic-key defense-in-depth shifts entirely to the
three layers that already exist and are exercised by ShellSpec:

  1. Helm template-time validator (`mariadb.replication.mode.validate`
     helper in `_helpers.tpl`) rejects invalid `mariadb.replication
     .mode` Helm values at `helm template` time.
  2. Startup seeder (`scripts/seed-replication-mode-overrides.sh`,
     sourced from `cmpd-replication-merged.yaml` container command
     body) writes the four real `rpl_semi_sync_*` engine variables
     to `runtime-overrides.d/` BEFORE mariadbd starts. Synthetic
     key never appears in any rendered file.
  3. Reconfigure mapper (`scripts/replication-mode-mapper.sh`,
     sourced from `reconfigureAction.persisted`) unconditional strip
     of `replicationMode` / `replicationmode` from the parameter
     list BEFORE the main loop reaches `SET GLOBAL` or
     `runtime-overrides.d/` writes.

A user OpsRequest that includes `replicationmode=<value>` now
passes CUE validation via the `[string]: _` open pattern, lands
in the rendered my.cnf, and produces a mariadbd unknown-variable
warning at next restart. That is noise, not a fail-closed product
break — mariadbd ignores unrecognized server variables (does not
refuse startup). Reconfigure mapper still strips the synthetic
key from the SET GLOBAL / persist loop, so the engine never sees
a `replicationmode` runtime variable assignment.

Runtime synthetic-key OpsRequest fail-closed (KB-validator
rejecting `replicationmode=*`) is deferred to alpha.90 — needs
a KB-supported form that survives OpenAPI schema generation.
Candidates documented in the case appendix (PR kubeblocks-addon
-docs #258): a CUE pattern that compiles to OpenAPI without
bottom, a pre-flight admission webhook on Cluster CR annotations,
or a PD validation hook.

Changes
-------

- `config/mariadb-config-constraint.cue`: remove the
  `replicationmode?: _|_` declaration. The preceding multi-line
  comment is rewritten to document the live N=1 first-blocker
  finding, the three remaining defense layers, and the alpha.90
  deferred runtime fail-closed candidates.
- `scripts-ut-spec/replication_merged_pd_schema_enum_spec.sh`:
  the positive grep that locked `replicationmode?: _|_` is
  inverted into a negative-assertion that the file does NOT
  declare the bottom value as code (the grep regex excludes
  comment lines so the rationale comment does not false-positive).
  Same example count (17), same suite count.

Static verification
-------------------

- `helm lint addons/mariadb`: PASS
- `helm template test addons/mariadb` and `--set replication.mode=
  semisync`: render cleanly (CUE no longer contains a bottom
  literal, so future live PD reconcile will succeed).
- Focused `replication_merged_pd_schema_enum_spec.sh`: 17 examples
  / 0 failures.
- Full mariadb scripts-ut-spec: 519 examples / 0 failures / 7
  pendings (unchanged from commit 13 v3; the bottom-value lock
  example was inverted, not added).

What this commit does NOT do
----------------------------

- Add a runtime KB-validator path that rejects `replicationmode=*`
  user input. Deferred to alpha.90; case appendix records the
  candidate forms.
- Re-run live N=1 validation in `mariadb-test5`. The test cluster
  has the alpha.89 chart with the broken CUE already installed;
  Jack will need to uninstall + reinstall (or apply a CmpD update
  if the chart version supports it) and re-run the install-time
  semisync first-boot SQL ON/ON check.
- Roll forward to alpha.90 chart version. The `_|_` revert
  affects CUE content, not the CmpD spec shape, so within the
  alpha.89 chart we can keep iterating.

No chart version change; no engine version change.
…sion compatibilityRules (live N=1 second first-blocker fix)

Live N=1 retest on commit 14 chart in vcluster `mariadb-test5`
(release `mariadb-alpha69-5c`, ns `mariadb-alpha89-mode-n1b-040717`)
got past the commit 14 PD OpenAPI blocker (PD now Available) but
hit a NEW first-blocker: Pod create fails with

  spec.containers[0].image: Required value
  spec.containers[1].image: Required value
  spec.initContainers[0].image: Required value

Controller log shows `ImageUtil parse image failed, image=""`. The
live InstanceSet has empty image fields for `mariadb`, `exporter`,
and `init-syncer`, while `kbagent` / `kbagent-worker` have their
tools image.

Root cause
----------

`addons/mariadb/templates/cmpv.yaml` declares ComponentVersion
compatibilityRules that bind release images (10.6.15, 11.4.5,
11.4.8, 11.4.10, 11.8.4, 12.0.2) to CmpDs matching one of these
regexes:

  - ^mariadb-[0-9]              (standalone)
  - ^mariadb-replication-[0-9]  (replication, digit-anchored)
  - ^mariadb-semisync-          (semisync)
  - ^mariadb-galera-            (galera, in a separate rule)

The merged CmpD added in commit 1 is named
`mariadb-replication-merged-1.1.1-alpha.89`. None of the four
regexes above match this name:
  - `^mariadb-replication-[0-9]` requires a digit immediately after
    `mariadb-replication-`, but the merged CmpD has `merged-` next.
  - The other three regexes don't match either.

KB's ComponentVersion controller therefore did not bind any
release images to the merged CmpD, and the InstanceSet rendered
with empty image fields.

Fix
---

Add `mariadb.replication.merged.cmpdRegexpPattern` (which expands
to `^mariadb-replication-merged-`) to the same compDefs list that
already covers standalone / replication / semisync. The merged
regex does not over-match the existing three (it requires the
literal `merged-` suffix), and shares the same 6-release image
set as the replication/semisync chain.

Changes
-------

- `templates/cmpv.yaml`: add the merged regex to the first
  compatibilityRule's compDefs list, with a multi-line comment
  pinning the live N=1 evidence so a future edit cannot silently
  drop it.

- `scripts-ut-spec/cmpv_merged_compatibility_rule_spec.sh` (new,
  9 examples):
  - Template-level: cmpv.yaml references the merged regex helper
    exactly once; the original three regexes are still present
    (regression guard).
  - Helper-level: _helpers.tpl defines the merged regex helper
    and its value is the expected `^mariadb-replication-merged-`.
  - Rendered-manifest level: render once into tmp file via
    `render_to_tmp` helper, awk-extract the first
    compatibilityRule's compDefs list, assert all four expected
    regexes are present (one assertion each so a future
    regression points at the specific missing regex). Plus a
    bounded grep that the rendered manifest contains the
    `docker.io/mariadb:11.4` image line for the 11.4.10 release
    block (the merged CmpD's default serviceVersion).

Same render-to-tmp + bounded-matcher pattern as commit 13 v3 v2
env-wire spec — keeps the spec fast (1.03s focused) and stable on
macOS+Homebrew.

Static verification
-------------------

- helm lint PASS
- helm template renders the merged regex as
  `^mariadb-replication-merged-` in the first compatibilityRule's
  compDefs list, alongside the existing three
- Focused cmpv-compat spec: 9/0 in 1.03s
- Full mariadb scripts-ut-spec: 528 examples / 0 failures / 7
  pendings (commit 14 base 519 + 9 new locks)

What this commit does NOT do
----------------------------

- Bump chart version. The compatibilityRules change is additive
  to an existing CmpV resource; no new CmpD spec mutation. Same
  alpha.89 chart can keep iterating.
- Re-run live N=1 in `mariadb-test5`. The test cluster has the
  alpha.89 chart with the missing-image-binding already installed;
  Jack will need to uninstall + reinstall (or apply a CmpV update)
  and re-run the install-time semisync first-boot SQL ON/ON
  check.

No engine version change.
…nc_master_wait_for_slave_count (MariaDB 11.4 unsupported, live N=1 third first-blocker fix)

Live N=1 third round on commit 15 chart in vcluster `mariadb-test5`
ns `mariadb-alpha89-mode-n1c2-041720`. Cleared the previous two
invalid-run blockers (PD Available, CmpV image binding bound) and
landed in the actual MariaDB runtime layer for the first time. The
seeder ran and wrote the 4 expected override files. mariadbd then
CrashLooped because one of those four variables is not recognized
by MariaDB.

Empirical evidence (Jack's same-image parse probe):
  - With all 4 overrides loaded via --defaults-extra-file:
    `mariadbd --verbose --help` exits rc=7 with stderr containing
    `unknown variable 'rpl_semi_sync_master_wait_for_slave_count=1'`
  - Removing only that one file: same probe exits rc=0.

Root cause: `rpl_semi_sync_master_wait_for_slave_count` is a MySQL
extension (added in MySQL 5.7.3). MariaDB 11.4 does NOT recognize
it. MariaDB semisync waits for exactly one secondary
acknowledgement and has no configurable count variable. The
original 4-variable picture in this addon came from a MySQL-flavored
reference and was never live-validated before commit 15 — local
ShellSpec only exercises shell parse / file-write behavior, not
mariadbd startup parse.

Changes
-------

- `config/mariadb-config-constraint.cue`: remove the field
  declaration `rpl_semi_sync_master_wait_for_slave_count?: int &
  >=1 & <=65535 | *1`. Comment rewritten to record the live N=1
  third first-blocker evidence and the MariaDB-vs-MySQL semisync
  variable delta.

- `config/mariadb-config-effect-scope.yaml`: remove
  `rpl_semi_sync_master_wait_for_slave_count` from
  dynamicParameters so the reconfigureAction.persisted helper
  does not attempt to `SET GLOBAL` the unknown variable.

- `scripts/seed-replication-mode-overrides.sh`: drop the variable
  from the 5-phase write loop (was 4 vars, now 3). Phase B
  pre-validate, Phase C tmp-write, Phase E rename, and the
  cleanup_all_tmps helper all updated. Header comment in the
  derive helper preserved the rationale.

- `scripts/replication-mode-mapper.sh`: drop the variable from the
  reconfigure-time derive helper. Script preamble extended with
  the live N=1 finding and the MariaDB-vs-MySQL delta. async
  branch comment trimmed because the 4-var deterministic
  rationale was specifically about both flips of the variable.

- ShellSpec updates (528 examples / 0 failures / 7 pendings —
  same count, three assertion polarities inverted, one fixture
  swapped):
  - `replication_merged_pd_schema_enum_spec.sh`:
    - "constrains rpl_semi_sync_master_wait_for_slave_count to a
      positive int range" → "does NOT declare ... (commit 16
      MariaDB-unsupported drop)". Match only code lines (skip `//`
      comments).
    - "lists rpl_semi_sync_master_wait_for_slave_count in
      dynamicParameters" → "does NOT list ... (commit 16)".
  - `replication_mode_mapper_spec.sh`: two "all 4 derived vars
    for mode=...semisync/async" assertions now check the 3
    MariaDB-supported vars and add a negative for
    wait_for_slave_count. Same for the both-consistent example.
  - `seed_replication_mode_overrides_spec.sh`: "writes
    rpl_semi_sync_master_wait_for_slave_count=1" → "does NOT
    write ... (commit 16 MariaDB-unsupported drop)". The B4
    dir-target validation tests pivot from
    `rpl_semi_sync_master_wait_for_slave_count.cnf` (no longer
    written) to `rpl_semi_sync_master_timeout.cnf`.

Static verification
-------------------

- helm lint PASS
- bash + dash -n on both scripts: PASS
- Smoke reproduction in real shell: seeder with mode=semisync now
  writes exactly 3 files (master_enabled, slave_enabled,
  master_timeout). master_enabled.cnf contains `[mysqld]\nrpl_semi
  _sync_master_enabled = ON`.
- Focused 3 specs (pd_schema + mapper + seeder): 67/0 in 3.92s
- Full mariadb scripts-ut-spec: 528 / 0 / 7 (same total as commit
  15)

What this commit does NOT do
----------------------------

- Bump chart version. The 3-variable variant is still alpha.89 v1
  iteration; no new CmpD spec shape that would require chart
  version bump.
- Run live N=1 in `mariadb-test5`. Jack will need to update to
  this commit and re-run the install-time semisync first-boot
  SQL ON/ON check.
- Address the wait-count semantics gap. MariaDB semisync waits
  for exactly one acknowledgement; no equivalent variable.

No engine version change.
…tability live N=1 fourth first-blocker fix)

Live N=1 fourth round on commit 16 chart in vcluster `mariadb-test5`
got past the MariaDB-unsupported-variable blocker (engine doesn't
CrashLoop anymore) but stalled on a setup blocker: the live
`mariadb-replication-merged-1.1.1-alpha.89` ComponentDefinition
stayed Unavailable with condition `immutable fields can't be
updated`.

Root cause: KubeBlocks treats the rendered ComponentDefinition spec
as immutable. alpha.89 commits 13/14/15/16 each mutated the merged
CmpD spec — env block (replicationMode env wire), container command
body (startup seeder source), CmpV regex (in cmpv.yaml, also tracked
under same chart version), dynamicParameters list, configmap-scripts
content. KB sees each upgrade attempt as a same-version update and
refuses because the spec has changed. The CmpD stays Unavailable;
the Cluster reconcile reuses the broken CmpD; further N=1 samples
are invalid until a fresh CmpD identity exists.

Same KB immutability rule that drove the historical alpha.65 /
alpha.66 / alpha.70 / earlier bumps. Once any cmpd-*.yaml mutation
happens within an alpha cycle, the chart version MUST bump so KB
creates a NEW CmpD (`mariadb-replication-merged-1.1.1-alpha.90`)
instead of attempting an immutable-field update on the old one.

Changes
-------

- `Chart.yaml`:
  - `version: 1.1.1-alpha.89` → `version: 1.1.1-alpha.90`
  - Prepended a new comment block at the top of the changelog stack
    documenting the live N=1 fourth first-blocker, the alpha.89
    commit chain that mutated the merged CmpD, and the immutability
    rule citation back to alpha.65.

- ShellSpec literal bumps (no behavior change, just version string
  updates so existing assertions still match the rendered chart):
  - `replication_switchover_spec.sh`: 4 occurrences of
    `1.1.1-alpha.89` → `1.1.1-alpha.90` in `grep -c '^version: ...'`
    assertions and explicit "version: ..." output expectations.
  - `replication_user_convergence_spec.sh`: 2 occurrences in Gate 1
    chart-version check. Test description updated to reference the
    immutability-bump rationale instead of the topology-merge
    rationale.
  - `replication_merged_pd_regex_disambiguation_spec.sh`: 6
    occurrences in the canonical-name fixture set (STANDALONE_NAME,
    GALERA_NAME, OLD_REPL_NAME, OLD_SEMISYNC_NAME, MERGED_NAME, plus
    one comment line).

Static verification
-------------------

- helm lint PASS
- Full mariadb scripts-ut-spec: 528 / 0 / 7 (same total; all 12
  version-literal references now match the new chart version).

What this commit does NOT do
----------------------------

- Mutate any cmpd-*.yaml, _helpers.tpl, scripts/*, or config/* file.
  Only Chart.yaml version + version literals in ShellSpec. The
  merged CmpD spec content is unchanged from commit 16; the new
  CmpD identity `mariadb-replication-merged-1.1.1-alpha.90` has the
  same shape, just a new name.
- Run live N=1 in `mariadb-test5`. Jack will need to update to this
  commit and the test cluster needs the new CmpD created (KB's
  CmpV controller will bind release images to it via the merged
  regex from commit 15; old `mariadb-replication-merged-1.1.1-alpha.89`
  CmpD stays Unavailable but doesn't block new clusters).

No engine version change.
@weicao weicao changed the title feat(mariadb): add 11.4 replication and semisync hardening feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90) May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants