feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90)#2633
Open
weicao wants to merge 79 commits into
Open
feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90)#2633weicao wants to merge 79 commits into
weicao wants to merge 79 commits into
Conversation
Add MariaDB 11.4 standalone, replication, semisync, and Galera chart resources. Harden semisync startup, role publication, switchover fencing, and script distribution. Add shell specs for replication member join, role probe, switchover, and standalone template mapping.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2633 +/- ##
=======================================
Coverage 0.00% 0.00%
=======================================
Files 73 79 +6
Lines 9197 12535 +3338
=======================================
- Misses 9197 12535 +3338 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Keep the KubeBlocks health-check table schema on fresh replicas and clear only local rows before starting or repairing SQL replication. This prevents the replica repair path from changing a duplicate-key error into a missing-table replication error.
Require internal local admin read-only privileges before role decisions. Track primary read/write readiness after local root unlock and read_only repair. Repair syncer primary reconciliation when the listener is already exposed but local write readiness is missing.
KB kbagent enforces a hardcoded `maxActionCallTimeout = 60 * time.Second`
in `pkg/kbagent/service/action_utils.go::actionCallTimeoutContext`, so any
CmpD `switchover.timeoutSeconds` greater than 60 is silently truncated.
alpha.58 declared 240; live-test evidence (cost=60060ms result=timedOut
on the kbagent action HTTP call) confirmed the action script was killed
mid-flight at exactly 60 seconds.
alpha.59 contract:
* CmpD `switchover.timeoutSeconds: 240 -> 60` in cmpd-semisync.yaml and
cmpd-replication.yaml so the declared contract reflects what kbagent
actually enforces.
* `run_switchover` shrinks to three required steps that all must fit
inside the 60s ceiling:
1. `prepare_current_primary_for_switchover` (local prep, ~3s)
2. `syncerctl_switchover` (DCS record, ~5s)
3. `fence_current_primary_local_writes_after_dcs` (local read_only
fence, ~1s) - retained synchronously because it is the
double-writable defense and must be true before action returns
4. `wait_candidate_remote_root_write_ready` (bounded ~8s probe,
fail-closed) - the third leg of the action-success contract:
never return 0 with a non-writable candidate.
* `wait_switchover_done`, `wait_post_switchover_stabilization`,
`wait_primary_service_routes_candidate`,
`wait_current_secondary_remote_root_fenced` are no longer invoked
from `run_switchover`. Post-DCS convergence is delegated to roleProbe
+ KB endpoint controller. The negative assertion that none of these
helpers fires lives in the new
`replication_switchover_spec.sh` "alpha.59 contract" tests.
* `kubeblocks.kb_health_check` 1062/1146 repair migrates from the
switchover action wait loops into the secondary roleProbe path
(`secondary_kb_health_check_repair_attempt` in
`replication-roleprobe.sh`). The repair has a precise signature
(`Last_(SQL_)?Errno: 1062|1146` AND `kubeblocks.kb_health_check` in
the slave error text), uses `kb_internal_root` (READ_ONLY ADMIN), is
best-effort, idempotent, and logs each attempt with rc. Other SQL
errors are NOT swallowed.
* ShellSpec gains six new examples for
`secondary_kb_health_check_repair_attempt` covering the precise
signature, the cli-user choice, the wrong-table negative case, the
wrong-errno negative case, and the empty-status negative case, plus
six examples for `slave_status_has_kb_health_check_repairable_error`.
* Two new `run_switchover` examples exercise the alpha.59 contract:
the negative assertion that the wait_* helpers are never invoked,
and the fail-closed path when the candidate write probe does not
close inside the bounded budget. Three obsolete examples (which
exercised `wait_switchover_done` directly) are removed.
* The runner-side post-OpsRequest convergence gate is
test-runner-owned (separate change; out of this addon patch).
References:
- apecloud/kubeblocks `pkg/kbagent/service/action_utils.go:64`
(`maxActionCallTimeout = 60 * time.Second`)
- addon-test-runner-write-after-bounded-role-gate guide
- bootstrap-runner-preload-after-bounded-role-gate-case
Two design-contract gaps caught in pair review: 1. fence_current_primary_local_writes_after_dcs previously verified only @@global.read_only=1, never that a user-facing root INSERT was actually rejected by the read-only fence. The contract field was non-empty but unenforced at the write site (xp design-contract class 2). Add verify_post_dcs_local_root_write_fenced: runs a localhost user-facing root INSERT into kubeblocks.kb_post_dcs_fence_probe and requires either rc=0 (fence not enforced -> fail closed) or rc!=0 with stderr containing 1290/read-only (fence verified). Other failure modes (no client, unrelated SQL error) also fail closed. Documentation in the function header records the contract change. 2. secondary_kb_health_check_repair_attempt previously did SET GLOBAL read_only=OFF -> DELETE -> SET GLOBAL read_only=ON, creating a small but real write window during which any client could have written to the secondary. This contradicts the double_writable=0 invariant the post-OpsRequest convergence test is meant to prove. Remove the read_only flip entirely: the repair now uses kb_internal_root (which holds READ_ONLY ADMIN from the addon's remote-root-fence path) and writes through while @@global.read_only=1 stays in place. If kb_internal_root cannot write for any reason, log rc and return; the next roleProbe tick re-evaluates. ShellSpec changes: * New Describe "verify_post_dcs_local_root_write_fenced()" with 4 examples: 1290 rejection -> success; rc=0 -> fail-closed; unrelated error -> fail-closed; no client binary -> fail-closed. * secondary_kb_health_check_repair_attempt "alpha.59 invariant" example now negative-asserts on SET GLOBAL read_only=OFF and SET GLOBAL read_only=ON. The earlier "fires repair" example drops the now-incorrect positive assertions for those two SQL statements. * Existing happy-path run_switchover examples gain a verify_post_dcs_local_root_write_fenced stub (return 0) so the fence verification still passes inside the SQL-mock environment. Total: 150 examples, 0 failures, 0 warnings.
…witchover
alpha.59 switchover N=1 RED with first-blocker = addon product / switchover
post-DCS root fence contract. Triple-source evidence: kbagent action cost
2.793s (NOT 60s cap; alpha.59 contract truncation works), action stderr
"post-DCS local-root write fence not enforced; user-facing root INSERT
succeeded after read_only=ON", SHOW GRANTS FOR root@% contains READ_ONLY
ADMIN, mysql.user shows root@127.0.0.1/root@localhost Insert_priv=Y
Super_priv=Y. Causal chain: addon apply_remote_root_fence "primary" granted
ALL PRIVILEGES (which in MariaDB 10.11+ bundles READ_ONLY ADMIN / SUPER /
BINLOG ADMIN), so user-facing root bypassed @@global.read_only=ON; the
alpha.59 verify_post_dcs_local_root_write_fenced caught it. This gap
existed in alpha.58 too but was masked by the absence of a verify probe.
alpha.60 hard contract (per Jack 23:28 8-class XP review):
* New revoke_user_facing_root_admin_privileges_for_secondary in
replication-switchover.sh:
- Enumerates mysql.user for actual root host rows (does not hardcode
%/127.0.0.1/localhost; covers whatever the live DB actually has)
- For each host: SHOW GRANTS first; if READ_ONLY ADMIN / SUPER /
BINLOG ADMIN / ALL PRIVILEGES is present, REVOKE each bypass priv
by name (never REVOKE ALL PRIVILEGES, never REVOKE GRANT OPTION as
a privilege)
- Distinct sentinel reasons per Jack class 4 (root_account_not_found,
privilege_absent_already_fenced, revoked, revoke_failed) so closeout
can attribute precisely
- 1141 (no such grant) on REVOKE is treated as already-fenced; any
other REVOKE error is fail-closed (Jack class 1: never silent
fallback)
- kb_internal_root is intentionally OUT of scope; it must keep
READ_ONLY ADMIN for the alpha.59 secondary roleProbe 1062 repair
path
- All SQL is via the kb_internal_root client (ROOT_LOCAL bypass not
used; revoking your own privilege mid-statement is risky)
- FLUSH PRIVILEGES + mysql.user snapshot logged at end
* fence_current_primary_local_writes_after_dcs gains the revoke step
between local_read_only_is "1" and verify_post_dcs_local_root_write_fenced.
Failed revoke -> immediate return 1; no partial fence.
* apply_remote_root_fence "primary" in replication-roleprobe.sh: the
GRANT ALL PRIVILEGES is replaced with an explicit privilege list that
EXCLUDES SUPER / READ_ONLY ADMIN / BINLOG ADMIN. GRANT OPTION is now
only via the trailing WITH GRANT OPTION clause (per Jack: putting it
in the comma-separated privilege list is a syntax error in some
MariaDB versions). This prevents alpha.61 from re-introducing the same
bypass through normal role transitions.
ShellSpec increments (10 new examples, 0 failures, 0 warnings, 157 total):
* Describe "revoke_user_facing_root_admin_privileges_for_secondary()"
6 examples covering each sentinel: account-not-found skip, multi-host
revoke success, multi-host with one fail-closed, 1141 already-fenced,
no-bypass-priv already-fenced, no-client fail-closed
* Describe "fence_current_primary_local_writes_after_dcs() revoke
fail-closed" 1 example asserts verify probe is NOT called when revoke
fails (negative trip-wire)
* Existing happy-path run_switchover examples gain
revoke_user_facing_root_admin_privileges_for_secondary stub (return 0)
alongside the existing verify_post_dcs stub
* roleprobe primary fence example asserts the new grant: REVOKE ALL
PRIVILEGES present, GRANT ALL PRIVILEGES NOT present, SUPER NOT
present, READ_ONLY ADMIN NOT present, BINLOG ADMIN NOT present,
", GRANT OPTION," (in the privilege list) NOT present, WITH GRANT
OPTION (trailing clause) present
Caveat: cmpd-semisync.yaml's set_local_root_account_state and
set_remote_root_account_state UNLOCK paths still re-grant ALL PRIVILEGES;
those are runtime sql-listener-fence transitions, not switchover-time
operations. Post-switchover their re-grant would have to be revoked again
on next switchover. Cleaning those up is alpha.61+ scope; alpha.60 trusts
switchover-time revoke as the immediate fix.
References:
- alpha.59 RED closeout msg 80e3b77c (4-source confirmation)
- alpha.59 design contract review msg 9e722fa8 (8-class)
- addon-test-runner-write-after-bounded-role-gate-guide.md (companion
methodology for the fence-correctness invariant)
…R to avoid mariadb image entrypoint side-effect (alpha.74 v1)
alpha.73 v1 N=1 partial first-blocker: pod-1 Slave_SQL_Running=No,
Last_SQL_Errno=1396 "Operation CREATE USER failed for kb_replicator@%"
on binlog replay. Direct evidence in n1h tar
3a8eccf7ef75ea4dd4214a8630a7b796afa33325a1832037cc6a8c5940254b00:
Query: CREATE USER 'kb_replicator'@'%' IDENTIFIED BY '<redacted>'
(no IF NOT EXISTS clause).
Root cause: mariadb 11.4 image entrypoint
/usr/local/bin/docker-entrypoint.sh examines MARIADB_REPLICATION_USER
env at initdb time. If set, it runs CREATE USER + GRANT REPLICATION
SLAVE WITHOUT IF NOT EXISTS, WITHOUT SET SESSION sql_log_bin=0
wrapper. Statement is binlogged. pod-1 ALSO runs entrypoint initdb
which creates kb_replicator locally; then pod-1 START SLAVE replays
pod-0 binlog -> 1396.
Fix: STOP setting MARIADB_REPLICATION_USER env entirely (so mariadb
entrypoint doesn't trigger that CREATE/GRANT path). Introduce a
renamed MARIADB_REPL_USER shell variable for chart scripts. syncer
Go binary continues to read MYSQL_REPLICATION_USER (engines/mariadb/
config.go:80) -- this env name is NOT consumed by mariadb image
entrypoint, so syncer path stays converged to kb_replicator with no
entrypoint side-effect. Chart's ensure_internal_local_admin still
creates kb_replicator@'%' via INTERNAL_LOCAL with sql_log_bin=0,
so chart's CREATE USER is NOT binlogged. Both pods bootstrap
kb_replicator locally (idempotent) and no binlog event needs to
replay through START SLAVE.
Changes:
- Chart.yaml: bump alpha.73 -> alpha.74 + alpha.74 v1 comment block
- cmpd-semisync.yaml env: remove MARIADB_REPLICATION_USER +
MARIADB_REPLICATION_PASSWORD; add MARIADB_REPL_USER +
MARIADB_REPL_PASSWORD. Keep MYSQL_REPLICATION_USER +
MYSQL_REPLICATION_PASSWORD.
- cmpd-semisync.yaml inline CHANGE MASTER (4 sites): MASTER_USER=
'${MARIADB_REPL_USER:-kb_replicator}'
- cmpd-semisync.yaml ensure_internal_local_admin shell var sourcing:
replication_user="$(sql_quote "${MARIADB_REPL_USER:-kb_replicator}")"
- replication-member-join.sh (2 sites): MASTER_USER=
'${MARIADB_REPL_USER:-kb_replicator}'
- replication_user_convergence_spec.sh: update Gate 2 (semisync
MARIADB_REPL_USER positive + MARIADB_REPLICATION_USER negative),
Gate 3 (member-join + inline assertions use REPL_USER), Gate 4
(shell var sourcing), Gate 5 (MARIADB_REPL_PASSWORD positive +
MARIADB_REPLICATION_PASSWORD negative). Total 25 examples (was 23).
- replication_switchover_spec.sh: bump alpha version literals
(alpha.73 -> alpha.74).
Static gates: helm lint PASS, bash -n / dash -n PASS, ShellSpec
309/0. Live N=0 (alpha.74 new patch-version window, does not
inherit alpha.73 partial).
westonnnn `df3c94b0` 01:28 enabled 12h autopilot. TL self-determined
without blocking on Jack XP review (Jack review parallel, msg
`100a28e1` + revised `bd7c5dff`). Cindy PM boundary kept at msg
`28cf6bfe`.
…in fallback (alpha.74 v1)
Jack XP review HOLD at msg 42405f6d 01:32 caught: replication-member-
join.sh is shared by semisync and replication topology. alpha.72 v1
Option 1 scope-cap requires replication topology fallback to
MARIADB_ROOT_USER (root) when MARIADB_REPL_USER env is absent. My
first push used '${MARIADB_REPL_USER:-kb_replicator}' which broke that
contract for replication-only pods.
Fix: restore chained fallback '${MARIADB_REPL_USER:-${MARIADB_ROOT_USER}}'.
Semisync pods set MARIADB_REPL_USER=kb_replicator -> use kb_replicator.
Replication-topology pods don't set the env -> fall through to
MARIADB_ROOT_USER (root) -> pre-alpha.72 behavior preserved.
ShellSpec Gate 3 member-join assertion updated to reflect chained
fallback.
Also adds documented evidence (in commit msg, not code) that mariadb
11.4 entrypoint /usr/local/bin/docker-entrypoint.sh greps for
MARIADB_REPLICATION_USER only:
- Line 178: if [ -n "$MARIADB_REPLICATION_USER" ]; then
- Line 278-280: file_env 'MARIADB_REPLICATION_USER' / _PASSWORD / _HASH
- Line 333 + 338: CREATE USER '$MARIADB_REPLICATION_USER'@'%' (no IF NOT EXISTS, binlogged)
- Line 464: if [ -n "$MARIADB_REPLICATION_USER" ]; -> Creating user
- Zero (0) grep matches for MYSQL_REPLICATION_USER -- this env name is NOT consumed by the mariadb 11.4 image entrypoint, so setting MYSQL_REPLICATION_USER for syncer Go binary doesn't trigger entrypoint side-effect.
This satisfies Jack blocker #2 (direct entrypoint evidence) and
blocker #1 (member-join fallback preserves scope-cap).
ShellSpec: 309/0. helm lint PASS. bash -n / dash -n PASS.
…ntract fix alpha.74 v1 switchover idle-state N=1 RED on n1u revealed inherited contract drift between alpha.61 v3 secondary fence (REVOKE ALL + GRANT non-bypass minimum list, NO BINLOG ADMIN) and the verifier verify_post_dcs_local_root_write_fenced() preamble. The preamble ran SET SESSION sql_log_bin=0 + DDL as user-facing root; after demote root lacks BINLOG ADMIN so the verifier failed with 1227 on the preamble before reaching the actual fence test. Fix (Jack XP A2 -> B final): - Strip preamble in verifier; only user-facing root INSERT runs - Move probe table create to bootstrap-time ensure_internal_local_admin in cmpd-semisync.yaml (INTERNAL_LOCAL, binlog-replay-safe) - Acceptance contract narrows: rc=0 FAIL; 1146/1227/1044 FAIL with distinct regression-guard sentinels; 1290/read-only PASS only alpha.61 fence safety model preserved: user-facing root still NOT granted BINLOG ADMIN. Fix is in the verifier contract, not in root privileges. ShellSpec hard gates (4 new + 2 contract): - verifier body must not contain SET SESSION sql_log_bin=0 - verifier body must not contain CREATE DATABASE / CREATE TABLE - 1146/1227/1044 FAIL with distinct regression-guard sentinels - 1290 PASS - cmpd-semisync.yaml ensure_internal_local_admin contains probe table create - Chart.yaml version = 1.1.1-alpha.75; no stale 1.1.1-alpha.74 Static gates: helm lint / bash -n / dash -n / ShellSpec 318/0 PASS. Boundary: alpha.74 fresh-bootstrap N=3 GREEN unchanged; alpha.74 switchover idle-state N=1 RED preserved on n1u as canonical anchor; alpha.75 N=0 fresh start for switchover idle-state axis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…= PD addition only; bump-to-escape-CmpD-immutability)
Bundles alpha.76 through alpha.85 chart work that was carried in working
tree across the recent autopilot cycles. Chart.yaml comment block
documents each alpha bump rationale individually (alpha.76 marker race
defense through alpha.83 reconfigure account switch through alpha.84 v2
semisync ParametersDefinition completion, then alpha.85 pure version
bump to escape KB CmpD immutability after v1 / v2 dry-run cycle on
alpha.84).
alpha.84 v1 first draft attempted to address two compounding causes of
the alpha.83 clean-r1 false-success: (A) semisync had no
ParametersDefinition so KB Configure controller defaulted to rolling
restart, and (B) reconfigureAction only ran SET GLOBAL with no
persistence so any mariadbd restart erased the runtime state. v1 added
both a semisync PD AND a defense-in-depth persistence layer.
v1 N=1 fresh dry-run found that the KB Configure controller parses
ConfigMap-stored my.cnf with a strict INI parser
(fileFormatConfig.format: ini); MariaDB !includedir directive is not
valid INI key=value syntax and made the parser throw a key-value
delimiter not found error before reconfigureAction could run.
OpsRequest entered Failed at pre-action parse step. v1 evidence ns
mariadb-t6-alpha84-n1-0547 frozen as parse-failure evidence
(attachment tar sha 9a7a74fecb42ed9547a6eab381856f98bc113a3f1322f8672c5542a25e233469).
alpha.84 v2 amend reverted the persistence layer entirely and kept
only the semisync PD addition. v2 dry-run failed at the deploy gate
because the v1 install had already created an alpha.84 CmpD in the
test cluster, and KB CmpD immutability rule rejected the v2 install
with phase=Unavailable / immutable fields can't be updated (evidence
tar sha bf50831ad968f9354c62e5909dae126d4df39c4f6ce3c6e1b9a4f4546293b7af).
alpha.85 (this version) is a pure version bump on top of alpha.84 v2
to escape that immutability lock. Chart source is byte-identical to
alpha.84 v2 except the version literal; functional scope remains the
minimal semisync PD addition (cause A from alpha.83 clean-r1).
Persistence redesign (cause B) is deferred to alpha.86 with a probable
approach of mariadbd --defaults-extra-file pointing at a single PVC
backed override file outside KB ParametersDefinition scope.
Fix in alpha.85:
- templates/paramsdef.yaml: new mariadb-semisync-pd declaration with
componentDef regex caret mariadb-semisync- and templateName
mariadb-semisync-config (matches cmpd-semisync.yaml configs[0].name,
not the helm template object name; Galera comment block already
documents the same trap). Uses the same staticParameters /
dynamicParameters source as the other topology PDs
(config/mariadb-config-effect-scope.yaml). With this PD in place,
the KB Configure controller can classify slow_query_log /
long_query_time / etc. as dynamic and skip rolling restart.
Test deltas (kept from v1/v2 because they are correct independent of
the persistence rollback and the alpha.85 bump):
- scripts-ut-spec/replication_switchover_spec.sh: chart version
literal assertions bumped to alpha.85; @ percent grant allowlist
adds SLAVE MONITOR (alpha.81 contract).
- scripts-ut-spec/replication_user_convergence_spec.sh: version
literal bumped to alpha.85; prior-version negative check updated
to alpha.84.
- scripts-ut-spec/semisync_rejoin_fence_template_spec.sh:
CMPD_SECONDARY_FENCE_GRANT_BODY literal adds SLAVE MONITOR
(no version-keyed change).
Verification: helm lint PASS; helm template PASS (no !includedir, no
OVERRIDES_DIR, no runtime-overrides references in rendered output;
semisync PD rendered with correct componentDef regex and templateName;
chart label rendered as mariadb-1.1.1-alpha.85); shellspec
addons/mariadb/scripts-ut-spec/ 333 examples 0 failures 7 pendings
(7 pendings are pre-existing obsolete tests, not new debt).
Runtime closeout is NOT included in this commit. alpha.85 still
needs fresh N>=3 to clear the 3-item runtime gate: (1) semisync PD
dynamic hit, (2) no rolling restart / switchover triggered by
reconfigure, (3) both pods reflect new values within bounded wait.
The forced mariadbd process restart preserves gate is deferred to
alpha.86 along with the persistence redesign.
Peer reviewed by Jack across four rounds (v1: 3 blockers closed
WARN-only persist to fail-closed, shared helper to semisync-only
persisted variant, ShellSpec version + grant assertions synced;
v1 release-note: WARN-only language to fail-closed; v2: persistence
layer reverted after parser blocker discovered in dry-run; alpha.85
bump after CmpD immutability lock blocked v2 deploy).
…riadbd --defaults-extra-file (semisync only) Builds on alpha.85 (commit 61cc382 PD addition only). alpha.85 dry-run by Jack confirmed PD addition alone is necessary but insufficient: KB Configure controller dynamic-classifies the params correctly, but the InstanceSet update reconciler (pkg/controller/instanceset/ reconciler_update.go) unconditionally triggers lifecycleActions.switchover before any in-place Pod template update, including config-hash-only annotation syncs from dynamic reconfigure. That bug is owned by Lily / Edward / Rocco in PR #10252. But mariadbd process restart can also happen from paths unrelated to the controller bug (OOMKill, node drain, manual restart, syncer failover after health probe loss). In any of those paths, SET GLOBAL runtime state from reconfigureAction is wiped and the runtime invariant reverts to chart defaults. alpha.86 adds a defense-in-depth persistence layer that is independent of #10252 and works for any restart path. Design (avoids the alpha.84 v1 INI parser FAIL by keeping !includedir out of KB-managed config): - init-syncer (cmpd-semisync.yaml): mkdir -p /var/lib/mysql/runtime-overrides.d (per-param .cnf files) chgrp 1000 + chmod 0770 (g+rwx for kbagent gid 1000 write access) create loader file /var/lib/mysql/runtime-overrides.cnf with one line: !includedir /var/lib/mysql/runtime-overrides.d/ chgrp 1000 + chmod 0660 on loader file (mariadbd-readable, kbagent-writable) - start_mariadbd_process (cmpd-semisync.yaml line 1418): mariadbd --defaults-extra-file=/var/lib/mysql/runtime-overrides.cnf ... (rest of args). --defaults-extra-file is the FIRST mariadbd option per MariaDB requirement; mariadbd silently accepts a missing directory so fresh bootstrap works before any reconfigure. - _helpers.tpl: NEW mariadb.config.reconfigureAction.persisted helper (semisync only). After each successful SET GLOBAL, writes per-param /var/lib/mysql/runtime-overrides.d/<name>.cnf via temp file + atomic rename. Fail-closed on mkdir / tmp write / mv / parse smoke failure (exit 1 after tmp cleanup or bad-file removal). - cmpd-semisync.yaml configs[].reconfigure: include the .persisted variant instead of the base helper. Jack 5-guard enforcement (peer review msg a13b8850 + 06:49 follow-up): 1. kbagent write permission: chgrp 1000 + chmod 0770 (g+rwx) on runtime-overrides.d; chmod 0660 on loader file. 2. --defaults-extra-file is the FIRST mariadbd option: positional assertion in ShellSpec (not just grep), checking source-order adjacency to docker-entrypoint.sh mariadbd line. 3. param name/value injection defense: name regex ^[A-Za-z0-9_.-]+$; value rejects newline (\n \r), NUL and other control chars (\x00-\x1f \x7f), bracketed section markers like [mysqld]. 4. fail-closed sentinels: mkdir / tmp write / mv / parse smoke all exit 1 with cleanup. WARN-only regression guard in ShellSpec ensures no degradation back to alpha.84 v1 first-draft semantics. 5. parse smoke after each persist: runs mariadbd --defaults-extra-file=<loader> --print-defaults >/dev/null to catch mariadb-syntax-invalid input that would crash the engine on next restart. Failure removes the corrupted param file and exits 1. Tests: - scripts-ut-spec/reconfigure_persisted_alpha86_spec.sh (NEW): 33 source-level contract tests covering 5 guards + semisync-only scope + KB-managed config has NO !includedir (alpha.84 v1 regression guard) + effect-scope still includes T6 target params. - replication_switchover_spec.sh / replication_user_convergence_spec.sh: chart version literal assertions bumped from alpha.85 to alpha.86; prior-version negative checks updated accordingly. Verification: helm lint PASS; helm template PASS (--defaults-extra-file and OVERRIDES_DIR appear ONLY in cmpd-semisync.yaml rendered output); shellspec addons/mariadb/scripts-ut-spec/ 386 examples 0 failures 7 pendings (333 pre-existing + 53 new focused (5 guards static + B1 behavioral subshell tests + B2 main-container chown-R-survival regression); 7 pendings are pre-existing obsolete tests, not new debt). Runtime closeout is NOT included in this commit. alpha.86 still needs fresh N>=3 to clear the 4-group runtime gate: Group 1 - render/static: PD + loader + dir + first-arg + fail-closed contracts. Group 2 - runtime reconfigure: OpsRequest Succeed + both pods SHOW GLOBAL ON/3 within bounded wait. Group 3 - process restart: kill -TERM mariadbd; verify new PID and SHOW GLOBAL still ON/3 (proves --defaults-extra-file loads the persisted override on startup; independent of #10252). Group 4 - controller combo: after PR #10252 patch image lands, verify config-hash-only update does NOT trigger switchover; Rocco gate 1 (controller image identity) + Rocco gate 2 (pod YAML diff is config-hash-only). Peer reviewed by Jack in two rounds (design ack msg a13b8850 with 5 hard guards + 06:49 follow-up adding g+rwx on dir and parse smoke first-option constraint). All guards encoded in code and ShellSpec regression tests. alpha.87 v1 amend (Helen 2026-05-19 07:38) — chart version literal bumped from 1.1.1-alpha.86 to 1.1.1-alpha.87 to escape KB CmpD immutability after the alpha.86 amend added parse smoke stderr capture in the persisted helper. The test vcluster already had an alpha.86 CmpD installed from Jack's first alpha.86 dry-run; KB rejected the upgrade with phase=Unavailable / immutable fields can't be updated. Same KB CmpD immutability rule the Chart.yaml comment block has tracked since alpha.65. ShellSpec version literal assertions also bumped to alpha.87. alpha.88 v1 amend (Helen 2026-05-19 07:48) — DROP parse smoke entirely after alpha.86 + alpha.87 dry-runs (Jack msg e6afaa1a). Two root causes made parse smoke unworkable in kbagent action runtime: 1. mariadbd is NOT on PATH in the kbagent action context (the smoke command-substitution returned rc=127). 2. set -e + var=$(failing_cmd) caused shell to exit immediately on the failed substitution, bypassing the stderr-print and bad-file cleanup that should have surfaced the diagnosis and removed the orphan .cnf file. Remaining defenses (Guards 1-4) cover the threat model: - injection defense (is_safe_param_name + is_safe_param_value) rejects unsafe names / values / control chars / section markers. - atomic temp + mv guarantees no half-written file. - mariadbd own error log on next restart is the authoritative validation surface; it sees the file in its real runtime context, not the kbagent synthesized PATH. Test deltas in alpha.88 v1: - chart version literal assertions bumped from alpha.87 to alpha.88. - reconfigure_persisted_alpha86_spec.sh: parse smoke presence tests REPLACED by parse smoke absence regression guards (must NOT contain mariadbd --defaults-extra-file= / --print-defaults / smoke_out=$( / Parse smoke failed). Guard 5 section kept inline as archeology comment. Same KB CmpD immutability rule applies (rendered cmpd-semisync.yaml content changed because the persisted helper body shrunk).
…on topology + merged CmpD First scaffolding commit for the replication+semisync topology merge under weston Option B (2026-05-19 14:12 msg d8ecfae7). Folds Jack design review v3.1 (15:36) 4 blockers + 3 non-blockers, and the scaffolding review (15:42) 2 blockers + 1 caveat. User-facing API change (breaking): - clusterdefinition.yaml lists only the `replication` topology for primary/secondary. The old `async` and `semisync` topology names are removed. Users with existing Cluster CRs must actively migrate to spec.topology=replication plus a ComponentSpec parameter replicationMode=async|semisync. The migration is documented as the main release-notes route; any KB auto-rebind on a topology compDef rename is exploratory and verified in a separate upgrade gate N=1 (not promised in release notes). New artifacts: - templates/_helpers.tpl: helpers for the merged CmpD name and a narrow PD regex pattern `^mariadb-replication-merged-`. Also narrows the old `mariadb-replication` regex from `^mariadb-replication-` to `^mariadb-replication-[0-9]` so the old PD does not silently double-match the new merged CmpD name (scaffolding review Blocker 1). - templates/cmpd-replication-merged.yaml: a new ComponentDefinition. Starts as a near-verbatim copy of cmpd-semisync.yaml. The header comment now correctly describes the Option B layout: old CmpDs are kept as compat resources for already-bound Clusters, NOT as ClusterDefinition topology fallback (scaffolding review Blocker 2). - templates/paramsdef.yaml: adds a single ParametersDefinition `mariadb-replication-merged-pd` bound to the merged CmpD via the narrow regex. - templates/pcr.yaml: adds the corresponding ParamConfigRenderer. Also fixes a pre-existing bug the scaffolding review surfaced: the semisync PCR's parametersDefs list referenced `mariadb-replication-pd` instead of `mariadb-semisync-pd`. Under the previous wide regex this was harmless; under the narrowed regex it would silently leave the deprecated semisync CmpD's PCR pointing at an unrelated PD (scaffolding review caveat). - scripts-ut-spec/replication_merged_pd_regex_disambiguation_spec.sh: 25-test ShellSpec contract that locks the regex disambiguation property — each of the five rendered CmpD names matches its own regex and none of the other four. Prevents future edits from silently reintroducing the overlap. Deprecated artifacts kept for the alpha.89 cycle (not listed in clusterdefinition.yaml.topologies, scheduled for removal once the upgrade gate closes): - templates/cmpd-replication.yaml (old async CmpD). - templates/cmpd-semisync.yaml (old semisync CmpD). What this commit does NOT do (subsequent commits in the same PR): - Parameterize the merged CmpD on `replicationMode`. Today the merged CmpD behaves identically to mariadb-semisync. - Rename the merged CmpD's configspec from `mariadb-semisync-config` to `mariadb-replication-config`. - Update the switchover script to detect mode at runtime via @@rpl_semi_sync_master_enabled. - Add scripts/validate-replication-mode.sh (read-only two-source consistency: DB SHOW VARIABLES vs ComponentParameter). - Update existing ShellSpec tests for the new topology and the new `replicationMode=async` track. - Run the upgrade gate N=1 against existing async / semisync Cluster CRs. These are sequenced in subsequent commits in this same PR. Static verification: - helm template renders cleanly: 5 ComponentDefinition objects (standalone, galera, two deprecated old async/semisync, plus the new merged one), 5 ParametersDefinition objects with mutually exclusive regexes, ClusterDefinition.topologies = [standalone, replication, galera]. - shellspec replication_merged_pd_regex_disambiguation_spec.sh passes 25 of 25 examples. No engine version change. appVersion stays at 11.4.10.
…o mariadb-replication-config Renames the merged CmpD's configspec name from the legacy `mariadb-semisync-config` (inherited from the verbatim copy in commit 1's scaffolding) to the canonical `mariadb-replication-config`, so the merged CmpD does not carry the legacy semisync identifier in its addressable surface. The configspec name now matches the merged CmpD's user-facing identity and the eventual single-template target. KB Configure resolves configspec name by triple binding: cmpd-replication-merged.yaml spec.configs[].name paramsdef.yaml (merged PD) spec.templateName pcr.yaml (merged PCR) spec.configs[].templateName A mismatch at any of the three sites would silently break parameter resolution. This commit updates all three and adds a ShellSpec (replication_merged_configspec_consistency_spec.sh, 3 examples) to lock the invariant so future edits can not silently re-introduce a drift. The underlying ConfigMap object template (mariadb-semisync-config- template) is unchanged in this commit. It will be migrated to a unified mariadb-replication-config-template once `replicationMode` parameterization lands and async / semisync defaults can be expressed in a single template. Renaming the rendered ConfigMap before parameter-driven defaults would silently change cluster behavior. Also folds Jack scaffolding review (15:50) non-blocking clarification: the _helpers.tpl block comment for `mariadb.replication.merged .cmpdName` previously said "not reusing `mariadb-replication-`", which read as if the new name had a different prefix. It actually shares the prefix; disambiguation comes from the `-merged-` infix plus the narrowed `^mariadb-replication-[0-9]` regex on the old CmpD's PD. The comment now states this directly and points at the disambiguation ShellSpec from commit 1. Static verification: - helm template renders cleanly (5 CmpD / 5 PD / 5 PCR; merged CmpD's configs[].name and the merged PD/PCR templateNames all resolve to `mariadb-replication-config`). - shellspec replication_merged_pd_regex_disambiguation_spec.sh: 25 of 25 examples pass. - shellspec replication_merged_configspec_consistency_spec.sh: 3 of 3 examples pass. What this commit does NOT do: - Parameterize the merged CmpD on `replicationMode`. Today the merged CmpD still uses the semisync ConfigMap content; semisync variables still default to ON. - Rename or unify the underlying ConfigMap object template. - Add enum validation or fail-closed handling for invalid `replicationMode` values. - Add the switchover script's runtime mode-detection branch. - Add validate-replication-mode.sh (read-only two-source check). These are scheduled in the next commit, which will close Class 2 (write-site) and Class 4 (sentinel) of Jack design contract. No engine version change. Chart version unchanged at 1.1.1-alpha.89.
…I section binding for fail-closed (C1 path)
C1-path implementation per Jack design-review enum-research
(2026-05-19 16:41) and weston bounded-poll default (16:45 escalation
+ two unanswered nudges through 18:35 → defaulted to C1 with explicit
override window). Folds Jack scaffolding-review commit 3 v1
fail-closed blocker B1 (2026-05-19 18:48): the v1 CUE file declared
the #MariaDBParameter type but did not bind it to the parsed INI
config structure, so KB's `ValidateConfigWithCue()` did not actually
use the constraints and invalid values such as
`rpl_semi_sync_master_enabled = MAYBE` passed through. v2 adds the
`[SectionName=_]: #MariaDBParameter` binding pattern that the MySQL
and ApeCloud MySQL CUE files already use, so KB binds the type to
every parsed INI section and rejects out-of-enum / out-of-range
values at the controller parameter reconcile path. Closes the
Class 4 sentinel requirement of the topology-merge design contract.
New CUE constraint file: addons/mariadb/config/mariadb-config-constraint.cue
- Declares the #MariaDBParameter top-level type.
- Constrains the four semisync engine variables that are the real
source-of-truth for replication mode:
* rpl_semi_sync_master_enabled : string & "ON" | "OFF" (default OFF)
* rpl_semi_sync_slave_enabled : string & "ON" | "OFF" (default OFF)
* rpl_semi_sync_master_wait_for_slave_count : int 1..65535 (default 1)
* rpl_semi_sync_master_timeout : int 1..2147483647 ms (default 10000)
- v2 adds the `[SectionName=_]: #MariaDBParameter` binding at the
end of the file. KB's CUE validator walks every parsed INI
section and applies #MariaDBParameter to each, so unknown enum
values / out-of-range ints surface as CUE conflicts before the
rendered ConfigMap is applied.
- Does NOT declare a synthetic `replicationMode` key. KB has no
transform from a logical mode key to multiple engine variables
(ParamConfigRenderer has no transform hook). A synthetic key
would either be ignored by the renderer or land in my.cnf
verbatim and break mariadbd. The unified-switch UX is delivered
by addon docs and (later) helper scripts that emit the
four-parameter block from a single user-facing choice.
- The merged PD ShellSpec asserts the absence of a
`replicationMode` key in the CUE so a future edit cannot
silently reintroduce it without forcing the design back to the
surface.
Updated ParametersDefinition for the merged CmpD (paramsdef.yaml):
- mariadb-replication-merged-pd now declares
spec.parametersSchema:
topLevelKey: MariaDBParameter
cue: |-
...Files.Get "config/mariadb-config-constraint.cue"...
- v2 paramsdef.yaml comment updated to call out the section
binding explicitly: fail-closed depends on both the
top-level definition AND the [SectionName=_] binding being
present, not on either alone.
Updated dynamicParameters classification (mariadb-config-effect-scope.yaml):
- Adds the four semisync engine variables to dynamicParameters.
MariaDB documents all four as dynamic system variables, so a
reconfigure can be applied at runtime via the alpha.88
reconfigureAction (SET GLOBAL + persisted override file) and
does NOT need a rolling restart. Without this classification
the KB Configure controller falls back to rolling restart,
which on semisync triggers switchover/promote and reopens the
SET-GLOBAL-without-persist race the alpha.84 → alpha.88 chain
already closed.
ShellSpec contract: 14 examples (was 13 in v1):
- scripts-ut-spec/replication_merged_pd_schema_enum_spec.sh
- Locks: the CUE file exists, declares MariaDBParameter, constrains
each of the four variables with the documented enum/range, does
NOT add a synthetic `replicationMode` key, AND binds the type
to every INI section via `[SectionName=_]: #MariaDBParameter`
(v2 new test guarding against B1 regression).
- Locks: the merged PD block declares parametersSchema with the
correct top-level key and references the CUE file via
.Files.Get.
- Locks: dynamicParameters lists all four semisync variables.
Behavioral validation reference: Jack's KB-validator reproduction
(2026-05-19 18:48) demonstrated that with the v1 CUE (no section
binding) `ValidateConfigWithCue()` returned <nil> for an invalid
config (`MAYBE` / `0` / `0`), and that adding the
`[SectionName=_]: #MariaDBParameter` line caused the same invalid
config to surface as a CUE conflict on `mysqld.rpl_semi_sync_*`.
This commit lands the same fix his reproduction validated.
What this commit does NOT do (sequenced for later commits):
- Change the ConfigMap template defaults. The merged CmpD still
inherits the semisync ConfigMap (`mariadb-semisync-config-template`),
so the default behavior for a new cluster remains semisync.
- Update the switchover script's runtime mode-detection branch.
- Add scripts/validate-replication-mode.sh (read-only two-source
check: SHOW VARIABLES vs ComponentParameter).
- Update existing ShellSpec tests for the new `dynamicParameters`
entries (those specs do not assert the full list).
- Run the upgrade gate N=1 against existing async / semisync
Cluster CRs.
Static verification:
- helm template renders cleanly. The merged PD now emits a
parametersSchema block with the inline CUE content including
the [SectionName=_]: #MariaDBParameter binding.
- shellspec replication_merged_pd_regex_disambiguation_spec.sh:
25 of 25 examples pass.
- shellspec replication_merged_configspec_consistency_spec.sh:
3 of 3 examples pass.
- shellspec replication_merged_pd_schema_enum_spec.sh:
14 of 14 examples pass (was 13 in v1; +1 for B1 guard).
- Combined: 42 of 42 examples pass.
No engine version change. Chart version unchanged at 1.1.1-alpha.89.
If weston later reverses the bounded-poll default and chooses C2,
the CUE schema, section binding, and dynamicParameters
classification land in this commit are still required and continue
to apply; only the addon-internal mapper layer would be added on
top (mapper would emit the same four parameters under the hood and
the schema + dynamic classification would gate values arriving from
the mapper).
… read-only two-source consistency check (C1 path)
Implements the read-only mode-consistency helper from the v3.1 design
document §3 (Jack design review Gate 5 v3 — annotation writes are
deferred; this commit covers the two-source check the design promises
to land in the first PR). Used by test runners and (later, optionally)
by a kbagent action at OpsRequest closeout to fail-closed when the
engine's in-memory state and the rendered ConfigMap-mounted my.cnf
disagree on replication mode.
Folds Jack commit-4 review (2026-05-19 20:13) blocker B1 and
prepares for the B2 lifecycle-wiring follow-up.
B1 fix: the v1 helper only inspected the master read's return code,
so a slave-side observability failure silently fell through and was
classified as `invariant_violated` (when master=ON) or `disagree`,
not the correct `engine_missing` / `configmap_missing`. v2 captures
the slave read's return code separately and treats a failed OR empty
read on either master or slave as the same engine/configmap-missing
signal. The engine_var() and configmap_var() helpers now return a
non-zero rc when the underlying read succeeded but returned an empty
string, so the caller does not have to duplicate the empty-check.
B2 prep: add MYSQL_SOCKET, MYSQL_USER, MYSQL_PASSWORD, and
MYSQL_EXTRA_ARGS env passthroughs to the mysql client invocation.
When a value is empty the corresponding flag is omitted, so the
default behavior (TCP connect to 127.0.0.1:3306 with no auth) stays
identical to v1 for local introspection and the existing tests.
The lifecycle wiring commit can supply these env vars from the
chart's existing mysqld socket path and the kbagent-mounted
credentials Secret without further changes to the helper.
Non-blocking cleanup: removed the `BEGIN { IGNORECASE = 1 }` block
from configmap_var's awk — the comparison already lowercases both
sides via tolower(), and IGNORECASE is a gawk extension that not
all busybox awks honor consistently.
Helper script: addons/mariadb/scripts/validate-replication-mode.sh
- /bin/sh shebang (kbagent ships busybox sh).
- Reads `rpl_semi_sync_master_enabled` and `rpl_semi_sync_slave_enabled`
from two sources (engine SHOW VARIABLES + ConfigMap-mounted my.cnf).
- Normalizes MariaDB boolean representations (`ON`/`OFF`/`1`/`0`/
`true`/`false`) to canonical `ON`/`OFF` before comparing.
- Read-only by design (Jack design review Gate 5 v3). Annotation
writes deferred to a future controller-side writer.
- Exit codes:
0 OK / sources agree
1 disagree (closeout FAIL)
2 engine_missing (transient; caller should bounded-retry) —
triggered by master OR slave read failure or empty result
3 configmap_missing (transient; caller should bounded-retry) —
triggered by master OR slave key missing or file unreadable
4 invariant_violated (master=ON but slave=OFF; mariadb semisync
silently degrades to async on such an asymmetric setting)
- Output format: per-key `key=value` tokens on stdout for grep
by tests / kbagent attestation; `mode_consistency=<state>` line
gives the verdict.
ShellSpec contract: addons/mariadb/scripts-ut-spec/validate_replication_mode_spec.sh
- 9 behavioral examples (was 7 in v1; +2 for B1 regression
coverage):
* both sources ON -> ok / exit 0
* both sources OFF -> ok / exit 0
* normalization `1` <-> `ON` -> ok / exit 0
* engine ON but ConfigMap OFF -> disagree / exit 1
* mysql client fails -> engine_missing / exit 2
* NEW: slave engine read returns empty row -> engine_missing /
exit 2 (was misclassified as invariant_violated in v1)
* NEW: ConfigMap missing the slave key -> configmap_missing /
exit 3 (was misclassified as disagree in v1)
* my.cnf unreadable -> configmap_missing / exit 3
* master=ON slave=OFF -> invariant_violated / exit 4
What this commit does NOT do (sequenced for later commits):
- Wire `validate-replication-mode.sh` into a cmpd lifecycle
action (the B2 env-var surface is in place so that commit only
has to add the action declaration + reference the env vars from
Secret / socket path).
- Flip the ConfigMap template default to async.
- Add the switchover script's runtime mode-detection branch.
- Update existing per-topology ShellSpec specs to the alpha.89
chart-version literal.
- Run the upgrade gate N=1.
Static verification:
- shellspec validate_replication_mode_spec.sh: 9 of 9 examples pass.
- Combined across 4 alpha.89 specs: 51 of 51 examples pass.
- helm template not affected (no template changes in this commit).
No chart version change; no engine version change.
Per the new autonomous-addon-development-loop guidance (Cindy
2026-05-19 20:03 kubeblocks-addon-docs main commit 2395b53), human
review of this branch is an upstreaming gate, not a testing gate.
The merged-branch chart is patchable today (chart-build + sideload)
for any caller that wants to validate the C1 path end-to-end before
the upstream PR lands.
….sh in replication script ConfigMap (C1 path)
Wires the read-only mode-consistency helper from commit 4 v2
(scripts/validate-replication-mode.sh) into the replication-tier
script ConfigMap (mariadb-replication-scripts-...) so it lands at
/scripts/validate-replication-mode.sh inside the mariadb container
alongside roleprobe / member-join / switchover.
Concrete effect:
- Test runners and ad-hoc operators can call the helper directly via
kubectl exec <pod> -c mariadb -- /scripts/validate-replication-mode.sh
No further chart change required to use it at closeout.
- A later commit can add a kbagent lifecycle action that references
the same /scripts path without re-touching this ConfigMap. The
B2 env passthroughs added in commit 4 v2
(MYSQL_SOCKET / MYSQL_USER / MYSQL_PASSWORD / MYSQL_EXTRA_ARGS)
cover the connection context that a kbagent action would supply
from the chart's existing Secret / socket path.
New ShellSpec contract: addons/mariadb/scripts-ut-spec/replication_merged_validate_script_mount_spec.sh
- 2 examples lock the ConfigMap surface:
1. data key `validate-replication-mode.sh:` is present
2. its body is pulled via .Files.Get "scripts/validate-replication-mode.sh"
- Prevents a future ConfigMap edit from silently dropping the mount
before the kbagent lifecycle action lands.
What this commit does NOT do:
- Add a kbagent lifecycle action declaration for the helper. The
helper is reachable via kubectl exec today; wiring it into a
formal `customActions` or `reconfigure` post-hook is a separate
design choice that depends on whether mode-consistency should
block a reconfigure OpsRequest or only surface in test closeout.
- Flip the ConfigMap template default to async.
- Add the switchover script's runtime mode-detection branch.
- Update existing per-topology ShellSpec specs to alpha.89.
- Run the upgrade gate N=1.
Static verification:
- helm template renders cleanly. The replication script ConfigMap
now embeds validate-replication-mode.sh under data, the body
visible in the rendered output.
- shellspec replication_merged_validate_script_mount_spec.sh: 2 of 2
examples pass.
- Combined across 5 alpha.89 specs: 53 of 53 examples pass
(regex 25 + configspec 3 + PD schema 14 + validate helper 9 +
script mount 2).
No chart version change; no engine version change.
Per the autonomous-addon-development-loop guidance (Cindy
2026-05-19 20:03 commit 2395b53), human review on this branch is
an upstreaming gate, not a testing gate. After this commit a
sideloaded chart on a live cluster has the helper available for
two-source consistency checks during test closeout.
…ersion literals to alpha.89 Updates the version-tracking assertions in the existing per-feature ShellSpec suites so they validate against the current alpha.89 chart, not the alpha.88 the suites were last frozen at. The previous run showed 4 failures all on `version: 1.1.1-alpha.88` literal compares; this commit bumps them to `version: 1.1.1-alpha.89` and rewrites each test description to call out the alpha.89 scope. Files touched: - scripts-ut-spec/replication_switchover_spec.sh (4 sites bumped to alpha.89 — one in the Chart.yaml literal-version gate, three in the alpha.65/.66/.67 immutability-rule "current bumped to" contract checks). - scripts-ut-spec/replication_user_convergence_spec.sh (2 sites bumped — Gate 1's literal-version check, and the "no stale prior literal" check; the latter is now alpha.88 since alpha.88 is what we just bumped away from). Pending-marked obsolete tests (alpha.79 / alpha.80 cleanup debt) remain unchanged: they document tech debt against alpha.80 cleanup and do not assert against the current chart. Combined test status (full mariadb scripts-ut-spec/ directory): 437 examples / 0 failures / 7 pendings. What this commit does NOT do: - Switchover script runtime mode detection. Deferred — the existing 1974-line switchover script needs careful surgery best done as a focused commit with its own review pass. - ConfigMap template default flip to async (behavior-changing). - Wrap validate-replication-mode.sh in a kbagent lifecycle action. - Upgrade gate N=1. - Runtime PASS re-validation. No chart version change; no engine version change.
…n replication-switchover.sh (C1 path, no caller change)
Stages a small read-only helper inside the existing 1974-line
replication-switchover.sh so a focused follow-up commit can wire it
into the specific switchover stages that today unconditionally wait
for semisync ACK. The merged CmpD can now run in either async or
semisync mode under the C1 path, and the existing switchover script
needs a way to distinguish the two at runtime before its ACK-wait
logic can be made conditional.
Per the agreed scope (Jack 2026-05-19 21:53 commit 6 PASS handoff),
this commit only adds the helper and its ShellSpec; no caller change.
That keeps the existing switchover behavior identical on alpha.89,
preserves the in-flight Jack PASS verdicts on commits 1-6, and lets
the caller wiring land in its own focused commit with its own review
pass.
New helper: is_semisync_mode()
- Located next to the existing query_local_value() helper, sharing
the same MARIADB_CLIENT_BIN / MARIADB_ROOT_USER / MARIADB_ROOT_PASSWORD
env surface and the same /bin/sh + busybox compatibility envelope.
- Queries the engine's in-memory @@rpl_semi_sync_master_enabled via
the same mariadb client invocation pattern the surrounding helpers
use.
- Return contract:
0 — semisync ON (value is 1 / ON / on)
1 — semisync OFF (value is 0 / OFF / off)
2 — undetermined: client failure, empty row, or any value outside
the six recognized literals. Future callers MUST treat 2 as
conservative fail-closed (assume semisync and keep the safety
wait) so a transient client failure cannot silently flip
behavior to async during switchover.
New ShellSpec contract:
scripts-ut-spec/replication_switchover_is_semisync_mode_spec.sh
- 9 behavioral examples covering:
* ON / 1 / on -> rc=0
* OFF / 0 / off -> rc=1
* MAYBE (unrecognized literal) -> rc=2
* empty row -> rc=2
* client exit non-zero -> rc=2
- Each test stubs MARIADB_CLIENT_BIN with a controllable shell script
in a tmp PATH and sources replication-switchover.sh via the
existing __SOURCED__=1 ShellSpec convention, so the helper is
exercised without running main().
What this commit does NOT do:
- Wire is_semisync_mode() into any switchover stage. The caller
patch lands in a follow-up commit with its own contract review.
- Change ConfigMap template defaults to async.
- Add a kbagent lifecycle action wrapping validate-replication-mode.sh.
- Run the upgrade gate N=1 against existing async / semisync
Cluster CRs.
- Modify any of the in-flight Jack-PASSed commits 1-6.
Static verification:
- sh -n / bash -n PASS on replication-switchover.sh.
- shellspec replication_switchover_is_semisync_mode_spec.sh:
9 of 9 examples pass.
- Full mariadb scripts-ut-spec/ directory:
446 examples / 0 failures / 7 pendings (was 437 / 0 / 7 before
this commit; +9 from the new helper spec, no regressions).
No chart version change; no engine version change.
Per the autonomous-addon-development-loop guidance (Cindy 2026-05-19
20:03 commit 2395b53), human review of this branch is an upstreaming
gate, not a testing gate. The branch is patchable today for any
caller that wants to validate the C1 path end-to-end before the
upstream PR lands.
…nc via mariadb-replication-config-template (C1 path closure)
Closes the v3.1 design §1 / §8 commitment that the merged
`mariadb-replication-merged` CmpD defaults to async replication.
The four `rpl_semi_sync_*` engine variables are absent from the
default my.cnf and therefore take their engine defaults (OFF / 0).
Users who want semisync set the four variables via
`spec.componentSpecs[].parameters` at create time; the PD CUE
schema added in commit 3 v2 validates the values, and the
alpha.88 persistence layer (still wired in this CmpD via the init
container's `--defaults-extra-file=/var/lib/mysql/runtime-overrides.cnf`
loader path and the `runtime-overrides.d/` directory) carries the
runtime values through process restarts.
Change:
- templates/cmpd-replication-merged.yaml: the configspec
`template:` field flips from `mariadb-semisync-config-template`
(loads `config/mariadb-semisync.tpl`, which hardcodes
`rpl_semi_sync_master_enabled = ON` and the auxiliary semisync
variables) to `mariadb-replication-config-template` (loads
`config/mariadb-replication.tpl`, which omits all four semisync
variables and lets engine defaults take effect).
The deprecated `cmpd-replication.yaml` and `cmpd-semisync.yaml`
files still render and still reference their respective
ConfigMap templates, so existing Cluster CRs that bound to one of
those CmpDs continue to receive the configspec they originally
resolved against; only the merged CmpD's default behavior changes.
Why async as the default:
Async is safer for new clusters than semisync because semisync's
wait_for_slave gate can silently degrade under partial secondary
failure (the master waits and then falls back to async after the
timeout), which is harder to observe at the cluster level than an
always-async cluster. Users who explicitly want semisync set the
four parameters at create time; the PD CUE schema fail-closes on
invalid values per commit 3 v2.
New ShellSpec contract:
scripts-ut-spec/replication_merged_default_async_configmap_spec.sh
- 4 examples lock the default-async invariant:
1. merged CmpD `configs[].template` references the async
ConfigMap template name (not the semisync one).
2. merged CmpD does NOT reference the semisync ConfigMap
template (negative guard against a silent flip-back).
3. async template my.cnf does not set
`rpl_semi_sync_master_enabled = ON` in defaults.
4. async template my.cnf does not set
`rpl_semi_sync_slave_enabled = ON` in defaults.
What this commit does NOT do:
- Add a kbagent lifecycle action wrapping validate-replication-mode.sh
(still pending design decision: should mode mismatch block reconfigure
OpsRequest, or only surface in test closeout?).
- Wire is_semisync_mode() helper into actual switchover caller sites
(no obvious caller site identified yet; the helper remains staged).
- Run the upgrade gate N=1 against existing async / semisync
Cluster CRs (test environment dependency).
- Add semisync auxiliary variables (rpl_semi_sync_master_wait_no_slave,
rpl_semi_sync_master_wait_point) to the PD schema (users today
override the four core variables; adding the auxiliary set is a
follow-up if/when soak testing shows they are reachable via
parameter override).
Static verification:
- helm template renders cleanly; the merged CmpD's `configs[].template`
field is `mariadb-replication-config-template`.
- shellspec replication_merged_default_async_configmap_spec.sh:
4 of 4 examples pass.
- Full mariadb scripts-ut-spec/ directory:
450 examples / 0 failures / 7 pendings (was 446 / 0 / 7;
+4 from the new default-async spec, no regressions).
No chart version change; no engine version change.
…t 8 default-async flip Non-blocking cleanup noted in Jack commit 8 review (2026-05-20 00:04, msg a4c1fc38): the merged PD comment block at paramsdef.yaml L129-L132 still claimed the underlying ConfigMap object template remained at mariadb-semisync-config-template. Commit 8 already flipped that pointer to mariadb-replication-config-template, so the comment was stale. Updates the comment to reflect post-commit-8 reality: PD parameter resolution operates against the async ConfigMap; the four rpl_semi_sync_* variables are absent from the default my.cnf and only land when a Cluster CR explicitly sets them via spec.componentSpecs[].parameters, validated by the CUE schema declared below. No runtime / render change; comment-only fix. Static verification: - helm template renders cleanly. - Full mariadb scripts-ut-spec/: 450 examples / 0 failures / 7 pendings (unchanged from post-commit-8 baseline).
…icationMode + conditional derivation (C3 path)
weston 2026-05-20 00:08 msg cb0afa37 directs that the merged
CmpD expose BOTH a single logical replicationMode switch AND the
four real rpl_semi_sync_* variables. Precedence rule: if
replicationMode is set, it overrides the four variables; if the
user also sets one of them with a conflicting literal, KB
rejects the assignment. If replicationMode is unset, the four
variables are freely settable and default to OFF (async).
Implementation layer 1 — PD CUE schema (this commit):
config/mariadb-config-constraint.cue now declares a
`replicationMode` field with enum `"async" | "semisync"` and two
conditional blocks that unify the two `*_enabled` variables with
the corresponding ON / OFF value:
```cue
if replicationMode == "semisync" {
rpl_semi_sync_master_enabled: "ON"
rpl_semi_sync_slave_enabled: "ON"
}
if replicationMode == "async" {
rpl_semi_sync_master_enabled: "OFF"
rpl_semi_sync_slave_enabled: "OFF"
}
```
CUE unification handles the consistency check natively:
- user sets only replicationMode=semisync -> CUE unifies the two
*_enabled fields to ON (whether KB's renderer emits the
derived values into my.cnf is a separate question that Jack
is verifying in parallel; if it does not, layer 2.5 below
applies).
- user sets only the four variables -> no replicationMode
constraint applies; values are validated against their
declared types and bounds as before.
- user sets replicationMode AND the four variables consistently
-> CUE unifies cleanly.
- user sets replicationMode=semisync AND
rpl_semi_sync_master_enabled=OFF (conflict) -> CUE
unification fails on the `*_enabled` field; KB
`ValidateConfigWithCue()` returns a CUE conflict and rejects
the assignment before it lands in the rendered ConfigMap.
The auxiliary `rpl_semi_sync_master_wait_for_slave_count` and
`rpl_semi_sync_master_timeout` fields are NOT constrained by
replicationMode — they remain user-tunable within their declared
int range. (The engine ignores those values when semisync is
OFF, so leaving them unconstrained does not cross-couple modes.)
Implementation layer 2 — KB renderer verification (parallel,
Jack-owned):
Jack 2026-05-20 00:11 confirmed he will run a KB-validator
behavioral test using ValidateConfigWithCue() on a fixture that
sets only replicationMode=semisync, to confirm whether KB
emits the derived `*_enabled = ON` values into the rendered
my.cnf. If yes, the CUE schema in this commit is sufficient
end-to-end. If no, a follow-up commit will add a thin addon-side
mapper in reconfigureAction that fills the four variables when
only replicationMode is set; CUE unification still rejects
conflicting explicit assignments either way.
Implementation layer 3 — ShellSpec updates:
scripts-ut-spec/replication_merged_pd_schema_enum_spec.sh
- Removed the C1 negative assertion that the CUE file does NOT
declare a `replicationMode` key (under C3 it must declare it).
- Added 3 new examples:
- replicationMode enum `"async" | "semisync"` present
- if replicationMode == "semisync" conditional block present
- if replicationMode == "async" conditional block present
- Net change: 14 -> 16 examples in this spec file.
What this commit does NOT do (sequenced for later commits):
- Verify the KB renderer emits CUE-derived values into the
rendered my.cnf. Jack's parallel behavioral test owns this.
- Add the addon-side mapper if Jack's verification shows the
renderer only validates and does not derive. The mapper would
live in reconfigureAction and fill the four variables when
only replicationMode is set.
- Add ShellSpec behavioral tests that exercise the four C3
cases (only mode / only 4 vars / both consistent / both
conflict) end-to-end through KB validator. Those tests
require either a KB CUE harness in the addon repo or a
reproduction of ValidateConfigWithCue logic in shell; deferred
to follow-up commit alongside the mapper question.
- Modify any of the in-flight Jack-PASS-ed commits 1-9.
Static verification:
- helm template renders cleanly. The merged PD now emits a
parametersSchema block whose inline CUE content includes the
replicationMode field and the two conditional blocks.
- shellspec replication_merged_pd_schema_enum_spec.sh:
16 of 16 examples pass (was 14 / 0 / 0 before this commit;
+3 for the new C3 schema additions, -1 for the now-incorrect
C1 negative assertion).
- Full mariadb scripts-ut-spec/ directory:
452 examples / 0 failures / 7 pendings (was 450 / 0 / 7;
+2 from the new C3 schema additions, no regressions).
No chart version change; no engine version change.
… replicationmode in CUE (Jack B1 fix) Builds on commit 11 v1 (CUE revert + open struct) by closing the behavioral hole Jack surfaced in the v1 review (2026-05-20 00:48 msg `f8e7e078`): the `[string]: _` open pattern alone allowed any patch that included a `replicationMode=semisync` key to merge into the rendered my.cnf as the lowercase `replicationmode=semisync` key, which mariadbd does not recognize as a server variable. The C3 design places `replicationMode` at the ComponentSpec-parameter layer consumed by an addon mapper BEFORE my.cnf render; under no path should the key appear in the rendered my.cnf. Change: - config/mariadb-config-constraint.cue: add an explicit `replicationmode?: _|_` (CUE bottom) declaration alongside the `[string]: _` open pattern. CUE prefers more-specific field declarations over the open string-pattern catch-all, so the bottom-value declaration fires for the specific lowercase key while unrelated base my.cnf keys still flow through unchallenged. - scripts-ut-spec/replication_merged_pd_schema_enum_spec.sh: add one positive assertion that the CUE file contains `replicationmode?: _|_`. Net change: 16 -> 17 examples in this spec file. Behavioral expectation (matches Jack's locally-verified override direction in his B1 finding): - A merge patch including `replicationMode=...` or `replicationmode=...` (any case) is normalized to `replicationmode` by KB's INI parser, hits the explicit bottom declaration, and fails `ValidateConfigWithCue()` with a clear CUE conflict. The merge does not reach the rendered ConfigMap. - A merge patch setting only base my.cnf keys (binlog_format, max_connections, slow_query_log, etc.) flows through the `[string]: _` open pattern unchallenged. Same for a merge that sets `rpl_semi_sync_master_enabled=ON`, which still goes through the existing field constraint. Static verification: - helm template renders cleanly. - shellspec replication_merged_pd_schema_enum_spec.sh: 17 of 17 examples pass. - Full mariadb scripts-ut-spec/: 453 examples / 0 failures / 7 pendings (was 452 / 0 / 7 before this v2; +1 from the new replicationmode-forbid assertion). No chart version change; no engine version change. What this commit does NOT do: - Re-introduce `replicationMode` as a ComponentSpec parameter consumed by an addon mapper. That is commit 12's scope per Jack's pre-loaded review criteria (msg `3a0f5385`): single mapper write-site, mapper consumes `replicationMode` before my.cnf render, 5 ShellSpec cases (only mode / only 4 vars / both consistent / both conflict / mapper failure), fail-closed on mapper failure. - Run the upgrade gate N=1 or live cluster validation.
…de → 4 engine vars (C3 design, Jack 5-case + 2-boundary contract)
Implements the C3 design mapper that translates the synthetic
`replicationMode` ComponentSpec parameter ("async" | "semisync") into
the four real MariaDB engine variables BEFORE the merged replication
CmpD's reconfigureAction.persisted main loop renders any my.cnf or
runtime-overrides.d/ file. CUE backstop in commit 11 v2
(`replicationmode?: _|_`) keeps the synthetic key from ever landing
in my.cnf; the mapper is the canonical write-site that owns the
translation.
Why a mapper instead of a CUE conditional
------------------------------------------
Jack's KB-validator behavioral test against commit `dc645466` proved
that KB's `pkg/parameters/validate/cue_util.go ValidateConfigWithCue()`
validates parameter values against a CUE schema but does NOT emit
CUE-derived field values back into the rendered my.cnf. Expressing
C3 precedence in CUE alone would either silently ignore
`replicationMode` or land the verbatim key in my.cnf (which mariadbd
rejects as unknown). The C3 design therefore places `replicationMode`
at the ComponentSpec-parameter layer consumed by this addon-side
mapper before my.cnf render.
Behavior contract (5 cases + 2 boundaries locked with Jack pre-loaded
review criteria msg `3a0f5385` and 2026-05-20 dm:@jack msg
`e8c80793` / `144afd93` / `2e93eb72`)
----------------------------------------------------------------------
1. mapper is the UNIQUE consumer / writer of `replicationMode`.
Sourced exactly once from `mariadb.config.reconfigureAction.persisted`
BEFORE the main loop processes any parameter. Synthetic key never
reaches SET GLOBAL or runtime-overrides.d/.
2. Conflict detection runs BEFORE any file modification. When user
simultaneously supplies `replicationMode=semisync` and any of the
four real engine variables with a disagreeing value, the mapper
exits non-zero (code 3) and leaves the parameter list as-is — no
partial state.
3. Mapper failure (invalid mode → code 2, conflict → code 3, bad arg
→ code 4, IO failure → code 5) always produces non-zero exit and
no partial state. The persisted helper exits 1 on any mapper
non-zero return; main loop does not run.
4. Only-4-vars path: when `MARIADB_REPLICATION_MODE` is empty or
unset, the mapper returns 0 immediately and the parameter list
flows through unchanged. The four real engine variables continue
to be processed exactly as before. Verified by sha256 invariance.
5. Both-consistent is idempotent: user supplying both
`replicationMode=semisync` and matching real vars yields exactly
one assignment per real var (no duplicates). Repeated mapper
invocation produces byte-identical output. Verified via sha256
compare across two passes.
Boundary 1 — call-site uniqueness: the persisted helper sources the
mapper exactly once (grep -c == 1) and gates the call on file
readability so non-merged topologies (e.g. cmpd-semisync.yaml using
the same persisted helper) are safe no-ops.
Boundary 2 — byte-equal short-circuit in main loop: when the new
tmp override file is byte-identical to the existing override file,
the helper skips `mv` so the on-disk mtime is preserved across
no-op reconfigures. Required removing the alpha.86 timestamp
comment line that forced every rewrite to differ. The skip branch
runs strictly after safety validation (is_safe_param_name +
is_safe_param_value) and after the mapper-driven conflict check
(mapper runs before main loop), but before the atomic mv. Conflict
cases never reach this point because the mapper exits non-zero
before any tmp file is written.
Changes in this commit
-----------------------
- `scripts/replication-mode-mapper.sh` (new, 255 lines):
- `apply_replication_mode_mapping <parameter_file>` is the single
entry point. Returns 0 on success, 2 on invalid mode, 3 on
conflict, 4 on bad arg, 5 on IO failure.
- Defense-in-depth synthetic-key strip (covers both
`replicationMode` and lowercase `replicationmode`).
- Atomic in-place rewrite via tmp + mv; cleanup on any error path.
- Source-time `__SOURCED__` guard for ShellSpec testability;
standalone execution also supported with same contract.
- `templates/_helpers.tpl` `mariadb.config.reconfigureAction.persisted`:
- Sources `/scripts/replication-mode-mapper.sh` (gated on file
readability) and calls `apply_replication_mode_mapping
"${parameter_file}"` BEFORE the parameter-empty check and main
SET GLOBAL + persist loop.
- On mapper non-zero return, exits 1 with a clear "replicationMode
mapper failed" sentinel so the reconfigure OpsRequest fails
closed and operator sees the diagnosis.
- Adds `cmp -s` short-circuit in the persist loop: when the new
tmp override file is byte-identical to the existing on-disk
override file, skips the atomic mv. Removes the alpha.86
timestamp comment line so byte-compare is meaningful for
identical values.
- `templates/configmap-scripts-replication.yaml`:
- Mounts `replication-mode-mapper.sh` at `/scripts/` via the
replication scripts ConfigMap. Same mount as commit 5's
`validate-replication-mode.sh`; no chart version bump needed
because ConfigMap content is not subject to CmpD immutability.
- `scripts-ut-spec/replication_mode_mapper_spec.sh` (new, 332 lines,
23 examples covering):
- Case 1 (only mode): 3 examples — semisync + async + synthetic
key strip.
- Case 2 (only 4 vars / mode empty / mode unset): 2 examples,
sha256-verified no-op.
- Case 3 (both consistent): 3 examples — derived appended, no
duplicates, sha256-verified idempotent across two passes.
- Case 4 (both conflict): 3 examples — exit code 3, stderr clear,
sha256-verified parameter list unchanged on conflict.
- Case 5 (mapper failure): 4 examples — invalid mode (exit 2),
bad arg (exit 4), sha256-verified parameter list unchanged on
invalid mode.
- Synthetic strip backstop: 2 examples — camelCase and lowercase.
- Unique-call-site contract: 4 examples grep'ing _helpers.tpl.
- Byte-equal short-circuit contract: 2 examples grep'ing
_helpers.tpl for `cmp -s` presence and absent timestamp line.
Static verification
--------------------
- `helm lint addons/mariadb`: PASS
- `helm template test addons/mariadb`: renders cleanly (16464 lines)
- `bash -n` + `dash -n` on the mapper script: PASS both shells
- Full mariadb scripts-ut-spec/:
476 examples / 0 failures / 7 pendings (was 453 / 0 / 7 before
commit 12; +23 from the new mapper spec; existing alpha.86
persisted helper tests all preserved).
What this commit does NOT do
-----------------------------
- Wire `MARIADB_REPLICATION_MODE` env injection from the user's
ComponentSpec parameter into the kbagent action container. The
mapper reads the env var; the actual plumbing (ParametersDefinition
field or CmpD vars: entry) is left for a follow-up commit pending
weston pace decision. Until plumbing lands, the mapper is a safe
no-op (empty env → return 0).
- Live cluster validation. The persisted helper has not been
exercised end-to-end through the kbagent action runtime with this
commit's changes; that is N=1 RED gate work for the cluster lane.
No chart version change; no engine version change.
…(B1) and unconditional synthetic-strip (B2) per Jack contract review Builds on commit 12 v1 (`1e9bc910`) by closing the two contract blockers Jack surfaced in his 5-case + 2-boundary behavioral review (2026-05-20 dm:@jack msg `008885e2`). Static gates all passed in v1, but two contract claims were not actually enforced by the shipped code. v2 makes them real with both code fixes and ShellSpec guards. B1 — `_helpers.tpl` lost the mapper's original rc --------------------------------------------------- Earlier code: if ! apply_replication_mode_mapping "${parameter_file}"; then mapper_rc=$? ... The `!` inverts the exit code, so inside the then-block `$?` is 0 (the inverted value), not the mapper's 2/3/4/5. The fail-closed sentinel still fired (exit 1), but it printed `rc=0`, hiding which contract layer (invalid mode / conflict / IO / bad arg) actually broke. First-blocker classification downstream cannot read the correct layer from the action log. Jack's minimal repro: `f(){ return 3; }; if ! f; then echo $?; fi` prints `0`. Fix: mapper_rc=0 apply_replication_mode_mapping "${parameter_file}" || mapper_rc=$? if [ "${mapper_rc}" -ne 0 ]; then echo "replicationMode mapper failed (rc=${mapper_rc}); ..." >&2 exit 1 fi The `|| <assign>` chain preserves the original rc and disables `set -e` for the mapper invocation, so the action stays alive long enough to emit the rc-aware diagnostic. B2 — unconditional synthetic-strip claim was false -------------------------------------------------- Earlier code path in `apply_replication_mode_mapping`: if [ -z "${mode}" ]; then return 0; fi ... strip synthetic happens after this empty-mode early return ... A parameter list containing `replicationMode=semisync` with `MARIADB_REPLICATION_MODE` unset returned rc=0 and left the synthetic key in the file — contradicting the script preamble and the v1 commit message that both claimed "mapper unconditionally strips any replicationMode / replicationmode line". In current product context this was not a runtime FAIL: KB's CUE `replicationmode?: _|_` (commit 11 v2) blocks the synthetic key from ever reaching the parameter list. But the mapper's defense-in-depth contract was theatre, not real, and v1's commit message and ShellSpec assertions misrepresented the behavior. Fix: move the synthetic-strip BEFORE the empty-mode early return. The strip uses a tmp file + `cmp -s` so clean only-4-vars input is byte-identical (mtime preserved); only inputs that actually contain a synthetic key are rewritten. The Jack contract item 4 (only-4-vars unchanged on clean input) holds. Changes in this commit ----------------------- - `scripts/replication-mode-mapper.sh`: - UNCONDITIONAL synthetic-strip block moved BEFORE the empty-mode early return. Uses tmpfile + cmp -s + atomic mv so clean only-4-vars input is byte-identical, only inputs with a synthetic key are rewritten. - Updated comment block above the empty-mode early return to document the new defense-in-depth ordering. - `templates/_helpers.tpl` `mariadb.config.reconfigureAction.persisted`: - Replaced `if ! apply_replication_mode_mapping ...; then mapper_rc=$?` with `mapper_rc=0; apply_replication_mode_mapping ... || mapper_rc=$?; if [ "${mapper_rc}" -ne 0 ]; then ... fi` so the original exit code (2/3/4/5) flows into the diagnostic sentinel. - `scripts-ut-spec/replication_mode_mapper_spec.sh` (+68 lines): - 3 new behavioral examples for B2: - synthetic strip when MARIADB_REPLICATION_MODE unset - synthetic strip when MARIADB_REPLICATION_MODE empty string - byte-identical preservation for clean only-4-vars input under the unconditional strip - 3 new contract examples for B1 on _helpers.tpl: - rejects the rc-losing `if ! apply_replication_mode_mapping` antipattern (grep regex excludes comment lines so the fix rationale comment does not false-positive) - locks the `|| mapper_rc=$?` rc-preservation form - locks the `if [ "${mapper_rc}" -ne 0 ]` rc-aware check Static verification -------------------- - `helm lint addons/mariadb`: PASS - `helm template`: clean render; mapper invocation now shows `|| mapper_rc=$?` + `if [ "${mapper_rc}" -ne 0 ]` for both cmpd-semisync and cmpd-replication-merged. - `bash -n` + `dash -n` on the mapper script: PASS both shells. - Smoke reproduction in real shell: - unset mode + synthetic key in file → rc=0, synthetic stripped, real var preserved. - clean only-4-vars input + unset mode → sha256 byte-identical pre/post. - Focused `replication_mode_mapper_spec.sh`: 29 examples / 0 failures (was 23 in v1; +3 B2 fix examples + 3 B1 contract examples). - Full mariadb scripts-ut-spec/: 482 examples / 0 failures / 7 pendings (was 476 in v1; +6 from B1+B2 lock examples). 8-class contract walk-through ------------------------------ - Class 4 (sentinel/rc): B1 fixed — sentinel now carries actual mapper rc; tests lock the `|| mapper_rc=$?` rc-preservation form. - Class 1 (silent fallback): B2 fixed — synthetic key is now unconditionally stripped; tests cover unset + empty-string env. - Other classes: no v2 regressions; commit 11 v2 CUE `_|_` backstop unchanged; alpha.86 persisted helper invariants unchanged. No chart version change; no engine version change.
…ire via Helm value (C3 plumbing Option C)
Wires the addon-side mapper's input env var. The mapper from commit 12
has been a safe no-op so far because nothing set MARIADB_REPLICATION_MODE.
Commit 13 connects the merged CmpD container env to a top-level Helm
value mariadb.replication.mode so chart users can pick async or
semisync at install/upgrade time.
Why Helm value path (Option C) and not user OpsRequest path
-----------------------------------------------------------
The standard KB ParametersDefinition reconfigure flow validates user
parameter values against the PD CUE schema and renders them into the
target ConfigMap (my.cnf). The CUE schema declared in commit 11 v2
has replicationmode forbidden as a CUE bottom value, so a user
OpsRequest setting replicationMode=semisync is rejected by KB
ValidateConfigWithCue with a CUE conflict, before the mapper sees
anything. That backstop exists to keep the synthetic key out of
my.cnf at the engine layer (mariadbd does not recognize replicationmode
and would log unknown-variable warnings at startup).
The Helm value path is the conservative plumbing that does not require
speculative KB behavior verification. It sets the topology default at
chart install/upgrade time. Runtime mode flip via OpsRequest is
deferred to a future commit that wires either a Cluster annotation
plus addon-side reader OR a non-INI-bound PD, depending on KB
behavior research. The mapper interface in commit 12 is unchanged in
either case; only the env source differs.
Changes
-------
- addons/mariadb/templates/cmpd-replication-merged.yaml:
Adds a MARIADB_REPLICATION_MODE env entry to the mariadb container
env block, valued from .Values.replication.mode with empty-string
default. Empty default preserves existing behavior on clusters whose
values do not set this key.
- addons/mariadb/values.yaml:
Adds top-level replication section with mode key defaulting to "".
Accepted values are "", async, semisync. Invalid values fail the
reconfigureAction with mapper rc=2 and exit 1 with no partial state.
- addons/mariadb/scripts-ut-spec/replication_merged_replication_mode_env_wire_spec.sh
(new, 10 examples):
Locks the wire-up at three layers:
1. values.yaml declares replication.mode with empty default
2. cmpd-replication-merged.yaml declares MARIADB_REPLICATION_MODE
env via .Values.replication.mode pipe with empty-default and
quoted output
3. helm template produces the expected env declaration with default
empty, semisync override, and async override
Plus a cross-topology negative: standalone and galera CmpDs do NOT
declare this env (they do not have the mapper wired in their
reconfigureAction helpers).
Static verification
-------------------
- helm lint PASS
- helm template default produces MARIADB_REPLICATION_MODE with empty value
- helm template --set replication.mode=semisync produces semisync value
- helm template --set replication.mode=async produces async value
- Full mariadb scripts-ut-spec: 492 examples / 0 failures / 7 pendings
(was 482 from commit 12 v2; +10 from new env wire spec; commit 12
v2 mapper invariants all preserved)
What this commit does NOT do
----------------------------
- Wire runtime OpsRequest reconfigure to flip replication.mode. Helm
value is install-time only. A future commit can add a Cluster
annotation reader OR a non-INI-bound PD declaration; the mapper
interface stays the same.
- Live cluster validation. The chart static gates pass; kbagent
runtime exercising of the env-to-mapper path on a real cluster is
the cluster lane work.
No chart version change; no engine version change.
…elm template-time fail-closed (Jack B1+B2 fix) Builds on commit 13 v1 (ae2698a) by closing the two contract gaps Jack surfaced in his rendered-level review (msg f9433634): B1 - Helm value was install-time API but had no install-time write-site. The env var was plumbed but no consumer existed before the first reconfigureAction trigger. A chart user setting mariadb.replication.mode=semisync at install would still boot async until OpsRequest reconfigure fired. B2 - Invalid Helm value did not fail at render time. helm template --set replication.mode=bogus rendered cleanly with a latent bad env; the failure surfaced only at container startup (correctly fail-closed, but diagnosis loop unnecessarily long and the bad value already embedded in the rendered CmpD). B1 fix - install-time seeder ---------------------------- New scripts/seed-replication-mode-overrides.sh (135 lines): - Reads MARIADB_REPLICATION_MODE from env at container startup - For valid mode, writes 4 per-parameter .cnf files into runtime-overrides.d/ BEFORE the first mariadbd start - Output is byte-identical to what reconfigureAction.persisted writes for the same env, so the two write-sites converge and a later reconfigure is a byte-equal no-op - cmp -s short-circuit preserves mtime across kubelet restarts / pod re-creates - Empty env: no-op (return 0); preserves existing behavior - Invalid mode: stderr + return 2; container fails to start mariadbd - Missing overrides dir: stderr + return 5 Wire-up in cmpd-replication-merged.yaml main container command body: - Source the seeder via /scripts/seed-replication-mode-overrides.sh - Run AFTER runtime-overrides.d permission reapplication and BEFORE the first start_mariadbd_process call - Container exits 1 on seeder non-zero rc with a clear sentinel - Wire is gated on file readability so non-merged topologies that do not mount the seeder are safe no-ops Mount in configmap-scripts-replication.yaml: - Added seed-replication-mode-overrides.sh to the same scripts ConfigMap that mounts replication-mode-mapper.sh B2 fix - Helm template-time validator -------------------------------------- New helper in templates/_helpers.tpl: - mariadb.replication.mode.validate accepts "", async, semisync - Any other value triggers Helm fail with a clear printf error - Returns the validated value for the caller to consume - Called from cmpd-replication-merged.yaml's MARIADB_REPLICATION_MODE env declaration; replaces the bare .Values.replication.mode pipe Effect: helm template --set replication.mode=bogus now exits with "invalid mariadb.replication.mode=bogus; expected one of ..." before any manifest is produced. No bad value can ever land in a rendered CmpD env. ShellSpec --------- New scripts-ut-spec/seed_replication_mode_overrides_spec.sh (16 examples): - empty / unset: no-op + zero override files written - semisync: writes 4 correct .cnf files - async: writes OFF / OFF / 1 / 10000 - invalid: rc=2 + zero override files - missing dir: rc=5 - idempotency: mtime preserved across two invocations - convergence: each .cnf is exactly 2 lines (no timestamp metadata) - static contract: configmap-scripts mounts seeder twice (data key + comment header); cmpd-replication-merged.yaml invokes seeder via the source line and exits 1 on non-zero rc Updated replication_merged_replication_mode_env_wire_spec.sh (+6 net examples, 10 -> 16): - Removed obsolete bare .Values.replication.mode grep (the env now flows through the validator helper, not a bare pipe) - Added 4 examples for Helm template-time fail-closed: bogus / garbage / mixed-case ASYNC all rejected; empty "" still accepted - Added 4 examples for the validator helper itself: defined in _helpers.tpl + uses default "" + calls fail() on bad value Static verification ------------------- - helm lint PASS - helm template default: env value "" - helm template --set replication.mode=semisync: env value "semisync" - helm template --set replication.mode=async: env value "async" - helm template --set replication.mode=bogus: FAILS at render with clear "invalid mariadb.replication.mode" sentinel - bash -n + dash -n on seeder: PASS both shells - Smoke reproduction in real shell: - empty env: no override files written - semisync: 4 files with correct content - repeated invocation: mtime preserved - bogus: rc=2 + no partial state - Full mariadb scripts-ut-spec: 514 examples / 0 failures / 7 pendings (was 492 from commit 13 v1; +16 from new seeder spec; +6 net from env-wire spec update) What this commit does NOT do ---------------------------- - Wire runtime OpsRequest reconfigure to flip replication.mode at runtime. The Helm value remains install-time only; runtime mode flip is deferred to a future commit using a Cluster annotation reader OR a non-INI-bound PD. Once that wire lands, the mapper + seeder pair already in place handles both write-sites without further changes. - Live cluster validation. The chart static + behavioral gates all pass; kbagent runtime exercising of the env-to-seeder-to-mariadbd path on a real cluster is the cluster lane work. No chart version change; no engine version change.
…te contract gaps B3+B4+B5 (Jack HOLD review msg) Builds on commit 13 v2 (1f2fe79) by closing three contract gaps Jack surfaced in his install-time review: B3 - script missing silent fallback when mode is non-empty. v2 used `if [ -r /scripts/seed...sh ]; then ... fi` which silently skipped the seeder when the script was unreadable. A non-empty MARIADB_REPLICATION_MODE could degrade to async (Class 1 silent fallback). B4 - target as directory bypass. v2 wrote tmp + mv unconditionally. If a target path existed but was NOT a regular file (e.g. a directory created by a prior buggy run or out-of-band action), `mv tmp existing_dir` succeeded by moving tmp INTO the directory. The target remained a directory, the override content never landed in the expected file, but seeder returned rc=0. B5 - partial-state on multi-file failure. v2 wrote files sequentially. If file 3 failed, files 1 and 2 were already committed to the PVC. After a partial failure, switching mode back to "" would not clean the orphan files (seeder no-op leaves them). Additionally a separate write-failure-detection bug in v2 used `{ ... } > tmp` compound-command form whose redirection failure does NOT propagate to the surrounding `if !` in bash, so write failures were silently undetected. B3 fix - container wire-up -------------------------- cmpd-replication-merged.yaml: when MARIADB_REPLICATION_MODE is non-empty AND seeder script is unreadable, container exits 1 with sentinel before any mariadbd start. When the env is empty the original lenient `if [ -r ]` form is preserved because the seeder is a no-op anyway and a missing-script scenario on async clusters should still let them boot. B4 fix - target type validation ------------------------------- seed-replication-mode-overrides.sh: new seed_replication_mode_validate_target_type helper checks each target is absent OR a regular file BEFORE any tmp write. If any target exists as a directory / symlink-to-dir / device / fifo / socket the seeder returns rc=5 immediately and writes nothing. Post-rename sanity check verifies each renamed target is now a regular file. B5 fix - multi-file all-or-nothing pattern ------------------------------------------- seed-replication-mode-overrides.sh restructured into 5 phases: - Phase A derive: compute all 4 (name, value) pairs into shell vars. Fail-closed on invalid mode BEFORE any disk write. - Phase B pre-validate target types: run the B4 check on all 4 targets BEFORE any tmp is written. - Phase C write 4 tmp files: any write failure triggers cleanup_all_tmps and returns rc=5. The cleanup function removes any subset of tmp files that exist for the current pid suffix. - Phase D byte-equal compare: each tmp vs target; targets already at the staged value skip rename to preserve mtime. - Phase E atomic rename in tight sequence: minimizes the partial-commit window. Post-rename sanity check. Also fixed the v2 write-failure-detection bug: `{ ... } > tmp` replaced with `printf '...' > tmp` (a simple command) whose redirect failure DOES propagate to `if !`. Added `[ -s tmp ]` post-check as defense in depth against silent truncation. ShellSpec --------- seed_replication_mode_overrides_spec.sh: 16 -> 21 examples - B3 wire contract (2 examples): - cmpd-replication-merged.yaml contains the "set but seeder script is missing or unreadable" sentinel. - cmpd-replication-merged.yaml gates the missing-script check on non-empty MARIADB_REPLICATION_MODE (`if [ -n ... ]`). - B4 behavior (2 examples): - Pre-create wait_for_slave_count target as a directory + seeder runs -> rc=5 with "exists but is not a regular file" sentinel. - With master_timeout target as directory: zero .cnf or .tmp residue after seeder runs (find -type f returns 0 for both). - B5 behavior (1 example, condensed into single When call to handle ShellSpec stderr-capture mechanics): - chmod 0555 on overrides dir + seeder runs -> rc=5 + zero .cnf residue + zero .tmp residue. - "fails the container on seeder non-zero rc" assertion updated: count is now 2 (original "seed-replication-mode-overrides failed" sentinel + new "set but seeder script is missing or unreadable" sentinel both match the regex). Static verification ------------------- - helm lint PASS - helm template default/semisync/async: render cleanly - helm template --set replication.mode=bogus: still fails with B2 sentinel - bash -n + dash -n on seeder: PASS both shells - Smoke reproduction in real shell: - target-as-directory + semisync mode -> rc=5, zero .cnf written - chmod 0555 + semisync mode -> rc=5, zero residue, zero partial - clean valid case -> rc=0, all 4 files written with correct content - Full mariadb scripts-ut-spec: 519 examples / 0 failures / 7 pendings (was 514 from commit 13 v2; +5 from B3/B4/B5 lock examples) 8-class contract walk-through update ------------------------------------ - Class 1 silent fallback: B3 + write-failure-detection both fixed; no remaining silent-fallback paths from non-empty mode. - Class 3/5 single-commit-boundary partial state: B5 fixed via phased write-all-then-rename; combined with B4 pre-validation, the common failure modes (target type mismatch, write permission, disk full at write time) are all caught before any rename occurs. - Class 4 sentinel rc: unchanged from v2; invalid mode -> rc=2, IO failure -> rc=5. What this commit does NOT do ---------------------------- - Roll back already-committed renames on mid-batch rename failure. That requires saving prior content for restoration and is a narrower failure mode than the common cases addressed here. The seeder logs loudly and the container refuses to start mariadbd on any rename failure, so an Ops engineer sees the diagnosis. - Fix the same `{ ... } > tmp` redirect-detection bug in the alpha.86 persisted helper (`_helpers.tpl` mariadb.config.reconfigureAction.persisted). That is a separate pre-existing issue out of commit 13 scope. No chart version change; no engine version change.
…f feeding full helm template to ShellSpec matcher (Jack test-harness HOLD msg) Closes the commit 13 v3 test-gate HOLD: the env-wire ShellSpec previously fed the entire ~16k-line `helm template` output into `When call ... The output should include ...`. ShellSpec's matcher loads the captured stdout into memory and pattern-matches against it, which was slow / unstable on macOS+Homebrew bash and timed out in the test engineer's 34s budget. v3 v2 refactor: - Render `helm template` once into a tmp file via a small helper (`render_to_tmp` / `render_stderr_to_tmp`) - Grep the tmp file for the relevant 2-line shape via `grep -F -A1 'name: MARIADB_REPLICATION_MODE' | awk` - ShellSpec matcher only sees a bounded result (~3 chars: "ok"), not the full manifest - AfterEach cleans up the tmp file Same contract assertions; faster matcher path: - focused env-wire spec: 279s -> 1.29s (216x speedup) - full mariadb scripts-ut-spec: ~5min -> ~57s Static verification (unchanged from commit 13 v3 behavior layer): - helm lint PASS - helm template default/semisync/async renders correctly - helm template --set replication.mode=bogus|garbage|ASYNC fails at render with `invalid mariadb.replication.mode` sentinel - Focused env-wire ShellSpec: 16/0 - Full mariadb scripts-ut-spec: 519/0/7 (unchanged count) No chart behavior change; only test harness refactor.
…CUE bottom that breaks live PD OpenAPI schema gen (live N=1 first-blocker fix)
Live N=1 verification in vcluster `mariadb-test5` with KubeBlocks
controller image `apecloud/kubeblocks:pr-10252-1c8723184` revealed
that commit 11 v2's `replicationmode?: _|_` declaration BREAKS the
live PD reconcile loop:
failed to generate openAPISchema: failed to marshal cue-yaml:
explicit error (_|_ literal) in source
The `mariadb-replication-merged-pd` ParametersDefinition never goes
Available; downstream Component reconcile reports the PD as
unavailable; Pods never get created; the chart is unusable.
Local KB validator fixtures and ShellSpec accepted the bottom
declaration because they exercise `ValidateConfigWithCue()` directly
on already-parsed CUE values. The live controller has a separate
codepath that generates an OpenAPI schema from the CUE source
before activating the PD; that codepath does NOT accept any CUE
bottom literal.
Fix
---
Remove `replicationmode?: _|_` from `config/mariadb-config-constraint
.cue`. The synthetic-key defense-in-depth shifts entirely to the
three layers that already exist and are exercised by ShellSpec:
1. Helm template-time validator (`mariadb.replication.mode.validate`
helper in `_helpers.tpl`) rejects invalid `mariadb.replication
.mode` Helm values at `helm template` time.
2. Startup seeder (`scripts/seed-replication-mode-overrides.sh`,
sourced from `cmpd-replication-merged.yaml` container command
body) writes the four real `rpl_semi_sync_*` engine variables
to `runtime-overrides.d/` BEFORE mariadbd starts. Synthetic
key never appears in any rendered file.
3. Reconfigure mapper (`scripts/replication-mode-mapper.sh`,
sourced from `reconfigureAction.persisted`) unconditional strip
of `replicationMode` / `replicationmode` from the parameter
list BEFORE the main loop reaches `SET GLOBAL` or
`runtime-overrides.d/` writes.
A user OpsRequest that includes `replicationmode=<value>` now
passes CUE validation via the `[string]: _` open pattern, lands
in the rendered my.cnf, and produces a mariadbd unknown-variable
warning at next restart. That is noise, not a fail-closed product
break — mariadbd ignores unrecognized server variables (does not
refuse startup). Reconfigure mapper still strips the synthetic
key from the SET GLOBAL / persist loop, so the engine never sees
a `replicationmode` runtime variable assignment.
Runtime synthetic-key OpsRequest fail-closed (KB-validator
rejecting `replicationmode=*`) is deferred to alpha.90 — needs
a KB-supported form that survives OpenAPI schema generation.
Candidates documented in the case appendix (PR kubeblocks-addon
-docs #258): a CUE pattern that compiles to OpenAPI without
bottom, a pre-flight admission webhook on Cluster CR annotations,
or a PD validation hook.
Changes
-------
- `config/mariadb-config-constraint.cue`: remove the
`replicationmode?: _|_` declaration. The preceding multi-line
comment is rewritten to document the live N=1 first-blocker
finding, the three remaining defense layers, and the alpha.90
deferred runtime fail-closed candidates.
- `scripts-ut-spec/replication_merged_pd_schema_enum_spec.sh`:
the positive grep that locked `replicationmode?: _|_` is
inverted into a negative-assertion that the file does NOT
declare the bottom value as code (the grep regex excludes
comment lines so the rationale comment does not false-positive).
Same example count (17), same suite count.
Static verification
-------------------
- `helm lint addons/mariadb`: PASS
- `helm template test addons/mariadb` and `--set replication.mode=
semisync`: render cleanly (CUE no longer contains a bottom
literal, so future live PD reconcile will succeed).
- Focused `replication_merged_pd_schema_enum_spec.sh`: 17 examples
/ 0 failures.
- Full mariadb scripts-ut-spec: 519 examples / 0 failures / 7
pendings (unchanged from commit 13 v3; the bottom-value lock
example was inverted, not added).
What this commit does NOT do
----------------------------
- Add a runtime KB-validator path that rejects `replicationmode=*`
user input. Deferred to alpha.90; case appendix records the
candidate forms.
- Re-run live N=1 validation in `mariadb-test5`. The test cluster
has the alpha.89 chart with the broken CUE already installed;
Jack will need to uninstall + reinstall (or apply a CmpD update
if the chart version supports it) and re-run the install-time
semisync first-boot SQL ON/ON check.
- Roll forward to alpha.90 chart version. The `_|_` revert
affects CUE content, not the CmpD spec shape, so within the
alpha.89 chart we can keep iterating.
No chart version change; no engine version change.
…sion compatibilityRules (live N=1 second first-blocker fix)
Live N=1 retest on commit 14 chart in vcluster `mariadb-test5`
(release `mariadb-alpha69-5c`, ns `mariadb-alpha89-mode-n1b-040717`)
got past the commit 14 PD OpenAPI blocker (PD now Available) but
hit a NEW first-blocker: Pod create fails with
spec.containers[0].image: Required value
spec.containers[1].image: Required value
spec.initContainers[0].image: Required value
Controller log shows `ImageUtil parse image failed, image=""`. The
live InstanceSet has empty image fields for `mariadb`, `exporter`,
and `init-syncer`, while `kbagent` / `kbagent-worker` have their
tools image.
Root cause
----------
`addons/mariadb/templates/cmpv.yaml` declares ComponentVersion
compatibilityRules that bind release images (10.6.15, 11.4.5,
11.4.8, 11.4.10, 11.8.4, 12.0.2) to CmpDs matching one of these
regexes:
- ^mariadb-[0-9] (standalone)
- ^mariadb-replication-[0-9] (replication, digit-anchored)
- ^mariadb-semisync- (semisync)
- ^mariadb-galera- (galera, in a separate rule)
The merged CmpD added in commit 1 is named
`mariadb-replication-merged-1.1.1-alpha.89`. None of the four
regexes above match this name:
- `^mariadb-replication-[0-9]` requires a digit immediately after
`mariadb-replication-`, but the merged CmpD has `merged-` next.
- The other three regexes don't match either.
KB's ComponentVersion controller therefore did not bind any
release images to the merged CmpD, and the InstanceSet rendered
with empty image fields.
Fix
---
Add `mariadb.replication.merged.cmpdRegexpPattern` (which expands
to `^mariadb-replication-merged-`) to the same compDefs list that
already covers standalone / replication / semisync. The merged
regex does not over-match the existing three (it requires the
literal `merged-` suffix), and shares the same 6-release image
set as the replication/semisync chain.
Changes
-------
- `templates/cmpv.yaml`: add the merged regex to the first
compatibilityRule's compDefs list, with a multi-line comment
pinning the live N=1 evidence so a future edit cannot silently
drop it.
- `scripts-ut-spec/cmpv_merged_compatibility_rule_spec.sh` (new,
9 examples):
- Template-level: cmpv.yaml references the merged regex helper
exactly once; the original three regexes are still present
(regression guard).
- Helper-level: _helpers.tpl defines the merged regex helper
and its value is the expected `^mariadb-replication-merged-`.
- Rendered-manifest level: render once into tmp file via
`render_to_tmp` helper, awk-extract the first
compatibilityRule's compDefs list, assert all four expected
regexes are present (one assertion each so a future
regression points at the specific missing regex). Plus a
bounded grep that the rendered manifest contains the
`docker.io/mariadb:11.4` image line for the 11.4.10 release
block (the merged CmpD's default serviceVersion).
Same render-to-tmp + bounded-matcher pattern as commit 13 v3 v2
env-wire spec — keeps the spec fast (1.03s focused) and stable on
macOS+Homebrew.
Static verification
-------------------
- helm lint PASS
- helm template renders the merged regex as
`^mariadb-replication-merged-` in the first compatibilityRule's
compDefs list, alongside the existing three
- Focused cmpv-compat spec: 9/0 in 1.03s
- Full mariadb scripts-ut-spec: 528 examples / 0 failures / 7
pendings (commit 14 base 519 + 9 new locks)
What this commit does NOT do
----------------------------
- Bump chart version. The compatibilityRules change is additive
to an existing CmpV resource; no new CmpD spec mutation. Same
alpha.89 chart can keep iterating.
- Re-run live N=1 in `mariadb-test5`. The test cluster has the
alpha.89 chart with the missing-image-binding already installed;
Jack will need to uninstall + reinstall (or apply a CmpV update)
and re-run the install-time semisync first-boot SQL ON/ON
check.
No engine version change.
…nc_master_wait_for_slave_count (MariaDB 11.4 unsupported, live N=1 third first-blocker fix)
Live N=1 third round on commit 15 chart in vcluster `mariadb-test5`
ns `mariadb-alpha89-mode-n1c2-041720`. Cleared the previous two
invalid-run blockers (PD Available, CmpV image binding bound) and
landed in the actual MariaDB runtime layer for the first time. The
seeder ran and wrote the 4 expected override files. mariadbd then
CrashLooped because one of those four variables is not recognized
by MariaDB.
Empirical evidence (Jack's same-image parse probe):
- With all 4 overrides loaded via --defaults-extra-file:
`mariadbd --verbose --help` exits rc=7 with stderr containing
`unknown variable 'rpl_semi_sync_master_wait_for_slave_count=1'`
- Removing only that one file: same probe exits rc=0.
Root cause: `rpl_semi_sync_master_wait_for_slave_count` is a MySQL
extension (added in MySQL 5.7.3). MariaDB 11.4 does NOT recognize
it. MariaDB semisync waits for exactly one secondary
acknowledgement and has no configurable count variable. The
original 4-variable picture in this addon came from a MySQL-flavored
reference and was never live-validated before commit 15 — local
ShellSpec only exercises shell parse / file-write behavior, not
mariadbd startup parse.
Changes
-------
- `config/mariadb-config-constraint.cue`: remove the field
declaration `rpl_semi_sync_master_wait_for_slave_count?: int &
>=1 & <=65535 | *1`. Comment rewritten to record the live N=1
third first-blocker evidence and the MariaDB-vs-MySQL semisync
variable delta.
- `config/mariadb-config-effect-scope.yaml`: remove
`rpl_semi_sync_master_wait_for_slave_count` from
dynamicParameters so the reconfigureAction.persisted helper
does not attempt to `SET GLOBAL` the unknown variable.
- `scripts/seed-replication-mode-overrides.sh`: drop the variable
from the 5-phase write loop (was 4 vars, now 3). Phase B
pre-validate, Phase C tmp-write, Phase E rename, and the
cleanup_all_tmps helper all updated. Header comment in the
derive helper preserved the rationale.
- `scripts/replication-mode-mapper.sh`: drop the variable from the
reconfigure-time derive helper. Script preamble extended with
the live N=1 finding and the MariaDB-vs-MySQL delta. async
branch comment trimmed because the 4-var deterministic
rationale was specifically about both flips of the variable.
- ShellSpec updates (528 examples / 0 failures / 7 pendings —
same count, three assertion polarities inverted, one fixture
swapped):
- `replication_merged_pd_schema_enum_spec.sh`:
- "constrains rpl_semi_sync_master_wait_for_slave_count to a
positive int range" → "does NOT declare ... (commit 16
MariaDB-unsupported drop)". Match only code lines (skip `//`
comments).
- "lists rpl_semi_sync_master_wait_for_slave_count in
dynamicParameters" → "does NOT list ... (commit 16)".
- `replication_mode_mapper_spec.sh`: two "all 4 derived vars
for mode=...semisync/async" assertions now check the 3
MariaDB-supported vars and add a negative for
wait_for_slave_count. Same for the both-consistent example.
- `seed_replication_mode_overrides_spec.sh`: "writes
rpl_semi_sync_master_wait_for_slave_count=1" → "does NOT
write ... (commit 16 MariaDB-unsupported drop)". The B4
dir-target validation tests pivot from
`rpl_semi_sync_master_wait_for_slave_count.cnf` (no longer
written) to `rpl_semi_sync_master_timeout.cnf`.
Static verification
-------------------
- helm lint PASS
- bash + dash -n on both scripts: PASS
- Smoke reproduction in real shell: seeder with mode=semisync now
writes exactly 3 files (master_enabled, slave_enabled,
master_timeout). master_enabled.cnf contains `[mysqld]\nrpl_semi
_sync_master_enabled = ON`.
- Focused 3 specs (pd_schema + mapper + seeder): 67/0 in 3.92s
- Full mariadb scripts-ut-spec: 528 / 0 / 7 (same total as commit
15)
What this commit does NOT do
----------------------------
- Bump chart version. The 3-variable variant is still alpha.89 v1
iteration; no new CmpD spec shape that would require chart
version bump.
- Run live N=1 in `mariadb-test5`. Jack will need to update to
this commit and re-run the install-time semisync first-boot
SQL ON/ON check.
- Address the wait-count semantics gap. MariaDB semisync waits
for exactly one acknowledgement; no equivalent variable.
No engine version change.
…tability live N=1 fourth first-blocker fix)
Live N=1 fourth round on commit 16 chart in vcluster `mariadb-test5`
got past the MariaDB-unsupported-variable blocker (engine doesn't
CrashLoop anymore) but stalled on a setup blocker: the live
`mariadb-replication-merged-1.1.1-alpha.89` ComponentDefinition
stayed Unavailable with condition `immutable fields can't be
updated`.
Root cause: KubeBlocks treats the rendered ComponentDefinition spec
as immutable. alpha.89 commits 13/14/15/16 each mutated the merged
CmpD spec — env block (replicationMode env wire), container command
body (startup seeder source), CmpV regex (in cmpv.yaml, also tracked
under same chart version), dynamicParameters list, configmap-scripts
content. KB sees each upgrade attempt as a same-version update and
refuses because the spec has changed. The CmpD stays Unavailable;
the Cluster reconcile reuses the broken CmpD; further N=1 samples
are invalid until a fresh CmpD identity exists.
Same KB immutability rule that drove the historical alpha.65 /
alpha.66 / alpha.70 / earlier bumps. Once any cmpd-*.yaml mutation
happens within an alpha cycle, the chart version MUST bump so KB
creates a NEW CmpD (`mariadb-replication-merged-1.1.1-alpha.90`)
instead of attempting an immutable-field update on the old one.
Changes
-------
- `Chart.yaml`:
- `version: 1.1.1-alpha.89` → `version: 1.1.1-alpha.90`
- Prepended a new comment block at the top of the changelog stack
documenting the live N=1 fourth first-blocker, the alpha.89
commit chain that mutated the merged CmpD, and the immutability
rule citation back to alpha.65.
- ShellSpec literal bumps (no behavior change, just version string
updates so existing assertions still match the rendered chart):
- `replication_switchover_spec.sh`: 4 occurrences of
`1.1.1-alpha.89` → `1.1.1-alpha.90` in `grep -c '^version: ...'`
assertions and explicit "version: ..." output expectations.
- `replication_user_convergence_spec.sh`: 2 occurrences in Gate 1
chart-version check. Test description updated to reference the
immutability-bump rationale instead of the topology-merge
rationale.
- `replication_merged_pd_regex_disambiguation_spec.sh`: 6
occurrences in the canonical-name fixture set (STANDALONE_NAME,
GALERA_NAME, OLD_REPL_NAME, OLD_SEMISYNC_NAME, MERGED_NAME, plus
one comment line).
Static verification
-------------------
- helm lint PASS
- Full mariadb scripts-ut-spec: 528 / 0 / 7 (same total; all 12
version-literal references now match the new chart version).
What this commit does NOT do
----------------------------
- Mutate any cmpd-*.yaml, _helpers.tpl, scripts/*, or config/* file.
Only Chart.yaml version + version literals in ShellSpec. The
merged CmpD spec content is unchanged from commit 16; the new
CmpD identity `mariadb-replication-merged-1.1.1-alpha.90` has the
same shape, just a new name.
- Run live N=1 in `mariadb-test5`. Jack will need to update to this
commit and the test cluster needs the new CmpD created (KB's
CmpV controller will bind release images to it via the merged
regex from commit 15; old `mariadb-replication-merged-1.1.1-alpha.89`
CmpD stays Unavailable but doesn't block new clusters).
No engine version change.
3 tasks
…a scripts CM name (#2663)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rolls up the MariaDB addon evolution from the alpha.37 semisync fencing baseline through alpha.86–90 topology merge and addon-api/12b source-traceability conformance. Final head
dd8f2c43is the live-validated alpha.90 chart with the source-traceability fixes from PR #2657 and PR #2663 already merged in.Key changes
replicationtopology with merged ComponentDefinition (cmpd-replication-merged.yaml) covers both async and semi-sync; legacycmpd-semisync.yamlandcmpd-replication.yamlretained for in-place upgrade compatibility.replicationMode={async,semisync}is the user-facing switch; CUE schema validates, addon-side mapper translates into the four realrpl_semi_sync_*engine variables at reconfigure-time, install-time seeder writes the same overrides at first boot. Three-layer fail-closed: Helm template, install-time, reconfigure-time.rpl_semi_sync_master_wait_for_slave_count(unsupported in MariaDB 11.4); chart bumped alpha.89 → alpha.90 to escape CmpD immutability lock.targetblock added (role: secondary, fallbackRole: primary, account: root).targetblock comment expanded with per-topology resolution table, MySQL cross-engine comparison, PITR forward note; newmariadb.galera.scriptConfigMapNamehelper replaces three hard-coded literal references.Validation
mdb-alpha90-conformance-test(2-replica replication topology, role labels primary/secondary, runtime BPT verified): evidence sha256518595924deb6ed53a1881353d814c137030c4331999e428ddecbe538d24c747Follow-up
Full kubeblocks-tests suite (83 sections × 4 topologies) is in progress against the alpha.90 chart at this branch head to validate release-standard readiness.