schemadiff, tablegc: reject foreign-key shadow-table conflicts and reclaim FK-linked GC tables safely#20352
Conversation
Detect and reject OnlineDDL batches where a foreign key present in the source schema would survive on a held table (`_vt_hld_...`) while the referenced parent is concurrently altered into an incompatible signature. Such batches keep the final schema valid yet corrupt the held table's stale FK; no ordering can resolve this, so we introduce a new DiffDependencyImpossibleExecution dependency type and reject in OrderedDiffs. Detection is gated behind the new ForeignKeyShadowConflictStrategy hint (default = current behavior), since the corruption is specific to OnlineDDL's held-table mechanism and is safe under direct strategy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
… detection Two review fixes for foreign-key shadow-table conflict detection: - Self-referencing foreign keys were wrongly rejected. A single-table ALTER that changes a self-referenced column type is safe: the held original is internally consistent (its FK references its own columns at the same version) and never references the new table. Skip the check when the referenced table is the child itself. - The error message rendered an empty referenced-column identifier (`p`.``) for the dropped-parent-table and dropped-covering-index cases. Default the referenced column to the FK's first referenced column so the message always names a column; column-specific branches still override it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20352 +/- ##
===========================================
+ Coverage 69.67% 85.08% +15.41%
===========================================
Files 1614 18 -1596
Lines 216793 4565 -212228
===========================================
- Hits 151044 3884 -147160
+ Misses 65749 681 -65068
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR extends schemadiff to detect and (optionally) reject OnlineDDL “shadow/held table” foreign-key conflicts that make a multi-DDL batch impossible to execute safely under any ordering. It introduces a new dependency severity for “impossible execution”, records these dependencies during SchemaDiff(), and makes OrderedDiffs() fail fast with a clear, actionable error when the opt-in hint is enabled.
Changes:
- Add a new
DiffHints.ForeignKeyShadowConflictStrategy(default allow) to gate this OnlineDDL-specific validation. - Detect “held table FK would become invalid” situations and record them as
DiffDependencyImpossibleExecution, returning a newForeignKeyShadowConflictErrorfromOrderedDiffs(). - Add targeted unit tests covering multiple conflict scenarios and the opt-in/opt-out behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| go/vt/schemadiff/types.go | Adds ForeignKeyShadowConflictStrategy* constants and a new DiffHints field to gate the behavior. |
| go/vt/schemadiff/schema.go | Implements shadow/held-table FK conflict detection and records impossible-execution dependencies during SchemaDiff(). |
| go/vt/schemadiff/schema_diff.go | Introduces DiffDependencyImpossibleExecution and fails fast in OrderedDiffs() with a precise FK shadow conflict error. |
| go/vt/schemadiff/schema_diff_test.go | Adds coverage for shadow conflict detection across multiple scenarios, including self-FK exclusion and strategy gating. |
| go/vt/schemadiff/errors.go | Adds ForeignKeyShadowConflictError with an operator-actionable message. |
The table GC reclaims held tables (`_vt_HOLD_…` → PURGE → DROP) one at a time, ordered by age, with no awareness of foreign keys and with foreign_key_checks left ON. When two FK-related tables are garbage collected together (e.g. a child and its parent dropped in one batch), this can wedge: purging the parent fails under ON DELETE RESTRICT while the child still has referencing rows, and dropping the referenced parent fails outright. Circular FKs can never be reclaimed at all. Disable foreign_key_checks on the dedicated purge and drop connections so reclamation succeeds regardless of order or FK relationship (the data in these doomed tables no longer matters). The connections are closed after use, but we restore the value defensively in case they ever become pooled. With the GC able to reclaim FK-linked held tables cleanly, dropping a child and its parent in a single batch is now safe, so the schemadiff shadow-conflict check no longer rejects it. The parent-is-dropped branch is removed; the genuinely-unsafe cases (a parent that survives in a broken FK relationship after an ALTER) are unchanged, including when the child itself is dropped while the parent is altered incompatibly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
schemadiff: reject foreign-key shadow-table conflicts…omment Address review follow-ups: - Add TestDropTableDisablesForeignKeyChecks, a fakesqldb-backed unit test that asserts dropTable issues `SET SESSION foreign_key_checks = 0` before the DROP and restores it afterwards. This covers the previously-untested GC foreign-key behavior (the purge path uses the same mechanism but is harder to isolate due to its throttler dependency). - Clarify the schemadiff DropTableEntityDiff comment: a dropped child conflicts only when a referenced parent survives this batch but is altered incompatibly. A dropped parent is intentionally not a conflict — the table GC reclaims the held pair with foreign key checks disabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
mattlord
left a comment
There was a problem hiding this comment.
LGTM! Just a couple of tiny nits.
A couple of other minor things...
In go/vt/vttablet/tabletserver/gc/tablegc.go:584, go/vt/vttablet/tabletserver/gc/tablegc_test.go:35 we could add real/purge coverage for FK-linked tablegc reclamation. The new fake DB test verifies that dropTable sends SET SESSION foreign_key_checks = 0 before DROP, but it still does not cover the purge path or exercise InnoDB FK behavior. One of the hazards fixed here is purging parent rows under ON DELETE RESTRICT while a doomed child still references them; a regression in purge would still wedge before DROP and would not fail this test. Or am I missing something? If not, then maybe we could add an end-to-end tablegc test (ideally adding to an existing e2e test) with FK-linked GC tables, preferably parent-first or circular, and assert GC can complete both PURGE and DROP. I don't consider this a blocker though as I think this is pretty straightforward and well understood behavior.
In go/vt/schemadiff/schema.go:995 I think that we can remove stale dropped-parent wording. The shadow-conflict helper comment still says a dropped parent table corrupts the held table and records an impossible dependency, but the final code intentionally removed the DropTableEntityDiff conflict and the tests now allow child+parent drop in one batch. Right?
Thanks, @shlomi-noach ! ❤️
|
|
||
| func (e *ForeignKeyShadowConflictError) Error() string { | ||
| return fmt.Sprintf( | ||
| "foreign key constraint %s in table %s references %s.%s, which is altered incompatibly in the same batch; the original table would survive with this foreign key while %s is altered, corrupting it. Split into separate migrations", |
There was a problem hiding this comment.
I would say "... These changes must be split into separate migrations."
| defer func() { | ||
| if !conn.IsClosed() { | ||
| if _, err := conn.Conn.ExecuteFetch("SET SESSION foreign_key_checks = 1", 0, false); err != nil { | ||
| log.Error(fmt.Sprintf("TableGC: error setting foreign_key_checks = 1: %+v", err)) |
There was a problem hiding this comment.
| log.Error(fmt.Sprintf("TableGC: error setting foreign_key_checks = 1: %+v", err)) | |
| log.Error("TableGC: error setting foreign_key_checks = 1", slog.Any("error", err)) |
…ent wording Review follow-ups from PR vitessio#20352: - Add TestPurgeDisablesForeignKeyChecks: a fakesqldb-backed test that the purge path also disables foreign_key_checks before deleting rows, so a regression there (the ON DELETE RESTRICT wedge) is caught — not just the DROP path. Shared setup extracted into newFakeDBTableGC. - Use structured logging (slog.Any) for the foreign_key_checks restore error in purge and dropTable. - Reword ForeignKeyShadowConflictError: "These changes must be split into separate migrations." - Drop the stale "dropped parent table" wording from the shadow-conflict helper comment; a dropped parent is reclaimed by the GC and is not a conflict. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Wrote that in a unit test. An
Done. All review comments addressed. |
| // Disable foreign key checks: the rows we purge belong to a doomed table, and another doomed | ||
| // table (e.g. a child table also being garbage collected) may still reference them. We must not | ||
| // let an ON DELETE RESTRICT constraint block reclamation. The connection is dedicated and closed | ||
| // when we're done, but we restore the value defensively, in case this ever becomes pooled. | ||
| if _, err := conn.ExecuteFetch("SET SESSION foreign_key_checks = 0", 0, false); err != nil { | ||
| return tableName, err | ||
| } | ||
| defer func() { | ||
| if !conn.IsClosed() { | ||
| if _, err := conn.ExecuteFetch("SET SESSION foreign_key_checks = 1", 0, false); err != nil { | ||
| log.Error("TableGC: error setting foreign_key_checks = 1", slog.Any("error", err)) | ||
| } | ||
| } | ||
| }() |
There was a problem hiding this comment.
Reject — false premise. Copilot's worry ("leak unexpected session state to later users of the connection") requires the connection to have later users. It doesn't. Both purge and dropTable open a dedicated dbconnpool.NewDBConnection and defer conn.Close(); DBConnection embeds *mysql.Conn, whose Close() tears down the socket. The session — foreign_key_checks and all — dies with it.
The restore we added is already pure belt-and-suspenders (we discussed this; it mirrors the existing sql_log_bin restore on the same closed connection, which is equally redundant). On a closed connection, hardcoded-1 vs capture-and-restore is unobservable — there's no next user to observe it.
One grain of truth: the OnlineDDL precedent (executor.go:496-503) does capture @@foreign_key_checks and restore the prior value rather than assuming 1. If you wanted byte-for-byte symmetry with that pattern we could, but it's gold-plating for a connection that's about to be discarded.
| // Disable foreign key checks so we can drop a table that is still referenced by another | ||
| // (also-doomed) table's foreign key, regardless of the order in which the GC drops them. The | ||
| // connection is dedicated and closed when we're done, but we restore the value defensively, in | ||
| // case this ever becomes pooled. | ||
| if _, err := conn.Conn.ExecuteFetch("SET SESSION foreign_key_checks = 0", 0, false); err != nil { | ||
| return err | ||
| } | ||
| defer func() { | ||
| if !conn.IsClosed() { | ||
| if _, err := conn.Conn.ExecuteFetch("SET SESSION foreign_key_checks = 1", 0, false); err != nil { | ||
| log.Error("TableGC: error setting foreign_key_checks = 1", slog.Any("error", err)) | ||
| } | ||
| } | ||
| }() |
| // IsSequential returns true if this is a sequential dependency | ||
| func (d *DiffDependency) IsSequential() bool { | ||
| return d.typ >= DiffDependencySequentialExecution | ||
| } | ||
|
|
||
| // IsImpossible returns true if this is an impossible-execution dependency, ie the two diffs | ||
| // cannot be safely applied together in the same batch under any ordering. | ||
| func (d *DiffDependency) IsImpossible() bool { | ||
| return d.typ >= DiffDependencyImpossibleExecution | ||
| } |
There was a problem hiding this comment.
Factually correct, but not a bug — and "fixing" it would make things worse. The >= ladder is deliberate (// the higher the value, the stricter), and IsInOrder()/IsSequential() all use >= by design: a stricter dependency satisfies every weaker predicate. An impossible dep is "at least sequential."
Why it's not a hazard:
- The runnable-plan entry point is guarded.
OrderedDiffs()checksHasImpossibleExecutionDependencies()first and returns the error before any sequential/ordering logic runs. Anyone getting an execution order goes through it. - No in-tree consumer branches on the sequential predicates (they're API surface for the future PlanetScale consumer).
- Banding would be strictly worse. If I made
HasSequentialExecutionDependencies()exclude impossible (>= Sequential && < Impossible), then a consumer who only checks "no sequential deps → run in parallel" would run an impossible batch in parallel — worse than running it sequentially. Either way, a consumer that ignoresHasImpossibleExecutionDependencies()is unsafe; the band doesn't rescue them, and it breaks consistency withIsInOrder().
So the real safety comes from OrderedDiffs() rejecting + the dedicated HasImpossibleExecutionDependencies() — both of which exist. The >= semantics are correct as-is.
| // AllSequentialExecutionDependencies returns all diffs that are of "sequential execution" type. | ||
| func (d *SchemaDiff) AllSequentialExecutionDependencies() (deps []*DiffDependency) { | ||
| for _, dep := range d.dependencies { | ||
| if dep.typ >= DiffDependencySequentialExecution { | ||
| deps = append(deps, dep) | ||
| } | ||
| } | ||
| return deps | ||
| } | ||
|
|
||
| // HasSequentialExecutionDependencies return `true` if there is at least one "subsequential execution" type diff. | ||
| // If not, that means all diffs can be applied in parallel. | ||
| func (d *SchemaDiff) HasSequentialExecutionDependencies() bool { | ||
| for _, dep := range d.dependencies { | ||
| if dep.typ >= DiffDependencySequentialExecution { | ||
| return true | ||
| } | ||
| } | ||
| return false |
There was a problem hiding this comment.
same as above.
What's this?
Two related foreign-key safety fixes for OnlineDDL with
--unsafe-allow-foreign-keys.OnlineDDL doesn't alter/drop a table in place — it builds a shadow table (or renames the original to a
_vt_HOLD_…table for the GC), and the original sticks around as a held table until table GC reclaims it. With the fork'srename_table_preserve_foreign_key, that held table keeps its original foreign key. That opens two distinct hazards.1. schemadiff: reject shadow-table FK conflicts (data-dictionary corruption)
Consider, in a single
ApplySchema:childdrops its FK toparent, andparent.id(the referenced column) changesint→varchar(32).The final schema is valid (no FK), so schemadiff allows both changes today. But while
childis migrating, its held original still carries the old FK referencingparent.id, and on the fork theparent.idtype change is allowed to proceed. The held table's FK is now type-incompatible, the data dictionary breaks, and the instance has to be decommissioned. Reordering within the batch doesn't help — the held table lingers — so this is genuinely impossible to do safely in one batch.schemadiff now detects this family and rejects it with a clear, actionable
ForeignKeyShadowConflictError. The conflict is any source-schema FK that would survive on a held table while the surviving parent is altered into an incompatible state: referenced column type/charset/collation/unsigned/zerofill change, a dropped/renamed referenced column, or a dropped covering index. This fires whether the child is beingALTERed orDROPped, as long as the parent is altered (and survives) incompatibly.Self-referencing FKs are excluded (the held copy references its own columns at the same version — internally consistent).
2. tablegc: disable
foreign_key_checkswhen reclaiming GC tables (reclamation hazard)A separate, weaker hazard: when two FK-related tables are garbage collected together (e.g. a child and its parent both dropped in one batch), the GC could wedge. It reclaims held tables one at a time, oldest-first, FK-unaware, with
foreign_key_checksleft ON — so purging the parent fails underON DELETE RESTRICTwhile the child still has referencing rows, and dropping the still-referenced parent fails outright. Circular FKs could never be reclaimed.The fix disables
foreign_key_checkson the GC's dedicated purge and drop connections — the data in these doomed tables no longer matters, so reclamation should succeed regardless of order or FK relationship. (Connections are closed after use; the value is restored defensively in case they ever become pooled.)Because the GC can now reclaim an FK-linked held pair cleanly, dropping a child and its parent in a single batch is safe, so schemadiff no longer rejects that case — only the genuinely-unsafe ALTER cases above remain rejected.
How it works
DiffDependencyImpossibleExecution, withHas/AllImpossibleExecutionDependencies()mirroring the existing sequential-dependency API.SchemaDiff()records the conflict;OrderedDiffs()rejects withForeignKeyShadowConflictError.DiffHints.ForeignKeyShadowConflictStrategy(default = today's behavior). The corruption is specific to OnlineDDL's held-table mechanism and is safe underdirectstrategy, so only that consumer opts in. Zero-value preserves existing behavior.tablegc:SET SESSION foreign_key_checks = 0inpurgeanddropTable, restored defensively viadefer.Scope / notes
go/vt/schemadiffandgo/vt/vttablet/tabletserver/gc) because the GC fix is what makes the drop+drop case safe and therefore lets schemadiff stop rejecting it.ForeignKeyShadowConflictStrategyRejectinto the actual ApplySchema / OnlineDDL path (when strategy is online +--unsafe-allow-foreign-keys) is a separate consumer change — this PR delivers the detection + rejection capability and surfaces it via the hint and the new dependency type.TestSchemaDiffForeignKeyShadowConflict(positive cases for each conflict shape; negatives for the gating hint, self-reference, char↔varchar lockstep, and drop+drop now being allowed). Thetablegcforeign_key_checksbehavior is best validated end-to-end against a real server; the existing GC unit tests pass, but an e2e test exercising FK-linked GC reclamation would be the ideal follow-up.Related to #20353
Most of this was written by Claude Code — I just provided direction and review.