Skip to content

schemadiff, tablegc: reject foreign-key shadow-table conflicts and reclaim FK-linked GC tables safely#20352

Open
shlomi-noach wants to merge 5 commits into
vitessio:mainfrom
planetscale:schemadiff-reject-fk-shadow-conflicts
Open

schemadiff, tablegc: reject foreign-key shadow-table conflicts and reclaim FK-linked GC tables safely#20352
shlomi-noach wants to merge 5 commits into
vitessio:mainfrom
planetscale:schemadiff-reject-fk-shadow-conflicts

Conversation

@shlomi-noach

@shlomi-noach shlomi-noach commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What's this?

Two related foreign-key safety fixes for OnlineDDL with --unsafe-allow-foreign-keys.

OnlineDDL doesn't alter/drop a table in place — it builds a shadow table (or renames the original to a _vt_HOLD_… table for the GC), and the original sticks around as a held table until table GC reclaims it. With the fork's rename_table_preserve_foreign_key, that held table keeps its original foreign key. That opens two distinct hazards.

1. schemadiff: reject shadow-table FK conflicts (data-dictionary corruption)

Consider, in a single ApplySchema:

  • child drops its FK to parent, and
  • parent.id (the referenced column) changes intvarchar(32).

The final schema is valid (no FK), so schemadiff allows both changes today. But while child is migrating, its held original still carries the old FK referencing parent.id, and on the fork the parent.id type change is allowed to proceed. The held table's FK is now type-incompatible, the data dictionary breaks, and the instance has to be decommissioned. Reordering within the batch doesn't help — the held table lingers — so this is genuinely impossible to do safely in one batch.

schemadiff now detects this family and rejects it with a clear, actionable ForeignKeyShadowConflictError. The conflict is any source-schema FK that would survive on a held table while the surviving parent is altered into an incompatible state: referenced column type/charset/collation/unsigned/zerofill change, a dropped/renamed referenced column, or a dropped covering index. This fires whether the child is being ALTERed or DROPped, as long as the parent is altered (and survives) incompatibly.

Self-referencing FKs are excluded (the held copy references its own columns at the same version — internally consistent).

2. tablegc: disable foreign_key_checks when reclaiming GC tables (reclamation hazard)

A separate, weaker hazard: when two FK-related tables are garbage collected together (e.g. a child and its parent both dropped in one batch), the GC could wedge. It reclaims held tables one at a time, oldest-first, FK-unaware, with foreign_key_checks left ON — so purging the parent fails under ON DELETE RESTRICT while the child still has referencing rows, and dropping the still-referenced parent fails outright. Circular FKs could never be reclaimed.

The fix disables foreign_key_checks on the GC's dedicated purge and drop connections — the data in these doomed tables no longer matters, so reclamation should succeed regardless of order or FK relationship. (Connections are closed after use; the value is restored defensively in case they ever become pooled.)

Because the GC can now reclaim an FK-linked held pair cleanly, dropping a child and its parent in a single batch is safe, so schemadiff no longer rejects that case — only the genuinely-unsafe ALTER cases above remain rejected.

How it works

  • New strictest dependency type DiffDependencyImpossibleExecution, with Has/AllImpossibleExecutionDependencies() mirroring the existing sequential-dependency API.
  • Detection in SchemaDiff() records the conflict; OrderedDiffs() rejects with ForeignKeyShadowConflictError.
  • Gated behind a new DiffHints.ForeignKeyShadowConflictStrategy (default = today's behavior). The corruption is specific to OnlineDDL's held-table mechanism and is safe under direct strategy, so only that consumer opts in. Zero-value preserves existing behavior.
  • tablegc: SET SESSION foreign_key_checks = 0 in purge and dropTable, restored defensively via defer.

Scope / notes

  • This spans two subsystems (go/vt/schemadiff and go/vt/vttablet/tabletserver/gc) because the GC fix is what makes the drop+drop case safe and therefore lets schemadiff stop rejecting it.
  • Wiring ForeignKeyShadowConflictStrategyReject into the actual ApplySchema / OnlineDDL path (when strategy is online + --unsafe-allow-foreign-keys) is a separate consumer change — this PR delivers the detection + rejection capability and surfaces it via the hint and the new dependency type.
  • User-visible: a new rejection under the opt-in hint, plus the GC now reclaims FK-linked tables. Worth a release-note callout when the consumer wiring lands.
  • Testing: schemadiff changes are covered by TestSchemaDiffForeignKeyShadowConflict (positive cases for each conflict shape; negatives for the gating hint, self-reference, char↔varchar lockstep, and drop+drop now being allowed). The tablegc foreign_key_checks behavior is best validated end-to-end against a real server; the existing GC unit tests pass, but an e2e test exercising FK-linked GC reclamation would be the ideal follow-up.

Related to #20353


Most of this was written by Claude Code — I just provided direction and review.

shlomi-noach and others added 2 commits June 18, 2026 12:06
Detect and reject OnlineDDL batches where a foreign key present in the
source schema would survive on a held table (`_vt_hld_...`) while the
referenced parent is concurrently altered into an incompatible signature.
Such batches keep the final schema valid yet corrupt the held table's
stale FK; no ordering can resolve this, so we introduce a new
DiffDependencyImpossibleExecution dependency type and reject in
OrderedDiffs.

Detection is gated behind the new ForeignKeyShadowConflictStrategy hint
(default = current behavior), since the corruption is specific to
OnlineDDL's held-table mechanism and is safe under direct strategy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
… detection

Two review fixes for foreign-key shadow-table conflict detection:

- Self-referencing foreign keys were wrongly rejected. A single-table ALTER
  that changes a self-referenced column type is safe: the held original is
  internally consistent (its FK references its own columns at the same
  version) and never references the new table. Skip the check when the
  referenced table is the child itself.

- The error message rendered an empty referenced-column identifier
  (`p`.``) for the dropped-parent-table and dropped-covering-index cases.
  Default the referenced column to the FK's first referenced column so the
  message always names a column; column-specific branches still override it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 10:17
@github-actions github-actions Bot added this to the v25.0.0 milestone Jun 18, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 18, 2026
@vitess-bot

vitess-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@shlomi-noach shlomi-noach removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 18, 2026
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.63265% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.08%. Comparing base (70c7a72) to head (d29acc4).
⚠️ Report is 343 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/schemadiff/schema_diff.go 44.44% 10 Missing ⚠️
go/vt/schemadiff/schema.go 92.45% 4 Missing ⚠️
go/vt/vttablet/tabletserver/gc/tablegc.go 66.66% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #20352       +/-   ##
===========================================
+ Coverage   69.67%   85.08%   +15.41%     
===========================================
  Files        1614       18     -1596     
  Lines      216793     4565   -212228     
===========================================
- Hits       151044     3884   -147160     
+ Misses      65749      681    -65068     
Flag Coverage Δ
partial 85.08% <81.63%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends schemadiff to detect and (optionally) reject OnlineDDL “shadow/held table” foreign-key conflicts that make a multi-DDL batch impossible to execute safely under any ordering. It introduces a new dependency severity for “impossible execution”, records these dependencies during SchemaDiff(), and makes OrderedDiffs() fail fast with a clear, actionable error when the opt-in hint is enabled.

Changes:

  • Add a new DiffHints.ForeignKeyShadowConflictStrategy (default allow) to gate this OnlineDDL-specific validation.
  • Detect “held table FK would become invalid” situations and record them as DiffDependencyImpossibleExecution, returning a new ForeignKeyShadowConflictError from OrderedDiffs().
  • Add targeted unit tests covering multiple conflict scenarios and the opt-in/opt-out behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
go/vt/schemadiff/types.go Adds ForeignKeyShadowConflictStrategy* constants and a new DiffHints field to gate the behavior.
go/vt/schemadiff/schema.go Implements shadow/held-table FK conflict detection and records impossible-execution dependencies during SchemaDiff().
go/vt/schemadiff/schema_diff.go Introduces DiffDependencyImpossibleExecution and fails fast in OrderedDiffs() with a precise FK shadow conflict error.
go/vt/schemadiff/schema_diff_test.go Adds coverage for shadow conflict detection across multiple scenarios, including self-FK exclusion and strategy gating.
go/vt/schemadiff/errors.go Adds ForeignKeyShadowConflictError with an operator-actionable message.

@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Component: schema management schemadiff and schema changes and removed NeedsIssue A linked issue is missing for this Pull Request labels Jun 18, 2026
The table GC reclaims held tables (`_vt_HOLD_…` → PURGE → DROP) one at a
time, ordered by age, with no awareness of foreign keys and with
foreign_key_checks left ON. When two FK-related tables are garbage
collected together (e.g. a child and its parent dropped in one batch),
this can wedge: purging the parent fails under ON DELETE RESTRICT while
the child still has referencing rows, and dropping the referenced parent
fails outright. Circular FKs can never be reclaimed at all.

Disable foreign_key_checks on the dedicated purge and drop connections so
reclamation succeeds regardless of order or FK relationship (the data in
these doomed tables no longer matters). The connections are closed after
use, but we restore the value defensively in case they ever become pooled.

With the GC able to reclaim FK-linked held tables cleanly, dropping a
child and its parent in a single batch is now safe, so the schemadiff
shadow-conflict check no longer rejects it. The parent-is-dropped branch
is removed; the genuinely-unsafe cases (a parent that survives in a
broken FK relationship after an ALTER) are unchanged, including when the
child itself is dropped while the parent is altered incompatibly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach removed the NeedsWebsiteDocsUpdate What it says label Jun 18, 2026
@shlomi-noach shlomi-noach changed the title schemadiff: reject foreign-key shadow-table conflicts schemadiff, tablegc: reject foreign-key shadow-table conflicts and reclaim FK-linked GC tables safely Jun 18, 2026
@shlomi-noach shlomi-noach marked this pull request as ready for review June 18, 2026 16:20
Copilot AI review requested due to automatic review settings June 18, 2026 16:20
@shlomi-noach shlomi-noach requested a review from mhamza15 June 18, 2026 16:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

…omment

Address review follow-ups:

- Add TestDropTableDisablesForeignKeyChecks, a fakesqldb-backed unit test
  that asserts dropTable issues `SET SESSION foreign_key_checks = 0` before
  the DROP and restores it afterwards. This covers the previously-untested
  GC foreign-key behavior (the purge path uses the same mechanism but is
  harder to isolate due to its throttler dependency).

- Clarify the schemadiff DropTableEntityDiff comment: a dropped child
  conflicts only when a referenced parent survives this batch but is altered
  incompatibly. A dropped parent is intentionally not a conflict — the table
  GC reclaims the held pair with foreign key checks disabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple of tiny nits.

A couple of other minor things...

In go/vt/vttablet/tabletserver/gc/tablegc.go:584, go/vt/vttablet/tabletserver/gc/tablegc_test.go:35 we could add real/purge coverage for FK-linked tablegc reclamation. The new fake DB test verifies that dropTable sends SET SESSION foreign_key_checks = 0 before DROP, but it still does not cover the purge path or exercise InnoDB FK behavior. One of the hazards fixed here is purging parent rows under ON DELETE RESTRICT while a doomed child still references them; a regression in purge would still wedge before DROP and would not fail this test. Or am I missing something? If not, then maybe we could add an end-to-end tablegc test (ideally adding to an existing e2e test) with FK-linked GC tables, preferably parent-first or circular, and assert GC can complete both PURGE and DROP. I don't consider this a blocker though as I think this is pretty straightforward and well understood behavior.

In go/vt/schemadiff/schema.go:995 I think that we can remove stale dropped-parent wording. The shadow-conflict helper comment still says a dropped parent table corrupts the held table and records an impossible dependency, but the final code intentionally removed the DropTableEntityDiff conflict and the tests now allow child+parent drop in one batch. Right?

Thanks, @shlomi-noach ! ❤️

Comment thread go/vt/schemadiff/errors.go Outdated

func (e *ForeignKeyShadowConflictError) Error() string {
return fmt.Sprintf(
"foreign key constraint %s in table %s references %s.%s, which is altered incompatibly in the same batch; the original table would survive with this foreign key while %s is altered, corrupting it. Split into separate migrations",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "... These changes must be split into separate migrations."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

defer func() {
if !conn.IsClosed() {
if _, err := conn.Conn.ExecuteFetch("SET SESSION foreign_key_checks = 1", 0, false); err != nil {
log.Error(fmt.Sprintf("TableGC: error setting foreign_key_checks = 1: %+v", err))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Error(fmt.Sprintf("TableGC: error setting foreign_key_checks = 1: %+v", err))
log.Error("TableGC: error setting foreign_key_checks = 1", slog.Any("error", err))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

…ent wording

Review follow-ups from PR vitessio#20352:

- Add TestPurgeDisablesForeignKeyChecks: a fakesqldb-backed test that the
  purge path also disables foreign_key_checks before deleting rows, so a
  regression there (the ON DELETE RESTRICT wedge) is caught — not just the
  DROP path. Shared setup extracted into newFakeDBTableGC.
- Use structured logging (slog.Any) for the foreign_key_checks restore error
  in purge and dropTable.
- Reword ForeignKeyShadowConflictError: "These changes must be split into
  separate migrations."
- Drop the stale "dropped parent table" wording from the shadow-conflict
  helper comment; a dropped parent is reclaimed by the GC and is not a
  conflict.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 21, 2026 05:37
@shlomi-noach

Copy link
Copy Markdown
Contributor Author

we could add real/purge coverage for FK-linked tablegc reclamation.

Wrote that in a unit test. An endtoend test would require the MySQL fork, which we don't test against in this repo.

I think that we can remove stale dropped-parent wording.

Done.

All review comments addressed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment on lines +588 to +601
// Disable foreign key checks: the rows we purge belong to a doomed table, and another doomed
// table (e.g. a child table also being garbage collected) may still reference them. We must not
// let an ON DELETE RESTRICT constraint block reclamation. The connection is dedicated and closed
// when we're done, but we restore the value defensively, in case this ever becomes pooled.
if _, err := conn.ExecuteFetch("SET SESSION foreign_key_checks = 0", 0, false); err != nil {
return tableName, err
}
defer func() {
if !conn.IsClosed() {
if _, err := conn.ExecuteFetch("SET SESSION foreign_key_checks = 1", 0, false); err != nil {
log.Error("TableGC: error setting foreign_key_checks = 1", slog.Any("error", err))
}
}
}()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reject — false premise. Copilot's worry ("leak unexpected session state to later users of the connection") requires the connection to have later users. It doesn't. Both purge and dropTable open a dedicated dbconnpool.NewDBConnection and defer conn.Close(); DBConnection embeds *mysql.Conn, whose Close() tears down the socket. The session — foreign_key_checks and all — dies with it.

The restore we added is already pure belt-and-suspenders (we discussed this; it mirrors the existing sql_log_bin restore on the same closed connection, which is equally redundant). On a closed connection, hardcoded-1 vs capture-and-restore is unobservable — there's no next user to observe it.

One grain of truth: the OnlineDDL precedent (executor.go:496-503) does capture @@foreign_key_checks and restore the prior value rather than assuming 1. If you wanted byte-for-byte symmetry with that pattern we could, but it's gold-plating for a connection that's about to be discarded.

Comment on lines +642 to +655
// Disable foreign key checks so we can drop a table that is still referenced by another
// (also-doomed) table's foreign key, regardless of the order in which the GC drops them. The
// connection is dedicated and closed when we're done, but we restore the value defensively, in
// case this ever becomes pooled.
if _, err := conn.Conn.ExecuteFetch("SET SESSION foreign_key_checks = 0", 0, false); err != nil {
return err
}
defer func() {
if !conn.IsClosed() {
if _, err := conn.Conn.ExecuteFetch("SET SESSION foreign_key_checks = 1", 0, false); err != nil {
log.Error("TableGC: error setting foreign_key_checks = 1", slog.Any("error", err))
}
}
}()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment on lines 79 to +88
// IsSequential returns true if this is a sequential dependency
func (d *DiffDependency) IsSequential() bool {
return d.typ >= DiffDependencySequentialExecution
}

// IsImpossible returns true if this is an impossible-execution dependency, ie the two diffs
// cannot be safely applied together in the same batch under any ordering.
func (d *DiffDependency) IsImpossible() bool {
return d.typ >= DiffDependencyImpossibleExecution
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factually correct, but not a bug — and "fixing" it would make things worse. The >= ladder is deliberate (// the higher the value, the stricter), and IsInOrder()/IsSequential() all use >= by design: a stricter dependency satisfies every weaker predicate. An impossible dep is "at least sequential."

Why it's not a hazard:

  • The runnable-plan entry point is guarded. OrderedDiffs() checks HasImpossibleExecutionDependencies() first and returns the error before any sequential/ordering logic runs. Anyone getting an execution order goes through it.
  • No in-tree consumer branches on the sequential predicates (they're API surface for the future PlanetScale consumer).
  • Banding would be strictly worse. If I made HasSequentialExecutionDependencies() exclude impossible (>= Sequential && < Impossible), then a consumer who only checks "no sequential deps → run in parallel" would run an impossible batch in parallel — worse than running it sequentially. Either way, a consumer that ignores HasImpossibleExecutionDependencies() is unsafe; the band doesn't rescue them, and it breaks consistency with IsInOrder().

So the real safety comes from OrderedDiffs() rejecting + the dedicated HasImpossibleExecutionDependencies() — both of which exist. The >= semantics are correct as-is.

Comment on lines 308 to 326
// AllSequentialExecutionDependencies returns all diffs that are of "sequential execution" type.
func (d *SchemaDiff) AllSequentialExecutionDependencies() (deps []*DiffDependency) {
for _, dep := range d.dependencies {
if dep.typ >= DiffDependencySequentialExecution {
deps = append(deps, dep)
}
}
return deps
}

// HasSequentialExecutionDependencies return `true` if there is at least one "subsequential execution" type diff.
// If not, that means all diffs can be applied in parallel.
func (d *SchemaDiff) HasSequentialExecutionDependencies() bool {
for _, dep := range d.dependencies {
if dep.typ >= DiffDependencySequentialExecution {
return true
}
}
return false

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Component: schema management schemadiff and schema changes Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants