[persist] Actually perform upgrade for all shards #34075

bkirwi · 2025-11-10T16:38:23Z

#34027 changed our approach to versioning, so that newer versions were responsible for maintaining compat with the latest version in shard state.

However, the actual updating of the version on upgrades was left to future work... this is that!

Motivation

https://github.com/MaterializeInc/database-issues/issues/9870

bkirwi · 2025-11-11T01:11:47Z

@def- - In nightlies, a few tests are failing with -

Test succeeded, but unknown errors found in logs, marking as failed

where the relevant log is -

platform-checks-mz_2-1               | cluster-u1-replica-u2-gen-1: 2025-11-11T00:10:38.843870Z  WARN mz_persist_client::internal::encoding: halting process: code at version 26.0.0-dev.0 cannot read data with version 26.1.0

I've looked into a couple examples, and this seems to be the new process correctly fencing out the old. (The behaviour has changed here, since older versions would not fence at all after the upgrade... but since the fencing happens after the new instance takes over this shouldn't change any observable behaviour.) It should be safe to relax this check!

bkirwi · 2025-11-11T01:16:05Z

@teskje - Tagged you as a reviewer here! The criteria for where to do the upgrade call is:

Somewhere that's guaranteed to be run once per instance start. (Even if the shard is not receiving writes or similar.)
Ideally, somewhere that's only called a small number of times per process. (To avoid a bunch of unnecessary writes.)
Somewhere after the point where the upgrade has definitely fenced out the old process. (But ideally fairly close to that point.)

I'd especially appreciate if you could check my work on these!

def- · 2025-11-11T10:36:05Z

It should be safe to relax this check!

Done, pushed into this PR and triggered a new upgrade test run: https://buildkite.com/materialize/nightly/builds/14051 & https://buildkite.com/materialize/release-qualification/builds/982

teskje

LGTM! I have two suggestions for moving upgrades to (imo) more obvious places, but nothing blocking.

teskje · 2025-11-11T11:22:11Z

src/adapter/src/catalog/open/builtin_item_migration.rs

+            persist_client
+                .upgrade_version::<TableKey, ShardId, Timestamp, StorageDiff>(shard_id, diagnostics)
+                .await
+                .expect("valid usage");


I think this could happen earlier, namely when a leader (i.e. non-read-only) environment opens the migration shard. But also all this code gets replaced by #34011 anyway. I'll make a note to add an upgrade_version call in that PR too.

Edit: Done

teskje · 2025-11-11T11:31:16Z

src/catalog/src/durable/persist.rs

+    async fn mark_bootstrap_complete(&mut self) {
        self.bootstrap_complete = true;
+        if matches!(self.mode, Mode::Writable) {
+            self.since_handle
+                .upgrade_version()
+                .await
+                .expect("invalid usage")
+        }


TIL about mark_bootstrap_complete! Not sure I like it, why does the durable catalog need to know about an adapter concept?

Anyway... What do you think about doing the upgrade of the catalog shard in PersistHandle::open_inner, immediately after we do the initial commit that fences out old versions in leader mode? That seems to be a good place and it's where we bump the version of the upgrade shard today.

I'll try it! (Originally I avoided it since it made the old version get fenced out more aggressively, but with Dennis' testing changes maybe it will all be fine.)

Oh right: it's because doing the fencing there makes the older envd instance halt! deep in Persist instead of at the current place, which makes logging slightly less useful.

I'll merge without making this change, but if you feel strongly about it tomorrow I'm happy to do a followup.

It's fine for now, but when we remove the upgrade shard I'd feel better to have the version upgrade moved there instead.

Although, I'm curious how this makes a difference for the old instance? How does it get fenced out now if not through persist?

Currently the old instance shuts down with:

environmentd: 2025-11-11T10:44:40.134392Z INFO coord::advance_timelines_interval:coord::group_commit: mz_adapter::util: exiting process (0): unable to confirm leadership: Catalog(Error { kind: Durable(Fence(DeployGeneration { current_generation: 0, fence_generation: 1 })) })

This exits with a zero exit code.

src/storage-controller/src/persist_handles.rs

def- · 2025-11-11T12:07:18Z

I guess this one in Legacy upgrade tests (last version from git) is also expected?

legacy-upgrade-materialized-1     | cluster-s2-replica-s2-gen-0: 2025-11-11T10:43:10.186386Z  WARN mz_persist_client::internal::encoding: halting process: 0.164.0 received persist state from the future 26.0.0-dev.0

Will add another ignore. New run: https://buildkite.com/materialize/nightly/builds/14052

def-

All good from testing side

bkirwi requested review from a team as code owners November 10, 2025 16:38

bkirwi changed the title ~~[persist] [wip] erform upgrade for all shards~~ [persist] [wip] Actually perform upgrade for all shards Nov 10, 2025

bkirwi marked this pull request as draft November 10, 2025 16:38

bkirwi force-pushed the do-upgrade branch 2 times, most recently from e64e1ac to 0284e94 Compare November 10, 2025 23:02

bkirwi changed the title ~~[persist] [wip] Actually perform upgrade for all shards~~ [persist] Actually perform upgrade for all shards Nov 11, 2025

bkirwi marked this pull request as ready for review November 11, 2025 01:07

bkirwi requested review from a team and aljoscha as code owners November 11, 2025 01:07

bkirwi requested a review from teskje November 11, 2025 01:11

def- requested a review from a team as a code owner November 11, 2025 10:35

teskje approved these changes Nov 11, 2025

View reviewed changes

def- force-pushed the do-upgrade branch from fd4a4c0 to ba8762e Compare November 11, 2025 12:07

def- approved these changes Nov 11, 2025

View reviewed changes

bkirwi and others added 4 commits November 11, 2025 15:17

Add a metric for stale shard versions

57804f6

Upgrade version for storage collections when opening handle

4c6fbc0

Explicit upgrades for special shards

a473adf

ci: Ignore another halting-process during 0dt upgrades

513e60b

bkirwi force-pushed the do-upgrade branch from ba8762e to 513e60b Compare November 11, 2025 20:19

bkirwi merged commit d62fc82 into MaterializeInc:main Nov 11, 2025
228 checks passed

[persist] Actually perform upgrade for all shards #34075

[persist] Actually perform upgrade for all shards #34075

Uh oh!

Conversation

bkirwi commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

bkirwi commented Nov 11, 2025

Uh oh!

bkirwi commented Nov 11, 2025

Uh oh!

def- commented Nov 11, 2025

Uh oh!

teskje left a comment

Choose a reason for hiding this comment

Uh oh!

teskje Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teskje Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

bkirwi Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

bkirwi Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

teskje Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

bkirwi Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

def- commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkirwi commented Nov 10, 2025 •

edited

Loading

teskje Nov 11, 2025 •

edited

Loading

def- commented Nov 11, 2025 •

edited

Loading