Skip to content

Conversation

@bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Nov 10, 2025

#34027 changed our approach to versioning, so that newer versions were responsible for maintaining compat with the latest version in shard state.

However, the actual updating of the version on upgrades was left to future work... this is that!

Motivation

https://github.com/MaterializeInc/database-issues/issues/9870

@bkirwi bkirwi requested review from a team as code owners November 10, 2025 16:38
@bkirwi bkirwi changed the title [persist] [wip] erform upgrade for all shards [persist] [wip] Actually perform upgrade for all shards Nov 10, 2025
@bkirwi bkirwi marked this pull request as draft November 10, 2025 16:38
@bkirwi bkirwi force-pushed the do-upgrade branch 2 times, most recently from e64e1ac to 0284e94 Compare November 10, 2025 23:02
@bkirwi bkirwi changed the title [persist] [wip] Actually perform upgrade for all shards [persist] Actually perform upgrade for all shards Nov 11, 2025
@bkirwi bkirwi marked this pull request as ready for review November 11, 2025 01:07
@bkirwi bkirwi requested review from a team and aljoscha as code owners November 11, 2025 01:07
@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 11, 2025

@def- - In nightlies, a few tests are failing with -

Test succeeded, but unknown errors found in logs, marking as failed
  • where the relevant log is -
platform-checks-mz_2-1               | cluster-u1-replica-u2-gen-1: 2025-11-11T00:10:38.843870Z  WARN mz_persist_client::internal::encoding: halting process: code at version 26.0.0-dev.0 cannot read data with version 26.1.0

I've looked into a couple examples, and this seems to be the new process correctly fencing out the old. (The behaviour has changed here, since older versions would not fence at all after the upgrade... but since the fencing happens after the new instance takes over this shouldn't change any observable behaviour.) It should be safe to relax this check!

@bkirwi bkirwi requested a review from teskje November 11, 2025 01:11
@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 11, 2025

@teskje - Tagged you as a reviewer here! The criteria for where to do the upgrade call is:

  • Somewhere that's guaranteed to be run once per instance start. (Even if the shard is not receiving writes or similar.)
  • Ideally, somewhere that's only called a small number of times per process. (To avoid a bunch of unnecessary writes.)
  • Somewhere after the point where the upgrade has definitely fenced out the old process. (But ideally fairly close to that point.)

I'd especially appreciate if you could check my work on these!

@def- def- requested a review from a team as a code owner November 11, 2025 10:35
@def-
Copy link
Contributor

def- commented Nov 11, 2025

It should be safe to relax this check!

Done, pushed into this PR and triggered a new upgrade test run: https://buildkite.com/materialize/nightly/builds/14051 & https://buildkite.com/materialize/release-qualification/builds/982

Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have two suggestions for moving upgrades to (imo) more obvious places, but nothing blocking.

Comment on lines +553 to +556
persist_client
.upgrade_version::<TableKey, ShardId, Timestamp, StorageDiff>(shard_id, diagnostics)
.await
.expect("valid usage");
Copy link
Contributor

@teskje teskje Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could happen earlier, namely when a leader (i.e. non-read-only) environment opens the migration shard. But also all this code gets replaced by #34011 anyway. I'll make a note to add an upgrade_version call in that PR too.

Edit: Done

Comment on lines +1715 to +1722
async fn mark_bootstrap_complete(&mut self) {
self.bootstrap_complete = true;
if matches!(self.mode, Mode::Writable) {
self.since_handle
.upgrade_version()
.await
.expect("invalid usage")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about mark_bootstrap_complete! Not sure I like it, why does the durable catalog need to know about an adapter concept?

Anyway... What do you think about doing the upgrade of the catalog shard in PersistHandle::open_inner, immediately after we do the initial commit that fences out old versions in leader mode? That seems to be a good place and it's where we bump the version of the upgrade shard today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try it! (Originally I avoided it since it made the old version get fenced out more aggressively, but with Dennis' testing changes maybe it will all be fine.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right: it's because doing the fencing there makes the older envd instance halt! deep in Persist instead of at the current place, which makes logging slightly less useful.

I'll merge without making this change, but if you feel strongly about it tomorrow I'm happy to do a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine for now, but when we remove the upgrade shard I'd feel better to have the version upgrade moved there instead.

Although, I'm curious how this makes a difference for the old instance? How does it get fenced out now if not through persist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the old instance shuts down with:

environmentd: 2025-11-11T10:44:40.134392Z  INFO coord::advance_timelines_interval:coord::group_commit: mz_adapter::util: exiting process (0): unable to confirm leadership: Catalog(Error { kind: Durable(Fence(DeployGeneration { current_generation: 0, fence_generation: 1 })) })

This exits with a zero exit code.

@def-
Copy link
Contributor

def- commented Nov 11, 2025

I guess this one in Legacy upgrade tests (last version from git) is also expected?

legacy-upgrade-materialized-1     | cluster-s2-replica-s2-gen-0: 2025-11-11T10:43:10.186386Z  WARN mz_persist_client::internal::encoding: halting process: 0.164.0 received persist state from the future 26.0.0-dev.0

Will add another ignore. New run: https://buildkite.com/materialize/nightly/builds/14052

Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good from testing side

@bkirwi bkirwi merged commit d62fc82 into MaterializeInc:main Nov 11, 2025
228 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants