-
Notifications
You must be signed in to change notification settings - Fork 482
[persist] Actually perform upgrade for all shards #34075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e64e1ac to
0284e94
Compare
|
@def- - In nightlies, a few tests are failing with -
I've looked into a couple examples, and this seems to be the new process correctly fencing out the old. (The behaviour has changed here, since older versions would not fence at all after the upgrade... but since the fencing happens after the new instance takes over this shouldn't change any observable behaviour.) It should be safe to relax this check! |
|
@teskje - Tagged you as a reviewer here! The criteria for where to do the upgrade call is:
I'd especially appreciate if you could check my work on these! |
Done, pushed into this PR and triggered a new upgrade test run: https://buildkite.com/materialize/nightly/builds/14051 & https://buildkite.com/materialize/release-qualification/builds/982 |
teskje
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I have two suggestions for moving upgrades to (imo) more obvious places, but nothing blocking.
| persist_client | ||
| .upgrade_version::<TableKey, ShardId, Timestamp, StorageDiff>(shard_id, diagnostics) | ||
| .await | ||
| .expect("valid usage"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| async fn mark_bootstrap_complete(&mut self) { | ||
| self.bootstrap_complete = true; | ||
| if matches!(self.mode, Mode::Writable) { | ||
| self.since_handle | ||
| .upgrade_version() | ||
| .await | ||
| .expect("invalid usage") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about mark_bootstrap_complete! Not sure I like it, why does the durable catalog need to know about an adapter concept?
Anyway... What do you think about doing the upgrade of the catalog shard in PersistHandle::open_inner, immediately after we do the initial commit that fences out old versions in leader mode? That seems to be a good place and it's where we bump the version of the upgrade shard today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try it! (Originally I avoided it since it made the old version get fenced out more aggressively, but with Dennis' testing changes maybe it will all be fine.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right: it's because doing the fencing there makes the older envd instance halt! deep in Persist instead of at the current place, which makes logging slightly less useful.
I'll merge without making this change, but if you feel strongly about it tomorrow I'm happy to do a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine for now, but when we remove the upgrade shard I'd feel better to have the version upgrade moved there instead.
Although, I'm curious how this makes a difference for the old instance? How does it get fenced out now if not through persist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the old instance shuts down with:
environmentd: 2025-11-11T10:44:40.134392Z INFO coord::advance_timelines_interval:coord::group_commit: mz_adapter::util: exiting process (0): unable to confirm leadership: Catalog(Error { kind: Durable(Fence(DeployGeneration { current_generation: 0, fence_generation: 1 })) })
This exits with a zero exit code.
|
I guess this one in Legacy upgrade tests (last version from git) is also expected? Will add another ignore. New run: https://buildkite.com/materialize/nightly/builds/14052 |
def-
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good from testing side
#34027 changed our approach to versioning, so that newer versions were responsible for maintaining compat with the latest version in shard state.
However, the actual updating of the version on upgrades was left to future work... this is that!
Motivation
https://github.com/MaterializeInc/database-issues/issues/9870