chunked builtin backup engine by rvrangel · Pull Request #20167 · vitessio/vitess

rvrangel · 2026-05-22T18:28:24Z

Description

This is the first PR as part of #20159

This PR adds chunked parallel backup/restore to the builtin backup engine. Files larger than a configurable threshold are split into independently-compressed chunks during backup, which can then be restored in parallel using writes at known offsets.

Changes:

Two new flags: --builtinbackup-file-chunk-threshold (default 0, disabled) and --builtinbackup-file-chunk-size (default 1GiB)
During backup, files exceeding the threshold are split into chunks, each stored as a separate object in backup storage
During restore, chunks of the same file are written concurrently via offsetWriter (pwrite semantics)
Failed chunks are retried using the same mechanism as whole-file retries
Backward compatible: threshold=0 disables chunking, and old manifests (no Chunks field) restore identically to before

Related Issue(s)

Feature Request: Improved builtin backup engine #20159

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

AI Disclosure

PR created by me with support of Claude, fully tested by me before publishing on our own branch and tested with unit tests and e2e on the main branch

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

vitess-bot · 2026-05-22T18:28:55Z

Copilot

Pull request overview

This PR adds optional chunking to the builtinbackupengine so that large MySQL files can be backed up and restored as independently-compressed pieces, enabling higher parallelism (especially beneficial for object stores like S3) and improving restore throughput.

Changes:

Introduces chunk metadata in the backup manifest (FileEntry.Chunks) and new flags to control chunking (--builtinbackup-file-chunk-threshold, --builtinbackup-file-chunk-size).
Updates builtin backup/restore to split large files into chunks for parallel backup and to restore chunked files via parallel WriteAt (pwrite-style) writes into a pre-sized destination.
Adds unit and end-to-end tests validating chunk name parsing and verifying chunked vs non-chunked backups via MANIFEST inspection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
go/vt/mysqlctl/file_close_test.go	Updates tests for the new `backupFile(..., chunkIndex)` signature.
go/vt/mysqlctl/builtinbackupengine.go	Core implementation: chunking flags, manifest schema, chunked backup work scheduling, and parallel chunk restore.
go/vt/mysqlctl/builtinbackupengine_test.go	Adds unit tests for parsing storage names (`parseBackupName`).
go/test/endtoend/backup/vtctlbackup/backup_utils.go	Adds helpers to verify chunking by reading MANIFEST and counting chunks.
go/test/endtoend/backup/vtctlbackup/backup_test.go	Adds end-to-end tests for chunked and non-chunked builtin backups with forced small thresholds/sizes.
go/flags/endtoend/vttestserver.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vttablet.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtctld.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtcombo.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.
go/flags/endtoend/vtbackup.txt	Documents new builtinbackup chunking flags in end-to-end flag snapshots.

Comments suppressed due to low confidence (2)

go/vt/mysqlctl/builtinbackupengine.go:1372

Chunk restore goroutines close over loop variables j and fe (and use dest/fe.Name inside the closure). This can result in writing the wrong chunk offset/data and misreporting errors/logs. Rebind the loop variables (e.g. j := j, feLocal := fe) before starting each goroutine.

			for j := range fe.Chunks {
				g.Go(func() error {
					chunk := &fe.Chunks[j]

					select {

go/vt/mysqlctl/builtinbackupengine.go:1394

Non-chunked restore goroutine closes over i/fe from the enclosing for-loop. This can cause it to restore the wrong file index and log/record errors under the wrong name. Capture locals (e.g. iLocal := i, feLocal := fe) before g.Go.

			// Non-chunked file: restore as before.
			g.Go(func() error {
				name := strconv.Itoa(i)

				select {

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

promptless · 2026-05-25T13:09:27Z

Promptless prepared a documentation update related to this change.

Triggered by PR #20167

Added documentation for the new --builtinbackup-file-chunk-threshold and --builtinbackup-file-chunk-size flags to the backup and restore overview guide. These flags enable parallel backup and restore of large files by splitting them into independently-compressed chunks.

Review: Document builtin backup chunking flags

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

codecov · 2026-05-25T13:19:57Z

Codecov Report

❌ Patch coverage is 80.58608% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.95%. Comparing base (70c7a72) to head (29ad57c).
⚠️ Report is 352 commits behind head on main.

Files with missing lines	Patch %	Lines
go/vt/mysqlctl/builtinbackupengine.go	80.58%	53 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (29ad57c). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (70c7a72) HEAD (29ad57c)

1 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #20167       +/-   ##
===========================================
- Coverage   69.67%   52.95%   -16.72%     
===========================================
  Files        1614       46     -1568     
  Lines      216793     7315   -209478     
===========================================
- Hits       151044     3874   -147170     
+ Misses      65749     3441    -62308

Flag	Coverage Δ
partial	`52.95% <80.58%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

mattlord

go/vt/mysqlctl/builtinbackupengine.go:1386-1390 ignores final close errors for chunked restore destinations. Non-chunked restore propagates destination close failures because they can mean data was not safely flushed; chunked restore only logs them and can report success, then later attempt to start MySQL on an incomplete/corrupt file. Please collect these close errors and return them, and also check the dest.Close() in createChunkedDestinations at line 1367.

go/vt/mysqlctl/builtinbackupengine.go:196-198 / :691-694 has no bound on the chunk count. A small --builtinbackup-file-chunk-size typo, e.g. 1, can allocate one FileChunk and one work item per byte of a large InnoDB file before backup starts. Please enforce a sane minimum chunk size or a max chunks-per-file limit before allocating.

go/vt/mysqlctl/builtinbackupengine.go:281-283 validates chunk size even when chunking is disabled or when taking an incremental backup that may not use chunking. I’d validate chunk-size > 0 only when chunk-threshold > 0, reject negative thresholds explicitly, and fix the message to say must be > 0.

I agree with the compatibility caveat too: once chunking is enabled, those backups are not restorable by older Vitess versions because old restore code ignores Chunks and looks for whole-file objects. That should be called out in release notes summary.

rvrangel · 2026-06-03T14:12:08Z

@mattlord PR updated!

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

mattlord

Just one (yay!) note... In go/vt/mysqlctl/builtinbackupengine.go:1329-1359 we should preserve chunked-destination close failures across retries. restoreFileEntries can return both retryable per-chunk errors from bh.Error() and direct errors from closing the shared chunked destination file (:1419-1424, after return bh.Error() at :1507-1508). restoreFiles then retries only bh.GetFailedFiles() and returns success if that retry succeeds, which drops the first-pass close failure. A close error can be where delayed writeback/ENOSPC is reported, so retrying only one failed chunk does not prove the already-written chunks are safe. Please keep direct restore errors separate and return them after retry, or mark the whole affected file/chunks for retry. A regression test should combine one retryable chunk failure with a chunked destination close failure.

Thanks, @rvrangel !

rvrangel · 2026-06-10T18:03:00Z

@mattlord if we hit a close error (specially for ENOSPC but also for failed flushes to disk) it might be better to skip retries altogether as it is very likely the same issue will fail again.

what do you think if on the deferred function (builtinbackupengine.go#L1419-L1426) we just wrapped the finalErr in a errRestoreFatal and that will prevent the restore from going forward?

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

rvrangel · 2026-06-11T12:43:15Z

added the above fix and a test for it too

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

mattlord

In 1go/vt/mysqlctl/builtinbackupengine.go:717I think that we should preserve first-pass backupWorkItems errors. The first backup pass still seems to discard thebackupWorkItems’ return value, but the retry path treats that same return as significant. If bh.EndBackup(ctx)returns a direct error that is not represented inbh.GetFailedFiles()`, this can continue into MANIFEST upload and report a successful backup. Please capture the first-pass error and return it when there are no per-file failures to retry, or preserve it across retry if needed. Unless I'm missing something?

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

frouioui · 2026-06-24T14:02:25Z

+- `--builtinbackup-file-chunk-threshold` (default `0`, chunking disabled): files larger than this size in bytes are split into chunks during backup.
+- `--builtinbackup-file-chunk-size` (default `1073741824` / 1 GiB): the target size in bytes for each chunk.
+
+**Compatibility note:** Backups created with chunking enabled are **not restorable by older Vitess versions** that do not understand the `Chunks` field in the backup MANIFEST. Non-chunked backups (the default) remain fully compatible with older versions.


This is forward compatible though right? Taking as an example a vtbackup job running periodically, if vtbackup starts being configured to use chunks will it be able to restore the previous non-chunked backup?

yes, that is correct. if the Chunks field in the MANIFEST is empty, it will use the same approach and functions as before.

I added a new e2e test that takes a back without chunking and restores it on a replica with chunking enabled in 29ad57c

frouioui · 2026-06-24T14:24:23Z

+	// How many times we will retry file close operations. Note that a file operation that
+	// returns a vtrpc.Code_FAILED_PRECONDITION error is considered fatal and we will
+	// not retry.
+	maxFileCloseRetries = 20


Any reason to make this a non-constant?

this was done to avoid the exponential back-off on one of the tests by changing the retry value:

https://github.com/rvrangel/vitess/blob/27325e4a9d13cde5add405e3c1d1d227825d25cf/go/vt/mysqlctl/builtinbackupengine_test.go#L494-L509

frouioui · 2026-06-24T15:03:20Z

+				workItems = append(workItems, restoreWorkItem{
+					fe:         fe,
+					chunk:      &fe.Chunks[j],
+					chunkIndex: j,


We infer the chunk index via j, the position of the chunk in the slice of chunks (fe.Chunks). fes can be built during a retry in the function restoreFiles before being sent down to restoreFileEntries.

The retry-failed-files code path in restoreFiles builds a new fes using only failed files/chunks. It is not impossible that a chunk that was previously in the index e.g. 10 now becomes index e.g. 0. We would run into a similar scenario each time a subset of chunks failed and are being retried.

To keep the chunk index consistent, can we infer its index another way than by looking at the slice index?

ah, good point! We actually don't use the index for restoring, just when we are logging, so we can drop it and usage the StorageName of the chunk instead

frouioui

I would like to see more E2E tests for this with these characteristics:

Assertions on a deterministic amount of chunks and files
Large amount of chunks
With chunked and unchunked files
With different storage handlers, like S3 using MinIO (e.g. go/test/endtoend/backup/s3/s3_builtin_test.go)
Using vtbackup
With failure scenarios
Showing the forward compatibility of restoring an unchunked backup with chunks parameters
Showing the failure scenario of what happens when a chunked backup is restored without the chunks parameters

Can you also provide a performance comparison between chunking and not-chunking. Mostly to get an idea of what the performance impact can be on a stable cluster and hardware.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

rvrangel · 2026-06-24T21:16:40Z

@frouioui ack, let me take a look to see if I can cover most of that this week

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

chunked builtin backup engine

3c46fbe

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot AI review requested due to automatic review settings May 22, 2026 18:28

github-actions Bot added this to the v25.0.0 milestone May 22, 2026

Copilot started reviewing on behalf of rvrangel May 22, 2026 18:29 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread go/vt/mysqlctl/builtinbackupengine.go Outdated

Comment thread go/vt/mysqlctl/builtinbackupengine.go

Comment thread go/vt/mysqlctl/builtinbackupengine.go Outdated

small refactor of how files are opened

b3e9314

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

rvrangel marked this pull request as ready for review May 25, 2026 13:06

rvrangel requested a review from mattlord as a code owner May 25, 2026 13:06

Copilot AI review requested due to automatic review settings May 25, 2026 13:06

rvrangel requested a review from frouioui as a code owner May 25, 2026 13:06

Copilot started reviewing on behalf of rvrangel May 25, 2026 13:07 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread go/vt/mysqlctl/builtinbackupengine.go

linter and other improvements

dd8bb8c

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

frouioui reviewed May 26, 2026

View reviewed changes

Comment thread go/vt/mysqlctl/builtinbackupengine.go

add some comments after PR feedback

22620c5

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot AI review requested due to automatic review settings May 26, 2026 15:39

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread go/vt/mysqlctl/builtinbackupengine.go

Comment thread go/vt/mysqlctl/builtinbackupengine.go

Comment thread go/vt/mysqlctl/builtinbackupengine.go Outdated

Comment thread go/vt/mysqlctl/builtinbackupengine.go Outdated

Comment thread go/vt/mysqlctl/builtinbackupengine.go

mattlord reviewed May 26, 2026

View reviewed changes

rvrangel added 2 commits June 3, 2026 07:18

improve unit test

3a11ef6

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Merge remote-tracking branch 'origin/main' into builtin-backup-chunking

618d837

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot AI review requested due to automatic review settings June 3, 2026 16:56

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread go/vt/mysqlctl/builtinbackupengine.go

Comment thread go/vt/mysqlctl/builtinbackupengine.go

Comment thread go/vt/mysqlctl/builtinbackupengine.go

Comment thread go/vt/mysqlctl/builtinbackupengine.go Outdated

add backup fatal error

26c261d

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

rvrangel force-pushed the builtin-backup-chunking branch from 90da58e to 26c261d Compare June 3, 2026 18:25

rvrangel mentioned this pull request Jun 5, 2026

Feature Request: Improved builtin backup engine #20159

Open

mattlord reviewed Jun 10, 2026

View reviewed changes

wrap close errors as fatal

4d0b4c6

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot AI review requested due to automatic review settings June 11, 2026 12:42

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread go/vt/mysqlctl/builtinbackupengine.go

Comment thread go/vt/mysqlctl/builtinbackupengine.go

Comment thread go/vt/mysqlctl/builtinbackupengine.go

revert backup fatal error

4cd2137

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

mattlord reviewed Jun 23, 2026

View reviewed changes

feedback

3fa9ff0

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot AI review requested due to automatic review settings June 23, 2026 18:33

Copilot started reviewing on behalf of rvrangel June 23, 2026 18:34 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

merge from main

27325e4

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

frouioui reviewed Jun 24, 2026

View reviewed changes

maxenglander assigned mattlord and frouioui Jun 24, 2026

frouioui reviewed Jun 24, 2026

View reviewed changes

remove chunkIndex

29ad57c

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>

Copilot AI review requested due to automatic review settings June 24, 2026 21:15

Copilot started reviewing on behalf of rvrangel June 24, 2026 21:15 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Uh oh!

Conversation

rvrangel commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot Bot commented May 22, 2026

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

promptless Bot commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

codecov Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

rvrangel commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

rvrangel commented Jun 10, 2026

Uh oh!

rvrangel commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

rvrangel commented May 22, 2026 •

edited

Loading

codecov Bot commented May 25, 2026 •

edited

Loading

rvrangel Jun 24, 2026 •

edited

Loading

frouioui left a comment •

edited

Loading