feat: mysql chunking optimisation by saksham-datazip · Pull Request #797 · datazip-inc/olake

saksham-datazip · 2026-01-27T11:40:34Z

Description

This PR improves the MySQL chunking strategy with the primary goal of significantly reducing chunk generation time for large tables during incremental reads.

To achieve this, two mathematical chunking strategies were introduced based on the primary key type, replacing repeated database-based chunk discovery.

Numeric Primary Keys

The numeric range [min, max] is divided using an arithmetic progression to generate evenly spaced chunk boundaries. This allows chunk boundaries to be computed mathematically instead of relying on repeated database lookups, significantly reducing chunking time.

String Primary Keys

String values are mapped into a numeric space using Unicode encoding (big.Int) and then split into balanced ranges. These candidate boundaries are then aligned with actual database values using collation-aware queries to maintain correct ordering.

These strategies substantially reduce the number of database round trips required for chunk discovery, resulting in faster chunk generation and improved performance for large datasets.

As part of this work, several edge cases in chunk boundary calculation were also addressed, particularly around MySQL collation-aware ordering for string primary keys. The implementation aligns generated boundaries with actual database values using collation-aware queries, ensuring correct range generation and preventing missing or overlapping chunks.

Additionally, a small compatibility fix was introduced in refractor.go. Previously, some queries used hardcoded SQL strings, which caused MySQL to return numeric values as uint64. After switching to parameterized queries, the Go MySQL driver began returning these values as []uint8 (byte slices).

To handle this change correctly, an additional []uint8 case was added in ReformatInt64 so that these values are properly parsed and converted to int64. This ensures consistent behavior regardless of how the query result is returned by the driver.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Tested MySQL chunking with INT32 primary keys
Tested MySQL chunking with INT64 primary keys
Tested MySQL chunking with FLOAT / DOUBLE primary keys
Verified no data loss or overlap across chunk boundaries
Tested on different kind of string pk for full refresh and cdc
Confirmed performance improvement on large datasets

Performance Stats (Different PK Types)

The following stats.json outputs were collected from runs on different MySQL tables, each containing 10M records, using different primary key types.

🔢 Table with `INT32` Primary Key

Seconds Elapsed: 184.00
Speed: 54,347.30 rps
Memory: 96 MB

🔣 Table with `FLOAT64` Primary Key

Seconds Elapsed: 54.00
Speed: 185,179.58 rps
Memory: 36 MB

Screenshots or Recordings

https://datazip.atlassian.net/wiki/x/AYCVDg

Documentation

Documentation Link: [link to README, olake.io/docs, or olake-docs]
N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

N/A

drivers/mysql/internal/backfill.go

saksham-datazip

self review

drivers/mysql/internal/backfill.go

drivers/mysql/internal/config.go

pkg/jdbc/jdbc.go

drivers/mysql/internal/backfill.go

utils/typeutils/reformat.go

drivers/mysql/internal/backfill.go

…ceberg

saksham-datazip

self review

saksham-datazip · 2026-03-11T10:30:18Z

constants/state_version.go

@@ -28,9 +28,13 @@ package constants
 //
 //   - Version 4: (Current Version) Unsigned int/integer/bigint map to Int64.


remove current version from it

saksham-datazip · 2026-03-11T10:30:52Z

utils/typeutils/reformat.go

change this todo

vikaxsh · 2026-03-26T10:53:32Z

drivers/mysql/internal/backfill.go

+
+// checks if the pk column is numeric and evenly distributed
+func IsNumericAndEvenDistributed(minVal any, maxVal any, approxRowCount int64, chunkSize int64, dataType string) (int64, int64, int64) {
+	icebergDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)]


Suggested change

icebergDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)]

destinationDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)]

vikaxsh · 2026-03-27T07:41:20Z

drivers/mysql/internal/backfill.go

+		// 2. If not numeric, check for supported String strategy
+		if chunkStepSize == 0 {
+			switch strings.ToLower(dataType) {
+			case "char", "varchar":


aren't we handling other string datatypes?

We Had a discussion and decided to only handle char and varchar

vikaxsh · 2026-03-27T07:48:43Z

drivers/mysql/internal/backfill.go

+			case "char", "varchar":
+				stringSupportedPk = true
+				logger.Infof("%s is a string type PK", pkColumns[0])
+				if dataMaxLength.Valid {


already have same check in splitEvenlyForString function

vikaxsh · 2026-03-27T07:48:57Z

drivers/mysql/internal/backfill.go

+			switch strings.ToLower(dataType) {
+			case "char", "varchar":
+				stringSupportedPk = true
+				logger.Infof("%s is a string type PK", pkColumns[0])


why do we need this log

vikaxsh · 2026-03-27T07:51:14Z

drivers/mysql/internal/backfill.go

+	case len(pkColumns) == 1 && chunkStepSize > 0:
+		logger.Infof("Using splitEvenlyForInt Method for stream %s", stream.ID())
+		err = splitEvenlyForInt(chunks, chunkStepSize)
+	case len(pkColumns) == 1 && stringSupportedPk:


you have already check for len(pkColumns) == 1 while computing chunkStepSize and stringSupportedPk, do we need this here as welll?

vikaxsh · 2026-03-27T11:10:38Z

drivers/mysql/internal/backfill.go

+		for next := minBoundary + chunkStepSize; next <= maxBoundary; next += chunkStepSize {
+			// condition to protect from infinite loop
+			if next <= prev {
+				logger.Warnf("int precision collapse detected, falling back to SplitViaPrimaryKey for stream %s", stream.ID())


can we simplify log message like int64 arithmetic overflow

vikaxsh · 2026-03-27T12:10:08Z

drivers/mysql/internal/backfill.go

+			rangeSlice = rangeSlice[:0]
+			// Some chunks generated might be completely empty when boundaries greater
+			// than the max value and smaller than the min value exists
+			for rows.Next() {


add defer rows.Close()

wouldn't it be redundant as i already added a rows.close() at the time when error is thrown ?

vikaxsh · 2026-03-27T12:15:26Z

drivers/mysql/internal/backfill.go

+
+			// Counting the number of valid chunks generated i.e., between min and max
+			query, args = jdbc.MySQLCountGeneratedInRange(rangeSlice, columnCollationType, minValPadded, maxValPadded)
+			err = m.client.QueryRowContext(ctx, query, args...).Scan(&validChunksCount)


validChunksCount will always be equal to len(rangeSlice). What do you think?

no ,this is an edge case as their might be cases when chunks generated can be smaller than min or larger than max and because of this reason i added the retry logic.

Ex :-

Min ='M' (Ascii code =77) Max= 'z'(Ascii code= 122) ''' now the chunks generated slice can be something like :-

[M, V, a, f, k, p, u, z]

After sorting based on Collation type of case insensitive,it will look like :

[a, f, k, M, p, V, u, z]

Now clearly our min and max changed and apart from that our valid chunks are clearly decreased from expectations. So clearly we need this function as validChunksCount is not equal to len(rangeSlice) But if we use retry logic ,we can simply get more number of valid chunk boundaries

since we are expecting chunks between [a,f,k,M] would be empty,
can we remove these?

vikaxsh · 2026-03-27T12:19:48Z

drivers/mysql/internal/backfill.go

+		prev := rangeSlice[0]
+		chunks.Insert(types.Chunk{
+			Min: nil,
+			Max: prev,
 		})
+
+		for idx := range rangeSlice {
+			if idx == 0 {
+				continue
+			}
+			currVal := rangeSlice[idx]
+			chunks.Insert(types.Chunk{
+				Min: prev,
+				Max: currVal,
+			})
+			prev = currVal
+		}
+
+		chunks.Insert(types.Chunk{
+			Min: prev,
+			Max: nil,
+		})


can we simplify this?
for example

// Open-ended first chunk chunks.Insert(types.Chunk{Min: nil, Max: rangeSlice[0]}) // Middle chunks for i := 1; i < len(rangeSlice); i++ { chunks.Insert(types.Chunk{Min: rangeSlice[i-1], Max: rangeSlice[i]}) } // Open-ended last chunk chunks.Insert(types.Chunk{Min: rangeSlice[len(rangeSlice)-1], Max: nil})

vikaxsh · 2026-03-27T12:49:12Z

pkg/jdbc/jdbc.go

+		lowerCond, lowerArgs := buildBound(lowerValues, true)
+		upperCond, upperArgs := buildBound(upperValues, false)
+		if lowerCond != "" && upperCond != "" {
+			chunkCond = fmt.Sprintf("(%s) AND (%s)", lowerCond, upperCond)


buildBound already wraps its return in ()

i changed this function from hardcoding values to accepting arguments with minimal changes :-

buildLexicographicChunkCondition

vikaxsh · 2026-03-30T06:39:12Z

drivers/mysql/internal/backfill.go

+	sort.Strings(pkColumns)
+
+	if len(pkColumns) > 0 {
+		minVal, maxVal, err = m.getTableExtremes(ctx, stream, pkColumns)


Can you check this once? Previously, this query was part of a transaction that acquired a repeatable read lock, will it be okay now, or could it still cause any issues?

vikaxsh · 2026-03-30T08:15:26Z

pkg/jdbc/jdbc.go

+		FROM (
+			%s
+		) AS t
+		ORDER BY val COLLATE %s;


Redundant collate (Since val in the ORDER BY already refers to the aliased expression)

vikaxsh · 2026-03-30T12:46:30Z

pkg/jdbc/jdbc.go

+func MySQLDistinctValuesWithCollationQuery(values []string, columnCollationType string) (string, []any) {
+	unionParts := make([]string, 0, len(values))
+	args := make([]any, 0, len(values))
+	for _, v := range values {


is this condition possible len(values) == 0

no as table size would atleast be 1 thus min and max would exist in the schema so len(values)>0

vikaxsh · 2026-03-30T13:06:52Z

drivers/mysql/internal/backfill.go

+
+			// Counting the number of valid chunks generated i.e., between min and max
+			query, args = jdbc.MySQLCountGeneratedInRange(rangeSlice, columnCollationType, minValPadded, maxValPadded)
+			err = m.client.QueryRowContext(ctx, query, args...).Scan(&validChunksCount)


since we are expecting chunks between [a,f,k,M] would be empty,
can we remove these?

feat: mysql chunking optimization

83ebf36

saksham-datazip commented Jan 27, 2026

View reviewed changes

drivers/mysql/internal/backfill.go Outdated Show resolved Hide resolved

saksham-datazip added 2 commits January 27, 2026 17:19

mysql optimization comment resolve

f5766f8

Merge branch 'staging' into feat/mysql-chunking-optimization

443cf94

vaibhav-datazip reviewed Jan 28, 2026

View reviewed changes

Merge branch 'staging' into feat/mysql-chunking-optimization

6fc574c

saksham-datazip had a problem deploying to integration_tests February 2, 2026 09:42 — with GitHub Actions Failure

chore: formatting fix

c09aee8

saksham-datazip had a problem deploying to integration_tests February 3, 2026 06:30 — with GitHub Actions Failure

my-sql-chunking-formatting-resolved

53520de

saksham-datazip had a problem deploying to integration_tests February 3, 2026 09:46 — with GitHub Actions Failure

saksham-datazip commented Feb 3, 2026

View reviewed changes

drivers/mysql/internal/backfill.go Outdated Show resolved Hide resolved

drivers/mysql/internal/backfill.go Outdated Show resolved Hide resolved

mysql-chunking-self-reviewed

3b9fbe7

saksham-datazip had a problem deploying to integration_tests February 3, 2026 09:53 — with GitHub Actions Failure

mysql-chunking-optimization-for-string-pk

8e4ba6a

saksham-datazip had a problem deploying to integration_tests February 7, 2026 15:19 — with GitHub Actions Failure

Merge branch 'staging' into feat/mysql-chunking-optimization

1707ae1

saksham-datazip had a problem deploying to integration_tests February 7, 2026 15:28 — with GitHub Actions Failure

Merge branch 'staging' into feat/mysql-chunking-optimization

feca5a0

vaibhav-datazip had a problem deploying to integration_tests February 9, 2026 08:01 — with GitHub Actions Failure

feat: solved lint issue

ccfb371

saksham-datazip temporarily deployed to integration_tests February 9, 2026 08:10 — with GitHub Actions Inactive

vaibhav-datazip reviewed Feb 9, 2026

View reviewed changes

Merge branch 'staging' into feat/mysql-chunking-optimization

fe4b4b2

saksham-datazip had a problem deploying to integration_tests February 10, 2026 09:13 — with GitHub Actions Failure

feat: mysql chunking optimization review resolved

910246a

saksham-datazip had a problem deploying to integration_tests February 10, 2026 09:52 — with GitHub Actions Failure

feat: resolving-lint-extra-spaces

1eacf5a

saksham-datazip had a problem deploying to integration_tests February 10, 2026 09:56 — with GitHub Actions Failure

feat: lint error resolved

964a2ee

saksham-datazip requested a deployment to integration_tests March 7, 2026 16:31 — with GitHub Actions Waiting

fix: changes pulled from staging

86a2d91

saksham-datazip force-pushed the feat/mysql-chunking-optimization branch from 0f4e0da to 86a2d91 Compare March 7, 2026 19:13

saksham-datazip requested a deployment to integration_tests March 7, 2026 19:13 — with GitHub Actions Waiting

chore: float and uint8 issue resolved

8ead67e

saksham-datazip requested a deployment to integration_tests March 9, 2026 06:34 — with GitHub Actions Waiting

chore: converted float64 to int64

0caf2aa

saksham-datazip requested a deployment to integration_tests March 9, 2026 11:00 — with GitHub Actions Waiting

vaibhav-datazip reviewed Mar 9, 2026

View reviewed changes

chore: added uint8[] block and took datatype for numeric value from i…

8ccfdd6

…ceberg

saksham-datazip requested a deployment to integration_tests March 11, 2026 10:16 — with GitHub Actions Waiting

saksham-datazip commented Mar 11, 2026

View reviewed changes

chore: self reviewed

7754d72

saksham-datazip requested a deployment to integration_tests March 11, 2026 10:32 — with GitHub Actions Waiting

Merge branch 'staging' into feat/mysql-chunking-optimization

083e0a0

saksham-datazip requested a deployment to integration_tests March 23, 2026 06:19 — with GitHub Actions Waiting

Merge branch 'staging' into feat/mysql-chunking-optimization

7cdf686

saksham-datazip requested a deployment to integration_tests March 23, 2026 07:47 — with GitHub Actions Waiting

vikaxsh reviewed Mar 27, 2026

View reviewed changes

Merge branch 'staging' into feat/mysql-chunking-optimization

2ecbf68

saksham-datazip temporarily deployed to integration_tests March 27, 2026 13:04 — with GitHub Actions Inactive

Merge branch 'staging' into feat/mysql-chunking-optimization

7ca5986

saksham-datazip temporarily deployed to integration_tests March 27, 2026 13:15 — with GitHub Actions Inactive

Merge branch 'staging' into feat/mysql-chunking-optimization

c2538e6

saksham-datazip requested a deployment to integration_tests March 29, 2026 16:54 — with GitHub Actions Waiting

chore: Refractored-splitEvenlyForString-chunking

a90451d

saksham-datazip temporarily deployed to integration_tests March 29, 2026 20:14 — with GitHub Actions Inactive

vikaxsh reviewed Mar 30, 2026

View reviewed changes

chore: removed empty chunks

463e532

saksham-datazip temporarily deployed to integration_tests March 30, 2026 19:15 — with GitHub Actions Inactive

		@@ -28,9 +28,13 @@ package constants
		//
		// - Version 4: (Current Version) Unsigned int/integer/bigint map to Int64.

	icebergDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)]
	destinationDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)]

Conversation

saksham-datazip commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Performance Stats (Different PK Types)

🔢 Table with INT32 Primary Key

🔣 Table with FLOAT64 Primary Key

Screenshots or Recordings

Documentation

Related PR's (If Any):

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saksham-datazip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saksham-datazip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saksham-datazip Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saksham-datazip commented Jan 27, 2026 •

edited

Loading

🔢 Table with `INT32` Primary Key

🔣 Table with `FLOAT64` Primary Key

saksham-datazip Mar 27, 2026 •

edited

Loading

saksham-datazip Mar 29, 2026 •

edited

Loading

saksham-datazip Mar 29, 2026 •

edited

Loading