Skip to content

feat: mysql chunking optimisation#797

Open
saksham-datazip wants to merge 41 commits intostagingfrom
feat/mysql-chunking-optimization
Open

feat: mysql chunking optimisation#797
saksham-datazip wants to merge 41 commits intostagingfrom
feat/mysql-chunking-optimization

Conversation

@saksham-datazip
Copy link
Copy Markdown
Collaborator

@saksham-datazip saksham-datazip commented Jan 27, 2026

Description

This PR improves the MySQL chunking strategy with the primary goal of significantly reducing chunk generation time for large tables during incremental reads.

To achieve this, two mathematical chunking strategies were introduced based on the primary key type, replacing repeated database-based chunk discovery.

Numeric Primary Keys

The numeric range [min, max] is divided using an arithmetic progression to generate evenly spaced chunk boundaries. This allows chunk boundaries to be computed mathematically instead of relying on repeated database lookups, significantly reducing chunking time.

String Primary Keys

String values are mapped into a numeric space using Unicode encoding (big.Int) and then split into balanced ranges. These candidate boundaries are then aligned with actual database values using collation-aware queries to maintain correct ordering.

These strategies substantially reduce the number of database round trips required for chunk discovery, resulting in faster chunk generation and improved performance for large datasets.

As part of this work, several edge cases in chunk boundary calculation were also addressed, particularly around MySQL collation-aware ordering for string primary keys. The implementation aligns generated boundaries with actual database values using collation-aware queries, ensuring correct range generation and preventing missing or overlapping chunks.

Additionally, a small compatibility fix was introduced in refractor.go. Previously, some queries used hardcoded SQL strings, which caused MySQL to return numeric values as uint64. After switching to parameterized queries, the Go MySQL driver began returning these values as []uint8 (byte slices).

To handle this change correctly, an additional []uint8 case was added in ReformatInt64 so that these values are properly parsed and converted to int64. This ensures consistent behavior regardless of how the query result is returned by the driver.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Tested MySQL chunking with INT32 primary keys

  • Tested MySQL chunking with INT64 primary keys

  • Tested MySQL chunking with FLOAT / DOUBLE primary keys

  • Verified no data loss or overlap across chunk boundaries

  • Tested on different kind of string pk for full refresh and cdc

  • Confirmed performance improvement on large datasets

Performance Stats (Different PK Types)

The following stats.json outputs were collected from runs on different MySQL tables, each containing 10M records, using different primary key types.

🔢 Table with INT32 Primary Key

  • Seconds Elapsed: 184.00
  • Speed: 54,347.30 rps
  • Memory: 96 MB

🔣 Table with FLOAT64 Primary Key

  • Seconds Elapsed: 54.00
  • Speed: 185,179.58 rps
  • Memory: 36 MB

Screenshots or Recordings

https://datazip.atlassian.net/wiki/x/AYCVDg

Documentation

  • Documentation Link: [link to README, olake.io/docs, or olake-docs]
  • N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

N/A

Copy link
Copy Markdown
Collaborator Author

@saksham-datazip saksham-datazip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

@saksham-datazip saksham-datazip force-pushed the feat/mysql-chunking-optimization branch from 0f4e0da to 86a2d91 Compare March 7, 2026 19:13
Copy link
Copy Markdown
Collaborator Author

@saksham-datazip saksham-datazip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

@@ -28,9 +28,13 @@ package constants
//
// - Version 4: (Current Version) Unsigned int/integer/bigint map to Int64.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove current version from it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this todo


// checks if the pk column is numeric and evenly distributed
func IsNumericAndEvenDistributed(minVal any, maxVal any, approxRowCount int64, chunkSize int64, dataType string) (int64, int64, int64) {
icebergDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
icebergDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)]
destinationDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

// 2. If not numeric, check for supported String strategy
if chunkStepSize == 0 {
switch strings.ToLower(dataType) {
case "char", "varchar":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we handling other string datatypes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We Had a discussion and decided to only handle char and varchar

case "char", "varchar":
stringSupportedPk = true
logger.Infof("%s is a string type PK", pkColumns[0])
if dataMaxLength.Valid {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already have same check in splitEvenlyForString function

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

switch strings.ToLower(dataType) {
case "char", "varchar":
stringSupportedPk = true
logger.Infof("%s is a string type PK", pkColumns[0])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this log

Copy link
Copy Markdown
Collaborator Author

@saksham-datazip saksham-datazip Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +421 to +424
case len(pkColumns) == 1 && chunkStepSize > 0:
logger.Infof("Using splitEvenlyForInt Method for stream %s", stream.ID())
err = splitEvenlyForInt(chunks, chunkStepSize)
case len(pkColumns) == 1 && stringSupportedPk:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have already check for len(pkColumns) == 1 while computing chunkStepSize and stringSupportedPk, do we need this here as welll?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

for next := minBoundary + chunkStepSize; next <= maxBoundary; next += chunkStepSize {
// condition to protect from infinite loop
if next <= prev {
logger.Warnf("int precision collapse detected, falling back to SplitViaPrimaryKey for stream %s", stream.ID())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify log message like int64 arithmetic overflow

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

rangeSlice = rangeSlice[:0]
// Some chunks generated might be completely empty when boundaries greater
// than the max value and smaller than the min value exists
for rows.Next() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add defer rows.Close()

Copy link
Copy Markdown
Collaborator Author

@saksham-datazip saksham-datazip Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be redundant as i already added a rows.close() at the time when error is thrown ?


// Counting the number of valid chunks generated i.e., between min and max
query, args = jdbc.MySQLCountGeneratedInRange(rangeSlice, columnCollationType, minValPadded, maxValPadded)
err = m.client.QueryRowContext(ctx, query, args...).Scan(&validChunksCount)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validChunksCount will always be equal to len(rangeSlice). What do you think?

Copy link
Copy Markdown
Collaborator Author

@saksham-datazip saksham-datazip Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ,this is an edge case as their might be cases when chunks generated can be smaller than min or larger than max and because of this reason i added the retry logic.

Ex :-

Min ='M' (Ascii code =77)
Max= 'z'(Ascii code= 122)
'''
now the chunks generated slice can be something like :-

[M, V, a, f, k, p, u, z]


After sorting based on Collation type of case insensitive,it will look like :

[a, f, k, M, p, V, u, z]


Now clearly our min and max changed and apart from that our valid chunks are clearly decreased from expectations.
So clearly we need this function as 
validChunksCount is not equal to len(rangeSlice)
But if we use retry logic ,we can simply get more number of valid chunk boundaries 

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are expecting chunks between [a,f,k,M] would be empty,
can we remove these?

Comment on lines +393 to +414
prev := rangeSlice[0]
chunks.Insert(types.Chunk{
Min: nil,
Max: prev,
})

for idx := range rangeSlice {
if idx == 0 {
continue
}
currVal := rangeSlice[idx]
chunks.Insert(types.Chunk{
Min: prev,
Max: currVal,
})
prev = currVal
}

chunks.Insert(types.Chunk{
Min: prev,
Max: nil,
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify this?
for example

// Open-ended first chunk
chunks.Insert(types.Chunk{Min: nil, Max: rangeSlice[0]})

// Middle chunks
for i := 1; i < len(rangeSlice); i++ {
    chunks.Insert(types.Chunk{Min: rangeSlice[i-1], Max: rangeSlice[i]})
}

// Open-ended last chunk
chunks.Insert(types.Chunk{Min: rangeSlice[len(rangeSlice)-1], Max: nil})

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

lowerCond, lowerArgs := buildBound(lowerValues, true)
upperCond, upperArgs := buildBound(upperValues, false)
if lowerCond != "" && upperCond != "" {
chunkCond = fmt.Sprintf("(%s) AND (%s)", lowerCond, upperCond)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildBound already wraps its return in ()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed this function from hardcoding values to accepting arguments with minimal changes :-

buildLexicographicChunkCondition

sort.Strings(pkColumns)

if len(pkColumns) > 0 {
minVal, maxVal, err = m.getTableExtremes(ctx, stream, pkColumns)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check this once? Previously, this query was part of a transaction that acquired a repeatable read lock, will it be okay now, or could it still cause any issues?

pkg/jdbc/jdbc.go Outdated
FROM (
%s
) AS t
ORDER BY val COLLATE %s;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant collate (Since val in the ORDER BY already refers to the aliased expression)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

pkg/jdbc/jdbc.go Outdated
func MySQLDistinctValuesWithCollationQuery(values []string, columnCollationType string) (string, []any) {
unionParts := make([]string, 0, len(values))
args := make([]any, 0, len(values))
for _, v := range values {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this condition possible len(values) == 0

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no as table size would atleast be 1 thus min and max would exist in the schema so len(values)>0


// Counting the number of valid chunks generated i.e., between min and max
query, args = jdbc.MySQLCountGeneratedInRange(rangeSlice, columnCollationType, minValPadded, maxValPadded)
err = m.client.QueryRowContext(ctx, query, args...).Scan(&validChunksCount)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are expecting chunks between [a,f,k,M] would be empty,
can we remove these?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants