Conversation
saksham-datazip
left a comment
There was a problem hiding this comment.
self review
0f4e0da to
86a2d91
Compare
saksham-datazip
left a comment
There was a problem hiding this comment.
self review
constants/state_version.go
Outdated
| @@ -28,9 +28,13 @@ package constants | |||
| // | |||
| // - Version 4: (Current Version) Unsigned int/integer/bigint map to Int64. | |||
There was a problem hiding this comment.
remove current version from it
utils/typeutils/reformat.go
Outdated
There was a problem hiding this comment.
change this todo
drivers/mysql/internal/backfill.go
Outdated
|
|
||
| // checks if the pk column is numeric and evenly distributed | ||
| func IsNumericAndEvenDistributed(minVal any, maxVal any, approxRowCount int64, chunkSize int64, dataType string) (int64, int64, int64) { | ||
| icebergDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)] |
There was a problem hiding this comment.
| icebergDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)] | |
| destinationDataType := mysqlTypeToDataTypes[strings.ToLower(dataType)] |
| // 2. If not numeric, check for supported String strategy | ||
| if chunkStepSize == 0 { | ||
| switch strings.ToLower(dataType) { | ||
| case "char", "varchar": |
There was a problem hiding this comment.
aren't we handling other string datatypes?
There was a problem hiding this comment.
We Had a discussion and decided to only handle char and varchar
drivers/mysql/internal/backfill.go
Outdated
| case "char", "varchar": | ||
| stringSupportedPk = true | ||
| logger.Infof("%s is a string type PK", pkColumns[0]) | ||
| if dataMaxLength.Valid { |
There was a problem hiding this comment.
already have same check in splitEvenlyForString function
drivers/mysql/internal/backfill.go
Outdated
| switch strings.ToLower(dataType) { | ||
| case "char", "varchar": | ||
| stringSupportedPk = true | ||
| logger.Infof("%s is a string type PK", pkColumns[0]) |
drivers/mysql/internal/backfill.go
Outdated
| case len(pkColumns) == 1 && chunkStepSize > 0: | ||
| logger.Infof("Using splitEvenlyForInt Method for stream %s", stream.ID()) | ||
| err = splitEvenlyForInt(chunks, chunkStepSize) | ||
| case len(pkColumns) == 1 && stringSupportedPk: |
There was a problem hiding this comment.
you have already check for len(pkColumns) == 1 while computing chunkStepSize and stringSupportedPk, do we need this here as welll?
drivers/mysql/internal/backfill.go
Outdated
| for next := minBoundary + chunkStepSize; next <= maxBoundary; next += chunkStepSize { | ||
| // condition to protect from infinite loop | ||
| if next <= prev { | ||
| logger.Warnf("int precision collapse detected, falling back to SplitViaPrimaryKey for stream %s", stream.ID()) |
There was a problem hiding this comment.
can we simplify log message like int64 arithmetic overflow
| rangeSlice = rangeSlice[:0] | ||
| // Some chunks generated might be completely empty when boundaries greater | ||
| // than the max value and smaller than the min value exists | ||
| for rows.Next() { |
There was a problem hiding this comment.
wouldn't it be redundant as i already added a rows.close() at the time when error is thrown ?
drivers/mysql/internal/backfill.go
Outdated
|
|
||
| // Counting the number of valid chunks generated i.e., between min and max | ||
| query, args = jdbc.MySQLCountGeneratedInRange(rangeSlice, columnCollationType, minValPadded, maxValPadded) | ||
| err = m.client.QueryRowContext(ctx, query, args...).Scan(&validChunksCount) |
There was a problem hiding this comment.
validChunksCount will always be equal to len(rangeSlice). What do you think?
There was a problem hiding this comment.
no ,this is an edge case as their might be cases when chunks generated can be smaller than min or larger than max and because of this reason i added the retry logic.
Ex :-
Min ='M' (Ascii code =77)
Max= 'z'(Ascii code= 122)
'''
now the chunks generated slice can be something like :-
[M, V, a, f, k, p, u, z]
After sorting based on Collation type of case insensitive,it will look like :
[a, f, k, M, p, V, u, z]
Now clearly our min and max changed and apart from that our valid chunks are clearly decreased from expectations.
So clearly we need this function as
validChunksCount is not equal to len(rangeSlice)
But if we use retry logic ,we can simply get more number of valid chunk boundaries
There was a problem hiding this comment.
since we are expecting chunks between [a,f,k,M] would be empty,
can we remove these?
drivers/mysql/internal/backfill.go
Outdated
| prev := rangeSlice[0] | ||
| chunks.Insert(types.Chunk{ | ||
| Min: nil, | ||
| Max: prev, | ||
| }) | ||
|
|
||
| for idx := range rangeSlice { | ||
| if idx == 0 { | ||
| continue | ||
| } | ||
| currVal := rangeSlice[idx] | ||
| chunks.Insert(types.Chunk{ | ||
| Min: prev, | ||
| Max: currVal, | ||
| }) | ||
| prev = currVal | ||
| } | ||
|
|
||
| chunks.Insert(types.Chunk{ | ||
| Min: prev, | ||
| Max: nil, | ||
| }) |
There was a problem hiding this comment.
can we simplify this?
for example
// Open-ended first chunk
chunks.Insert(types.Chunk{Min: nil, Max: rangeSlice[0]})
// Middle chunks
for i := 1; i < len(rangeSlice); i++ {
chunks.Insert(types.Chunk{Min: rangeSlice[i-1], Max: rangeSlice[i]})
}
// Open-ended last chunk
chunks.Insert(types.Chunk{Min: rangeSlice[len(rangeSlice)-1], Max: nil})
| lowerCond, lowerArgs := buildBound(lowerValues, true) | ||
| upperCond, upperArgs := buildBound(upperValues, false) | ||
| if lowerCond != "" && upperCond != "" { | ||
| chunkCond = fmt.Sprintf("(%s) AND (%s)", lowerCond, upperCond) |
There was a problem hiding this comment.
buildBound already wraps its return in ()
There was a problem hiding this comment.
i changed this function from hardcoding values to accepting arguments with minimal changes :-
buildLexicographicChunkCondition
| sort.Strings(pkColumns) | ||
|
|
||
| if len(pkColumns) > 0 { | ||
| minVal, maxVal, err = m.getTableExtremes(ctx, stream, pkColumns) |
There was a problem hiding this comment.
Can you check this once? Previously, this query was part of a transaction that acquired a repeatable read lock, will it be okay now, or could it still cause any issues?
pkg/jdbc/jdbc.go
Outdated
| FROM ( | ||
| %s | ||
| ) AS t | ||
| ORDER BY val COLLATE %s; |
There was a problem hiding this comment.
Redundant collate (Since val in the ORDER BY already refers to the aliased expression)
pkg/jdbc/jdbc.go
Outdated
| func MySQLDistinctValuesWithCollationQuery(values []string, columnCollationType string) (string, []any) { | ||
| unionParts := make([]string, 0, len(values)) | ||
| args := make([]any, 0, len(values)) | ||
| for _, v := range values { |
There was a problem hiding this comment.
is this condition possible len(values) == 0
There was a problem hiding this comment.
no as table size would atleast be 1 thus min and max would exist in the schema so len(values)>0
drivers/mysql/internal/backfill.go
Outdated
|
|
||
| // Counting the number of valid chunks generated i.e., between min and max | ||
| query, args = jdbc.MySQLCountGeneratedInRange(rangeSlice, columnCollationType, minValPadded, maxValPadded) | ||
| err = m.client.QueryRowContext(ctx, query, args...).Scan(&validChunksCount) |
There was a problem hiding this comment.
since we are expecting chunks between [a,f,k,M] would be empty,
can we remove these?
Description
This PR improves the MySQL chunking strategy with the primary goal of significantly reducing chunk generation time for large tables during incremental reads.
To achieve this, two mathematical chunking strategies were introduced based on the primary key type, replacing repeated database-based chunk discovery.
Numeric Primary Keys
The numeric range [min, max] is divided using an arithmetic progression to generate evenly spaced chunk boundaries. This allows chunk boundaries to be computed mathematically instead of relying on repeated database lookups, significantly reducing chunking time.
String Primary Keys
String values are mapped into a numeric space using Unicode encoding (big.Int) and then split into balanced ranges. These candidate boundaries are then aligned with actual database values using collation-aware queries to maintain correct ordering.
These strategies substantially reduce the number of database round trips required for chunk discovery, resulting in faster chunk generation and improved performance for large datasets.
As part of this work, several edge cases in chunk boundary calculation were also addressed, particularly around MySQL collation-aware ordering for string primary keys. The implementation aligns generated boundaries with actual database values using collation-aware queries, ensuring correct range generation and preventing missing or overlapping chunks.
Additionally, a small compatibility fix was introduced in refractor.go. Previously, some queries used hardcoded SQL strings, which caused MySQL to return numeric values as uint64. After switching to parameterized queries, the Go MySQL driver began returning these values as []uint8 (byte slices).
To handle this change correctly, an additional []uint8 case was added in ReformatInt64 so that these values are properly parsed and converted to int64. This ensures consistent behavior regardless of how the query result is returned by the driver.
Type of change
How Has This Been Tested?
Tested MySQL chunking with INT32 primary keys
Tested MySQL chunking with INT64 primary keys
Tested MySQL chunking with FLOAT / DOUBLE primary keys
Verified no data loss or overlap across chunk boundaries
Tested on different kind of string pk for full refresh and cdc
Confirmed performance improvement on large datasets
Performance Stats (Different PK Types)
The following
stats.jsonoutputs were collected from runs on different MySQL tables, each containing 10M records, using different primary key types.🔢 Table with
INT32Primary Key🔣 Table with
FLOAT64Primary KeyScreenshots or Recordings
https://datazip.atlassian.net/wiki/x/AYCVDg
Documentation
Related PR's (If Any):
N/A