Skip to content

[All]Fix compression#385

Open
elenagaljak-db wants to merge 3 commits into
mainfrom
elenagaljak-db-fix_compression
Open

[All]Fix compression#385
elenagaljak-db wants to merge 3 commits into
mainfrom
elenagaljak-db-fix_compression

Conversation

@elenagaljak-db

@elenagaljak-db elenagaljak-db commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

Arrow Flight ingestion was over-splitting every batch decoded from Arrow IPC bytes into many tiny FlightData messages, which inflated message counts and made IPC compression ineffective (compressing each tiny chunk independently yields almost nothing). This patches the Flight encoder to size batches correctly, fixing
compression and message overhead with no extra data copy.

Vendor arrow-flight 58.2.0 into rust/third_party/arrow-flight and change the size calculation in split_batch_for_grpc_response to be slice-aware:

// before
.map(|col| col.get_buffer_memory_size())
// after
.map(|col| col.to_data().get_slice_memory_size().unwrap_or_else(|_| col.get_buffer_memory_size()))
## How is this tested?

New unit tests. Tested manually

@elenagaljak-db elenagaljak-db marked this pull request as ready for review June 18, 2026 09:20
Comment thread rust/sdk/src/arrow_stream.rs Outdated
.collect();
RecordBatch::try_new(batch.schema(), columns)
.expect("compacted batch preserves the original schema and row count")
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must not be forgotten, it will impact perf on SDKs, as will not fixing it. It should be removed as soon as it is fixed in the arrow. Do we know the rough performance impact of this change?

@teodordelibasic-db teodordelibasic-db left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's just revert the Python/Java changes like discussed offline. Also, do we know if fix for this issue is in progress for Arrow?

Comment thread rust/jni/src/arrow_stream.rs Outdated
stream.ingest_batch(batches.remove(0)).await
// The IPC reader is zero-copy and slices all columns out of one shared
// allocation, which makes the Flight encoder massively over-estimate the
// batch size and over-split it (arrow-rs#9388). Compact to restore accurate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It seems it's better for these to either be apache/arrow-rs#9388 or the full URL https://github.com/apache/arrow-rs/issues/9388.

Comment thread rust/sdk/src/arrow_stream.rs Outdated
/// each `Buffer::capacity()` reports the whole message-body size. This makes the
/// Flight encoder over-estimate the batch and over-split it (and defeats IPC
/// compression). Re-allocating restores accurate size estimates. Workaround for
/// arrow-rs#9388 / #5352.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here.

Comment thread rust/sdk/src/arrow_stream.rs Outdated
match reader.next() {
None => Ok(batch),
// Compact so the Flight encoder estimates batch size correctly (see
// `compact_record_batch` and arrow-rs#9388).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@elenagaljak-db elenagaljak-db force-pushed the elenagaljak-db-fix_compression branch from 06c42bd to d5791d8 Compare June 18, 2026 14:21
Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>
@elenagaljak-db elenagaljak-db force-pushed the elenagaljak-db-fix_compression branch from d5791d8 to 77bfec5 Compare June 18, 2026 14:33
Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>
Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants