[All]Fix compression by elenagaljak-db · Pull Request #385 · databricks/zerobus-sdk

elenagaljak-db · 2026-06-18T09:18:04Z

What changes are proposed in this pull request?

Arrow Flight ingestion was over-splitting every batch decoded from Arrow IPC bytes into many tiny FlightData messages, which inflated message counts and made IPC compression ineffective (compressing each tiny chunk independently yields almost nothing). This patches the Flight encoder to size batches correctly, fixing
compression and message overhead with no extra data copy.

Vendor arrow-flight 58.2.0 into rust/third_party/arrow-flight and change the size calculation in split_batch_for_grpc_response to be slice-aware:

// before
.map(|col| col.get_buffer_memory_size())
// after
.map(|col| col.to_data().get_slice_memory_size().unwrap_or_else(|_| col.get_buffer_memory_size()))
## How is this tested?

New unit tests. Tested manually

nikolaobradovic-db · 2026-06-18T10:05:30Z

+        .collect();
+    RecordBatch::try_new(batch.schema(), columns)
+        .expect("compacted batch preserves the original schema and row count")
+}


This must not be forgotten, it will impact perf on SDKs, as will not fixing it. It should be removed as soon as it is fixed in the arrow. Do we know the rough performance impact of this change?

teodordelibasic-db

LGTM, let's just revert the Python/Java changes like discussed offline. Also, do we know if fix for this issue is in progress for Arrow?

teodordelibasic-db · 2026-06-18T12:39:50Z

-        stream.ingest_batch(batches.remove(0)).await
+        // The IPC reader is zero-copy and slices all columns out of one shared
+        // allocation, which makes the Flight encoder massively over-estimate the
+        // batch size and over-split it (arrow-rs#9388). Compact to restore accurate


nit: It seems it's better for these to either be apache/arrow-rs#9388 or the full URL https://github.com/apache/arrow-rs/issues/9388.

teodordelibasic-db · 2026-06-18T12:40:00Z

+/// each `Buffer::capacity()` reports the whole message-body size. This makes the
+/// Flight encoder over-estimate the batch and over-split it (and defeats IPC
+/// compression). Re-allocating restores accurate size estimates. Workaround for
+/// arrow-rs#9388 / #5352.


Same applies here.

teodordelibasic-db · 2026-06-18T12:40:20Z

    match reader.next() {
-        None => Ok(batch),
+        // Compact so the Flight encoder estimates batch size correctly (see
+        // `compact_record_batch` and arrow-rs#9388).


Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>

elenagaljak-db marked this pull request as ready for review June 18, 2026 09:20

elenagaljak-db requested a review from nikolaobradovic-db June 18, 2026 09:20

nikolaobradovic-db reviewed Jun 18, 2026

View reviewed changes

teodordelibasic-db reviewed Jun 18, 2026

View reviewed changes

elenagaljak-db force-pushed the elenagaljak-db-fix_compression branch from 06c42bd to d5791d8 Compare June 18, 2026 14:21

fork arrow

77bfec5

Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>

elenagaljak-db force-pushed the elenagaljak-db-fix_compression branch from d5791d8 to 77bfec5 Compare June 18, 2026 14:33

elenagaljak-db added 2 commits June 18, 2026 18:24

fix

b1661c8

Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>

fix

443ee13

Signed-off-by: elenagaljak-db <elena.galjak@databricks.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[All]Fix compression#385

[All]Fix compression#385
elenagaljak-db wants to merge 3 commits into
mainfrom
elenagaljak-db-fix_compression

elenagaljak-db commented Jun 18, 2026 •

edited

Loading

Uh oh!

nikolaobradovic-db Jun 18, 2026

Uh oh!

teodordelibasic-db left a comment

Uh oh!

teodordelibasic-db Jun 18, 2026

Uh oh!

teodordelibasic-db Jun 18, 2026

Uh oh!

teodordelibasic-db Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

elenagaljak-db commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

Uh oh!

nikolaobradovic-db Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

teodordelibasic-db left a comment

Choose a reason for hiding this comment

Uh oh!

teodordelibasic-db Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

teodordelibasic-db Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

teodordelibasic-db Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elenagaljak-db commented Jun 18, 2026 •

edited

Loading