fix: Resolve Avro RecordEncoder bugs related to nullable Struct fields and Union type ids #8935

jecsand838 · 2025-11-30T09:31:36Z

Which issue does this PR close?

Closes [arrow-avro] RecordEncoder Bugs #8934

Rationale for this change

The arrow-avro writer currently fails on two classes of valid Arrow inputs:

Nullable Struct with non‑nullable children + row‑wise sliced encoding

When encoding a RecordBatch row‑by‑row, a nullable Struct field whose child is non‑nullable can cause the writer to error with Invalid argument error: Avro site '{field}' is non-nullable, but array contains nulls, even when the parent Struct is null at that row and the child value should be ignored.
Dense UnionArray with non‑zero, non‑consecutive type ids

A dense UnionArray whose UnionFields use type ids such as 2 and 5 will currently fail with a SchemaError("Binding and field mismatch"), even though this layout is valid per Arrow’s union semantics.

This PR updates the RecordEncoder to resolve both of these issues and better respect Arrow’s struct/union semantics.

What changes are included in this PR?

This PR touches only the arrow-avro writer implementation, specifically arrow-avro/src/writer/encoder.rs and arrow-avro/src/writer/mod.rs.

1. Fix nullable struct + non‑nullable child handling

Adjusts the RecordEncoder / StructEncoder path so that child field null validation is masked by the parent Struct’s null bitmap.
For rows where the parent Struct value is null, the encoder now skips encoding the non‑nullable children for that row, instead of treating any child‑side nulls as a violation of the Avro site’s nullability.
This ensures that row‑wise encoding of a sliced RecordBatch, like the one in the issue’s reproducing test, now succeeds without triggering Invalid argument error: Avro site '{field}' is non-nullable, but array contains nulls.

2. Support dense unions with non‑zero, non‑consecutive type ids

Updates the union encoding path (UnionEncoder) so that it no longer assumes Arrow dense union type IDs are 0..N-1.
The encoder now builds an explicit mapping from Arrow type_ids (as declared in UnionFields) to Avro union branch indices, and uses this mapping when:
- constructing the union’s Avro schema binding, and
- writing out the branch index and value for each union element.
As a result, dense unions with type ids such as 2 and 5 now encode successfully, matching Arrow’s semantics that only require type ids to be consistent with UnionFields, not only contiguous and/or zero‑based.

3. Regression tests for both bugs

Adds targeted regression tests under arrow-avro/src/writer/mod.rs’s test module to validate the fixes:

test_nullable_struct_with_nonnullable_field_sliced_encoding

Builds the nullable Struct + non‑nullable child scenario from the issue.
Encodes the RecordBatch one row at a time via WriterBuilder::new(schema).with_fingerprint_strategy(FingerprintStrategy::Id(1)).build::<_, AvroSoeFormat>(...) and asserts all rows encode successfully.

test_nullable_struct_with_decimal_and_timestamp_sliced

Constructs a RecordBatch containing nullable Struct fields populated with Decimal128 and TimestampMicrosecond types to verify encoding of complex nested data.
Encodes the RecordBatch one row at a time using AvroSoeFormat and FingerprintStrategy::Id(1), asserting that each sliced row encodes successfully.

non_nullable_child_in_nullable_struct_should_encode_per_row

Builds a test case with a nullable Struct column containing a non-nullable child field, alongside a timestamp column.
Slices a single row from the batch and writes it via AvroSoeFormat, asserting that writer.write returns Ok to confirm the fix for sliced encoding constraints.

test_union_nonzero_type_ids

Constructs a dense UnionArray whose UnionFields use type ids [2, 5] and a mix of string/int values.
Encodes via AvroWriter and asserts that writing and finishing the writer both succeed without error.

Together these tests reproduce the failures described in #8934 and confirm that the new encoder behavior handles them correctly.

Are these changes tested?

Yes.

New unit tests are added for both regression scenarios (nullable struct + non‑nullable child, and dense union with non‑zero & non‑consecutive type ids).
Existing writer / reader integration tests (round‑trip tests, nested record tests, etc.) continue to pass unchanged, ensuring that the crate’s previously tested behavior / public API remains intact without breaking changes.

Are there any user-facing changes?

Behavioral change (bug fix):

Previously, valid and supported Arrow inputs could cause the Avro writer to error or panic in the two scenarios described above.
After this change, those inputs encode successfully and produce Avro output consistent with the generated or provided Avro schema.

APIs and configuration:

No public APIs, types, or configuration options are added, removed, or renamed.
The on‑wire Avro representation for already‑supported layouts is unchanged; the encoder simply now accepts valid Arrow layouts that were failing prior.

The change is strictly a non-breaking backwards compatible bug fix that makes the arrow-avro writer function as expected.

… `arrow-avro` writer.

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Nov 30, 2025

Fixes for nullable child encoder bug and nonzero union type id bug in…

6bcba74

… `arrow-avro` writer.

jecsand838 force-pushed the fix-encoder-nullability-bug branch from 81cdda2 to 6bcba74 Compare November 30, 2025 09:37

Merge branch 'main' into fix-encoder-nullability-bug

8ffb25a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Resolve Avro RecordEncoder bugs related to nullable Struct fields and Union type ids #8935

fix: Resolve Avro RecordEncoder bugs related to nullable Struct fields and Union type ids #8935

Uh oh!

jecsand838 commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: Resolve Avro RecordEncoder bugs related to nullable Struct fields and Union type ids #8935

Are you sure you want to change the base?

fix: Resolve Avro RecordEncoder bugs related to nullable Struct fields and Union type ids #8935

Uh oh!

Conversation

jecsand838 commented Nov 30, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant