fix: Resolve Avro RecordEncoder bugs related to nullable Struct fields and Union type ids #8935
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
The
arrow-avrowriter currently fails on two classes of valid Arrow inputs:Nullable
Structwith non‑nullable children + row‑wise sliced encodingWhen encoding a
RecordBatchrow‑by‑row, a nullableStructfield whose child is non‑nullable can cause the writer to error withInvalid argument error: Avro site '{field}' is non-nullable, but array contains nulls, even when the parentStructis null at that row and the child value should be ignored.Dense
UnionArraywith non‑zero, non‑consecutive type idsA dense
UnionArraywhoseUnionFieldsuse type ids such as2and5will currently fail with aSchemaError("Binding and field mismatch"), even though this layout is valid per Arrow’s union semantics.This PR updates the
RecordEncoderto resolve both of these issues and better respect Arrow’s struct/union semantics.What changes are included in this PR?
This PR touches only the
arrow-avrowriter implementation, specificallyarrow-avro/src/writer/encoder.rsandarrow-avro/src/writer/mod.rs.1. Fix nullable struct + non‑nullable child handling
RecordEncoder/StructEncoderpath so that child field null validation is masked by the parentStruct’s null bitmap.Structvalue is null, the encoder now skips encoding the non‑nullable children for that row, instead of treating any child‑side nulls as a violation of the Avro site’s nullability.RecordBatch, like the one in the issue’s reproducing test, now succeeds without triggeringInvalid argument error: Avro site '{field}' is non-nullable, but array contains nulls.2. Support dense unions with non‑zero, non‑consecutive type ids
UnionEncoder) so that it no longer assumes Arrow dense union type IDs are0..N-1.type_ids(as declared inUnionFields) to Avro union branch indices, and uses this mapping when:2and5now encode successfully, matching Arrow’s semantics that only require type ids to be consistent withUnionFields, not only contiguous and/or zero‑based.3. Regression tests for both bugs
Adds targeted regression tests under
arrow-avro/src/writer/mod.rs’s test module to validate the fixes:test_nullable_struct_with_nonnullable_field_sliced_encodingStruct+ non‑nullable child scenario from the issue.RecordBatchone row at a time viaWriterBuilder::new(schema).with_fingerprint_strategy(FingerprintStrategy::Id(1)).build::<_, AvroSoeFormat>(...)and asserts all rows encode successfully.test_nullable_struct_with_decimal_and_timestamp_slicedRecordBatchcontaining nullableStructfields populated withDecimal128andTimestampMicrosecondtypes to verify encoding of complex nested data.RecordBatchone row at a time usingAvroSoeFormatandFingerprintStrategy::Id(1), asserting that each sliced row encodes successfully.non_nullable_child_in_nullable_struct_should_encode_per_rowStructcolumn containing a non-nullable child field, alongside a timestamp column.AvroSoeFormat, asserting thatwriter.writereturnsOkto confirm the fix for sliced encoding constraints.test_union_nonzero_type_idsUnionArraywhoseUnionFieldsuse type ids[2, 5]and a mix of string/int values.AvroWriterand asserts that writing and finishing the writer both succeed without error.Together these tests reproduce the failures described in #8934 and confirm that the new encoder behavior handles them correctly.
Are these changes tested?
Yes.
Are there any user-facing changes?
The change is strictly a non-breaking backwards compatible bug fix that makes the
arrow-avrowriter function as expected.