Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The arrow-avro writer currently fails on two classes of valid Arrow inputs:

  1. Nullable Struct with non‑nullable children + row‑wise sliced encoding

    When encoding a RecordBatch row‑by‑row, a nullable Struct field whose child is non‑nullable can cause the writer to error with Invalid argument error: Avro site '{field}' is non-nullable, but array contains nulls, even when the parent Struct is null at that row and the child value should be ignored.

  2. Dense UnionArray with non‑zero, non‑consecutive type ids

    A dense UnionArray whose UnionFields use type ids such as 2 and 5 will currently fail with a SchemaError("Binding and field mismatch"), even though this layout is valid per Arrow’s union semantics.

This PR updates the RecordEncoder to resolve both of these issues and better respect Arrow’s struct/union semantics.

What changes are included in this PR?

This PR touches only the arrow-avro writer implementation, specifically arrow-avro/src/writer/encoder.rs and arrow-avro/src/writer/mod.rs.

1. Fix nullable struct + non‑nullable child handling

  • Adjusts the RecordEncoder / StructEncoder path so that child field null validation is masked by the parent Struct’s null bitmap.
  • For rows where the parent Struct value is null, the encoder now skips encoding the non‑nullable children for that row, instead of treating any child‑side nulls as a violation of the Avro site’s nullability.
  • This ensures that row‑wise encoding of a sliced RecordBatch, like the one in the issue’s reproducing test, now succeeds without triggering Invalid argument error: Avro site '{field}' is non-nullable, but array contains nulls.

2. Support dense unions with non‑zero, non‑consecutive type ids

  • Updates the union encoding path (UnionEncoder) so that it no longer assumes Arrow dense union type IDs are 0..N-1.
  • The encoder now builds an explicit mapping from Arrow type_ids (as declared in UnionFields) to Avro union branch indices, and uses this mapping when:
    • constructing the union’s Avro schema binding, and
    • writing out the branch index and value for each union element.
  • As a result, dense unions with type ids such as 2 and 5 now encode successfully, matching Arrow’s semantics that only require type ids to be consistent with UnionFields, not only contiguous and/or zero‑based.

3. Regression tests for both bugs

Adds targeted regression tests under arrow-avro/src/writer/mod.rs’s test module to validate the fixes:

  1. test_nullable_struct_with_nonnullable_field_sliced_encoding
  • Builds the nullable Struct + non‑nullable child scenario from the issue.
  • Encodes the RecordBatch one row at a time via WriterBuilder::new(schema).with_fingerprint_strategy(FingerprintStrategy::Id(1)).build::<_, AvroSoeFormat>(...) and asserts all rows encode successfully.
  1. test_nullable_struct_with_decimal_and_timestamp_sliced
  • Constructs a RecordBatch containing nullable Struct fields populated with Decimal128 and TimestampMicrosecond types to verify encoding of complex nested data.
  • Encodes the RecordBatch one row at a time using AvroSoeFormat and FingerprintStrategy::Id(1), asserting that each sliced row encodes successfully.
  1. non_nullable_child_in_nullable_struct_should_encode_per_row
  • Builds a test case with a nullable Struct column containing a non-nullable child field, alongside a timestamp column.
  • Slices a single row from the batch and writes it via AvroSoeFormat, asserting that writer.write returns Ok to confirm the fix for sliced encoding constraints.
  1. test_union_nonzero_type_ids
  • Constructs a dense UnionArray whose UnionFields use type ids [2, 5] and a mix of string/int values.
  • Encodes via AvroWriter and asserts that writing and finishing the writer both succeed without error.

Together these tests reproduce the failures described in #8934 and confirm that the new encoder behavior handles them correctly.

Are these changes tested?

Yes.

  • New unit tests are added for both regression scenarios (nullable struct + non‑nullable child, and dense union with non‑zero & non‑consecutive type ids).
  • Existing writer / reader integration tests (round‑trip tests, nested record tests, etc.) continue to pass unchanged, ensuring that the crate’s previously tested behavior / public API remains intact without breaking changes.

Are there any user-facing changes?

  1. Behavioral change (bug fix):
  • Previously, valid and supported Arrow inputs could cause the Avro writer to error or panic in the two scenarios described above.
  • After this change, those inputs encode successfully and produce Avro output consistent with the generated or provided Avro schema.
  1. APIs and configuration:
  • No public APIs, types, or configuration options are added, removed, or renamed.
  • The on‑wire Avro representation for already‑supported layouts is unchanged; the encoder simply now accepts valid Arrow layouts that were failing prior.

The change is strictly a non-breaking backwards compatible bug fix that makes the arrow-avro writer function as expected.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Nov 30, 2025
@jecsand838 jecsand838 force-pushed the fix-encoder-nullability-bug branch from 81cdda2 to 6bcba74 Compare November 30, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-avro] RecordEncoder Bugs

1 participant