Fix: Support nested struct field filtering with PyArrow (#953) #2628
+118
−21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #953
Rationale for this change
Fixes filtering on nested struct fields when using PyArrow for scan operations.
Are these changes tested?
Yes, the full test suite + new tests
Are there any user-facing changes?
Now, filtering a scan using a nested field will work
Problem
When filtering on nested struct fields (e.g.,
parentField.childField == 'value'), PyArrow would fail with:The issue occurred because PyArrow requires nested field references as tuples (e.g.,
("parent", "child")) rather than dotted strings (e.g.,"parent.child").Solution
_ConvertToArrowExpressionto accept an optionalSchemaparameter_get_field_name()method that converts dotted field paths to tuples for nested struct fieldsexpression_to_pyarrow()to accept and pass the schema parameterChanges
pyiceberg/io/pyarrow.py:_ConvertToArrowExpressionclass to handle nested field pathsexpression_to_pyarrow()signature to accept schema_expression_to_complementary_pyarrow()signaturepyiceberg/table/__init__.py:_expression_to_complementary_pyarrow()to pass schematest_ref_binding_nested_struct_field()for comprehensive nested field testingtest_nested_fields()with issue Query on nested struct field with PyIceberg? #953 scenariosExample
The fix converts the field reference from:
FieldRef.Name(run_id)(fails - field not found)FieldRef.Nested(FieldRef.Name(mazeMetadata) FieldRef.Name(run_id))(works!)