Skip to content

Conversation

@martin-augment
Copy link
Owner

19003: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Dec 1, 2025

Walkthrough

This pull request upgrades Arrow ecosystem crates from version 57.0.0 to 57.1.0, including arrow, arrow-buffer, arrow-flight, arrow-ipc, arrow-ord, arrow-schema, and parquet crates. A new boolean configuration option force_filter_selections is introduced to ParquetOptions, controllable via datafusion.execution.parquet.force_filter_selections, which enforces row selection policies during Parquet filter pushdown. The feature is integrated throughout the codebase including proto definitions, ParquetOpener, and ParquetSource. Numerous test expectations are updated to reflect non-nullable type representations for nested types like List(Int64) instead of List(nullable Int64). Documentation is updated to describe the adaptive filter representation strategy.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-19003-2025-12-01-11-17-03

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Dec 1, 2025

Pull Request Review

Summary

This PR adds a new configuration option force_filter_selections to restore the previous behavior of pushdown_filters in Parquet reading. The change is necessitated by Arrow 57.1.0's new adaptive filter strategy that automatically chooses between RowSelection and Bitmap representations based on performance characteristics.

✅ Code Quality and Best Practices

Strengths:

  • Clean implementation following DataFusion's configuration pattern
  • Consistent naming convention (force_filter_selections) matches existing options
  • Proper threading of the configuration through all layers (config → source → opener → builder)
  • Good documentation in comments explaining the purpose
  • Comprehensive upgrade guide added to upgrading.md

Minor Suggestions:

  1. The comment at datafusion/core/tests/parquet/filter_pushdown.rs:641 is helpful but could be more precise about why 7 records are read (bitmap overhead).
  2. Consider adding a link to the Arrow-RS issue in the code comment at config.rs:697-700 for future maintainers.

✅ Potential Bugs and Issues

No critical issues found. The implementation is solid:

  • Proper default value (false) maintains backward compatibility with the new adaptive behavior
  • All code paths correctly handle the new field
  • Proto serialization includes the new field with correct field number (34)
  • Test assertions updated to reflect the new behavior

Observations:

  • The proto field number 34 appears out of sequence in the diff but this is acceptable in protobuf
  • All test file updates correctly reflect the change in type display format (e.g., List(nullable Int32)List(Int32)) which appears to be from the Arrow upgrade, not this feature

⚡ Performance Considerations

Positive:

  • This feature provides an escape hatch for users who may experience performance regressions with the new adaptive strategy
  • The new default (adaptive) is generally better for most workloads
  • Test results show the difference: 7 records read with adaptive vs 4 with forced selections in predicate_cache_pushdown_default

Note:

  • The choice between bitmap and selection has trade-offs. The fact that the test shows 7 vs 4 records read suggests the adaptive strategy may read more data in some cases, but this is expected behavior where the bitmap representation is more efficient overall.

🔒 Security Concerns

No security issues identified.

  • This is a read-only performance optimization flag
  • No exposure of sensitive data or attack vectors
  • Configuration is properly validated through the standard config system

🧪 Test Coverage

Excellent test coverage:

  • New test predicate_cache_pushdown_default_selections_only specifically validates the new flag behavior
  • Existing test predicate_cache_pushdown_default updated to document the new default behavior
  • Integration tests updated across multiple files (array.slt, struct.slt, etc.)
  • CLI tests updated to reflect Arrow upgrade metadata size changes

Suggestions:

  1. Consider adding a test that validates the config serialization/deserialization through proto
  2. Test coverage for the source.rs:force_filter_selections() method would be beneficial

📝 Additional Notes

Dependency Updates:

  • Arrow upgraded from 57.0.0 → 57.1.0 (appropriate for this feature)
  • lz4_flex upgraded 0.11.5 → 0.12.0 (appears unrelated but included in Cargo.lock)
  • Windows dependency consolidation in Cargo.lock is clean

Documentation:

  • Excellent user-facing documentation in upgrading.md
  • Config documentation in configs.md updated
  • Consider adding an example use case to the upgrade guide

🎯 Recommendation

APPROVE - This is a well-implemented feature that:

  1. Provides necessary backward compatibility control
  2. Follows DataFusion conventions consistently
  3. Has comprehensive test coverage
  4. Includes good documentation

The code is production-ready. The minor suggestions above are optional enhancements, not blockers.

References


Review generated by Claude Code

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
docs/source/user-guide/configs.md (1)

81-92: force_filter_selections config doc is consistent with intended behavior

The new entry correctly documents the flag (default false) and its relationship to pushdown_filters and the RowSelection/Bitmap choice. The wording is clear enough and aligned with the code changes; no required edits.

datafusion/datasource-parquet/src/source.rs (1)

413-416: force_filter_selections plumbing into ParquetOpener looks consistent

The new accessor mirroring table_parquet_options.global.force_filter_selections and passing it into ParquetOpener follows the existing pattern for other Parquet flags; no issues from this side.

If you expect callers to tweak this per‑scan (not only via TableParquetOptions / session config), consider adding a with_force_filter_selections(bool) builder for parity with with_pushdown_filters and with_reorder_filters, but that’s optional.

Also applies to: 588-605

docs/source/library-user-guide/upgrading.md (1)

33-53: Tighten wording and fix minor style in the new Parquet filter section

The new section is clear and matches the behavior of force_filter_selections. Two small nits you may want to address:

  • Line 35–38: consider “pushed-down filters” instead of “pushed down filters”.
  • Lines 41–46: to align wording with the flag name, consider making it explicit that setting datafusion.execution.parquet.force_filter_selections = true restores the pre‑57.1.0 behavior by forcing row selections instead of the new adaptive representation.

These are editorial only; the technical description looks good.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4b1361 and 6ae5d68.

⛔ Files ignored due to path filters (4)
  • Cargo.lock is excluded by !**/*.lock
  • datafusion/proto-common/src/generated/pbjson.rs is excluded by !**/generated/**
  • datafusion/proto-common/src/generated/prost.rs is excluded by !**/generated/**
  • datafusion/proto/src/generated/datafusion_proto_common.rs is excluded by !**/generated/**
📒 Files selected for processing (24)
  • Cargo.toml (2 hunks)
  • datafusion-cli/src/main.rs (2 hunks)
  • datafusion/common/src/config.rs (1 hunks)
  • datafusion/common/src/file_options/parquet_writer.rs (3 hunks)
  • datafusion/core/src/physical_planner.rs (1 hunks)
  • datafusion/core/tests/dataframe/dataframe_functions.rs (1 hunks)
  • datafusion/core/tests/parquet/filter_pushdown.rs (1 hunks)
  • datafusion/core/tests/sql/select.rs (1 hunks)
  • datafusion/datasource-parquet/src/opener.rs (10 hunks)
  • datafusion/datasource-parquet/src/source.rs (2 hunks)
  • datafusion/optimizer/src/analyzer/type_coercion.rs (1 hunks)
  • datafusion/proto-common/proto/datafusion_common.proto (1 hunks)
  • datafusion/proto-common/src/from_proto/mod.rs (1 hunks)
  • datafusion/proto-common/src/to_proto/mod.rs (1 hunks)
  • datafusion/proto/src/logical_plan/file_formats.rs (2 hunks)
  • datafusion/sqllogictest/test_files/array.slt (12 hunks)
  • datafusion/sqllogictest/test_files/arrow_typeof.slt (3 hunks)
  • datafusion/sqllogictest/test_files/coalesce.slt (1 hunks)
  • datafusion/sqllogictest/test_files/information_schema.slt (2 hunks)
  • datafusion/sqllogictest/test_files/map.slt (1 hunks)
  • datafusion/sqllogictest/test_files/struct.slt (9 hunks)
  • datafusion/sqllogictest/test_files/window.slt (2 hunks)
  • docs/source/library-user-guide/upgrading.md (1 hunks)
  • docs/source/user-guide/configs.md (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T13:37:46.028Z
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.

Applied to files:

  • datafusion/core/tests/sql/select.rs
  • datafusion/core/src/physical_planner.rs
🧬 Code graph analysis (3)
datafusion/common/src/config.rs (1)
datafusion/datasource-parquet/src/source.rs (1)
  • force_filter_selections (414-416)
datafusion/core/tests/parquet/filter_pushdown.rs (1)
datafusion/datasource-parquet/src/metrics.rs (1)
  • new (83-183)
datafusion/datasource-parquet/src/opener.rs (1)
datafusion/datasource-parquet/src/source.rs (1)
  • force_filter_selections (414-416)
🪛 LanguageTool
docs/source/library-user-guide/upgrading.md

[grammar] ~36-~36: Use a hyphen to join words.
Context: ...e filter strategy when evaluating pushed down filters for Parquet files. This new...

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (rust)
🔇 Additional comments (31)
datafusion/optimizer/src/analyzer/type_coercion.rs (1)

2468-2468: LGTM! Test expectation correctly updated for explicit non-null type representation.

The updated expectation accurately reflects the schema definition where the "entries" field and the "key" field within the Struct are non-nullable, while the "value" field remains nullable. This change aligns with the Arrow ecosystem upgrade to version 57.1.0 mentioned in the PR summary.

datafusion-cli/src/main.rs (1)

595-597: No verification needed—metadata sizes are intrinsic parquet file properties unrelated to the feature.

The metadata_size_bytes values (8882, 269266, 1347) are static properties of the parquet files themselves, not affected by the force_filter_selections configuration. The test correctly verifies that the metadata cache tracks these values and that hit counts increase appropriately with repeated queries. The snapshot assertions are accurate as written.

datafusion/sqllogictest/test_files/coalesce.slt (1)

196-217: Array coalesce type expectations match updated type rendering

Switching the arrow_typeof expectations here to List(Int64) for the array coalesce cases is consistent with the argument types and the rest of the suite’s new non-nullable list representations. No issues.

Cargo.toml (1)

94-106: Arrow/Parquet version bumps look coherent; ensure tests pass with 57.1.0

Bumping the Arrow family crates and parquet to 57.1.0 with the existing feature sets is consistent and should keep the workspace on a single Arrow/Parquet version. Nothing obviously problematic in the feature flags.

Please confirm the full test suite (including Parquet and sqllogictest suites) passes against these new versions in your CI environment.

Also applies to: 169-173

datafusion/sqllogictest/test_files/struct.slt (1)

56-58: Struct and list-of-struct arrow_typeof expectations align with schemas

All the updated arrow_typeof expectations for structs and arrays of structs now match the defined schemas and literals (field names, Int32/Int64/Float32/Float64, Utf8 vs Utf8View, and non-null element representations). This is consistent with the broader non-nullable type rendering changes; nothing stands out as incorrect.

Also applies to: 232-237, 416-416, 467-472, 489-490, 522-524, 549-551, 586-586, 609-615

datafusion/sqllogictest/test_files/map.slt (1)

44-48: Map describe output updated to match non-null entry/key representation

The new describe data expectations for ints and strings correctly reflect a Map with non-null entries and non-null key/value fields, in line with the upstream type representation and the rest of the suite’s non-nullability changes. Looks good.

datafusion/sqllogictest/test_files/window.slt (2)

5907-5909: Updated error expectation matches new list type formatting

The expected error text now refers to List(Null), which is consistent with the updated type pretty-printing for list literals in comparison coercion errors. This keeps the regression test aligned with the current planner diagnostics.


5935-5937: Array_agg window field type updated to nullable List(Int64)

The physical plan expectation for array_agg(test.c2) now uses nullable List(Int64) rather than a list with a nullable element type. This matches the new schema representation for list outputs from window aggregates and keeps the EXPLAIN-based test in sync with the engine.

datafusion/sqllogictest/test_files/array.slt (10)

710-719: Updated arrow_typeof(arrays.*) outputs to non-null nested element types

Changing List(nullable Int64) to List(Int64) and similar for Float64/Utf8 matches the new Arrow/DataFusion type display that omits inner-nullability in these contexts. Looks consistent with the rest of the file.


1182-1186: arrow_typeof for mixed make_array now returns List(LargeList(Int64))

The type List(LargeList(Int64)) for make_array(make_array(1), arrow_cast(make_array(-1), 'LargeList(Int8)')) correctly reflects unifying to a common element representation (LargeList(Int64)). Expectation update looks correct.


2028-2035: LargeListView TODO and error expectation are aligned

Here both the TODO and the query refer to LargeListView(...) types, and the expected planner error is explicit. This matches the intended “not yet supported” state and mirrors the earlier ListView case.


3320-3326: array_concat result type updated to List(Utf8View)

The mixed Utf8 + Utf8View concat now returning List(Utf8View) (rather than List(Utf8)) is consistent with coercing to a view type when any argument is Utf8View. The value output [1, 2, 3] is unchanged; type change looks intentional.


4614-4619: make_array forcing string array element type to Utf8View

The test explicitly asserts that make_array(arrow_cast('a', 'Utf8View'), 'b', 'c', 'd') yields List(Utf8View), which matches the surrounding comment and newer string coercion rules. This expectation is coherent.


7710-7714: Fixed-size list element type made non-nullable in fixed_size_list_array

arrow_typeof(f0) now prints FixedSizeList(2 x Int64) instead of including nullability on the inner Int64. This lines up with other changes in the file where nested element types are printed without nullable. Looks good.


7740-7744: Nested fixed-size -> list cast type string normalized

arrow_typeof(make_array(arrow_cast(f0, 'List(Int64)'))) producing List(List(Int64)) (without nullable) matches the updated convention for nested types. The value behavior is unchanged; just the display string updated.


7797-7802: flatten over mixed LargeList/fixed-size lists: result types now LargeList(...)

The updated expectations:

  • values unchanged
  • types LargeList(Int64) and LargeList(FixedSizeList(1 x Float64))

fit the semantics that flattening a LargeList should keep the LargeList representation. This looks internally consistent with other flatten tests.


8340-8347: int[] column array types now use Int32 in arrow_typeof

For test_create_array_table, arrow_typeof(a) and arrow_typeof(c) now show List(Int32) and List(List(Int32)), which is consistent with int mapping to Arrow Int32 in this dialect. The expectations align with that mapping.


8352-8355: text[] cast now reported as List(Utf8View)

The arrow_typeof([]::text[]) expectation changed to List(Utf8View), matching the broader switch to Utf8View for textual arrays elsewhere in this file. No behavioral change beyond display; the adjustment looks correct.

datafusion/core/tests/sql/select.rs (1)

223-228: Snapshot updated to match non-null List(Int32) display

The new expected error text (List(Int32) = Int32) matches the updated type rendering; test remains semantically correct.

datafusion/proto-common/proto/datafusion_common.proto (1)

515-523: New ParquetOptions.force_filter_selections field is wire‑compatible

Adding bool force_filter_selections = 34; is compatible with existing proto3 usage and matches the documented default‑false semantics; assuming the Rust from/to‑proto mappings and TableParquetOptions::default were updated accordingly, this looks correct.

Please double‑check that:

  • force_filter_selections is included in all ParquetOptions from/to‑proto conversions, and
  • the default value in Rust config matches the intended default here.
datafusion/core/src/physical_planner.rs (1)

3108-3115: Struct literal incompatibility snapshot matches new non‑null display

Updating the expected error text to Struct("foo": non-null Boolean) correctly tracks the new type formatter; no planner behavior concerns here.

datafusion/core/tests/dataframe/dataframe_functions.rs (1)

310-320: arrow_typeof(List) snapshot aligned with new List(Int32) rendering

The updated expectation to List(Int32) for all rows matches the revised type display; the test remains valid.

datafusion/sqllogictest/test_files/information_schema.slt (1)

247-247: SHOW ALL expectations updated correctly for force_filter_selections.

Key name, default value, placement among other parquet options, and verbose description all match the new ParquetOptions::force_filter_selections semantics.

Also applies to: 374-374

datafusion/common/src/config.rs (1)

697-701: New ParquetOptions::force_filter_selections field is well-documented and safely defaulted.

The option is grouped with related pushdown settings, its description is precise about when it applies, and default = false preserves existing behavior unless explicitly enabled.

datafusion/core/tests/parquet/filter_pushdown.rs (1)

639-645: Predicate cache tests cleanly cover default vs force_filter_selections behavior.

Adjusting the default expectations to 8/7 and adding a dedicated 8/4 test with force_filter_selections = true provides good regression coverage for how the predicate cache interacts with selection strategy under filter pushdown.

Also applies to: 647-666

datafusion/proto-common/src/to_proto/mod.rs (1)

851-885: Protobuf mapping for force_filter_selections completes ParquetOptions round-trip.

Including force_filter_selections here aligns the proto representation with the Rust struct and prevents silent loss of this flag across (de)serialization.

datafusion/sqllogictest/test_files/arrow_typeof.slt (1)

357-365: Updated arrow_typeof expectations match new list/FixedSizeList type rendering.

Switching to non-nullable inner types in the printed forms for List, LargeList, and FixedSizeList keeps these fixtures consistent with the engine’s current Arrow type pretty-printer.

Also applies to: 380-388, 417-420

datafusion/proto/src/logical_plan/file_formats.rs (1)

368-400: Parquet proto wiring for force_filter_selections looks correct

force_filter_selections is passed from global_options.global into ParquetOptionsProto and back into ParquetOptions via the From<&ParquetOptionsProto> impl, matching how other boolean fields are handled. No additional logic seems required here.

Also applies to: 463-476

datafusion/common/src/file_options/parquet_writer.rs (1)

169-212: Writer options correctly ignore force_filter_selections while keeping tests in sync

  • The destructuring in into_writer_properties_builder explicitly ignores force_filter_selections, which is correct since this flag only affects read‑time filter behavior, not WriterProperties.
  • Test helpers (parquet_options_with_non_defaults and session_config_from_writer_props) now carry force_filter_selections through using defaults, keeping identity/round‑trip tests valid without pretending it affects writer configuration.

No further changes needed here.

Also applies to: 433-477, 546-591

datafusion/proto-common/src/from_proto/mod.rs (1)

927-947: force_filter_selections mapping is bidirectionally consistent

The new field is correctly mapped in both directions: from_proto (line 946 in from_proto/mod.rs) and to_proto (line 859 in to_proto/mod.rs) both use identical simple bool assignments (force_filter_selections: value.force_filter_selections), ensuring round-trip consistency.

datafusion/datasource-parquet/src/opener.rs (1)

56-58: force_filter_selections integration with Arrow reader is correct and API-verified

  • Importing RowSelectionPolicy and threading force_filter_selections through ParquetOpener into open is consistent with how other Parquet options are handled.
  • The if force_filter_selections { builder = builder.with_row_selection_policy(RowSelectionPolicy::Selectors); } block is placed appropriately so it affects row selections created later (from predicates and access plans) without touching the default path when the flag is false.
  • Tests correctly initialize the new field to false, preserving existing behavior in all existing scenarios.
  • Confirmed: RowSelectionPolicy::Selectors and with_row_selection_policy() are available in parquet 57.1.0 and correctly applied here.

Comment on lines +1981 to 1988
# TODO: Enable once array_slice supports LargeListView types.
# Expected output (once supported):
# ----
# [1, 2, 3, 4, 5] [h, e, l, l, o]
query error DataFusion error: Execution error: Unsupported type 'ListView\(Int64\)'. Must be a supported arrow type name such as 'Int32' or 'Timestamp\(ns\)'. Error unknown token: ListView
query error Failed to coerce arguments to satisfy a call to 'array_slice' function:
select array_slice(arrow_cast(make_array(1, 2, 3, 4, 5), 'ListView(Int64)'), 0, 6),
array_slice(arrow_cast(make_array('h', 'e', 'l', 'l', 'o'), 'ListView(Utf8)'), 0, 5);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify TODO wording vs actual ListView usage

The TODO mentions enabling array_slice for “LargeListView types”, but the failing query here uses ListView(Int64)/ListView(Utf8). That mismatch may confuse readers about which logical type is missing support.

Consider rephrasing to “ListView types” (or explicitly mentioning both ListView and LargeListView) to match the query below.

🤖 Prompt for AI Agents
In datafusion/sqllogictest/test_files/array.slt around lines 1981 to 1988, the
TODO text refers to "LargeListView types" while the failing query immediately
below uses ListView(Int64) and ListView(Utf8); update the comment to accurately
reflect the types in use by rephrasing to "ListView types" or explicitly mention
"ListView and LargeListView" depending on which is unsupported, so the note
matches the query and avoids confusion.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback:The CodeRabbit AI reviewer is correct that the documentation does not the actual test below and should be updated to mention the tested types (ListView and ListView)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants