-
Notifications
You must be signed in to change notification settings - Fork 0
19003: Add force_filter_selections to restore pushdown_filters behavior
#87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 1b1f46b.
WalkthroughThis pull request upgrades Arrow ecosystem crates from version 57.0.0 to 57.1.0, including arrow, arrow-buffer, arrow-flight, arrow-ipc, arrow-ord, arrow-schema, and parquet crates. A new boolean configuration option ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request ReviewSummaryThis PR adds a new configuration option ✅ Code Quality and Best PracticesStrengths:
Minor Suggestions:
✅ Potential Bugs and IssuesNo critical issues found. The implementation is solid:
Observations:
⚡ Performance ConsiderationsPositive:
Note:
🔒 Security ConcernsNo security issues identified.
🧪 Test CoverageExcellent test coverage:
Suggestions:
📝 Additional NotesDependency Updates:
Documentation:
🎯 RecommendationAPPROVE - This is a well-implemented feature that:
The code is production-ready. The minor suggestions above are optional enhancements, not blockers. References
Review generated by Claude Code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review completed. No suggestions at this time.
Comment augment review to trigger a new review at any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
docs/source/user-guide/configs.md (1)
81-92:force_filter_selectionsconfig doc is consistent with intended behaviorThe new entry correctly documents the flag (default
false) and its relationship topushdown_filtersand the RowSelection/Bitmap choice. The wording is clear enough and aligned with the code changes; no required edits.datafusion/datasource-parquet/src/source.rs (1)
413-416: force_filter_selections plumbing into ParquetOpener looks consistentThe new accessor mirroring
table_parquet_options.global.force_filter_selectionsand passing it intoParquetOpenerfollows the existing pattern for other Parquet flags; no issues from this side.If you expect callers to tweak this per‑scan (not only via
TableParquetOptions/ session config), consider adding awith_force_filter_selections(bool)builder for parity withwith_pushdown_filtersandwith_reorder_filters, but that’s optional.Also applies to: 588-605
docs/source/library-user-guide/upgrading.md (1)
33-53: Tighten wording and fix minor style in the new Parquet filter sectionThe new section is clear and matches the behavior of
force_filter_selections. Two small nits you may want to address:
- Line 35–38: consider “pushed-down filters” instead of “pushed down filters”.
- Lines 41–46: to align wording with the flag name, consider making it explicit that setting
datafusion.execution.parquet.force_filter_selections = truerestores the pre‑57.1.0 behavior by forcing row selections instead of the new adaptive representation.These are editorial only; the technical description looks good.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
Cargo.lockis excluded by!**/*.lockdatafusion/proto-common/src/generated/pbjson.rsis excluded by!**/generated/**datafusion/proto-common/src/generated/prost.rsis excluded by!**/generated/**datafusion/proto/src/generated/datafusion_proto_common.rsis excluded by!**/generated/**
📒 Files selected for processing (24)
Cargo.toml(2 hunks)datafusion-cli/src/main.rs(2 hunks)datafusion/common/src/config.rs(1 hunks)datafusion/common/src/file_options/parquet_writer.rs(3 hunks)datafusion/core/src/physical_planner.rs(1 hunks)datafusion/core/tests/dataframe/dataframe_functions.rs(1 hunks)datafusion/core/tests/parquet/filter_pushdown.rs(1 hunks)datafusion/core/tests/sql/select.rs(1 hunks)datafusion/datasource-parquet/src/opener.rs(10 hunks)datafusion/datasource-parquet/src/source.rs(2 hunks)datafusion/optimizer/src/analyzer/type_coercion.rs(1 hunks)datafusion/proto-common/proto/datafusion_common.proto(1 hunks)datafusion/proto-common/src/from_proto/mod.rs(1 hunks)datafusion/proto-common/src/to_proto/mod.rs(1 hunks)datafusion/proto/src/logical_plan/file_formats.rs(2 hunks)datafusion/sqllogictest/test_files/array.slt(12 hunks)datafusion/sqllogictest/test_files/arrow_typeof.slt(3 hunks)datafusion/sqllogictest/test_files/coalesce.slt(1 hunks)datafusion/sqllogictest/test_files/information_schema.slt(2 hunks)datafusion/sqllogictest/test_files/map.slt(1 hunks)datafusion/sqllogictest/test_files/struct.slt(9 hunks)datafusion/sqllogictest/test_files/window.slt(2 hunks)docs/source/library-user-guide/upgrading.md(1 hunks)docs/source/user-guide/configs.md(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T13:37:46.028Z
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.
Applied to files:
datafusion/core/tests/sql/select.rsdatafusion/core/src/physical_planner.rs
🧬 Code graph analysis (3)
datafusion/common/src/config.rs (1)
datafusion/datasource-parquet/src/source.rs (1)
force_filter_selections(414-416)
datafusion/core/tests/parquet/filter_pushdown.rs (1)
datafusion/datasource-parquet/src/metrics.rs (1)
new(83-183)
datafusion/datasource-parquet/src/opener.rs (1)
datafusion/datasource-parquet/src/source.rs (1)
force_filter_selections(414-416)
🪛 LanguageTool
docs/source/library-user-guide/upgrading.md
[grammar] ~36-~36: Use a hyphen to join words.
Context: ...e filter strategy when evaluating pushed down filters for Parquet files. This new...
(QB_NEW_EN_HYPHEN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (31)
datafusion/optimizer/src/analyzer/type_coercion.rs (1)
2468-2468: LGTM! Test expectation correctly updated for explicit non-null type representation.The updated expectation accurately reflects the schema definition where the
"entries"field and the"key"field within the Struct are non-nullable, while the"value"field remains nullable. This change aligns with the Arrow ecosystem upgrade to version 57.1.0 mentioned in the PR summary.datafusion-cli/src/main.rs (1)
595-597: No verification needed—metadata sizes are intrinsic parquet file properties unrelated to the feature.The
metadata_size_bytesvalues (8882, 269266, 1347) are static properties of the parquet files themselves, not affected by theforce_filter_selectionsconfiguration. The test correctly verifies that the metadata cache tracks these values and that hit counts increase appropriately with repeated queries. The snapshot assertions are accurate as written.datafusion/sqllogictest/test_files/coalesce.slt (1)
196-217: Arraycoalescetype expectations match updated type renderingSwitching the
arrow_typeofexpectations here toList(Int64)for the arraycoalescecases is consistent with the argument types and the rest of the suite’s new non-nullable list representations. No issues.Cargo.toml (1)
94-106: Arrow/Parquet version bumps look coherent; ensure tests pass with 57.1.0Bumping the Arrow family crates and
parquetto57.1.0with the existing feature sets is consistent and should keep the workspace on a single Arrow/Parquet version. Nothing obviously problematic in the feature flags.Please confirm the full test suite (including Parquet and sqllogictest suites) passes against these new versions in your CI environment.
Also applies to: 169-173
datafusion/sqllogictest/test_files/struct.slt (1)
56-58: Struct and list-of-structarrow_typeofexpectations align with schemasAll the updated
arrow_typeofexpectations for structs and arrays of structs now match the defined schemas and literals (field names, Int32/Int64/Float32/Float64, Utf8 vs Utf8View, and non-null element representations). This is consistent with the broader non-nullable type rendering changes; nothing stands out as incorrect.Also applies to: 232-237, 416-416, 467-472, 489-490, 522-524, 549-551, 586-586, 609-615
datafusion/sqllogictest/test_files/map.slt (1)
44-48: Mapdescribeoutput updated to match non-null entry/key representationThe new
describe dataexpectations forintsandstringscorrectly reflect a Map with non-nullentriesand non-nullkey/valuefields, in line with the upstream type representation and the rest of the suite’s non-nullability changes. Looks good.datafusion/sqllogictest/test_files/window.slt (2)
5907-5909: Updated error expectation matches new list type formattingThe expected error text now refers to
List(Null), which is consistent with the updated type pretty-printing for list literals in comparison coercion errors. This keeps the regression test aligned with the current planner diagnostics.
5935-5937: Array_agg window field type updated tonullable List(Int64)The physical plan expectation for
array_agg(test.c2)now usesnullable List(Int64)rather than a list with a nullable element type. This matches the new schema representation for list outputs from window aggregates and keeps the EXPLAIN-based test in sync with the engine.datafusion/sqllogictest/test_files/array.slt (10)
710-719: Updatedarrow_typeof(arrays.*)outputs to non-null nested element typesChanging
List(nullable Int64)toList(Int64)and similar forFloat64/Utf8matches the new Arrow/DataFusion type display that omits inner-nullability in these contexts. Looks consistent with the rest of the file.
1182-1186:arrow_typeoffor mixedmake_arraynow returnsList(LargeList(Int64))The type
List(LargeList(Int64))formake_array(make_array(1), arrow_cast(make_array(-1), 'LargeList(Int8)'))correctly reflects unifying to a common element representation (LargeList(Int64)). Expectation update looks correct.
2028-2035: LargeListView TODO and error expectation are alignedHere both the TODO and the query refer to
LargeListView(...)types, and the expected planner error is explicit. This matches the intended “not yet supported” state and mirrors the earlier ListView case.
3320-3326:array_concatresult type updated toList(Utf8View)The mixed
Utf8+Utf8Viewconcat now returningList(Utf8View)(rather thanList(Utf8)) is consistent with coercing to a view type when any argument isUtf8View. The value output[1, 2, 3]is unchanged; type change looks intentional.
4614-4619:make_arrayforcing string array element type toUtf8ViewThe test explicitly asserts that
make_array(arrow_cast('a', 'Utf8View'), 'b', 'c', 'd')yieldsList(Utf8View), which matches the surrounding comment and newer string coercion rules. This expectation is coherent.
7710-7714: Fixed-size list element type made non-nullable infixed_size_list_array
arrow_typeof(f0)now printsFixedSizeList(2 x Int64)instead of including nullability on the inner Int64. This lines up with other changes in the file where nested element types are printed withoutnullable. Looks good.
7740-7744: Nested fixed-size -> list cast type string normalized
arrow_typeof(make_array(arrow_cast(f0, 'List(Int64)')))producingList(List(Int64))(without nullable) matches the updated convention for nested types. The value behavior is unchanged; just the display string updated.
7797-7802:flattenover mixed LargeList/fixed-size lists: result types nowLargeList(...)The updated expectations:
- values unchanged
- types
LargeList(Int64)andLargeList(FixedSizeList(1 x Float64))fit the semantics that flattening a
LargeListshould keep theLargeListrepresentation. This looks internally consistent with other flatten tests.
8340-8347:int[]column array types now useInt32inarrow_typeofFor
test_create_array_table,arrow_typeof(a)andarrow_typeof(c)now showList(Int32)andList(List(Int32)), which is consistent withintmapping to ArrowInt32in this dialect. The expectations align with that mapping.
8352-8355:text[]cast now reported asList(Utf8View)The
arrow_typeof([]::text[])expectation changed toList(Utf8View), matching the broader switch to Utf8View for textual arrays elsewhere in this file. No behavioral change beyond display; the adjustment looks correct.datafusion/core/tests/sql/select.rs (1)
223-228: Snapshot updated to match non-null List(Int32) displayThe new expected error text (
List(Int32) = Int32) matches the updated type rendering; test remains semantically correct.datafusion/proto-common/proto/datafusion_common.proto (1)
515-523: NewParquetOptions.force_filter_selectionsfield is wire‑compatibleAdding
bool force_filter_selections = 34;is compatible with existing proto3 usage and matches the documented default‑false semantics; assuming the Rust from/to‑proto mappings andTableParquetOptions::defaultwere updated accordingly, this looks correct.Please double‑check that:
force_filter_selectionsis included in allParquetOptionsfrom/to‑proto conversions, and- the default value in Rust config matches the intended default here.
datafusion/core/src/physical_planner.rs (1)
3108-3115: Struct literal incompatibility snapshot matches new non‑null displayUpdating the expected error text to
Struct("foo": non-null Boolean)correctly tracks the new type formatter; no planner behavior concerns here.datafusion/core/tests/dataframe/dataframe_functions.rs (1)
310-320: arrow_typeof(List) snapshot aligned with new List(Int32) renderingThe updated expectation to
List(Int32)for all rows matches the revised type display; the test remains valid.datafusion/sqllogictest/test_files/information_schema.slt (1)
247-247: SHOW ALL expectations updated correctly forforce_filter_selections.Key name, default value, placement among other parquet options, and verbose description all match the new
ParquetOptions::force_filter_selectionssemantics.Also applies to: 374-374
datafusion/common/src/config.rs (1)
697-701: NewParquetOptions::force_filter_selectionsfield is well-documented and safely defaulted.The option is grouped with related pushdown settings, its description is precise about when it applies, and
default = falsepreserves existing behavior unless explicitly enabled.datafusion/core/tests/parquet/filter_pushdown.rs (1)
639-645: Predicate cache tests cleanly cover default vsforce_filter_selectionsbehavior.Adjusting the default expectations to 8/7 and adding a dedicated 8/4 test with
force_filter_selections = trueprovides good regression coverage for how the predicate cache interacts with selection strategy under filter pushdown.Also applies to: 647-666
datafusion/proto-common/src/to_proto/mod.rs (1)
851-885: Protobuf mapping forforce_filter_selectionscompletes ParquetOptions round-trip.Including
force_filter_selectionshere aligns the proto representation with the Rust struct and prevents silent loss of this flag across (de)serialization.datafusion/sqllogictest/test_files/arrow_typeof.slt (1)
357-365: Updatedarrow_typeofexpectations match new list/FixedSizeList type rendering.Switching to non-nullable inner types in the printed forms for List, LargeList, and FixedSizeList keeps these fixtures consistent with the engine’s current Arrow type pretty-printer.
Also applies to: 380-388, 417-420
datafusion/proto/src/logical_plan/file_formats.rs (1)
368-400: Parquet proto wiring forforce_filter_selectionslooks correct
force_filter_selectionsis passed fromglobal_options.globalintoParquetOptionsProtoand back intoParquetOptionsvia theFrom<&ParquetOptionsProto>impl, matching how other boolean fields are handled. No additional logic seems required here.Also applies to: 463-476
datafusion/common/src/file_options/parquet_writer.rs (1)
169-212: Writer options correctly ignoreforce_filter_selectionswhile keeping tests in sync
- The destructuring in
into_writer_properties_builderexplicitly ignoresforce_filter_selections, which is correct since this flag only affects read‑time filter behavior, notWriterProperties.- Test helpers (
parquet_options_with_non_defaultsandsession_config_from_writer_props) now carryforce_filter_selectionsthrough using defaults, keeping identity/round‑trip tests valid without pretending it affects writer configuration.No further changes needed here.
Also applies to: 433-477, 546-591
datafusion/proto-common/src/from_proto/mod.rs (1)
927-947:force_filter_selectionsmapping is bidirectionally consistentThe new field is correctly mapped in both directions:
from_proto(line 946 in from_proto/mod.rs) andto_proto(line 859 in to_proto/mod.rs) both use identical simple bool assignments (force_filter_selections: value.force_filter_selections), ensuring round-trip consistency.datafusion/datasource-parquet/src/opener.rs (1)
56-58:force_filter_selectionsintegration with Arrow reader is correct and API-verified
- Importing
RowSelectionPolicyand threadingforce_filter_selectionsthroughParquetOpenerintoopenis consistent with how other Parquet options are handled.- The
if force_filter_selections { builder = builder.with_row_selection_policy(RowSelectionPolicy::Selectors); }block is placed appropriately so it affects row selections created later (from predicates and access plans) without touching the default path when the flag isfalse.- Tests correctly initialize the new field to
false, preserving existing behavior in all existing scenarios.- Confirmed:
RowSelectionPolicy::Selectorsandwith_row_selection_policy()are available in parquet 57.1.0 and correctly applied here.
| # TODO: Enable once array_slice supports LargeListView types. | ||
| # Expected output (once supported): | ||
| # ---- | ||
| # [1, 2, 3, 4, 5] [h, e, l, l, o] | ||
| query error DataFusion error: Execution error: Unsupported type 'ListView\(Int64\)'. Must be a supported arrow type name such as 'Int32' or 'Timestamp\(ns\)'. Error unknown token: ListView | ||
| query error Failed to coerce arguments to satisfy a call to 'array_slice' function: | ||
| select array_slice(arrow_cast(make_array(1, 2, 3, 4, 5), 'ListView(Int64)'), 0, 6), | ||
| array_slice(arrow_cast(make_array('h', 'e', 'l', 'l', 'o'), 'ListView(Utf8)'), 0, 5); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify TODO wording vs actual ListView usage
The TODO mentions enabling array_slice for “LargeListView types”, but the failing query here uses ListView(Int64)/ListView(Utf8). That mismatch may confuse readers about which logical type is missing support.
Consider rephrasing to “ListView types” (or explicitly mentioning both ListView and LargeListView) to match the query below.
🤖 Prompt for AI Agents
In datafusion/sqllogictest/test_files/array.slt around lines 1981 to 1988, the
TODO text refers to "LargeListView types" while the failing query immediately
below uses ListView(Int64) and ListView(Utf8); update the comment to accurately
reflect the types in use by rephrasing to "ListView types" or explicitly mention
"ListView and LargeListView" depending on which is unsupported, so the note
matches the query and avoids confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:documentation; feedback:The CodeRabbit AI reviewer is correct that the documentation does not the actual test below and should be updated to mention the tested types (ListView and ListView)
19003: To review by AI