-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Enable placeholders with extension types #17986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
97a8408
to
04cebe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timsaucer @alamb I'm happy to add tests for all these components but wanted to make sure this is vaugely headed in the right direction before I do so!
pub data_type: Option<DataType>, | ||
pub field: Option<FieldRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change. We can change this to a less severely breaking option (e.g., just add metadata: FieldMetadata
to the struct)...I started with the most breaking version to identify its use in as many places as possible.
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> { | ||
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<FieldRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change would also enable supporting UUIDs and other SQL types that map to extension types
Ok(Expr::Cast(Cast::new( | ||
Box::new(expr), | ||
dt.data_type().clone(), | ||
))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another place where metadata is dropped that I didn't update (casts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made #18060 for this one
#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)] | ||
pub struct FieldMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved this from the expr crate so I could us it in ParamValues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @paleolimbot -- I think this is definitely the right direction. I had some small API suggestions, but overall looks good
The largest open question in my mind is what you have highlighted for customizing behavior for different extension types (e.g. comparing two fields for "equality" and printing them, and casting them, etc.)
@findepi brought up the same thing many months ago when discussing adding new types in
One idea is to create a TypeRegistry
similar to a FunctionRegistry
and some sort of ExtensionType
trait that encapsulates these behaviors.
The challenge would then be to thread the registry to all places that need it. Though that is likely largely an API design / plumbing exercise
If you think that is an idea worth exploring
datafusion/common/src/param_value.rs
Outdated
impl ParamValues { | ||
/// Verify parameter list length and type | ||
pub fn verify(&self, expect: &[DataType]) -> Result<()> { | ||
pub fn verify(&self, expect: &[FieldRef]) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing that would be nice to help people upgrade could be to add a new function and deprecate this one -- perhaps something like suggested in https://datafusion.apache.org/contributor-guide/api-health.html#api-health-policy
#[deprecated]
pub fn verify(&self, expect: &[DataType]) -> Result<()> {
// make dummy Fields
let expect = ...;
self.verify_fields(&expect)
}
// new function that has the new signature
pub fn verify_fields(&self, expect: &[FieldRef]) -> Result<()> {
...
}
datafusion/common/src/param_value.rs
Outdated
} | ||
|
||
if let Some(expected_metadata) = maybe_metadata { | ||
// Probably too strict of a comparison (this is an example of where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree straight up comparing strings is probably not ideal
If we wanted to introduce type equality, I thing the bigger question is how to thread it through (you would have to have some way to register your types / methods to check equality and ensure that somehow ended up here 🤔 )
I've stumbled across this PR and remembered that I implemented such a thing in #15106 for defining custom sort orders for certain types. I now have a bit more time on my hand and I'll try to revive that effort as we would be still interested in that. |
Thank you for the review! I will clean this up today/tomorrow and ensure we can test all the spots.
I'd love that! Perhaps the first step from the non-sorting angle might be to just consolidate all the places we print, cast, or test equality (which are currently scattered around the code base). A vague thought based mostly on this PR, SedonaDB's pub fn resolve_df_data_type(session: Option<SessionSomething>, storage_type: &DataType, metadata: impl MetadataViewOfSomeSort ) -> Box<dyn DFDataType> {
// sanity check to avoid hitting the lock on any registry whatsoever when not needed)
// If the session is specified, use a session registry
// otherwise, use a static or compile-time populated registry until such a time as the session
// is made available everywhere and/or DataType::Extension is a thing
}
pub trait DFDataType: Clone {
fn storage_type(&self) -> DataType;
// e.g., the sanity check on the return field of a UDxF
fn type_identical_to(&self, &dyn DFDataType) -> Result<bool>;
// for when this has to be a storage field (it might already be one)
fn storage_field(&self, name: Option<&str>, nullability: Option<bool>) -> FieldRef;
fn add_to_existing_storage_field(&self
// for when this has to be just metadata (e.g., the Expr::Literal)
fn storage_field_metadata(&self) -> Option<FieldMetadata>;
// to avoid dumping a Field debug string in user-facing errors
fn display_type(&self) -> String;
// for any hope of ever being able to deal with shredded/unshredded variants in UNION ALL
// slash listing tables. This is useful for some geoarrow types too (wkb/wkb_view -> wkb)
fn common_castable_type(&self, &dyn DFDataType) -> Result<Box<dyn DFDataType>> { /* self if types are identical or error */ }
// usually printing the storage is OK but no ideal (e.g., variants should be jsonish)
fn format_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
// Or something more advanced like the dynamic comparator in the linked PR
fn sort_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
// It's a shame to bleed execution in here but casting shows up a lot
fn cast_storage(&self, storage: ArrayRef, to: Box<dyn DFDataType>);
}
// All the structures that are currently communicating that type information
impl DFDataType for FieldMetadata {
// implement strict equality based on all metadata (or just extension name/extension metadata?)
// printing based on metadata
// everything else based on datatype
// ...basically everything datafusion does right now
}
impl DFDataType for FieldRef {
// defer to fieldmetadata
}
impl DFDataType for DataType {
// defer to DataTypes
} |
Btw, if you want another set of eyes you can ping me once its ready for another review. |
Thank you...I'd love that! I just need to finish up a few tests...I didn't quite get there today but I'm confident I can get this ready for eyes tomorrow. |
9b6c182
to
27e8e42
Compare
479fe56
to
70f3663
Compare
/// A [`ScalarValue`] with optional [`FieldMetadata`] | ||
#[derive(Debug, Clone)] | ||
pub struct Literal { | ||
pub value: ScalarValue, | ||
pub metadata: Option<FieldMetadata>, | ||
} | ||
|
||
impl Literal { | ||
/// Create a new Literal from a scalar value with optional [`FieldMetadata`] | ||
pub fn new(value: ScalarValue, metadata: Option<FieldMetadata>) -> Self { | ||
Self { value, metadata } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure Literal
is the best name here (we already have the trait datafusion_expr::Literal
and datafusion_physical_expr::Literal
), but ScalarValueAndMetadata
seemed verbose and using (ScalarValue, Option<FieldMetadata>)
has been used in enough places that I think having its own struct makes sense.
Other places where we pair a ScalarValue
with metadata are Expr::Literal
, and ConstSimplifyResult::Simplified
.
I don't have strong opinions here and am happy to have this be whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that Literal
is OK and that it should be used in Expr::Literal
and ConstSimplifyResult::Simplified
. However, I believe this could be done in a follow-up PR and an issue such that others can provide input on that.
pub fn check_metadata_with_storage_equal( | ||
actual: ( | ||
&DataType, | ||
Option<&std::collections::HashMap<String, String>>, | ||
), | ||
expected: ( | ||
&DataType, | ||
Option<&std::collections::HashMap<String, String>>, | ||
), | ||
what: &str, | ||
context: &str, | ||
) -> Result<(), DataFusionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of ugly but there are two places in this PR that do this check (type equality). I figured at least collecting the checks in one place would make it easier to do something smarter later if it comes up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think its better than duplicating the logic. I think a good solution for this problem will require support for registering instances of logical types.
Should we track that somewhere?
pub fn format_type_and_metadata( | ||
data_type: &DataType, | ||
metadata: Option<&std::collections::HashMap<String, String>>, | ||
) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to keep the LogicalType::Statement(Prepared {..})
output in the plan text from dumping the debug output from Field
so that it is mostly unchanged (unless there is, in fact, metadata). Also helps with the error message above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best solution would be to delegate the formatting to a registered implementation of a logical type / extension type. This way we can really use type-specific formatting. But I think this is beyond the scope of this PR.
Expr::Placeholder(Placeholder { field, .. }) => match &field { | ||
// TODO: This seems to be pulling metadata/types from the schema | ||
// based on the Alias destination. name? Is this correct? | ||
None => schema | ||
.field_from_column(&Column::from_name(name)) | ||
.map(|f| f.clone().with_name(&schema_name)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this first branch and why it would pull metadata from a schema based on the destination name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the special handling of Alias(Placeholder(...))
in general seems to be a bit fishy. Have you tried removing the special case?
This is ready for 👀! @tobixdev The CI failure is
I'm happy to take a look if this keeps showing up. |
I believe that is due to this issue (which was fixed yesterday) It should be fixed with a merge up from main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this looks like a very good change and just some minor comments. Thank you!
Note that I am by no means a DataFusion Pro, so feel free to take my comments with a grain of salt. :)
/// A [`ScalarValue`] with optional [`FieldMetadata`] | ||
#[derive(Debug, Clone)] | ||
pub struct Literal { | ||
pub value: ScalarValue, | ||
pub metadata: Option<FieldMetadata>, | ||
} | ||
|
||
impl Literal { | ||
/// Create a new Literal from a scalar value with optional [`FieldMetadata`] | ||
pub fn new(value: ScalarValue, metadata: Option<FieldMetadata>) -> Self { | ||
Self { value, metadata } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that Literal
is OK and that it should be used in Expr::Literal
and ConstSimplifyResult::Simplified
. However, I believe this could be done in a follow-up PR and an issue such that others can provide input on that.
/// This operation assumes that if the underlying [ScalarValue] can be casted | ||
/// to a given type that any extension type represented by the metadata is also | ||
/// valid. | ||
pub fn cast_storage_to( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on why we need this function? It seems like it is used to cast the provided parameter instances to the type of the placeholder. Can't we for now assume that these types match and return an Err
if this is not the case? It just seems like we are here trying something that supports logical types / extension types but then we ignore any restrictions they might have on converting to a particular physical type.
Just some thoughts, don't know if this is actually a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see the cast was there before so this should be OK for now. Maybe in the future we can consider the allowed casts of a logical type.
pub fn check_metadata_with_storage_equal( | ||
actual: ( | ||
&DataType, | ||
Option<&std::collections::HashMap<String, String>>, | ||
), | ||
expected: ( | ||
&DataType, | ||
Option<&std::collections::HashMap<String, String>>, | ||
), | ||
what: &str, | ||
context: &str, | ||
) -> Result<(), DataFusionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think its better than duplicating the logic. I think a good solution for this problem will require support for registering instances of logical types.
Should we track that somewhere?
pub fn format_type_and_metadata( | ||
data_type: &DataType, | ||
metadata: Option<&std::collections::HashMap<String, String>>, | ||
) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best solution would be to delegate the formatting to a registered implementation of a logical type / extension type. This way we can really use type-specific formatting. But I think this is beyond the scope of this PR.
pub fn check_metadata_with_storage_equal( | ||
actual: ( | ||
&DataType, | ||
Option<&std::collections::HashMap<String, String>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we're not using FieldMetadata
here?
let bytes = logical_plan_to_bytes(&prepared)?; | ||
let logical_round_trip = logical_plan_from_bytes(&bytes, &ctx)?; | ||
assert_eq!(format!("{prepared}"), format!("{logical_round_trip}")); | ||
// Also check exact equality in case metadata isn't reflected in the display string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why assert_eq!(&prepared, &logical_round_trip);
is not enough and the test also asserts the equality of the serialized version?
Maybe we could also use insta for this test if asserting the serialized version is important. 🤔
} | ||
|
||
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> { | ||
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<FieldRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be renamed to something like convert_data_type_to_field
?
let inner_data_type = self.convert_data_type(inner_sql_type)?; | ||
Ok(DataType::new_list(inner_data_type, true)) | ||
|
||
// Lists are allowed to have an arbitrarily named field; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an issue for creating DataType::new_list_from_field
in arrow-rs?
LogicalPlan::Statement(Statement::Prepare(Prepare { data_types, .. })) => { | ||
data_types.iter().join(", ").to_string() | ||
LogicalPlan::Statement(Statement::Prepare(Prepare { fields, .. })) => { | ||
fields.iter().map(|f| f.data_type()).join(", ").to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use format_type_and_metadata
here?
} | ||
|
||
#[test] | ||
fn test_update_infer_with_metadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Rationale for this change
Most logical plan expressions now propagate metadata; however, parameters with extension types or other field metadata cannot participate in placeholder/parameter binding.
What changes are included in this PR?
The DataType in the Placeholder struct was replaced with a FieldRef along with anything that stored the "DataType" of a parameter.
Strictly speaking one could bind parameters with an extension type by copy/pasting the placeholder replacer, which I figured out towards the end of this change. I still think this change makes sense and opens up the door for things like handling UUID in SQL with full parameter binding support.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, one new function was added to extract the placeholder fields from a plan.
This is a breaking change for code that specifically interacts with the pub fields of the modified structs (ParamValues, Placeholder, and Prepare are the main ones).