-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Adds ExtensionType for Parquet geospatial WKB arrays #8943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adds ExtensionType for Parquet geospatial WKB arrays #8943
Conversation
39c84f2 to
d264731
Compare
- Implements the ExtensionType for Parquet geospatial logical types - Adds GeoArrow compatible Metadata type to hold geospatial metadata - Adds basic tests around geospatial metadata serialization/deserialization - Integrates geospatial logical type parsing into schema extensions
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!
I have a few suggestions of more tests that would probably be good to add now...I think you can mostly hijack existing tests. Because you've nicely tested the low-level conversion, I think all we need at a higher level is to check that it's plugged in when we use the Arrow reader/writer.
| impl FromStr for EdgeInterpolationAlgorithm { | ||
| type Err = ParquetError; | ||
|
|
||
| fn from_str(s: &str) -> Result<Self> { | ||
| match s.to_ascii_uppercase().as_str() { | ||
| "SPHERICAL" => Ok(EdgeInterpolationAlgorithm::SPHERICAL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good spot in this file to add a test for this? (I think there are also some tests for testing time unit serialization somewhere)
| #[cfg(feature = "geospatial")] | ||
| pub(crate) fn logical_type_for_binary(field: &Field) -> Option<LogicalType> { | ||
| use parquet_geospatial::WkbType; | ||
| use parquet_geospatial::WkbTypeHint; | ||
|
|
||
| match field.extension_type_name() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with this added in we should be able to test this in parquet/tests/geospatial.rs by tacking on the expected field metadata here:
arrow-rs/parquet/tests/geospatial.rs
Lines 41 to 44 in b93fa52
| #[test] | |
| fn test_read_logical_type() { | |
| // Some crs values are short strings | |
| let expected_logical_type = [ |
...and by using the Arrow schema converter to create the Parquet schema instead of doing it by hand here:
arrow-rs/parquet/tests/geospatial.rs
Lines 409 to 420 in b93fa52
| fn parquet_schema_geometry() -> Type { | |
| Type::group_type_builder("root") | |
| .with_fields(vec![ | |
| Type::primitive_type_builder("geo", parquet::basic::Type::BYTE_ARRAY) | |
| .with_logical_type(Some(LogicalType::Geometry { crs: None })) | |
| .build() | |
| .unwrap() | |
| .into(), | |
| ]) | |
| .build() | |
| .unwrap() | |
| } |
parquet-geospatial/src/types.rs
Outdated
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use arrow::error::Result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also available from arrow_schema? (I have no idea what the conventions are in these crates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general we prefer using the smallest possible dependency crate surface to minimize compile times. So instead of adding the full arrow dependency, it's preferred to use arrow_schema instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble finding this in arrow_schema, can someone point me to it?
https://docs.rs/arrow-schema/57.1.0/arrow_schema/index.html?search=Result
The only similar thing I've found is a private type alias here:
| type ArrowResult<T> = Result<T, ArrowError>; |
If we want to avoid the dependency altogether there aren't that many Result<T> uses here and I can easily type alias or use std::Result<T, ArrowError> for those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is OK by me! I probably would have typed Result<T, ArrowError> but also the only place that this crate is used is by parquet which depends on arrow anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason ArrowError is in arrow_schema but the Result typedef is in the arrow crate:
https://docs.rs/arrow/latest/src/arrow/error.rs.html#20-23
If we want to avoid the dependency altogether there aren't that many Result uses here and I can easily type alias or use std::Result<T, ArrowError> for those as well.
THis is what I would suggest
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BlakeOrth @kylebarron and @paleolimbot
I think this code looks good to me and could be merged as is, though I think the comments from @paleolimbot and @kylebarron are definitely worth reviewing
As I am not an expert in Geometry types I would like to see an approval from @kylebarron before we merge it, but as seems to be your pattern, I think this is a well thought out, well structured and well tested PR. Thank you again
parquet-geospatial/src/types.rs
Outdated
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use arrow::error::Result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason ArrowError is in arrow_schema but the Result typedef is in the arrow crate:
https://docs.rs/arrow/latest/src/arrow/error.rs.html#20-23
If we want to avoid the dependency altogether there aren't that many Result uses here and I can easily type alias or use std::Result<T, ArrowError> for those as well.
THis is what I would suggest
parquet-geospatial/src/types.rs
Outdated
|
|
||
| /// Well-Known Binary (WKB) [`ExtensionType`] for geospatial data. | ||
| /// | ||
| /// Represents the canonical Arrow Extension Type for storing GeoArrow data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is worth a link to GeoArrow here https://github.com/geoarrow/geoarrow
| } | ||
|
|
||
| impl ExtensionType for WkbType { | ||
| const NAME: &'static str = "geoarrow.wkb"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "geoarrow" the right extension type for the parquet reader to use? I ask this in ignorance as now there is a geometry type in Parquet, but it seems like the geoarrow is still listed as a "Community Extension Type" https://arrow.apache.org/docs/format/CanonicalExtensions.html#community-extension-types
I have no particular preference here, FWIW, and if we can use this crate to drive standardization on "geoarrow.wkb" that seems good to me but I wanted to ask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot or @kylebarron are both probably a more authoritative resource on this than I am, but geoarrow.wkb is what the C++ parquet/arrow library uses as well:
I've probably already rocked the boat enough with @paleolimbot in my POC for this PR (#8801), but having one extension type name map to two parquet LogicalTypes does end up causing some headaches, and there is one case that can't really be represented (however, nobody seems very concerned about this case). If the extension names were arrow.parquet.geometry and arrow.parquet.geography that could be a bit more clear in my opinion, but using geoarrow.wkb seems to be the current community standard, so it's probably best to just stick with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blake is correct here...the main one that's hard to go back and change is GDAL (the version that's in Ubuntu 24.04 LTS produces geoarrow.wkb arrays)...everybody else we could probably convince to update if there's a good reason to do so and the software involved is reasonably easy to run current versions of. I think if we were designing it again we'd definitely shoot for better alignment with the Parquet ecosystem.
I've probably already rocked the boat enough
Loving the rocking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've probably already rocked the boat enough
Loving the rocking!
Ditto
|
@alamb @paleolimbot Thank you both for the thorough review! I believe the most recent commit addresses all of the feedback for improvements here. Please let me know if there's any additional feedback for improvements here! @paleolimbot I believe the geospatial extension test additions are what you were looking for, but let me know if that's not what you had in mind and I'm happy to change those. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on Andrew's comments about this PR...I've learned a lot of Rust by reviewing this and the POC. There's a lot to going on with these two specs and you were kind to consider my comments.
| } | ||
|
|
||
| impl ExtensionType for WkbType { | ||
| const NAME: &'static str = "geoarrow.wkb"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blake is correct here...the main one that's hard to go back and change is GDAL (the version that's in Ubuntu 24.04 LTS produces geoarrow.wkb arrays)...everybody else we could probably convince to update if there's a good reason to do so and the software involved is reasonably easy to run current versions of. I think if we were designing it again we'd definitely shoot for better alignment with the Parquet ecosystem.
I've probably already rocked the boat enough
Loving the rocking!
parquet-geospatial/src/types.rs
Outdated
| let metadata = Metadata::new(Some(String::from("srid:1234")), None); | ||
| let wkb = WkbType::new(Some(metadata)); | ||
|
|
||
| let serialized = wkb.serialize_metadata().unwrap(); | ||
| assert_eq!(serialized, r#"{"crs":"srid:1234"}"#); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are really splitting hairs a Parquet CRS of "srid:1234" can also be {"crs": "1234", "crs_type": "srid"} in GeoArrow land. I have also never seen either of those outside of test data I created myself and so I think we should defer considering that as a low-priority follow-up!
|
@alamb It looks like my most recent test additions to |
I think it's fine to have it gated behind that feature...I think there is a large portion of that test file already that is. |
| pub crs: Option<serde_json::Value>, | ||
| /// The edge interpolation algorithm of the [`WkbType`], if present. | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub algorithm: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a known set of algorithms in the spec? Does the spec allow new algorithms without an update of the spec? Or should we use an Enum here?
In geoarrow-schema I define this Edges enum: https://github.com/geoarrow/geoarrow-rs/blob/f1deee4549760c7cd8476511525115c185515272/rust/geoarrow-schema/src/edges.rs#L3-L74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything I can see in the Parquet spec for GEOGRAPHY types ends up being an exhaustive enum for edge interpolation:
https://github.com/apache/parquet-format/blob/master/Geospatial.md#edge-interpolation-algorithm
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geography
GeoArrow also seems exhaustive with the same set of options:
https://geoarrow.org/extension-types.html#extension-metadata
However, during the POC for this PR it seemed like the general desire was for this to more-or-less be a "data tube" and defer validation/correctness of the file metadata to users or some higher level library. I'm more than happy to increase the rigor here if that's the direction we'd like to move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either is fine but a string is easier because we already have an enum in the parquet crate (but we can't use it here or else we'd have a circular dependency, and we can't define it here because parquet-geospatial is an optional dependency). I think we ultimately do end up validating the string (either because this came from Parquet thrift where we generated it ourselves, or because we try to parse it into Parquet thrift, where I think it will error if it's a value we don't recognize).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, due to the dependency graph we would have to effectively duplicate this enum within the parquet-geospatial crate:
https://github.com/apache/arrow-rs/pull/8943/files#diff-77b1a58271f532bba48f784fc34eefd52639735547f9e32187dbfbf485ae179dL998
I agree that we already validate the value on each leg of the round-trip, so I don't think we're likely to get unexpected values from the raw string interpretation. It seems like it's mostly a matter of how we'd like to interact with this downstream. An enum results in some duplicated code, but it's probably a bit more idiomatic from a user's perspective. @kylebarron I suspect geoarrow-rs is likely to be the primary consumer of this metadata, so I'd probably generally defer to you on what you think the best way to interact with this metadata is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the bare string, but I'm not as in the weeds as you two are with the state of this implementation, so I'm happy to go with what @paleolimbot thinks is right
| /// Note: Common lon/lat CRS representations (EPSG:4326, OGC:CRS84) are canonicalized | ||
| /// to `None` during serialization to match Parquet conventions. | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub crs: Option<serde_json::Value>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise I think having a plain serde_json::Value here leads to opportunities for invalid formatting of CRS. I don't know whether we expect this code to validate input or whether it's expected for validation to be somewhere else down the pipeline.
In particular, in geoarrow-schema I split up the underlying crs and crs_type: https://github.com/geoarrow/geoarrow-rs/blob/f1deee4549760c7cd8476511525115c185515272/rust/geoarrow-schema/src/crs.rs#L35-L62
Or is the encoding in the Parquet geometry spec not the same as the encoding in GeoArrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike GeoArrow, the Parquet spec for this metadata does not carry any notion of crs_type. There are some "recommended" conventions for how to communicate the type, but they are guidelines as opposed to actual spec defined rules:
https://github.com/apache/parquet-format/blob/master/Geospatial.md#crs-customization
My reading of the Parquet spec suggests that technically any string (or lack thereof) is a valid value for crs. Again, I think that the general sense was CRS validation (outside of avoiding double-escaped json strings) was not really the job of this code and that should be handled by users. If we want some additional validation here I'd be happy to add this.
parquet-geospatial/src/types.rs
Outdated
| let md = Metadata { | ||
| crs: (*crs).cloned(), | ||
| algorithm: self.0.algorithm.clone(), | ||
| }; | ||
|
|
||
| serde_json::to_string(&md).ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if there were a way to not clone the crs and algorithm, when we only want the reference anyways to pass to serde_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I pushed a solution to avoid a clone here, although I'll be honest and say it does feel a bit dubious. If you have any implementation recommendations on a way you'd prefer to see this clone avoided I'm happy to implement something else!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well in this function you're never constructing a new CRS; you're only sometimes removing the CRS. So how about instead of constructing a new object, just serialize different things to JSON:
(pseudocode)
match crs {
// Do checks for whether to remove CRS
Some(Value::String(s) if s == "EPSG:4326" || s == "OGC:CRS84") || Some(Value::Object(_)) // restore object checks
=> {
let new_metadata = Metadata { crs: None, algorithm: self.algorithm.clone() }
serde_json::to_string(&new_metadata).ok()
}
// Serialize existing metadata
_ => {
serde_json::to_string(&self.0).ok()
}
}- Better clarity for some comments - Avoids clone during metadata serialization
Which issue does this PR close?
This does not fully close, but is an initial component of:
GEOMETRYandGEOGRAPHY#8717Rationale for this change
To keep PR size digestible, this implements the bi-directional extension type metadata parsing for Parquet geospatial arrays. An actual array type that allows users to easily interact with the data can come as a follow-on PR.
What changes are included in this PR?
Are these changes tested?
Yes. Targeted unit tests for the metadata and associated parsing have been implemented. Higher level tests that show a full round-trip to and from arrays can be included in the array implementation PR.
Are there any user-facing changes?
Yes. All new public items have been documented, and no API breaking changes have been made.
cc @alamb @paleolimbot @kylebarron