Skip to content

Conversation

@BlakeOrth
Copy link

Which issue does this PR close?

This does not fully close, but is an initial component of:

Rationale for this change

To keep PR size digestible, this implements the bi-directional extension type metadata parsing for Parquet geospatial arrays. An actual array type that allows users to easily interact with the data can come as a follow-on PR.

What changes are included in this PR?

  • Implements the ExtensionType for Parquet geospatial logical types
  • Adds GeoArrow compatible Metadata type to hold geospatial metadata
  • Adds basic tests around geospatial metadata serialization/deserialization
  • Integrates geospatial logical type parsing into schema extensions

Are these changes tested?

Yes. Targeted unit tests for the metadata and associated parsing have been implemented. Higher level tests that show a full round-trip to and from arrays can be included in the array implementation PR.

Are there any user-facing changes?

Yes. All new public items have been documented, and no API breaking changes have been made.

cc @alamb @paleolimbot @kylebarron

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 3, 2025
@BlakeOrth BlakeOrth force-pushed the feature/geospatial_extension_types branch from 39c84f2 to d264731 Compare December 3, 2025 17:59
 - Implements the ExtensionType for Parquet geospatial logical types
 - Adds GeoArrow compatible Metadata type to hold geospatial metadata
 - Adds basic tests around geospatial metadata
   serialization/deserialization
 - Integrates geospatial logical type parsing into schema extensions
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

I have a few suggestions of more tests that would probably be good to add now...I think you can mostly hijack existing tests. Because you've nicely tested the low-level conversion, I think all we need at a higher level is to check that it's plugged in when we use the Arrow reader/writer.

Comment on lines +1020 to +1025
impl FromStr for EdgeInterpolationAlgorithm {
type Err = ParquetError;

fn from_str(s: &str) -> Result<Self> {
match s.to_ascii_uppercase().as_str() {
"SPHERICAL" => Ok(EdgeInterpolationAlgorithm::SPHERICAL),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good spot in this file to add a test for this? (I think there are also some tests for testing time unit serialization somewhere)

Comment on lines +152 to +157
#[cfg(feature = "geospatial")]
pub(crate) fn logical_type_for_binary(field: &Field) -> Option<LogicalType> {
use parquet_geospatial::WkbType;
use parquet_geospatial::WkbTypeHint;

match field.extension_type_name() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this added in we should be able to test this in parquet/tests/geospatial.rs by tacking on the expected field metadata here:

#[test]
fn test_read_logical_type() {
// Some crs values are short strings
let expected_logical_type = [

...and by using the Arrow schema converter to create the Parquet schema instead of doing it by hand here:

fn parquet_schema_geometry() -> Type {
Type::group_type_builder("root")
.with_fields(vec![
Type::primitive_type_builder("geo", parquet::basic::Type::BYTE_ARRAY)
.with_logical_type(Some(LogicalType::Geometry { crs: None }))
.build()
.unwrap()
.into(),
])
.build()
.unwrap()
}

// specific language governing permissions and limitations
// under the License.

use arrow::error::Result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also available from arrow_schema? (I have no idea what the conventions are in these crates)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general we prefer using the smallest possible dependency crate surface to minimize compile times. So instead of adding the full arrow dependency, it's preferred to use arrow_schema instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble finding this in arrow_schema, can someone point me to it?

https://docs.rs/arrow-schema/57.1.0/arrow_schema/index.html?search=Result

The only similar thing I've found is a private type alias here:

type ArrowResult<T> = Result<T, ArrowError>;

If we want to avoid the dependency altogether there aren't that many Result<T> uses here and I can easily type alias or use std::Result<T, ArrowError> for those as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is OK by me! I probably would have typed Result<T, ArrowError> but also the only place that this crate is used is by parquet which depends on arrow anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason ArrowError is in arrow_schema but the Result typedef is in the arrow crate:

https://docs.rs/arrow/latest/src/arrow/error.rs.html#20-23

If we want to avoid the dependency altogether there aren't that many Result uses here and I can easily type alias or use std::Result<T, ArrowError> for those as well.

THis is what I would suggest

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BlakeOrth @kylebarron and @paleolimbot

I think this code looks good to me and could be merged as is, though I think the comments from @paleolimbot and @kylebarron are definitely worth reviewing

As I am not an expert in Geometry types I would like to see an approval from @kylebarron before we merge it, but as seems to be your pattern, I think this is a well thought out, well structured and well tested PR. Thank you again

// specific language governing permissions and limitations
// under the License.

use arrow::error::Result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason ArrowError is in arrow_schema but the Result typedef is in the arrow crate:

https://docs.rs/arrow/latest/src/arrow/error.rs.html#20-23

If we want to avoid the dependency altogether there aren't that many Result uses here and I can easily type alias or use std::Result<T, ArrowError> for those as well.

THis is what I would suggest


/// Well-Known Binary (WKB) [`ExtensionType`] for geospatial data.
///
/// Represents the canonical Arrow Extension Type for storing GeoArrow data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is worth a link to GeoArrow here https://github.com/geoarrow/geoarrow

}

impl ExtensionType for WkbType {
const NAME: &'static str = "geoarrow.wkb";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "geoarrow" the right extension type for the parquet reader to use? I ask this in ignorance as now there is a geometry type in Parquet, but it seems like the geoarrow is still listed as a "Community Extension Type" https://arrow.apache.org/docs/format/CanonicalExtensions.html#community-extension-types

I have no particular preference here, FWIW, and if we can use this crate to drive standardization on "geoarrow.wkb" that seems good to me but I wanted to ask

Copy link
Author

@BlakeOrth BlakeOrth Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot or @kylebarron are both probably a more authoritative resource on this than I am, but geoarrow.wkb is what the C++ parquet/arrow library uses as well:

https://github.com/apache/arrow/blob/e90bacded2ae5a34be11c764e00751c18ab3b355/docs/source/cpp/parquet.rst#logical-types

I've probably already rocked the boat enough with @paleolimbot in my POC for this PR (#8801), but having one extension type name map to two parquet LogicalTypes does end up causing some headaches, and there is one case that can't really be represented (however, nobody seems very concerned about this case). If the extension names were arrow.parquet.geometry and arrow.parquet.geography that could be a bit more clear in my opinion, but using geoarrow.wkb seems to be the current community standard, so it's probably best to just stick with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blake is correct here...the main one that's hard to go back and change is GDAL (the version that's in Ubuntu 24.04 LTS produces geoarrow.wkb arrays)...everybody else we could probably convince to update if there's a good reason to do so and the software involved is reasonably easy to run current versions of. I think if we were designing it again we'd definitely shoot for better alignment with the Parquet ecosystem.

I've probably already rocked the boat enough

Loving the rocking!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've probably already rocked the boat enough

Loving the rocking!

Ditto

@BlakeOrth
Copy link
Author

@alamb @paleolimbot Thank you both for the thorough review! I believe the most recent commit addresses all of the feedback for improvements here. Please let me know if there's any additional feedback for improvements here!

@paleolimbot I believe the geospatial extension test additions are what you were looking for, but let me know if that's not what you had in mind and I'm happy to change those.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on Andrew's comments about this PR...I've learned a lot of Rust by reviewing this and the POC. There's a lot to going on with these two specs and you were kind to consider my comments.

}

impl ExtensionType for WkbType {
const NAME: &'static str = "geoarrow.wkb";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blake is correct here...the main one that's hard to go back and change is GDAL (the version that's in Ubuntu 24.04 LTS produces geoarrow.wkb arrays)...everybody else we could probably convince to update if there's a good reason to do so and the software involved is reasonably easy to run current versions of. I think if we were designing it again we'd definitely shoot for better alignment with the Parquet ecosystem.

I've probably already rocked the boat enough

Loving the rocking!

Comment on lines 182 to 186
let metadata = Metadata::new(Some(String::from("srid:1234")), None);
let wkb = WkbType::new(Some(metadata));

let serialized = wkb.serialize_metadata().unwrap();
assert_eq!(serialized, r#"{"crs":"srid:1234"}"#);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are really splitting hairs a Parquet CRS of "srid:1234" can also be {"crs": "1234", "crs_type": "srid"} in GeoArrow land. I have also never seen either of those outside of test data I created myself and so I think we should defer considering that as a low-priority follow-up!

@BlakeOrth
Copy link
Author

@alamb It looks like my most recent test additions to parquet/tests/geospatial.rs broke the "standard" build because it's now using imports from the feature gated parquet-geospatial crate (that'll will teach me to use the --all-features flag). Would it be ok to gate this entire test behind that feature or maybe I should move the affected tests into the parquet-geospatial crate?

@paleolimbot
Copy link
Member

Would it be ok to gate this entire test behind that feature

I think it's fine to have it gated behind that feature...I think there is a large portion of that test file already that is.

pub crs: Option<serde_json::Value>,
/// The edge interpolation algorithm of the [`WkbType`], if present.
#[serde(skip_serializing_if = "Option::is_none")]
pub algorithm: Option<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a known set of algorithms in the spec? Does the spec allow new algorithms without an update of the spec? Or should we use an Enum here?

In geoarrow-schema I define this Edges enum: https://github.com/geoarrow/geoarrow-rs/blob/f1deee4549760c7cd8476511525115c185515272/rust/geoarrow-schema/src/edges.rs#L3-L74

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I can see in the Parquet spec for GEOGRAPHY types ends up being an exhaustive enum for edge interpolation:
https://github.com/apache/parquet-format/blob/master/Geospatial.md#edge-interpolation-algorithm
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geography

GeoArrow also seems exhaustive with the same set of options:
https://geoarrow.org/extension-types.html#extension-metadata

However, during the POC for this PR it seemed like the general desire was for this to more-or-less be a "data tube" and defer validation/correctness of the file metadata to users or some higher level library. I'm more than happy to increase the rigor here if that's the direction we'd like to move.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine but a string is easier because we already have an enum in the parquet crate (but we can't use it here or else we'd have a circular dependency, and we can't define it here because parquet-geospatial is an optional dependency). I think we ultimately do end up validating the string (either because this came from Parquet thrift where we generated it ourselves, or because we try to parse it into Parquet thrift, where I think it will error if it's a value we don't recognize).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, due to the dependency graph we would have to effectively duplicate this enum within the parquet-geospatial crate:
https://github.com/apache/arrow-rs/pull/8943/files#diff-77b1a58271f532bba48f784fc34eefd52639735547f9e32187dbfbf485ae179dL998

I agree that we already validate the value on each leg of the round-trip, so I don't think we're likely to get unexpected values from the raw string interpretation. It seems like it's mostly a matter of how we'd like to interact with this downstream. An enum results in some duplicated code, but it's probably a bit more idiomatic from a user's perspective. @kylebarron I suspect geoarrow-rs is likely to be the primary consumer of this metadata, so I'd probably generally defer to you on what you think the best way to interact with this metadata is.

Copy link
Member

@kylebarron kylebarron Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the bare string, but I'm not as in the weeds as you two are with the state of this implementation, so I'm happy to go with what @paleolimbot thinks is right

/// Note: Common lon/lat CRS representations (EPSG:4326, OGC:CRS84) are canonicalized
/// to `None` during serialization to match Parquet conventions.
#[serde(skip_serializing_if = "Option::is_none")]
pub crs: Option<serde_json::Value>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise I think having a plain serde_json::Value here leads to opportunities for invalid formatting of CRS. I don't know whether we expect this code to validate input or whether it's expected for validation to be somewhere else down the pipeline.

In particular, in geoarrow-schema I split up the underlying crs and crs_type: https://github.com/geoarrow/geoarrow-rs/blob/f1deee4549760c7cd8476511525115c185515272/rust/geoarrow-schema/src/crs.rs#L35-L62

Or is the encoding in the Parquet geometry spec not the same as the encoding in GeoArrow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike GeoArrow, the Parquet spec for this metadata does not carry any notion of crs_type. There are some "recommended" conventions for how to communicate the type, but they are guidelines as opposed to actual spec defined rules:
https://github.com/apache/parquet-format/blob/master/Geospatial.md#crs-customization

My reading of the Parquet spec suggests that technically any string (or lack thereof) is a valid value for crs. Again, I think that the general sense was CRS validation (outside of avoiding double-escaped json strings) was not really the job of this code and that should be handled by users. If we want some additional validation here I'd be happy to add this.

Comment on lines 126 to 131
let md = Metadata {
crs: (*crs).cloned(),
algorithm: self.0.algorithm.clone(),
};

serde_json::to_string(&md).ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if there were a way to not clone the crs and algorithm, when we only want the reference anyways to pass to serde_json

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I pushed a solution to avoid a clone here, although I'll be honest and say it does feel a bit dubious. If you have any implementation recommendations on a way you'd prefer to see this clone avoided I'm happy to implement something else!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well in this function you're never constructing a new CRS; you're only sometimes removing the CRS. So how about instead of constructing a new object, just serialize different things to JSON:

(pseudocode)

match crs {
	// Do checks for whether to remove CRS
    Some(Value::String(s) if s == "EPSG:4326" || s == "OGC:CRS84") || Some(Value::Object(_)) // restore object checks
        => {
			let new_metadata = Metadata { crs: None, algorithm: self.algorithm.clone() }
			serde_json::to_string(&new_metadata).ok()
        }
	// Serialize existing metadata
	_ => {
		serde_json::to_string(&self.0).ok()
	}
}

 - Better clarity for some comments
 - Avoids clone during metadata serialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants