Skip to content

Add documentation for Dask/Multiprocessing support and Xarray accessor#878

Open
rhugonnet wants to merge 3 commits into
GlacioHack:mainfrom
rhugonnet:add_accessor_daskmp_doc
Open

Add documentation for Dask/Multiprocessing support and Xarray accessor#878
rhugonnet wants to merge 3 commits into
GlacioHack:mainfrom
rhugonnet:add_accessor_daskmp_doc

Conversation

@rhugonnet
Copy link
Copy Markdown
Member

@rhugonnet rhugonnet commented Mar 6, 2026

This PR finally adds the documentation for the Xarray accessor rst, as well as the Dask and Multiprocessing support through chunked implementations that we have been steadily developing for the last 2-3 years! 🥳

Thanks in particular to @vschaffn @ameliefroessl for their big contributions at various stages!

Link to the new doc (landing on the Scalability page with schematics that was a bit of work!): https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/scalability_logic.html

Details

The documentation changes are the following:

  • Added a whole new "Scalability" section in "Features" with 4 sub-pages "Usage and good practices" (intro with summary and short examples; for those who want to get directly into it), "Concept definition" (explanation of scalability concepts with small examples; for those who want to learn), "Supported operations" (detailed table of scalability support across methods/object types; for those who want to know scalability exactly), and "Implementation strategies" with diagrams and details of implementation (for interested users debugging their memory or who want to learn, or developers/contributors to get a grasp of the code logic),
  • Re-worked a page "Feature and scalability overview" in "Getting started" that is very high-level (no code run, just tables and summaries),
  • Added a new page "Cheatsheet: From GDAL" + "Ecosystem" pages in a new "Resources" area (not so happy with the ecosystem page yet, need to buff it up a bit more),
  • Edited page "The georeferenced raster" to work for both Raster and accessor,
  • Removed page "Implicit lazy loading" in "Fundamentals" (as now covered in new "Scalability/Concept definition" page),
  • Updated the "API reference" to explain the mirrored API + to have a custom template showing Raster.method() or ds.rst.method() for each method linked through a single RasterBase call with RasterBase itself not being visible (for users using the search button). Also added full Raster + RasterAccessor full autoclass summary for those also curious to search through the class details. But those two pages are "hidden" in the table-of-content structure to avoid duplicating the main API page (users can only land there through search or clicking the class object name); and the page starts with a link pointing back to the main API, if that's not where they wanted to land.

I haven't yet updated the "Quick start", but I think we really need a better example there, and we should use the Xarray/Pandas accessor directly.
Same for the "Fundamentals" section, there's some editing to do there so that's it's not too "GeoUtils object-focused", but balanced whether it's about accessors or GeoUtils objects.
Finally, we have to see for the "Examples" (in feature pages, and in the Sphinx gallery), do we switch all to Xarray/Pandas accessor?

In particular, as I was already doing diagrams in Python code for something else, I kept my inertia and decided to add some to the PR to describe our implementations visually! 😄
We might think of doing something similar to describe functions themselves in the future (interpolation, etc).

Finally, I realized that it was annoying to explain the mirror for Raster/rst but not for vector/point cloud at the same time, even though the accessors for those 2 are much easier to add.
So I might add them when I get the chance in the next weeks while this PR is being reviewed 😉

Resolves #677
Resolves #673

@rhugonnet
Copy link
Copy Markdown
Member Author

rhugonnet commented Mar 10, 2026

@belletva @adehecq @atedstone @adebardo @marinebcht @erikmannerfelt @ould-a This new documentation draft is ready for your review! 😄
@remi-braun @guillaumeeb @fmaussion @friedrichknuth If you have the time to provide comments, that'd be amazing as well!

The link is here: https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/feature_overview.html
I think I'll wait for 2-3 weeks to leave time to get everyone's feedback, then consolidate.

I suggest you start by reading the new "Feature and scalability overview" in the "Getting started" section, then move on to the "Scalability" section which essentially contains all the novelty.
Then, there are smaller changes in specific pages. On GitHub, you can ignore the plotting scripts for the new diagrams, and simply comment about the rendering of those on the related page ("Implementation strategies"), that'll make it easier.

I also added this new "Cheasheet: From GDAL" page to help users make the link. We could also add other Python packages there (i.e. table that @remi-braun started)?
Finally, you'll notice that we (mostly unintentionally) converged almost exactly towards the same API as the new overhauled GDAL CLI, which is great news (even "footprint" or "info" are the same)! At this stage, we could almost think of updating the last few functions to match that API entirely. (Move crop to clip, for instance)
I let you comment on that as well 😉

Very happy to see this in a near-finalized stage after so many years of us working on it! 😊

@remi-braun
Copy link
Copy Markdown

remi-braun commented Mar 10, 2026

Hi @rhugonnet,
Your doc is simply amazing! I bow before such hard work 👏
I particularly love how you explained the different scalability concepts and the supported operations summary 💖

I have some comments, but honestly it's not much:

  • In Feature and scalability overview
    • In Summary paragraph, I would highlight the paragraph about you making sure that all methods produce identical outputs. This is extremely important and often not clearly stated in other libs
    • In Data operations, maybe a bit more explanations about what "scalable means" (just a hyperlink is enough), as it is explained after in the docs
  • In Supported operations, it seems lacking some methods (i.e. merge_rasters). Maybe adding the methods that will arrive in a near future could be useful too, but this is maybe a bad idea haha
  • In Cheatsheet (I also love this page)
    • I would go for a 100% match with GDAL honestly, IMO the field suffers from varying vocabulary across libraries
    • The footprint corresponds to the extent of the raster (=bbox in STAC) or the area without nodata ? (I am used to this definition). Maybe this is worth adding in the docs (here is what I tried to do)
  • In Ecosystem
    • Thank you for mentioning EOReader ❤️
    • Don't you want to mention rioxarray and rasterio here?
  • I think your Mission page is very important to define what this lib is about, maybe it would be useful to link it somewhere in the first pages as an hyperlink (even in ReadMe)

And again: such a wonderful job

@remi-braun
Copy link
Copy Markdown

I forgot a point :

  • In Cheatsheet
    • Maybe add the rasterio equivalence at least and / or maybe link the rioxarray / rasterio equivalences (here)

All pages of this documentation containing code cells can be **run interactively online without the need of setting up your own environment**. Simply click the top launch button!
(MyBinder can be a bit capricious: you might have to be patient, or restart it after the build is done the first time 😅)

Alternatively, start your own notebook to test GeoUtils at [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/GlacioHack/geoutils/main).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Binder not working anymore?

-

* - {meth}`~geoutils.Raster.reproject()`
- Reproject to other CRS. Default tolerance parameters ensure chunk-invariance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to other CRS, but also to another grid (resolution, bounds...)


We first describe GeoUtils' core **data operations**, which operate on underlying arrays or geometries and can therefore benefit from **scalable execution**.

**Legend:** **“/”** indicates methods **shared across object types**, while **“⟷”** indicates methods **interfacing between two object types**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work you've done is exceptional as always.
I'll try to find some time to go through it all.
But just in case I forget, I'd like to suggest a nice way to display the captions.

.. admonition:: Legend

   - ``/`` — Methods **shared across object types**
   - ``↔`` — Methods **interfacing between two object types**
Image

- Rasterio / PyProj

* - {meth}`~geoutils.Raster.crop()`
- Crop to bounds, either intersecting (untouched) allowing efficient I/O, or clipped (data modified).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this part. I guess because the meaning is maybe different for raster vs vector? For example, I think there is never any resampling for raster, right? So "data modified" only applies to geometries, in case of clipping?

- ✅
- Rasterio / GeoPandas

* - {meth}`~geoutils.Vector.rasterize()`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it strange that create_mask and rasterize fall under different categories, since they have basically the same underlying logic.

## Metadata properties and operations

In addition to data operations, GeoUtils exposes **metadata** properties and methods consistently across geospatial objects.
These operate only on metadata and therefore **do not load or modify underlying data arrays**.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "operate" is maybe not the most appropriate, as no operation is run, they are just properties. Maybe "Reading/accessing those properties does not require loading the whole data array. If the property is modified by one of the above operation (e.g. reproject), the actual operation is run only when needed (delayed operation)"
Not fully sure about the last sentence though.

-

* - {attr}`~geoutils.Raster.data`
- Data array (2D grid for raster, 1D for point cloud).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2D or 3D (for multiband)

and `rast`'s metadata is sufficient to provide a georeferenced grid for {func}`~geoutils.Vector.proximity`. The array will only be loaded when necessary.
```

## Quick plotting
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These features and below are all good to be aware of. Has the description be moved elsewhere?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more consistent way to describe the different features that are share among the different objects, so well done. I'm only a bit worried that some of the features that were introduced here such as arithmetic operations, plotting, saving etc, are lost, but maybe it's been moved somewhere else.

For a raster output, one typically wants to write to file lazily to avoid loading it in-memory:

```{code-cell} python
# ds_reproj.rst.to_file("reproj_rast.tif", compute=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is commented!

- If **memory** is the limitating factor for you, use a **single-threaded scheduler** through Dask (```dask.config.set(scheduler='single-threaded')```) or Multiprocessing (default cluster),
- If **speed** is the limiting factor for you, use **parallelized processes** through Dask (see [Dask scheduler configuration](https://docs.dask.org/en/stable/scheduler-overview.html#scheduler-overview)) or Multiprocessing (see our Cluster configuration),
- Choose chunk sizes large enough to reduce scheduling overhead, but **small enough to fit comfortably in memory**,
- Check that your data files have **on-disk chunksizes** (otherwise loads everything) and use a multiple of it for optimal **in-memory chunking**,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's often the crux I find when working with Dask. Does it make sense to explain somewhere which file format allow chunks and how to create them (externally, e.g. zarr? but also with Geoutils, e.g. option "tiled" by default with Raster.to_file etc)?


# We open the dataset without Dask backend (same behaviour with Raster class)
filename_rast = gu.examples.get_path("exploradores_aster_dem")
ds = gu.open_raster(filename_rast)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a remark, I find it a bit confusing now that we have 2 options to read the rasterm either with gu.Raster or gu.open_raster. If I understand correctly, the latter is needed to instantiate a dask array instead of a Raster? Is there a way to homogenize both?
I'm quite attached to the gu.Raster way now, but I guess it would be strange to have an option "chunks" that would return a dask array rather than a Raster... In that case, should open_raster be the preferred option (and the other approach being less documented or deprecated). I guess this approach is more consistent with xarray and geopandas framework.

rast = gu.Raster(filename_rast)

# Create Multiprocessing config, output filepath optional (temporary file by default)
mp_config = gu.multiproc.MultiprocConfig(chunk_size=200)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought. Could we not bypass this step, by adding an argument "chunk_size" (and all other needed) to gu.Raster (or gu.load_raster) to specify that operations should be made in chunk, with backend="multiproc"?


Lazy execution refers to **deferring computation until results are explicitly requested**.

In GeoUtils, lazy execution is available through the Xarray {class}`rst <geoutils.RasterAccessor>` accessor with **Dask-backed arrays**.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dask-backed -> Dask-backend?

We can materialize it with `compute()`:

```{code-cell}
ds_interp.compute()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question since I'm still not very familiar with Dask arrays. It looks like the compute method returns an array. Is it needed to capture that output, or is the data stored somewhere in the ds_interp object?


This enables **out-of-core execution**, allowing datasets larger than available RAM to be processed safely.

In GeoUtils, chunked execution is implemented through two backends:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the examples below are mostly redundant with the "Good practice" page. Some repetition is not bad, but I wonder if we could reduce the redundancy of information a bit? Maybe the concepts should go first and only provide theoretical backgrounds with little/no example, and all examples should be in the "usage and good practice" page?

* - {meth}`~geoutils.Raster.reproject`
- {bdg-success}`Chunked`
- {bdg-success}`Chunked`
- ~4 (default), or ~downsampling²
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the exponent 2 here, a typo? I first thought it was a footnote but could not find it 😆

* - Method
- Input
- Output
- Memory usage (# chunks)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what is the meaning of the memory usage column...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for me

-
-

* - {meth}`~geoutils.Raster.reproject`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is great to know which steps need to be implemented in the future 😄


Several operations are easy to support for **chunked execution** as they directly re-use existing **Dask** methods:
- The {meth}`~geoutils.Raster.filter` function uses {func}`~dask.array.map_overlap` with a `depth` (overlap) half the `size` of the filter,
- The {meth}`~geoutils.Raster.proximity` function uses {func}`~dask.array.map_overlap` with a `max_distance` parameter.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange, because in the table on the previous page, proximity is marked as "in-memory" rather than "chunk" implemented. is the table wrong?

- The {meth}`~geoutils.Raster.filter` function uses {func}`~dask.array.map_overlap` with a `depth` (overlap) half the `size` of the filter,
- The {meth}`~geoutils.Raster.proximity` function uses {func}`~dask.array.map_overlap` with a `max_distance` parameter.

Other operations are more complex and require specific logic {ref}`specific logic described further below<specific-logic>` and summarized as:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"specific logic" is repeated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, you made a real effort to even make the figure in pure Python and reproducible, well done!! 🙌

The diagram below illustrates this mapping procedure, with a new CRS and a downsampling of 2:

```{eval-rst}
.. plot:: code/diagram_chunked_reproject.py
Copy link
Copy Markdown
Member

@adehecq adehecq Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice diagram! Just a minor thought. With "resolution x2", do you mean that it is coarsened (e.g. from 5m to 10m)? If so, why are the chunks 2x smaller, should it not be the opposite?
If x2 means higher resolution, then should the pixel grid not be denser on the right panel?
In any case, there is something that i find confusing between the pixel and chunk grid, as I would expected approximately as many pixels in each chunk... Currently 16 on the left, 4 on the right.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's all clear when reading the text below, sorry! I guess what matters is the number of dark gray pixels in the right panel.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is always difficult to explain the resolution!

:width: 100%
```

All strategies begin by processing individual raster chunks independently, then reconstruct continuous polygons that span chunk boundaries.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super interesting! I just did not understand why we have the 3 strategies and when they are used.
I'm also wondering, what happens if polygons extend over more than 2 chunks? Can it just "grow" iteratively? Or do you need to load all the chunks that overlap with the geometry?

- New GDAL CLI
- xDEM

* - <span class="gu-table-section">Terrain attributes</span>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add the terrain attributes that we plan to add ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this page 😍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here! It's nice to see that all packages convert towards a similar toponymy!

Comment thread doc/source/ecosystem.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice this page !!

Comment thread doc/source/ecosystem.md

Several platforms provide geospatial datasets, cloud environments and learning resources:

- **[STAC](https://stacspec.org)** — SpatioTemporal Asset Catalog standard for geospatial data discovery
Copy link
Copy Markdown
Contributor

@belletva belletva Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to add the following pages :

  • Geodes-Tools - CNES portal to share tools and ressources for satellite image processing
  • Geodes - CNES portal for distributing satellite products (Sentinel 1 and 2, Spot, Venus, etc.)

The diagram below illustrates this mapping procedure, with a new CRS and a downsampling of 2:

```{eval-rst}
.. plot:: code/diagram_chunked_reproject.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice diagram but quite difficult to see that the destination chunk requires 5 source chunks ? the bold blue is quite difficult to see

@rhugonnet
Copy link
Copy Markdown
Member Author

Thanks!
@adehecq @belletva Don't forget to comment on the important generic aspects:

  • Should we mirror the new GDAL CLI exactly? (a few functions: bounds to bbox, crop to clip, etc)?
  • Should we update the main pages/Sphinx-gallery example with accessor examples instead of GeoUtils classes?
  • Reworking the Quick start fully?

@belletva
Copy link
Copy Markdown
Contributor

Thanks! @adehecq @belletva Don't forget to comment on the important generic aspects:

  • Should we mirror the new GDAL CLI exactly? (a few functions: bounds to bbox, crop to clip, etc)?
  • Should we update the main pages/Sphinx-gallery example with accessor examples instead of GeoUtils classes?
  • Reworking the Quick start fully?

And again, congratulation for your amazing work !!

* - Raster calculator
- `gdal_calc.py`
- `gdal raster calc`
- NumPy array interface on {attr}`~geoutils.Raster.data`)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: extra ) at the end

* - Edit referencing
- `gdal_edit` / `gdalmove.py`
- `gdal raster edit`
- {meth}`~geoutils.Raster.set_crs`, {meth}`~geoutils.Raster.set_transform`, {meth}`~geoutils.Raster.set_nodata`, {meth}`~geoutils.Raster.translate`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me think that we could also have a unified function "edit" that does all of that, e.g.,
rst.edit(crs="EPSG:4326", nodata=0)
rst.edit(transform=(...))
etc.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be very interesting, as working with xarray has some drawback: your array can lose all attributes if you do sth wrong. Some other times, you have no choice and must set it again by hand, so having a unified fct that does all that can solve a true usecase 👍

* - Fill nodata gaps
- `gdal_fillnodata.py`
- `gdal raster fill-nodata`
- Not implemented (planned)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be straightforward if we transfer the functionality from xDEM :-)

Comment thread doc/source/ecosystem.md
# Ecosystem

GeoUtils integrates naturally with the broader **geospatial ecosystem**.
It extends commonly used tools and works alongside many other for geospatial data access, processing, and analysis.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: other -> others

Comment thread doc/source/ecosystem.md
**[xDEM](https://github.com/GlacioHack/xdem)** is the sister package of GeoUtils, focused on the **analysis of digital elevation models (DEMs) and elevation point clouds**, including terrain attributes, coregistration and uncertainty propagation.
```

## Related Python libraries
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have liked to see a quick list/description of the libraries on which we rely: rasterio, geopandas, proj etc.
I will share with you a scheme I made to illustrate the Python geospatial ecosystem and where geoutils and xDEM fit.

Comment thread doc/source/ecosystem.md
- **[EOReader](https://github.com/sertit/eoreader)** — Unified access to satellite imagery products
- **[stackstac](https://github.com/gjoseph92/stackstac)** — Load STAC datasets as large raster stacks

Some libraries focus on raster operations specifically:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also include the orfeo toolbox here too?
https://www.orfeo-toolbox.org/

# The georeferenced raster

Below, a summary of the {class}`~geoutils.Raster` object and its methods.
In GeoUtils, the georeferenced raster object is mirrored through two objects:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why you use the term "mirrored" here. It could work if you say "the Raster class is mirrored through the rst accessor", but here it does not sound right. Maybe "raster objects are handled/instantiated through two object types:"?

In GeoUtils, the georeferenced raster object is mirrored through two objects:

- The Xarray {class}`rst <geoutils.RasterAccessor>` accessor for a {class}`xarray.DataArray`,
- The {class}`~geoutils.Raster`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when printed in the doc, the term "class" is not visible, so I would write "The {class}~geoutils.Raster class"


3. A {class}`~geoutils.Raster` can have **ambiguous casting behaviour** for subclasses (e.g., a {class}`~xdem.DEM`) making maintenance difficult, while accessors' **data-structure-centered mechanism enables clearer interfacing**.

4. A {class}`~geoutils.Raster` currently only supports **Multiprocessing** as scalable backend which is not lazy, while the {class}`rst <geoutils.RasterAccessor>` accessor support **Dask** allowing lazy graph building.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: support -> supports

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add a coma after Dask.

A **raster** has **four main attributes**:

1. a {class}`numpy.ma.MaskedArray` as {attr}`~geoutils.Raster.data`, of either {class}`~numpy.integer` or {class}`~numpy.floating` {class}`~numpy.dtype`,
1. a {class}`numpy.ma.MaskedArray` ({class}`~geoutils.Raster`) or {class}`np.ndarray` ({class}`rst <geoutils.RasterAccessor>` accessor) as {attr}`~geoutils.Raster.data`, of either {class}`~numpy.integer` or {class}`~numpy.floating` {class}`~numpy.dtype` (forced to {class}`~numpy.floating` for {class}`rst <geoutils.RasterAccessor>` accessor),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue was not introduced in this PR, but I think it is easier to read if one first name the attribute (data, transform, crs, nodata) before describing them. Currently it's the other way around so tha attribute name is at the end of the sentence... Also the description is very much focused on type, which is not always very user-friendly (e.g. not many people know/care about affine.Affine objects. It should first describe what is the meaning of each.

multi-band raster!
```

Finally, the remaining attributes are only relevant when instantiating from a **on-disk** file: {attr}`~geoutils.Raster.name`, {attr}`~geoutils.Raster.driver`,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: a on-disk -> an on-disk

@adehecq
Copy link
Copy Markdown
Member

adehecq commented Mar 18, 2026

I just finished my 2nd round of reviews. Very nice documentation ! 😍 Thanks for all the effort you made to explain the concepts and write a clear documentation!! 🙌

Replying to your main questions below:

* Should we mirror the new GDAL CLI exactly? (a few functions: `bounds` to `bbox`, `crop` to `clip`, etc)?

If not too much effort, yes I would encourage going in that direction! It is so much easier when different tools use the same naming convention and I like that we "accidentally" end up with most of the same terms 😆 I actually suggested having an edit method, but the other functions can also be useful.
Like @remi-braun, it is unclear to me if footprint refers to the bounding box, or the actual raster footprint. The latter would be useful, and I think we have a tool to calculate box of valid areas only.

* Should we update the main pages/Sphinx-gallery example with accessor examples instead of GeoUtils classes?

Since I am still a big user of the Geoutils classes, I would keep those examples. Also a lot of our base users are used to this system now, so I think it is too early to completely change the structure. Something to re-evaluate later?
We could of course provide some examples with the accessor in a few places (using tab, see below).

* Reworking the Quick start fully?

Maybe we could show the same example with the 2 approaches? Using tabs, like in xDEM CLI documentation could be a nice way to avoid long pages?
I also think we need a more interesting example than this one. But this would require new example datasets maybe. Should we have a brainstorming at our next meeting?

Comment thread doc/source/ecosystem.md
GeoUtils integrates naturally with the broader **geospatial ecosystem**.
It extends commonly used tools and works alongside many other for geospatial data access, processing, and analysis.

See the {ref}`accessors` page for details on GeoUtils' accessors.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the link does not work

Comment thread doc/source/ecosystem.md
See the {ref}`accessors` page for details on GeoUtils' accessors.

```{seealso}
**[xDEM](https://github.com/GlacioHack/xdem)** is the sister package of GeoUtils, focused on the **analysis of digital elevation models (DEMs) and elevation point clouds**, including terrain attributes, coregistration and uncertainty propagation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

daughter ?

# Feature and scalability overview

# Feature overview
GeoUtils provides a unified API for manipulating **raster**, **vector**, and **point-cloud** data, and provides **scalable CPU execution** for most raster operations through Dask and Multiprocessing.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raster, vector and point-cloud data

To facilitate the analysis process, GeoUtils includes quick plotting tools that support multiple colorbars and implicitly add layers to the current axis.
Those are build on top of {func}`rasterio.plot.show` and {func}`geopandas.GeoDataFrame.plot`, and relay any argument passed.
The **{ref}`summary tables<tables-overview>` directly below** lists the core features of GeoUtils, their scalability and available backends.
Further below, a series of **{ref}`illustrated examples<examples-overview>`** demonstrate these features.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the {ref} doest not work

- Scalable
- Backend

* - <span class="gu-table-section">Raster / Vector / Point</span>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PointCloud

## Using Dask through accessors

With **Dask**, raster operations are both **chunked** and **lazy**.
This behavior is enabled by opening a raster with the `chunks` argument, which returns an Xarray object backed by Dask arrays.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if its obvious or not that chunks is in pixels

```{code-cell} python
# ds_reproj.rst.to_file("reproj_rast.tif", compute=True)
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter compute=True ... triggers

```

If the output is a {class}`~geoutils.Raster`, it is written to disk out-of-memory, and the returned object is a {class}`~geoutils.Raster` of that file without data loaded.
This keeps syntax consistent with in-memory code, and allow to easily chain operations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows

## Good practices with chunked and lazy operations

- If **memory** is the limitating factor for you, use a **single-threaded scheduler** through Dask (```dask.config.set(scheduler='single-threaded')```) or Multiprocessing (default cluster),
- If **speed** is the limiting factor for you, use **parallelized processes** through Dask (see [Dask scheduler configuration](https://docs.dask.org/en/stable/scheduler-overview.html#scheduler-overview)) or Multiprocessing (see our Cluster configuration),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cluster generation and configuration" to fit the page content


For more guidance on chunk sizing and performance, see the [Dask array best practices](https://docs.dask.org/en/stable/array-best-practices.html).

Finally, note that currently, operations returning **point** or **vector** outputs are often **eager** and scalable execution applies mostly to the **raster input/output**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand how one chooses between "point", "point cloud" without s or "PointCloud"

ds_cropped = ds.rst.icrop((0, 0, 100, 100))

# Neither input nor output dataset are loaded yet
print(f"Input loaded by deferred I/O? {ds.rst.is_loaded}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds.rst.get_stats()

# The dataset is now loaded
print(f"Loaded after data operation? {ds.rst.is_loaded}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

:align: center
:class: tight-table

* - Method
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/feature_overview.html in this order maybe ? "Raster / Vector / Point" with plot, "Raster / Point", etc ... and after the -> list with only A -> B, A and B different

- {bdg-secondary}`In-memory`
- —

* - <span class="gu-table-section">Raster ⟶ Point</span>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment, why "point" and not "PointCloud" ?

(scalability-logic)=
# Implementation strategies

Implementing **chunked execution** requires developping substantial internal logic often invisible to the user, making it difficult to understand what is happening in the background and how to potentially address a scalability issue.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

developing

Copy link
Copy Markdown
Contributor

@marinebcht marinebcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@rhugonnet
Copy link
Copy Markdown
Member Author

Thanks for the feedback! 😉
The reviewer distribution is currently biased pretty heavily towards non-Xarray users. It would be good to hear more from those experienced with Xarray/Dask for this PR specifically, if some of you have the time @atedstone @erikmannerfelt @friedrichknuth @fmaussion @scottyhq @guillaumeeb 🙂

@rhugonnet
Copy link
Copy Markdown
Member Author

Reminder: Last week for feedback, then I'll consolidate and merge!

@friedrichknuth
Copy link
Copy Markdown
Contributor

This is fantastic - thank you for all the hard work on bridging xDEM with Xarray! 🚀 The documentation is very helpful and really nice to see supported operations in the Table summary 🤩

After browsing the documentation I have a few comments:

  • There seem to be two methods for opening a file from disk (gu.Raster(fn) or gu.open_raster(fn)). Ideally, there is only one way. gu.Raster(fn) has been nice. To keep the API consistent, perhaps gu.Raster(fn) can adopt the functionality provided in gu.open_raster(fn) and simply convert numpy arrays to dask arrays in memory, if passed a rasterio dataset with chunks argument.
  • I frequently use dask.array.rechunk to determine optimal chunk sizes using the block_size_limit and balance arguments and rarely have to think about defining multiples of the pixel size. Adding a pointer to this API might be helpful to others as well in the Good practices with chunked and lazy operations section.
  • The APIs for chunk size definition are slightly different, e.g. gu.open_raster(filename_rast, chunks={"x": 200, "y": 200}) and gu.multiproc.MultiprocConfig(chunk_size=200). Given that pixels don't have to be square in dimension, gu.multiproc.MultiprocConfig(chunks={"x": 200, "y": 200}) could be more explicit, consistent, and give better control?
  • Is gu.multiproc.MultiprocConfig(chunk_size='auto') currently supported?
  • Is there an example for writing a COG to disk? With enhanced lazy IO and multiprocessing support in GeoUtils / xDEM it might be helpful to state which modern cloud optimized formats are supported and how to work with them.
  • While dask will parallelize tasks by default, setting up a cluster (even locally) is much more powerful. I see the MultiprocConfig page. Similarly, one might point users to the dask documentation to instantiate a cluster with dask, before using it under the hood in GeoUtils.
  • Xvec may be of interest when tackling vector support in the future https://xvec.readthedocs.io/en/stable/

@marinebcht
Copy link
Copy Markdown
Contributor

I did not find how do you set a mask in a Xarray (ds.rst.set_mask do not work) and how to you cast an Raster to an Xarray ?
Thanks :)

@rhugonnet
Copy link
Copy Markdown
Member Author

Thanks a lot @friedrichknuth! 😉

On @marinebcht's question:

I did not find how do you set a mask in a Xarray (ds.rst.set_mask do not work) and how to you cast an Raster to an Xarray ? Thanks :)

No masked arrays through Xarray, so the arrays are forced to floating type to support NaNs instead. So an equivalent would be simply: ds[mask] = np.nan (and you lose the original data under the mask).

For conversion: Raster.to_xarray() or ds.rst.to_geoutils() depending on the direction. If you want to compare them, note that raster_equal() and raster_allclose() already support both input types (Raster or xr.DataArray). 🙂

@remi-braun
Copy link
Copy Markdown

Hello,

I don't want to put pressure or whatever, but @rhugonnet do you have a release date in mind? 😇
Thanks a lot!

@rhugonnet
Copy link
Copy Markdown
Member Author

Hello,

I don't want to put pressure or whatever, but @rhugonnet do you have a release date in mind? 😇 Thanks a lot!

Yes, sorry, I was busy with deadlines on research projects the past weeks 😅
I finalized the related accessor in xDEM early April, so I should be able to merge both fairly quickly now. I'll dedicate some of the next weekends on this (to account for everyone's comments), so I think it'll be done around mid-June (I leave on holidays after that). 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Finalize Multiprocessing API ahead of documentation Multiprocessing/Tiling Documentation

7 participants