Add documentation for Dask/Multiprocessing support and Xarray accessor by rhugonnet · Pull Request #878 · GlacioHack/geoutils

rhugonnet · 2026-03-06T06:42:17Z

This PR finally adds the documentation for the Xarray accessor rst, as well as the Dask and Multiprocessing support through chunked implementations that we have been steadily developing for the last 2-3 years! 🥳

Thanks in particular to @vschaffn @ameliefroessl for their big contributions at various stages!

Link to the new doc (landing on the Scalability page with schematics that was a bit of work!): https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/scalability_logic.html

Details

The documentation changes are the following:

Added a whole new "Scalability" section in "Features" with 4 sub-pages "Usage and good practices" (intro with summary and short examples; for those who want to get directly into it), "Concept definition" (explanation of scalability concepts with small examples; for those who want to learn), "Supported operations" (detailed table of scalability support across methods/object types; for those who want to know scalability exactly), and "Implementation strategies" with diagrams and details of implementation (for interested users debugging their memory or who want to learn, or developers/contributors to get a grasp of the code logic),
Re-worked a page "Feature and scalability overview" in "Getting started" that is very high-level (no code run, just tables and summaries),
Added a new page "Cheatsheet: From GDAL" + "Ecosystem" pages in a new "Resources" area (not so happy with the ecosystem page yet, need to buff it up a bit more),
Edited page "The georeferenced raster" to work for both Raster and accessor,
Removed page "Implicit lazy loading" in "Fundamentals" (as now covered in new "Scalability/Concept definition" page),
Updated the "API reference" to explain the mirrored API + to have a custom template showing Raster.method() or ds.rst.method() for each method linked through a single RasterBase call with RasterBase itself not being visible (for users using the search button). Also added full Raster + RasterAccessor full autoclass summary for those also curious to search through the class details. But those two pages are "hidden" in the table-of-content structure to avoid duplicating the main API page (users can only land there through search or clicking the class object name); and the page starts with a link pointing back to the main API, if that's not where they wanted to land.

I haven't yet updated the "Quick start", but I think we really need a better example there, and we should use the Xarray/Pandas accessor directly.
Same for the "Fundamentals" section, there's some editing to do there so that's it's not too "GeoUtils object-focused", but balanced whether it's about accessors or GeoUtils objects.
Finally, we have to see for the "Examples" (in feature pages, and in the Sphinx gallery), do we switch all to Xarray/Pandas accessor?

In particular, as I was already doing diagrams in Python code for something else, I kept my inertia and decided to add some to the PR to describe our implementations visually! 😄
We might think of doing something similar to describe functions themselves in the future (interpolation, etc).

Finally, I realized that it was annoying to explain the mirror for Raster/rst but not for vector/point cloud at the same time, even though the accessors for those 2 are much easier to add.
So I might add them when I get the chance in the next weeks while this PR is being reviewed 😉

Resolves #677
Resolves #673

rhugonnet · 2026-03-10T06:10:05Z

@belletva @adehecq @atedstone @adebardo @marinebcht @erikmannerfelt @ould-a This new documentation draft is ready for your review! 😄
@remi-braun @guillaumeeb @fmaussion @friedrichknuth If you have the time to provide comments, that'd be amazing as well!

The link is here: https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/feature_overview.html
I think I'll wait for 2-3 weeks to leave time to get everyone's feedback, then consolidate.

I suggest you start by reading the new "Feature and scalability overview" in the "Getting started" section, then move on to the "Scalability" section which essentially contains all the novelty.
Then, there are smaller changes in specific pages. On GitHub, you can ignore the plotting scripts for the new diagrams, and simply comment about the rendering of those on the related page ("Implementation strategies"), that'll make it easier.

I also added this new "Cheasheet: From GDAL" page to help users make the link. We could also add other Python packages there (i.e. table that @remi-braun started)?
Finally, you'll notice that we (mostly unintentionally) converged almost exactly towards the same API as the new overhauled GDAL CLI, which is great news (even "footprint" or "info" are the same)! At this stage, we could almost think of updating the last few functions to match that API entirely. (Move crop to clip, for instance)
I let you comment on that as well 😉

Very happy to see this in a near-finalized stage after so many years of us working on it! 😊

remi-braun · 2026-03-10T10:24:10Z

Hi @rhugonnet,
Your doc is simply amazing! I bow before such hard work 👏
I particularly love how you explained the different scalability concepts and the supported operations summary 💖

I have some comments, but honestly it's not much:

In Feature and scalability overview
- In Summary paragraph, I would highlight the paragraph about you making sure that all methods produce identical outputs. This is extremely important and often not clearly stated in other libs
- In Data operations, maybe a bit more explanations about what "scalable means" (just a hyperlink is enough), as it is explained after in the docs
In Supported operations, it seems lacking some methods (i.e. merge_rasters). Maybe adding the methods that will arrive in a near future could be useful too, but this is maybe a bad idea haha
In Cheatsheet (I also love this page)
- I would go for a 100% match with GDAL honestly, IMO the field suffers from varying vocabulary across libraries
- The footprint corresponds to the extent of the raster (=bbox in STAC) or the area without nodata ? (I am used to this definition). Maybe this is worth adding in the docs (here is what I tried to do)
In Ecosystem
- Thank you for mentioning EOReader ❤️
- Don't you want to mention rioxarray and rasterio here?
I think your Mission page is very important to define what this lib is about, maybe it would be useful to link it somewhere in the first pages as an hyperlink (even in ReadMe)

And again: such a wonderful job

remi-braun · 2026-03-10T11:23:13Z

I forgot a point :

In Cheatsheet
- Maybe add the rasterio equivalence at least and / or maybe link the rioxarray / rasterio equivalences (here)

adehecq · 2026-03-11T09:42:27Z

-All pages of this documentation containing code cells can be **run interactively online without the need of setting up your own environment**. Simply click the top launch button!
-(MyBinder can be a bit capricious: you might have to be patient, or restart it after the build is done the first time 😅)
-
-Alternatively, start your own notebook to test GeoUtils at [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/GlacioHack/geoutils/main).


Is the Binder not working anymore?

adehecq · 2026-03-11T09:44:02Z

+  -
+
+* - {meth}`~geoutils.Raster.reproject()`
+  - Reproject to other CRS. Default tolerance parameters ensure chunk-invariance.


to other CRS, but also to another grid (resolution, bounds...)

adebardo · 2026-03-11T09:47:40Z

+
+We first describe GeoUtils' core **data operations**, which operate on underlying arrays or geometries and can therefore benefit from **scalable execution**.
+
+**Legend:** **“/”** indicates methods **shared across object types**, while **“⟷”** indicates methods **interfacing between two object types**.


The work you've done is exceptional as always.
I'll try to find some time to go through it all.
But just in case I forget, I'd like to suggest a nice way to display the captions.

.. admonition:: Legend - ``/`` — Methods **shared across object types** - ``↔`` — Methods **interfacing between two object types**

adehecq · 2026-03-11T10:01:54Z

+  - Rasterio / PyProj
+
+* - {meth}`~geoutils.Raster.crop()`
+  - Crop to bounds, either intersecting (untouched) allowing efficient I/O, or clipped (data modified).


I don't fully understand this part. I guess because the meaning is maybe different for raster vs vector? For example, I think there is never any resampling for raster, right? So "data modified" only applies to geometries, in case of clipping?

adehecq · 2026-03-11T10:05:32Z

+  - ✅
+  - Rasterio / GeoPandas
+
+* - {meth}`~geoutils.Vector.rasterize()`


I find it strange that create_mask and rasterize fall under different categories, since they have basically the same underlying logic.

adehecq · 2026-03-11T10:38:56Z

+## Metadata properties and operations
+
+In addition to data operations, GeoUtils exposes **metadata** properties and methods consistently across geospatial objects. 
+These operate only on metadata and therefore **do not load or modify underlying data arrays**.


The term "operate" is maybe not the most appropriate, as no operation is run, they are just properties. Maybe "Reading/accessing those properties does not require loading the whole data array. If the property is modified by one of the above operation (e.g. reproject), the actual operation is run only when needed (delayed operation)"
Not fully sure about the last sentence though.

adehecq · 2026-03-11T10:40:19Z

+  -
+
+* - {attr}`~geoutils.Raster.data`
+  - Data array (2D grid for raster, 1D for point cloud).


2D or 3D (for multiband)

adehecq · 2026-03-11T10:44:26Z

-and `rast`'s metadata is sufficient to provide a georeferenced grid for {func}`~geoutils.Vector.proximity`. The array will only be loaded when necessary.
-```
-
-## Quick plotting


These features and below are all good to be aware of. Has the description be moved elsewhere?

adehecq · 2026-03-11T10:46:15Z

This is a more consistent way to describe the different features that are share among the different objects, so well done. I'm only a bit worried that some of the features that were introduced here such as arithmetic operations, plotting, saving etc, are lost, but maybe it's been moved somewhere else.

adehecq · 2026-03-11T10:52:07Z

+For a raster output, one typically wants to write to file lazily to avoid loading it in-memory:
+
+```{code-cell} python
+# ds_reproj.rst.to_file("reproj_rast.tif", compute=True)


this line is commented!

adehecq · 2026-03-11T10:57:00Z

+- If **memory** is the limitating factor for you, use a **single-threaded scheduler** through Dask (```dask.config.set(scheduler='single-threaded')```) or Multiprocessing (default cluster),
+- If **speed** is the limiting factor for you, use **parallelized processes** through Dask (see [Dask scheduler configuration](https://docs.dask.org/en/stable/scheduler-overview.html#scheduler-overview)) or Multiprocessing (see our Cluster configuration),
+- Choose chunk sizes large enough to reduce scheduling overhead, but **small enough to fit comfortably in memory**,
+- Check that your data files have **on-disk chunksizes** (otherwise loads everything) and use a multiple of it for optimal **in-memory chunking**,


That's often the crux I find when working with Dask. Does it make sense to explain somewhere which file format allow chunks and how to create them (externally, e.g. zarr? but also with Geoutils, e.g. option "tiled" by default with Raster.to_file etc)?

adehecq · 2026-03-11T11:56:05Z

+
+# We open the dataset without Dask backend (same behaviour with Raster class)
+filename_rast = gu.examples.get_path("exploradores_aster_dem")
+ds = gu.open_raster(filename_rast)


Just a remark, I find it a bit confusing now that we have 2 options to read the rasterm either with gu.Raster or gu.open_raster. If I understand correctly, the latter is needed to instantiate a dask array instead of a Raster? Is there a way to homogenize both?
I'm quite attached to the gu.Raster way now, but I guess it would be strange to have an option "chunks" that would return a dask array rather than a Raster... In that case, should open_raster be the preferred option (and the other approach being less documented or deprecated). I guess this approach is more consistent with xarray and geopandas framework.

adehecq · 2026-03-11T12:01:31Z

+rast = gu.Raster(filename_rast)
+
+# Create Multiprocessing config, output filepath optional (temporary file by default)
+mp_config = gu.multiproc.MultiprocConfig(chunk_size=200)


Just a thought. Could we not bypass this step, by adding an argument "chunk_size" (and all other needed) to gu.Raster (or gu.load_raster) to specify that operations should be made in chunk, with backend="multiproc"?

adehecq · 2026-03-11T12:02:38Z

+
+Lazy execution refers to **deferring computation until results are explicitly requested**.
+
+In GeoUtils, lazy execution is available through the Xarray {class}`rst <geoutils.RasterAccessor>` accessor with **Dask-backed arrays**.


Dask-backed -> Dask-backend?

adehecq · 2026-03-11T12:05:25Z

+We can materialize it with `compute()`:
+
+```{code-cell}
+ds_interp.compute()


Just a question since I'm still not very familiar with Dask arrays. It looks like the compute method returns an array. Is it needed to capture that output, or is the data stored somewhere in the ds_interp object?

adehecq · 2026-03-11T12:09:15Z

+
+This enables **out-of-core execution**, allowing datasets larger than available RAM to be processed safely.
+
+In GeoUtils, chunked execution is implemented through two backends:


This and the examples below are mostly redundant with the "Good practice" page. Some repetition is not bad, but I wonder if we could reduce the redundancy of information a bit? Maybe the concepts should go first and only provide theoretical backgrounds with little/no example, and all examples should be in the "usage and good practice" page?

adehecq · 2026-03-11T12:11:24Z

+* - {meth}`~geoutils.Raster.reproject`
+  - {bdg-success}`Chunked`
+  - {bdg-success}`Chunked`
+  - ~4 (default), or ~downsampling²


Why the exponent 2 here, a typo? I first thought it was a footnote but could not find it 😆

adehecq · 2026-03-11T12:12:53Z

+* - Method
+  - Input
+  - Output
+  - Memory usage (# chunks)


I don't fully understand what is the meaning of the memory usage column...

same for me

adehecq · 2026-03-11T12:14:35Z

+  -
+  -
+
+* - {meth}`~geoutils.Raster.reproject`


The table is great to know which steps need to be implemented in the future 😄

adehecq · 2026-03-11T12:16:29Z

+
+Several operations are easy to support for **chunked execution** as they directly re-use existing **Dask** methods:
+- The {meth}`~geoutils.Raster.filter` function uses {func}`~dask.array.map_overlap` with a `depth` (overlap) half the `size` of the filter,
+- The {meth}`~geoutils.Raster.proximity` function uses {func}`~dask.array.map_overlap` with a `max_distance` parameter.


Strange, because in the table on the previous page, proximity is marked as "in-memory" rather than "chunk" implemented. is the table wrong?

adehecq · 2026-03-11T12:16:48Z

+- The {meth}`~geoutils.Raster.filter` function uses {func}`~dask.array.map_overlap` with a `depth` (overlap) half the `size` of the filter,
+- The {meth}`~geoutils.Raster.proximity` function uses {func}`~dask.array.map_overlap` with a `max_distance` parameter.
+
+Other operations are more complex and require specific logic {ref}`specific logic described further below<specific-logic>` and summarized as:


"specific logic" is repeated

adehecq · 2026-03-11T12:21:04Z

Wow, you made a real effort to even make the figure in pure Python and reproducible, well done!! 🙌

adehecq · 2026-03-11T12:25:49Z

+The diagram below illustrates this mapping procedure, with a new CRS and a downsampling of 2:
+
+```{eval-rst}
+.. plot:: code/diagram_chunked_reproject.py


Very nice diagram! Just a minor thought. With "resolution x2", do you mean that it is coarsened (e.g. from 5m to 10m)? If so, why are the chunks 2x smaller, should it not be the opposite?
If x2 means higher resolution, then should the pixel grid not be denser on the right panel?
In any case, there is something that i find confusing between the pixel and chunk grid, as I would expected approximately as many pixels in each chunk... Currently 16 on the left, 4 on the right.

Ok, it's all clear when reading the text below, sorry! I guess what matters is the number of dark gray pixels in the right panel.

yes it is always difficult to explain the resolution!

adehecq · 2026-03-11T12:32:16Z

+    :width: 100%
+```
+
+All strategies begin by processing individual raster chunks independently, then reconstruct continuous polygons that span chunk boundaries.


Super interesting! I just did not understand why we have the 3 strategies and when they are used.
I'm also wondering, what happens if polygons extend over more than 2 chunks? Can it just "grow" iteratively? Or do you need to load all the chunks that overlap with the geometry?

belletva · 2026-03-16T16:23:28Z

+  - New GDAL CLI
+  - xDEM
+
+* - <span class="gu-table-section">Terrain attributes</span>


maybe we can add the terrain attributes that we plan to add ?

belletva · 2026-03-16T16:24:43Z

I really like this page 😍

Same here! It's nice to see that all packages convert towards a similar toponymy!

belletva · 2026-03-16T16:26:28Z

Really nice this page !!

belletva · 2026-03-16T16:30:33Z

+
+Several platforms provide geospatial datasets, cloud environments and learning resources:
+
+- **[STAC](https://stacspec.org)** — SpatioTemporal Asset Catalog standard for geospatial data discovery


I propose to add the following pages :

Geodes-Tools - CNES portal to share tools and ressources for satellite image processing

Geodes - CNES portal for distributing satellite products (Sentinel 1 and 2, Spot, Venus, etc.)

belletva · 2026-03-16T16:42:47Z

+The diagram below illustrates this mapping procedure, with a new CRS and a downsampling of 2:
+
+```{eval-rst}
+.. plot:: code/diagram_chunked_reproject.py


really nice diagram but quite difficult to see that the destination chunk requires 5 source chunks ? the bold blue is quite difficult to see

rhugonnet · 2026-03-16T19:31:33Z

Thanks!
@adehecq @belletva Don't forget to comment on the important generic aspects:

Should we mirror the new GDAL CLI exactly? (a few functions: bounds to bbox, crop to clip, etc)?
Should we update the main pages/Sphinx-gallery example with accessor examples instead of GeoUtils classes?
Reworking the Quick start fully?

belletva · 2026-03-17T09:28:29Z

Thanks! @adehecq @belletva Don't forget to comment on the important generic aspects:

Should we mirror the new GDAL CLI exactly? (a few functions: bounds to bbox, crop to clip, etc)?

Should we update the main pages/Sphinx-gallery example with accessor examples instead of GeoUtils classes?

Reworking the Quick start fully?

Good for me to mirror the new GDAL CLI
If we update the main pages with accessor examples, I think that we should keep the Geoutils classes
I really like the Quick Start but it can be interesting to add the Point Clouds. In the Short example, we could add point cloud data with raster and vector. Besides, we could modify "Examples using rasters, vectors and point clouds" and adding the following examples in the panel : https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/handling_examples/raster_point/interpolation.html, https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/handling_examples/raster_point/topoints.html for instance. What do you think ?

And again, congratulation for your amazing work !!

adehecq · 2026-03-18T07:31:37Z

+* - Raster calculator
+  - `gdal_calc.py`
+  - `gdal raster calc`
+  - NumPy array interface on {attr}`~geoutils.Raster.data`)


typo: extra ) at the end

adehecq · 2026-03-18T07:33:21Z

+* - Edit referencing
+  - `gdal_edit` / `gdalmove.py`
+  - `gdal raster edit`
+  - {meth}`~geoutils.Raster.set_crs`, {meth}`~geoutils.Raster.set_transform`, {meth}`~geoutils.Raster.set_nodata`, {meth}`~geoutils.Raster.translate`


this makes me think that we could also have a unified function "edit" that does all of that, e.g.,
rst.edit(crs="EPSG:4326", nodata=0)
rst.edit(transform=(...))
etc.

This could be very interesting, as working with xarray has some drawback: your array can lose all attributes if you do sth wrong. Some other times, you have no choice and must set it again by hand, so having a unified fct that does all that can solve a true usecase 👍

adehecq · 2026-03-18T07:34:49Z

+* - Fill nodata gaps
+  - `gdal_fillnodata.py`
+  - `gdal raster fill-nodata`
+  - Not implemented (planned)


this will be straightforward if we transfer the functionality from xDEM :-)

adehecq · 2026-03-18T07:38:30Z

+# Ecosystem
+
+GeoUtils integrates naturally with the broader **geospatial ecosystem**.  
+It extends commonly used tools and works alongside many other for geospatial data access, processing, and analysis.


typo: other -> others

adehecq · 2026-03-18T07:41:24Z

+**[xDEM](https://github.com/GlacioHack/xdem)** is the sister package of GeoUtils, focused on the **analysis of digital elevation models (DEMs) and elevation point clouds**, including terrain attributes, coregistration and uncertainty propagation.
+```
+
+## Related Python libraries


I would have liked to see a quick list/description of the libraries on which we rely: rasterio, geopandas, proj etc.
I will share with you a scheme I made to illustrate the Python geospatial ecosystem and where geoutils and xDEM fit.

adehecq · 2026-03-18T07:45:51Z

+- **[EOReader](https://github.com/sertit/eoreader)** — Unified access to satellite imagery products
+- **[stackstac](https://github.com/gjoseph92/stackstac)** — Load STAC datasets as large raster stacks
+
+Some libraries focus on raster operations specifically:


should we also include the orfeo toolbox here too?
https://www.orfeo-toolbox.org/

adehecq · 2026-03-18T07:51:02Z

+# The georeferenced raster

-Below, a summary of the {class}`~geoutils.Raster` object and its methods.
+In GeoUtils, the georeferenced raster object is mirrored through two objects:


I don't really understand why you use the term "mirrored" here. It could work if you say "the Raster class is mirrored through the rst accessor", but here it does not sound right. Maybe "raster objects are handled/instantiated through two object types:"?

adehecq · 2026-03-18T07:51:58Z

+In GeoUtils, the georeferenced raster object is mirrored through two objects:
+
+- The Xarray {class}`rst <geoutils.RasterAccessor>` accessor for a {class}`xarray.DataArray`,
+- The {class}`~geoutils.Raster`.


when printed in the doc, the term "class" is not visible, so I would write "The {class}~geoutils.Raster class"

adehecq · 2026-03-18T07:54:39Z

+
+3. A {class}`~geoutils.Raster` can have **ambiguous casting behaviour** for subclasses (e.g., a {class}`~xdem.DEM`) making maintenance difficult, while accessors' **data-structure-centered mechanism enables clearer interfacing**.
+
+4. A {class}`~geoutils.Raster` currently only supports **Multiprocessing** as scalable backend which is not lazy, while the {class}`rst <geoutils.RasterAccessor>` accessor support **Dask** allowing lazy graph building.


typo: support -> supports

also add a coma after Dask.

adehecq · 2026-03-18T07:58:55Z

+A **raster** has **four main attributes**:

-1. a {class}`numpy.ma.MaskedArray` as {attr}`~geoutils.Raster.data`, of either {class}`~numpy.integer` or {class}`~numpy.floating` {class}`~numpy.dtype`,
+1. a {class}`numpy.ma.MaskedArray` ({class}`~geoutils.Raster`) or {class}`np.ndarray` ({class}`rst <geoutils.RasterAccessor>` accessor) as {attr}`~geoutils.Raster.data`, of either {class}`~numpy.integer` or {class}`~numpy.floating` {class}`~numpy.dtype` (forced to {class}`~numpy.floating` for {class}`rst <geoutils.RasterAccessor>` accessor),


This issue was not introduced in this PR, but I think it is easier to read if one first name the attribute (data, transform, crs, nodata) before describing them. Currently it's the other way around so tha attribute name is at the end of the sentence... Also the description is very much focused on type, which is not always very user-friendly (e.g. not many people know/care about affine.Affine objects. It should first describe what is the meaning of each.

adehecq · 2026-03-18T08:02:26Z

 multi-band raster!
 ```

 Finally, the remaining attributes are only relevant when instantiating from a **on-disk** file: {attr}`~geoutils.Raster.name`, {attr}`~geoutils.Raster.driver`,


typo: a on-disk -> an on-disk

adehecq · 2026-03-18T08:22:12Z

I just finished my 2nd round of reviews. Very nice documentation ! 😍 Thanks for all the effort you made to explain the concepts and write a clear documentation!! 🙌

Replying to your main questions below:

* Should we mirror the new GDAL CLI exactly? (a few functions: `bounds` to `bbox`, `crop` to `clip`, etc)?

If not too much effort, yes I would encourage going in that direction! It is so much easier when different tools use the same naming convention and I like that we "accidentally" end up with most of the same terms 😆 I actually suggested having an edit method, but the other functions can also be useful.
Like @remi-braun, it is unclear to me if footprint refers to the bounding box, or the actual raster footprint. The latter would be useful, and I think we have a tool to calculate box of valid areas only.

* Should we update the main pages/Sphinx-gallery example with accessor examples instead of GeoUtils classes?

Since I am still a big user of the Geoutils classes, I would keep those examples. Also a lot of our base users are used to this system now, so I think it is too early to completely change the structure. Something to re-evaluate later?
We could of course provide some examples with the accessor in a few places (using tab, see below).

* Reworking the Quick start fully?

Maybe we could show the same example with the 2 approaches? Using tabs, like in xDEM CLI documentation could be a nice way to avoid long pages?
I also think we need a more interesting example than this one. But this would require new example datasets maybe. Should we have a brainstorming at our next meeting?

marinebcht · 2026-03-18T09:16:09Z

+GeoUtils integrates naturally with the broader **geospatial ecosystem**.  
+It extends commonly used tools and works alongside many other for geospatial data access, processing, and analysis.
+
+See the {ref}`accessors` page for details on GeoUtils' accessors.


the link does not work

marinebcht · 2026-03-18T09:16:32Z

+See the {ref}`accessors` page for details on GeoUtils' accessors.
+
+```{seealso}
+**[xDEM](https://github.com/GlacioHack/xdem)** is the sister package of GeoUtils, focused on the **analysis of digital elevation models (DEMs) and elevation point clouds**, including terrain attributes, coregistration and uncertainty propagation.


marinebcht · 2026-03-18T09:24:54Z

+# Feature and scalability overview

-# Feature overview
+GeoUtils provides a unified API for manipulating **raster**, **vector**, and **point-cloud** data, and provides **scalable CPU execution** for most raster operations through Dask and Multiprocessing.


raster, vector and point-cloud data

marinebcht · 2026-03-18T09:29:41Z

-To facilitate the analysis process, GeoUtils includes quick plotting tools that support multiple colorbars and implicitly add layers to the current axis.
-Those are build on top of {func}`rasterio.plot.show` and {func}`geopandas.GeoDataFrame.plot`, and relay any argument passed.
+The **{ref}`summary tables<tables-overview>` directly below** lists the core features of GeoUtils, their scalability and available backends.
+Further below, a series of **{ref}`illustrated examples<examples-overview>`** demonstrate these features. 


the {ref} doest not work

marinebcht · 2026-03-18T09:37:35Z

+  - Scalable
+  - Backend
+
+* - <span class="gu-table-section">Raster / Vector / Point</span>


marinebcht · 2026-03-18T13:33:11Z

+## Using Dask through accessors
+
+With **Dask**, raster operations are both **chunked** and **lazy**.
+This behavior is enabled by opening a raster with the `chunks` argument, which returns an Xarray object backed by Dask arrays.


I don't know if its obvious or not that chunks is in pixels

marinebcht · 2026-03-18T13:34:22Z

+```{code-cell} python
+# ds_reproj.rst.to_file("reproj_rast.tif", compute=True)
+```
+


The parameter compute=True ... triggers

marinebcht · 2026-03-18T13:40:53Z

+```
+
+If the output is a {class}`~geoutils.Raster`, it is written to disk out-of-memory, and the returned object is a {class}`~geoutils.Raster` of that file without data loaded.
+This keeps syntax consistent with in-memory code, and allow to easily chain operations. 


marinebcht · 2026-03-18T13:44:51Z

+## Good practices with chunked and lazy operations
+
+- If **memory** is the limitating factor for you, use a **single-threaded scheduler** through Dask (```dask.config.set(scheduler='single-threaded')```) or Multiprocessing (default cluster),
+- If **speed** is the limiting factor for you, use **parallelized processes** through Dask (see [Dask scheduler configuration](https://docs.dask.org/en/stable/scheduler-overview.html#scheduler-overview)) or Multiprocessing (see our Cluster configuration),


"Cluster generation and configuration" to fit the page content

marinebcht · 2026-03-18T13:52:54Z

+
+For more guidance on chunk sizing and performance, see the [Dask array best practices](https://docs.dask.org/en/stable/array-best-practices.html).
+
+Finally, note that currently, operations returning **point** or **vector** outputs are often **eager** and scalable execution applies mostly to the **raster input/output**.


I don't quite understand how one chooses between "point", "point cloud" without s or "PointCloud"

marinebcht · 2026-03-18T14:03:29Z

+ds_cropped = ds.rst.icrop((0, 0, 100, 100))
+
+# Neither input nor output dataset are loaded yet
+print(f"Input loaded by deferred I/O? {ds.rst.is_loaded}")


@belletva :)

marinebcht · 2026-03-18T14:05:57Z

+ds.rst.get_stats()
+
+# The dataset is now loaded
+print(f"Loaded after data operation? {ds.rst.is_loaded}")


marinebcht · 2026-03-18T14:53:36Z

+:align: center
+:class: tight-table
+
+* - Method


https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/feature_overview.html in this order maybe ? "Raster / Vector / Point" with plot, "Raster / Point", etc ... and after the -> list with only A -> B, A and B different

marinebcht · 2026-03-18T14:57:04Z

+  - {bdg-secondary}`In-memory`
+  - —
+
+* - <span class="gu-table-section">Raster ⟶ Point</span>


Same comment, why "point" and not "PointCloud" ?

marinebcht · 2026-03-18T15:00:46Z

+(scalability-logic)=
+# Implementation strategies
+
+Implementing **chunked execution** requires developping substantial internal logic often invisible to the user, making it difficult to understand what is happening in the background and how to potentially address a scalability issue.


marinebcht

.

rhugonnet · 2026-03-18T19:39:35Z

Thanks for the feedback! 😉
The reviewer distribution is currently biased pretty heavily towards non-Xarray users. It would be good to hear more from those experienced with Xarray/Dask for this PR specifically, if some of you have the time @atedstone @erikmannerfelt @friedrichknuth @fmaussion @scottyhq @guillaumeeb 🙂

rhugonnet · 2026-03-24T00:43:05Z

Reminder: Last week for feedback, then I'll consolidate and merge!

friedrichknuth · 2026-03-30T13:21:10Z

This is fantastic - thank you for all the hard work on bridging xDEM with Xarray! 🚀 The documentation is very helpful and really nice to see supported operations in the Table summary 🤩

After browsing the documentation I have a few comments:

There seem to be two methods for opening a file from disk (gu.Raster(fn) or gu.open_raster(fn)). Ideally, there is only one way. gu.Raster(fn) has been nice. To keep the API consistent, perhaps gu.Raster(fn) can adopt the functionality provided in gu.open_raster(fn) and simply convert numpy arrays to dask arrays in memory, if passed a rasterio dataset with chunks argument.
I frequently use dask.array.rechunk to determine optimal chunk sizes using the block_size_limit and balance arguments and rarely have to think about defining multiples of the pixel size. Adding a pointer to this API might be helpful to others as well in the Good practices with chunked and lazy operations section.
The APIs for chunk size definition are slightly different, e.g. gu.open_raster(filename_rast, chunks={"x": 200, "y": 200}) and gu.multiproc.MultiprocConfig(chunk_size=200). Given that pixels don't have to be square in dimension, gu.multiproc.MultiprocConfig(chunks={"x": 200, "y": 200}) could be more explicit, consistent, and give better control?
Is gu.multiproc.MultiprocConfig(chunk_size='auto') currently supported?
Is there an example for writing a COG to disk? With enhanced lazy IO and multiprocessing support in GeoUtils / xDEM it might be helpful to state which modern cloud optimized formats are supported and how to work with them.
While dask will parallelize tasks by default, setting up a cluster (even locally) is much more powerful. I see the MultiprocConfig page. Similarly, one might point users to the dask documentation to instantiate a cluster with dask, before using it under the hood in GeoUtils.
Xvec may be of interest when tackling vector support in the future https://xvec.readthedocs.io/en/stable/

marinebcht · 2026-03-30T14:06:43Z

I did not find how do you set a mask in a Xarray (ds.rst.set_mask do not work) and how to you cast an Raster to an Xarray ?
Thanks :)

rhugonnet · 2026-03-30T18:37:45Z

Thanks a lot @friedrichknuth! 😉

On @marinebcht's question:

I did not find how do you set a mask in a Xarray (ds.rst.set_mask do not work) and how to you cast an Raster to an Xarray ? Thanks :)

No masked arrays through Xarray, so the arrays are forced to floating type to support NaNs instead. So an equivalent would be simply: ds[mask] = np.nan (and you lose the original data under the mask).

For conversion: Raster.to_xarray() or ds.rst.to_geoutils() depending on the direction. If you want to compare them, note that raster_equal() and raster_allclose() already support both input types (Raster or xr.DataArray). 🙂

remi-braun · 2026-05-27T10:13:21Z

Hello,

I don't want to put pressure or whatever, but @rhugonnet do you have a release date in mind? 😇
Thanks a lot!

rhugonnet · 2026-05-27T19:11:06Z

Hello,

I don't want to put pressure or whatever, but @rhugonnet do you have a release date in mind? 😇 Thanks a lot!

Yes, sorry, I was busy with deadlines on research projects the past weeks 😅
I finalized the related accessor in xDEM early April, so I should be able to merge both fairly quickly now. I'll dedicate some of the next weekends on this (to account for everyone's comments), so I think it'll be done around mid-June (I leave on holidays after that). 🙂

rhugonnet added 2 commits March 5, 2026 21:41

First draft of scalability documentation

a1f3310

Incremental commit on doc

dd1701f

Remove commented code + add other changes than doc

1c86ffa

adehecq reviewed Mar 11, 2026

View reviewed changes

adebardo reviewed Mar 11, 2026

View reviewed changes

adehecq reviewed Mar 11, 2026

View reviewed changes

belletva reviewed Mar 16, 2026

View reviewed changes

Comment thread doc/source/ecosystem.md

Copy link
Copy Markdown

Contributor

belletva Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice this page !!

belletva reviewed Mar 16, 2026

View reviewed changes

adehecq reviewed Mar 18, 2026

View reviewed changes

marinebcht reviewed Mar 18, 2026

View reviewed changes

belletva mentioned this pull request May 29, 2026

Test the new xarray accessor for workflows GlacioHack/xdem#960

Open


		We first describe GeoUtils' core data operations, which operate on underlying arrays or geometries and can therefore benefit from scalable execution.

		Legend: “/” indicates methods shared across object types, while “⟷” indicates methods interfacing between two object types.


		Lazy execution refers to deferring computation until results are explicitly requested.

		In GeoUtils, lazy execution is available through the Xarray {class}`rst <geoutils.RasterAccessor>` accessor with Dask-backed arrays.


		This enables out-of-core execution, allowing datasets larger than available RAM to be processed safely.

		In GeoUtils, chunked execution is implemented through two backends:


		Several platforms provide geospatial datasets, cloud environments and learning resources:

		- [STAC](https://stacspec.org) — SpatioTemporal Asset Catalog standard for geospatial data discovery


		3. A {class}`~geoutils.Raster` can have ambiguous casting behaviour for subclasses (e.g., a {class}`~xdem.DEM`) making maintenance difficult, while accessors' data-structure-centered mechanism enables clearer interfacing.

		4. A {class}`~geoutils.Raster` currently only supports Multiprocessing as scalable backend which is not lazy, while the {class}`rst <geoutils.RasterAccessor>` accessor support Dask allowing lazy graph building.


		For more guidance on chunk sizing and performance, see the [Dask array best practices](https://docs.dask.org/en/stable/array-best-practices.html).

		Finally, note that currently, operations returning point or vector outputs are often eager and scalable execution applies mostly to the raster input/output.

Conversation

rhugonnet commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Uh oh!

rhugonnet commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remi-braun commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remi-braun commented Mar 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adehecq Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

belletva Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

rhugonnet commented Mar 6, 2026 •

edited

Loading

rhugonnet commented Mar 10, 2026 •

edited

Loading

remi-braun commented Mar 10, 2026 •

edited

Loading

adehecq Mar 11, 2026 •

edited

Loading

belletva Mar 16, 2026 •

edited

Loading