Add documentation for Dask/Multiprocessing support and Xarray accessor#878
Add documentation for Dask/Multiprocessing support and Xarray accessor#878rhugonnet wants to merge 3 commits into
Conversation
|
@belletva @adehecq @atedstone @adebardo @marinebcht @erikmannerfelt @ould-a This new documentation draft is ready for your review! 😄 The link is here: https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/feature_overview.html I suggest you start by reading the new "Feature and scalability overview" in the "Getting started" section, then move on to the "Scalability" section which essentially contains all the novelty. I also added this new "Cheasheet: From GDAL" page to help users make the link. We could also add other Python packages there (i.e. table that @remi-braun started)? Very happy to see this in a near-finalized stage after so many years of us working on it! 😊 |
|
Hi @rhugonnet, I have some comments, but honestly it's not much:
And again: such a wonderful job |
|
I forgot a point :
|
| All pages of this documentation containing code cells can be **run interactively online without the need of setting up your own environment**. Simply click the top launch button! | ||
| (MyBinder can be a bit capricious: you might have to be patient, or restart it after the build is done the first time 😅) | ||
|
|
||
| Alternatively, start your own notebook to test GeoUtils at [](https://mybinder.org/v2/gh/GlacioHack/geoutils/main). |
There was a problem hiding this comment.
Is the Binder not working anymore?
| - | ||
|
|
||
| * - {meth}`~geoutils.Raster.reproject()` | ||
| - Reproject to other CRS. Default tolerance parameters ensure chunk-invariance. |
There was a problem hiding this comment.
to other CRS, but also to another grid (resolution, bounds...)
|
|
||
| We first describe GeoUtils' core **data operations**, which operate on underlying arrays or geometries and can therefore benefit from **scalable execution**. | ||
|
|
||
| **Legend:** **“/”** indicates methods **shared across object types**, while **“⟷”** indicates methods **interfacing between two object types**. |
There was a problem hiding this comment.
The work you've done is exceptional as always.
I'll try to find some time to go through it all.
But just in case I forget, I'd like to suggest a nice way to display the captions.
.. admonition:: Legend
- ``/`` — Methods **shared across object types**
- ``↔`` — Methods **interfacing between two object types**
| - Rasterio / PyProj | ||
|
|
||
| * - {meth}`~geoutils.Raster.crop()` | ||
| - Crop to bounds, either intersecting (untouched) allowing efficient I/O, or clipped (data modified). |
There was a problem hiding this comment.
I don't fully understand this part. I guess because the meaning is maybe different for raster vs vector? For example, I think there is never any resampling for raster, right? So "data modified" only applies to geometries, in case of clipping?
| - ✅ | ||
| - Rasterio / GeoPandas | ||
|
|
||
| * - {meth}`~geoutils.Vector.rasterize()` |
There was a problem hiding this comment.
I find it strange that create_mask and rasterize fall under different categories, since they have basically the same underlying logic.
| ## Metadata properties and operations | ||
|
|
||
| In addition to data operations, GeoUtils exposes **metadata** properties and methods consistently across geospatial objects. | ||
| These operate only on metadata and therefore **do not load or modify underlying data arrays**. |
There was a problem hiding this comment.
The term "operate" is maybe not the most appropriate, as no operation is run, they are just properties. Maybe "Reading/accessing those properties does not require loading the whole data array. If the property is modified by one of the above operation (e.g. reproject), the actual operation is run only when needed (delayed operation)"
Not fully sure about the last sentence though.
| - | ||
|
|
||
| * - {attr}`~geoutils.Raster.data` | ||
| - Data array (2D grid for raster, 1D for point cloud). |
| and `rast`'s metadata is sufficient to provide a georeferenced grid for {func}`~geoutils.Vector.proximity`. The array will only be loaded when necessary. | ||
| ``` | ||
|
|
||
| ## Quick plotting |
There was a problem hiding this comment.
These features and below are all good to be aware of. Has the description be moved elsewhere?
There was a problem hiding this comment.
This is a more consistent way to describe the different features that are share among the different objects, so well done. I'm only a bit worried that some of the features that were introduced here such as arithmetic operations, plotting, saving etc, are lost, but maybe it's been moved somewhere else.
| For a raster output, one typically wants to write to file lazily to avoid loading it in-memory: | ||
|
|
||
| ```{code-cell} python | ||
| # ds_reproj.rst.to_file("reproj_rast.tif", compute=True) |
| - If **memory** is the limitating factor for you, use a **single-threaded scheduler** through Dask (```dask.config.set(scheduler='single-threaded')```) or Multiprocessing (default cluster), | ||
| - If **speed** is the limiting factor for you, use **parallelized processes** through Dask (see [Dask scheduler configuration](https://docs.dask.org/en/stable/scheduler-overview.html#scheduler-overview)) or Multiprocessing (see our Cluster configuration), | ||
| - Choose chunk sizes large enough to reduce scheduling overhead, but **small enough to fit comfortably in memory**, | ||
| - Check that your data files have **on-disk chunksizes** (otherwise loads everything) and use a multiple of it for optimal **in-memory chunking**, |
There was a problem hiding this comment.
That's often the crux I find when working with Dask. Does it make sense to explain somewhere which file format allow chunks and how to create them (externally, e.g. zarr? but also with Geoutils, e.g. option "tiled" by default with Raster.to_file etc)?
|
|
||
| # We open the dataset without Dask backend (same behaviour with Raster class) | ||
| filename_rast = gu.examples.get_path("exploradores_aster_dem") | ||
| ds = gu.open_raster(filename_rast) |
There was a problem hiding this comment.
Just a remark, I find it a bit confusing now that we have 2 options to read the rasterm either with gu.Raster or gu.open_raster. If I understand correctly, the latter is needed to instantiate a dask array instead of a Raster? Is there a way to homogenize both?
I'm quite attached to the gu.Raster way now, but I guess it would be strange to have an option "chunks" that would return a dask array rather than a Raster... In that case, should open_raster be the preferred option (and the other approach being less documented or deprecated). I guess this approach is more consistent with xarray and geopandas framework.
| rast = gu.Raster(filename_rast) | ||
|
|
||
| # Create Multiprocessing config, output filepath optional (temporary file by default) | ||
| mp_config = gu.multiproc.MultiprocConfig(chunk_size=200) |
There was a problem hiding this comment.
Just a thought. Could we not bypass this step, by adding an argument "chunk_size" (and all other needed) to gu.Raster (or gu.load_raster) to specify that operations should be made in chunk, with backend="multiproc"?
|
|
||
| Lazy execution refers to **deferring computation until results are explicitly requested**. | ||
|
|
||
| In GeoUtils, lazy execution is available through the Xarray {class}`rst <geoutils.RasterAccessor>` accessor with **Dask-backed arrays**. |
| We can materialize it with `compute()`: | ||
|
|
||
| ```{code-cell} | ||
| ds_interp.compute() |
There was a problem hiding this comment.
Just a question since I'm still not very familiar with Dask arrays. It looks like the compute method returns an array. Is it needed to capture that output, or is the data stored somewhere in the ds_interp object?
|
|
||
| This enables **out-of-core execution**, allowing datasets larger than available RAM to be processed safely. | ||
|
|
||
| In GeoUtils, chunked execution is implemented through two backends: |
There was a problem hiding this comment.
This and the examples below are mostly redundant with the "Good practice" page. Some repetition is not bad, but I wonder if we could reduce the redundancy of information a bit? Maybe the concepts should go first and only provide theoretical backgrounds with little/no example, and all examples should be in the "usage and good practice" page?
| * - {meth}`~geoutils.Raster.reproject` | ||
| - {bdg-success}`Chunked` | ||
| - {bdg-success}`Chunked` | ||
| - ~4 (default), or ~downsampling² |
There was a problem hiding this comment.
Why the exponent 2 here, a typo? I first thought it was a footnote but could not find it 😆
| * - Method | ||
| - Input | ||
| - Output | ||
| - Memory usage (# chunks) |
There was a problem hiding this comment.
I don't fully understand what is the meaning of the memory usage column...
| - | ||
| - | ||
|
|
||
| * - {meth}`~geoutils.Raster.reproject` |
There was a problem hiding this comment.
The table is great to know which steps need to be implemented in the future 😄
|
|
||
| Several operations are easy to support for **chunked execution** as they directly re-use existing **Dask** methods: | ||
| - The {meth}`~geoutils.Raster.filter` function uses {func}`~dask.array.map_overlap` with a `depth` (overlap) half the `size` of the filter, | ||
| - The {meth}`~geoutils.Raster.proximity` function uses {func}`~dask.array.map_overlap` with a `max_distance` parameter. |
There was a problem hiding this comment.
Strange, because in the table on the previous page, proximity is marked as "in-memory" rather than "chunk" implemented. is the table wrong?
| - The {meth}`~geoutils.Raster.filter` function uses {func}`~dask.array.map_overlap` with a `depth` (overlap) half the `size` of the filter, | ||
| - The {meth}`~geoutils.Raster.proximity` function uses {func}`~dask.array.map_overlap` with a `max_distance` parameter. | ||
|
|
||
| Other operations are more complex and require specific logic {ref}`specific logic described further below<specific-logic>` and summarized as: |
There was a problem hiding this comment.
Wow, you made a real effort to even make the figure in pure Python and reproducible, well done!! 🙌
| The diagram below illustrates this mapping procedure, with a new CRS and a downsampling of 2: | ||
|
|
||
| ```{eval-rst} | ||
| .. plot:: code/diagram_chunked_reproject.py |
There was a problem hiding this comment.
Very nice diagram! Just a minor thought. With "resolution x2", do you mean that it is coarsened (e.g. from 5m to 10m)? If so, why are the chunks 2x smaller, should it not be the opposite?
If x2 means higher resolution, then should the pixel grid not be denser on the right panel?
In any case, there is something that i find confusing between the pixel and chunk grid, as I would expected approximately as many pixels in each chunk... Currently 16 on the left, 4 on the right.
There was a problem hiding this comment.
Ok, it's all clear when reading the text below, sorry! I guess what matters is the number of dark gray pixels in the right panel.
There was a problem hiding this comment.
yes it is always difficult to explain the resolution!
| :width: 100% | ||
| ``` | ||
|
|
||
| All strategies begin by processing individual raster chunks independently, then reconstruct continuous polygons that span chunk boundaries. |
There was a problem hiding this comment.
Super interesting! I just did not understand why we have the 3 strategies and when they are used.
I'm also wondering, what happens if polygons extend over more than 2 chunks? Can it just "grow" iteratively? Or do you need to load all the chunks that overlap with the geometry?
| - New GDAL CLI | ||
| - xDEM | ||
|
|
||
| * - <span class="gu-table-section">Terrain attributes</span> |
There was a problem hiding this comment.
maybe we can add the terrain attributes that we plan to add ?
There was a problem hiding this comment.
I really like this page 😍
There was a problem hiding this comment.
Same here! It's nice to see that all packages convert towards a similar toponymy!
|
|
||
| Several platforms provide geospatial datasets, cloud environments and learning resources: | ||
|
|
||
| - **[STAC](https://stacspec.org)** — SpatioTemporal Asset Catalog standard for geospatial data discovery |
There was a problem hiding this comment.
I propose to add the following pages :
- Geodes-Tools - CNES portal to share tools and ressources for satellite image processing
- Geodes - CNES portal for distributing satellite products (Sentinel 1 and 2, Spot, Venus, etc.)
| The diagram below illustrates this mapping procedure, with a new CRS and a downsampling of 2: | ||
|
|
||
| ```{eval-rst} | ||
| .. plot:: code/diagram_chunked_reproject.py |
There was a problem hiding this comment.
really nice diagram but quite difficult to see that the destination chunk requires 5 source chunks ? the bold blue is quite difficult to see
|
Thanks!
|
And again, congratulation for your amazing work !! |
| * - Raster calculator | ||
| - `gdal_calc.py` | ||
| - `gdal raster calc` | ||
| - NumPy array interface on {attr}`~geoutils.Raster.data`) |
| * - Edit referencing | ||
| - `gdal_edit` / `gdalmove.py` | ||
| - `gdal raster edit` | ||
| - {meth}`~geoutils.Raster.set_crs`, {meth}`~geoutils.Raster.set_transform`, {meth}`~geoutils.Raster.set_nodata`, {meth}`~geoutils.Raster.translate` |
There was a problem hiding this comment.
this makes me think that we could also have a unified function "edit" that does all of that, e.g.,
rst.edit(crs="EPSG:4326", nodata=0)
rst.edit(transform=(...))
etc.
There was a problem hiding this comment.
This could be very interesting, as working with xarray has some drawback: your array can lose all attributes if you do sth wrong. Some other times, you have no choice and must set it again by hand, so having a unified fct that does all that can solve a true usecase 👍
| * - Fill nodata gaps | ||
| - `gdal_fillnodata.py` | ||
| - `gdal raster fill-nodata` | ||
| - Not implemented (planned) |
There was a problem hiding this comment.
this will be straightforward if we transfer the functionality from xDEM :-)
| # Ecosystem | ||
|
|
||
| GeoUtils integrates naturally with the broader **geospatial ecosystem**. | ||
| It extends commonly used tools and works alongside many other for geospatial data access, processing, and analysis. |
| **[xDEM](https://github.com/GlacioHack/xdem)** is the sister package of GeoUtils, focused on the **analysis of digital elevation models (DEMs) and elevation point clouds**, including terrain attributes, coregistration and uncertainty propagation. | ||
| ``` | ||
|
|
||
| ## Related Python libraries |
There was a problem hiding this comment.
I would have liked to see a quick list/description of the libraries on which we rely: rasterio, geopandas, proj etc.
I will share with you a scheme I made to illustrate the Python geospatial ecosystem and where geoutils and xDEM fit.
| - **[EOReader](https://github.com/sertit/eoreader)** — Unified access to satellite imagery products | ||
| - **[stackstac](https://github.com/gjoseph92/stackstac)** — Load STAC datasets as large raster stacks | ||
|
|
||
| Some libraries focus on raster operations specifically: |
There was a problem hiding this comment.
should we also include the orfeo toolbox here too?
https://www.orfeo-toolbox.org/
| # The georeferenced raster | ||
|
|
||
| Below, a summary of the {class}`~geoutils.Raster` object and its methods. | ||
| In GeoUtils, the georeferenced raster object is mirrored through two objects: |
There was a problem hiding this comment.
I don't really understand why you use the term "mirrored" here. It could work if you say "the Raster class is mirrored through the rst accessor", but here it does not sound right. Maybe "raster objects are handled/instantiated through two object types:"?
| In GeoUtils, the georeferenced raster object is mirrored through two objects: | ||
|
|
||
| - The Xarray {class}`rst <geoutils.RasterAccessor>` accessor for a {class}`xarray.DataArray`, | ||
| - The {class}`~geoutils.Raster`. |
There was a problem hiding this comment.
when printed in the doc, the term "class" is not visible, so I would write "The {class}~geoutils.Raster class"
|
|
||
| 3. A {class}`~geoutils.Raster` can have **ambiguous casting behaviour** for subclasses (e.g., a {class}`~xdem.DEM`) making maintenance difficult, while accessors' **data-structure-centered mechanism enables clearer interfacing**. | ||
|
|
||
| 4. A {class}`~geoutils.Raster` currently only supports **Multiprocessing** as scalable backend which is not lazy, while the {class}`rst <geoutils.RasterAccessor>` accessor support **Dask** allowing lazy graph building. |
| A **raster** has **four main attributes**: | ||
|
|
||
| 1. a {class}`numpy.ma.MaskedArray` as {attr}`~geoutils.Raster.data`, of either {class}`~numpy.integer` or {class}`~numpy.floating` {class}`~numpy.dtype`, | ||
| 1. a {class}`numpy.ma.MaskedArray` ({class}`~geoutils.Raster`) or {class}`np.ndarray` ({class}`rst <geoutils.RasterAccessor>` accessor) as {attr}`~geoutils.Raster.data`, of either {class}`~numpy.integer` or {class}`~numpy.floating` {class}`~numpy.dtype` (forced to {class}`~numpy.floating` for {class}`rst <geoutils.RasterAccessor>` accessor), |
There was a problem hiding this comment.
This issue was not introduced in this PR, but I think it is easier to read if one first name the attribute (data, transform, crs, nodata) before describing them. Currently it's the other way around so tha attribute name is at the end of the sentence... Also the description is very much focused on type, which is not always very user-friendly (e.g. not many people know/care about affine.Affine objects. It should first describe what is the meaning of each.
| multi-band raster! | ||
| ``` | ||
|
|
||
| Finally, the remaining attributes are only relevant when instantiating from a **on-disk** file: {attr}`~geoutils.Raster.name`, {attr}`~geoutils.Raster.driver`, |
|
I just finished my 2nd round of reviews. Very nice documentation ! 😍 Thanks for all the effort you made to explain the concepts and write a clear documentation!! 🙌 Replying to your main questions below:
If not too much effort, yes I would encourage going in that direction! It is so much easier when different tools use the same naming convention and I like that we "accidentally" end up with most of the same terms 😆 I actually suggested having an
Since I am still a big user of the Geoutils classes, I would keep those examples. Also a lot of our base users are used to this system now, so I think it is too early to completely change the structure. Something to re-evaluate later?
Maybe we could show the same example with the 2 approaches? Using tabs, like in xDEM CLI documentation could be a nice way to avoid long pages? |
| GeoUtils integrates naturally with the broader **geospatial ecosystem**. | ||
| It extends commonly used tools and works alongside many other for geospatial data access, processing, and analysis. | ||
|
|
||
| See the {ref}`accessors` page for details on GeoUtils' accessors. |
| See the {ref}`accessors` page for details on GeoUtils' accessors. | ||
|
|
||
| ```{seealso} | ||
| **[xDEM](https://github.com/GlacioHack/xdem)** is the sister package of GeoUtils, focused on the **analysis of digital elevation models (DEMs) and elevation point clouds**, including terrain attributes, coregistration and uncertainty propagation. |
| # Feature and scalability overview | ||
|
|
||
| # Feature overview | ||
| GeoUtils provides a unified API for manipulating **raster**, **vector**, and **point-cloud** data, and provides **scalable CPU execution** for most raster operations through Dask and Multiprocessing. |
There was a problem hiding this comment.
raster, vector and point-cloud data
| To facilitate the analysis process, GeoUtils includes quick plotting tools that support multiple colorbars and implicitly add layers to the current axis. | ||
| Those are build on top of {func}`rasterio.plot.show` and {func}`geopandas.GeoDataFrame.plot`, and relay any argument passed. | ||
| The **{ref}`summary tables<tables-overview>` directly below** lists the core features of GeoUtils, their scalability and available backends. | ||
| Further below, a series of **{ref}`illustrated examples<examples-overview>`** demonstrate these features. |
There was a problem hiding this comment.
the {ref} doest not work
| - Scalable | ||
| - Backend | ||
|
|
||
| * - <span class="gu-table-section">Raster / Vector / Point</span> |
| ## Using Dask through accessors | ||
|
|
||
| With **Dask**, raster operations are both **chunked** and **lazy**. | ||
| This behavior is enabled by opening a raster with the `chunks` argument, which returns an Xarray object backed by Dask arrays. |
There was a problem hiding this comment.
I don't know if its obvious or not that chunks is in pixels
| ```{code-cell} python | ||
| # ds_reproj.rst.to_file("reproj_rast.tif", compute=True) | ||
| ``` | ||
|
|
There was a problem hiding this comment.
The parameter compute=True ... triggers
| ``` | ||
|
|
||
| If the output is a {class}`~geoutils.Raster`, it is written to disk out-of-memory, and the returned object is a {class}`~geoutils.Raster` of that file without data loaded. | ||
| This keeps syntax consistent with in-memory code, and allow to easily chain operations. |
| ## Good practices with chunked and lazy operations | ||
|
|
||
| - If **memory** is the limitating factor for you, use a **single-threaded scheduler** through Dask (```dask.config.set(scheduler='single-threaded')```) or Multiprocessing (default cluster), | ||
| - If **speed** is the limiting factor for you, use **parallelized processes** through Dask (see [Dask scheduler configuration](https://docs.dask.org/en/stable/scheduler-overview.html#scheduler-overview)) or Multiprocessing (see our Cluster configuration), |
There was a problem hiding this comment.
"Cluster generation and configuration" to fit the page content
|
|
||
| For more guidance on chunk sizing and performance, see the [Dask array best practices](https://docs.dask.org/en/stable/array-best-practices.html). | ||
|
|
||
| Finally, note that currently, operations returning **point** or **vector** outputs are often **eager** and scalable execution applies mostly to the **raster input/output**. |
There was a problem hiding this comment.
I don't quite understand how one chooses between "point", "point cloud" without s or "PointCloud"
| ds_cropped = ds.rst.icrop((0, 0, 100, 100)) | ||
|
|
||
| # Neither input nor output dataset are loaded yet | ||
| print(f"Input loaded by deferred I/O? {ds.rst.is_loaded}") |
| ds.rst.get_stats() | ||
|
|
||
| # The dataset is now loaded | ||
| print(f"Loaded after data operation? {ds.rst.is_loaded}") |
| :align: center | ||
| :class: tight-table | ||
|
|
||
| * - Method |
There was a problem hiding this comment.
https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/feature_overview.html in this order maybe ? "Raster / Vector / Point" with plot, "Raster / Point", etc ... and after the -> list with only A -> B, A and B different
| - {bdg-secondary}`In-memory` | ||
| - — | ||
|
|
||
| * - <span class="gu-table-section">Raster ⟶ Point</span> |
There was a problem hiding this comment.
Same comment, why "point" and not "PointCloud" ?
| (scalability-logic)= | ||
| # Implementation strategies | ||
|
|
||
| Implementing **chunked execution** requires developping substantial internal logic often invisible to the user, making it difficult to understand what is happening in the background and how to potentially address a scalability issue. |
|
Thanks for the feedback! 😉 |
|
Reminder: Last week for feedback, then I'll consolidate and merge! |
|
This is fantastic - thank you for all the hard work on bridging xDEM with Xarray! 🚀 The documentation is very helpful and really nice to see supported operations in the Table summary 🤩 After browsing the documentation I have a few comments:
|
|
I did not find how do you set a mask in a Xarray (ds.rst.set_mask do not work) and how to you cast an Raster to an Xarray ? |
|
Thanks a lot @friedrichknuth! 😉 On @marinebcht's question:
No masked arrays through Xarray, so the arrays are forced to floating type to support NaNs instead. So an equivalent would be simply: For conversion: |
|
Hello, I don't want to put pressure or whatever, but @rhugonnet do you have a release date in mind? 😇 |
Yes, sorry, I was busy with deadlines on research projects the past weeks 😅 |


This PR finally adds the documentation for the Xarray accessor
rst, as well as the Dask and Multiprocessing support through chunked implementations that we have been steadily developing for the last 2-3 years! 🥳Thanks in particular to @vschaffn @ameliefroessl for their big contributions at various stages!
Link to the new doc (landing on the Scalability page with schematics that was a bit of work!): https://geoutils-rhugonnet.readthedocs.io/en/add_accessor_daskmp_doc/scalability_logic.html
Details
The documentation changes are the following:
Raster.method() or ds.rst.method()for each method linked through a singleRasterBasecall withRasterBaseitself not being visible (for users using the search button). Also added fullRaster+RasterAccessorfull autoclass summary for those also curious to search through the class details. But those two pages are "hidden" in the table-of-content structure to avoid duplicating the main API page (users can only land there through search or clicking the class object name); and the page starts with a link pointing back to the main API, if that's not where they wanted to land.I haven't yet updated the "Quick start", but I think we really need a better example there, and we should use the Xarray/Pandas accessor directly.
Same for the "Fundamentals" section, there's some editing to do there so that's it's not too "GeoUtils object-focused", but balanced whether it's about accessors or GeoUtils objects.
Finally, we have to see for the "Examples" (in feature pages, and in the Sphinx gallery), do we switch all to Xarray/Pandas accessor?
In particular, as I was already doing diagrams in Python code for something else, I kept my inertia and decided to add some to the PR to describe our implementations visually! 😄
We might think of doing something similar to describe functions themselves in the future (interpolation, etc).
Finally, I realized that it was annoying to explain the mirror for
Raster/rstbut not for vector/point cloud at the same time, even though the accessors for those 2 are much easier to add.So I might add them when I get the chance in the next weeks while this PR is being reviewed 😉
Resolves #677
Resolves #673