Skip to content

[WIP] Apply matrix: add dask and multiprocess version#953

Draft
marinebcht wants to merge 22 commits into
GlacioHack:mainfrom
marinebcht:apply_matrix_from_geoutils_and_romain
Draft

[WIP] Apply matrix: add dask and multiprocess version#953
marinebcht wants to merge 22 commits into
GlacioHack:mainfrom
marinebcht:apply_matrix_from_geoutils_and_romain

Conversation

@marinebcht
Copy link
Copy Markdown
Contributor

@marinebcht marinebcht commented Apr 29, 2026

/!\ Don't waste your time rereading. Need to tidy all this up :)

This PR introduces the multiprocessing and xarray/dask uses for raster apply_matrix function, improving memory efficiency and processing speed.

Development

The development of this two backends, I was inspired by the development of the reproject function in geoutils : multiprocessing here and dask here.

Both backends rely on the same logic:

  1. Redirection into one or the other process:
  • if multiprcessing config given => mp backend
  • if dask is installed and the input elevation is a chunked RasterAccessor => dask backend

Note that it can't be both simultaneously (ValueError)

  1. Georeferenced tiling to identify source/destination blocks (chunk size) and the link between the two. To know what blocks gives/is needed to compute what blocks, the idea is to:

For each source blocks:
-> get bounds projected
-> apply_matrix of the four corners => polygon (*)
Later, verify intersection with each destination block bounds projected polygon.

(*) As we need to now elevation to compute the real output point after apply_matrix, the tricks is to:
-> find z_min and z_max on each source blocs (_wrapper_multiproc_zmin_zmax_per_blocks for multiprocess case + _delayed_zmin_zmax dask case)
-> compute apply_matrix(corners_x, corners_y, z_min)
-> compute apply_matrix(corners_x, corners_y, z_max)
-> destination bounds = intersection of the two polygons to obtain the maximum aera

Some others things are done for this step/in the function _build_geotiling_and_meta_apply_matrix , but this is the tricky part.

  1. Launch Tasks/Building the task graph

Multi process : for each destination block, a task is submitted to the cluster via mp_config.cluster.launch_task(), calling _wrapper_multiproc_apply_matrix_per_block(). This function:

  • Extracts the source blocks that intersect this destination block
  • Merges them into a single rectangular array (comb_src_arr)
  • Calls apply_matrix() on that sub-array ..... TO EXPLAIN

Dask : For each destination block, a task (node) is created for _delayed_apply_matrix_per_block() calling _apply_matrix_per_block() that do the same as _wrapper_multiproc_apply_matrix_per_block().

  1. Write/Construct results

Multi process : run _write_multiproc_result() with the results of each task => file that will be read => Raster

Dask : Constructs the concatenation of the results => array that will be use in gu.raster.xr_accessor.RasterAccessor.from_array() => RasterAccessor

Core functions

Tests

Apply Matrix

Parameters

See: docstring

  • matrix: matrix_identity, matrix_vertical, matrix_translations, matrix_rotations, matrix_all
  • invert: True/False
  • centroid: fixed from mean(values(x, y))
  • resample: True/False
  • resampling: nearest, linear, cubic, quintic
  • force_regrid_method: None, iterative, griddata

Input data:

  • Synthetic raster 10*12 (float) without nodata
  • Synthetic raster 10*12 (float) with several nodata (pixel, ligne) => ok si resample != cubic, quintic
  • Real data

As inputs size 10x12 (not square to have not alwars the same blocks size), chunk_size varies with:

  • 5 to have 6 blocks: 4 * (5,5) and 2* (5*2)
  • 8 to have 4 blocks: (8, 8), (8, 4), (4, 8) and (4, 4)
  • 12 to have 1 block : (10, 12)

Tests

  • apply matrix simple this version vs old version (instance, nodata, crs, transfom, mask and values)
  • apply matrix simple raster vs apply matrix simple xarray (instance, nodata, crs, transfom, mask and values)
  • apply matrix multi vs simple (instance, nodata, crs, transfom, mask and values, output file)
  • apply matrix dask vs simple (_in_memory before/after compute(), instance, nodata, crs, transfom, mask and values)

Compute z min and z max

Input data:

  • Synthetic raster 10*12 (float) without nodata
  • Synthetic raster 10*12 (float) with several nodata (pixel, ligne) => ok si resample != cubic, quintic
  • Real data

Differents Chunks Size :

Tests:

  • Multiprocess : test z_min/z_max on each input block
  • Dask : test z_min/z_max on each input block + test _in_memory before/after compute()

Old problem corrected

Remaining problems

Question dev here

  • Buffer value tu study
  • reproject_horizontal_shift_samecrs need for direct apply_matrix calls BUT not for coreg.apply()
  • Iinfluence of the new inperpolation via Raster.interp_point

Dev (easy)

  • Compute zz_min, zz_max for each scr block => need to do this outside the if
  • Optimise when matrix = matrix_vertical, matrix_translations)

@marinebcht
Copy link
Copy Markdown
Contributor Author

marinebcht commented May 12, 2026

Problem with inter_point

After replacing RegularGridInterpolator by Raster.interp_points() :
tz = dem_rst.interp_points(points=(tx, ty), method=resampling)["z"]

I saw that some of my tests don't pass anymore
After some checks, I saw that the results of the tz can be really different than the tz = z_interp((ty, tx)), even >~ 1.5 !
Here an example: pytest tests/test_coreg/test_base.py::TestAffineManipulation::test_apply_matrix_dask_multi[None-True-None-True-0-matrix4-real]

I have another error : "ValueError: cannot convert float NaN to integer" when tx ou ty contain nan values. Don't know how to change inputs properly without making more tz errors
Exemple : pytest tests/test_coreg/test_base.py::TestAffineManipulation::test_apply_matrix_dask_multi[None-False-quintic-True-0-matrix4-fake]

Opti apply_matrix_rst()

I started to move (*) the reproject_horizontal_shift_samecrs in _apply_matrix_rst(), but as this one is called in apply process here, I introduced a wrapper to do this here

(*) Need to optimize this part when all the problem will be resolved (also in apply_matrix())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Apply rotation matrix on a DEM with nan values and resample = cubic or quintic

2 participants