Refactor hf-sim to use xarray and Dask for larger-than-memory station processing by Copilot · Pull Request #102 · ucgmsim/workflow

Copilot · 2026-04-08T09:07:12Z

Summary

Refactors the hf-sim script to replace ThreadPoolExecutor with xarray.map_blocks + dask.distributed for parallelization, enabling scaling to 100k+ stations and supporting larger-than-memory output files (up to 100GB+). The entire Dask graph flows lazily to to_netcdf, so the full waveform array never needs to reside in memory.

Changes

New Functions

load_hf_dataset(...) — Reads the station CSV, computes seeds and vref, returns a chunked xr.Dataset indexed by station. Chunk size is calculated as max(1, total_stations // 500) to keep the task graph between ~500–1000 tasks.
process_hf_dataset(ds, *, hf_sim_path, hf_input_template) — map_blocks-compatible function that iterates stations in a chunk and calls hf_simulate_station. Returns waveform (component × station × time) and epicentre_distance (station).

CLI Refactor

Workflow: load_hf_dataset → build template with dask.array.empty → xr.map_blocks(process_hf_dataset) → assign metadata lazily → ds.to_netcdf(out_file, engine="h5netcdf").
No eager .compute(), .result(), or .values on large data — Dask streams chunks to disk one at a time.
Uses LocalCluster for parallel subprocess execution within the container.

Dependencies

Added dask[distributed] to pyproject.toml.

Tests

test_load_hf_dataset — Checks dataset structure, coordinates, and attributes.
test_load_hf_dataset_chunking — Verifies chunking logic with 1500 stations (chunk_size=3, no data loss).
test_process_hf_dataset_structure — Mocks hf_simulate_station and verifies output dataset dimensions.

Notes

hf_simulate_station and process_hf_dataset are standalone (not nested) functions for pickle serialization to Dask workers.
Dask worker memory limits should account for the external hb_high_binmod binary's memory footprint (documented in module docstring).
The map_blocks template uses dask.array.empty to avoid allocating memory for the full waveform array upfront.

- Add HostType(str, Enum) with 'local' and 'slurm' values - Add dask_cluster context manager for LocalCluster/SLURMCluster - Add load_hf_dataset: loads stations into chunked xarray Dataset - Add process_hf_dataset: map_blocks-compatible station processor - Refactor run_hf CLI to use --host option and Dask workflow - Replace ThreadPoolExecutor with xr.map_blocks + dask.distributed - Add dask[distributed] and dask-jobqueue dependencies - Add tests for HostType, load_hf_dataset, process_hf_dataset Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/66b1316d-545e-44d8-96d2-317087688eed Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

… test Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/66b1316d-545e-44d8-96d2-317087688eed Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

lispandfound

I've changed my mind here. This is run in a container, so a slurmcluster is a bad idea because it requires too many hacks. However I do believe this template is still useful because dask allows us to process larger than memory datasets which multiprocessing won't do. Take out the slurm stuff and refactor this to focus on being able to process larger-than-memory outputs. This means in particular you should not eagerly compute the whole array at any point (no .values, no .result, no .compute) and instead allow the graph to get all the way up to to_netcdf before saving to let dask handle the memory pressure. These output files can be up to 100gb.

… output - Remove HostType enum, dask_cluster context manager, --host CLI option - Remove dask-jobqueue dependency - Use dask.array.empty for template to avoid allocating memory - Remove all eager .compute()/.result()/.values on large data - Let Dask graph flow lazily to to_netcdf for chunk-by-chunk writing - Remove test_host_type_values test Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/dee90e3b-9cd0-4ac3-afcc-58a8a3d10c06 Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

Copilot · 2026-04-08T09:57:15Z

I've changed my mind here. This is run in a container, so a slurmcluster is a bad idea because it requires too many hacks. However I do believe this template is still useful because dask allows us to ...

Done in 9953992. Removed all SLURM/HostType/dask-jobqueue code and the --host CLI option. The pipeline is now fully lazy:

Template uses dask.array.empty so no memory is allocated upfront for large waveform arrays
No .compute(), .result(), or .values on large data — the Dask graph flows from map_blocks all the way to to_netcdf
to_netcdf streams chunks to disk one at a time, so 100GB+ output files never need to reside fully in memory
LocalCluster is used for parallel subprocess execution within the container

lispandfound · 2026-04-08T10:11:30Z

+    assert "seed" in ds.data_vars
+    assert "vref" in ds.data_vars
+
+    np.testing.assert_allclose(ds["latitude"].values, [-43.5, -43.6, -43.7])


You should use xarray testing facilities to simplify this test

Refactored in 9b69839. test_load_hf_dataset now builds an expected xr.Dataset and uses xr.testing.assert_allclose for the data variable comparison. test_process_hf_dataset_structure uses xr.testing.assert_equal for coordinate comparisons instead of np.testing.assert_array_equal.

Replace np.testing.assert_allclose/.assert_array_equal with xr.testing.assert_allclose/assert_equal for dataset and coordinate comparisons in test_load_hf_dataset and test_process_hf_dataset_structure. Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/eb2dbecf-d37e-45e1-bc15-020c0d78111f Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

Copilot AI and others added 2 commits April 8, 2026 09:05

address review: extract _TARGET_TASK_COUNT constant, improve chunking…

8bc6d62

… test Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/66b1316d-545e-44d8-96d2-317087688eed Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>

Copilot AI assigned Copilot and lispandfound Apr 8, 2026

Copilot created this pull request from a session on behalf of lispandfound April 8, 2026 09:10 View session

Copilot finished work on behalf of lispandfound April 8, 2026 09:10

Copilot AI requested a review from lispandfound April 8, 2026 09:10

lispandfound requested changes Apr 8, 2026

View reviewed changes

Copilot started work on behalf of lispandfound April 8, 2026 09:48 View session

Copilot AI changed the title ~~Refactor hf-sim to use xarray and Dask for distributed station processing~~ Refactor hf-sim to use xarray and Dask for larger-than-memory station processing Apr 8, 2026

Copilot finished work on behalf of lispandfound April 8, 2026 09:59

Copilot AI requested a review from lispandfound April 8, 2026 09:59

lispandfound requested changes Apr 8, 2026

View reviewed changes

Copilot started work on behalf of lispandfound April 8, 2026 10:15 View session

Copilot finished work on behalf of lispandfound April 8, 2026 10:21

Copilot AI requested a review from lispandfound April 8, 2026 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor hf-sim to use xarray and Dask for larger-than-memory station processing#102

Refactor hf-sim to use xarray and Dask for larger-than-memory station processing#102
Copilot wants to merge 4 commits into
pegasusfrom
copilot/refactor-hf-sim-using-xarray-dask

Copilot AI commented Apr 8, 2026 •

edited

Loading

Uh oh!

lispandfound left a comment

Uh oh!

Copilot AI commented Apr 8, 2026

Uh oh!

lispandfound Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Functions

CLI Refactor

Dependencies

Tests

Notes

Uh oh!

lispandfound left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Apr 8, 2026

Uh oh!

lispandfound Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 8, 2026 •

edited

Loading