Skip to content

Refactor hf-sim to use xarray and Dask for larger-than-memory station processing#102

Draft
Copilot wants to merge 4 commits into
pegasusfrom
copilot/refactor-hf-sim-using-xarray-dask
Draft

Refactor hf-sim to use xarray and Dask for larger-than-memory station processing#102
Copilot wants to merge 4 commits into
pegasusfrom
copilot/refactor-hf-sim-using-xarray-dask

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 8, 2026

Summary

Refactors the hf-sim script to replace ThreadPoolExecutor with xarray.map_blocks + dask.distributed for parallelization, enabling scaling to 100k+ stations and supporting larger-than-memory output files (up to 100GB+). The entire Dask graph flows lazily to to_netcdf, so the full waveform array never needs to reside in memory.

Changes

New Functions

  • load_hf_dataset(...) — Reads the station CSV, computes seeds and vref, returns a chunked xr.Dataset indexed by station. Chunk size is calculated as max(1, total_stations // 500) to keep the task graph between ~500–1000 tasks.
  • process_hf_dataset(ds, *, hf_sim_path, hf_input_template)map_blocks-compatible function that iterates stations in a chunk and calls hf_simulate_station. Returns waveform (component × station × time) and epicentre_distance (station).

CLI Refactor

  • Workflow: load_hf_dataset → build template with dask.array.emptyxr.map_blocks(process_hf_dataset) → assign metadata lazily → ds.to_netcdf(out_file, engine="h5netcdf").
  • No eager .compute(), .result(), or .values on large data — Dask streams chunks to disk one at a time.
  • Uses LocalCluster for parallel subprocess execution within the container.

Dependencies

  • Added dask[distributed] to pyproject.toml.

Tests

  • test_load_hf_dataset — Checks dataset structure, coordinates, and attributes.
  • test_load_hf_dataset_chunking — Verifies chunking logic with 1500 stations (chunk_size=3, no data loss).
  • test_process_hf_dataset_structure — Mocks hf_simulate_station and verifies output dataset dimensions.

Notes

  • hf_simulate_station and process_hf_dataset are standalone (not nested) functions for pickle serialization to Dask workers.
  • Dask worker memory limits should account for the external hb_high_binmod binary's memory footprint (documented in module docstring).
  • The map_blocks template uses dask.array.empty to avoid allocating memory for the full waveform array upfront.

Copilot AI and others added 2 commits April 8, 2026 09:05
- Add HostType(str, Enum) with 'local' and 'slurm' values
- Add dask_cluster context manager for LocalCluster/SLURMCluster
- Add load_hf_dataset: loads stations into chunked xarray Dataset
- Add process_hf_dataset: map_blocks-compatible station processor
- Refactor run_hf CLI to use --host option and Dask workflow
- Replace ThreadPoolExecutor with xr.map_blocks + dask.distributed
- Add dask[distributed] and dask-jobqueue dependencies
- Add tests for HostType, load_hf_dataset, process_hf_dataset

Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/66b1316d-545e-44d8-96d2-317087688eed

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
… test

Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/66b1316d-545e-44d8-96d2-317087688eed

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@lispandfound lispandfound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed my mind here. This is run in a container, so a slurmcluster is a bad idea because it requires too many hacks. However I do believe this template is still useful because dask allows us to process larger than memory datasets which multiprocessing won't do. Take out the slurm stuff and refactor this to focus on being able to process larger-than-memory outputs. This means in particular you should not eagerly compute the whole array at any point (no .values, no .result, no .compute) and instead allow the graph to get all the way up to to_netcdf before saving to let dask handle the memory pressure. These output files can be up to 100gb.

… output

- Remove HostType enum, dask_cluster context manager, --host CLI option
- Remove dask-jobqueue dependency
- Use dask.array.empty for template to avoid allocating memory
- Remove all eager .compute()/.result()/.values on large data
- Let Dask graph flow lazily to to_netcdf for chunk-by-chunk writing
- Remove test_host_type_values test

Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/dee90e3b-9cd0-4ac3-afcc-58a8a3d10c06

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 8, 2026

I've changed my mind here. This is run in a container, so a slurmcluster is a bad idea because it requires too many hacks. However I do believe this template is still useful because dask allows us to ...

Done in 9953992. Removed all SLURM/HostType/dask-jobqueue code and the --host CLI option. The pipeline is now fully lazy:

  • Template uses dask.array.empty so no memory is allocated upfront for large waveform arrays
  • No .compute(), .result(), or .values on large data — the Dask graph flows from map_blocks all the way to to_netcdf
  • to_netcdf streams chunks to disk one at a time, so 100GB+ output files never need to reside fully in memory
  • LocalCluster is used for parallel subprocess execution within the container

Copilot AI changed the title Refactor hf-sim to use xarray and Dask for distributed station processing Refactor hf-sim to use xarray and Dask for larger-than-memory station processing Apr 8, 2026
Copilot AI requested a review from lispandfound April 8, 2026 09:59
Comment thread tests/test_hf.py Outdated
assert "seed" in ds.data_vars
assert "vref" in ds.data_vars

np.testing.assert_allclose(ds["latitude"].values, [-43.5, -43.6, -43.7])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use xarray testing facilities to simplify this test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in 9b69839. test_load_hf_dataset now builds an expected xr.Dataset and uses xr.testing.assert_allclose for the data variable comparison. test_process_hf_dataset_structure uses xr.testing.assert_equal for coordinate comparisons instead of np.testing.assert_array_equal.

Replace np.testing.assert_allclose/.assert_array_equal with
xr.testing.assert_allclose/assert_equal for dataset and coordinate
comparisons in test_load_hf_dataset and test_process_hf_dataset_structure.

Agent-Logs-Url: https://github.com/ucgmsim/workflow/sessions/eb2dbecf-d37e-45e1-bc15-020c0d78111f

Co-authored-by: lispandfound <12835929+lispandfound@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants