Vs30 refactor by AndrewRidden-Harper · Pull Request #45 · ucgmsim/Vs30

AndrewRidden-Harper · 2026-05-13T00:36:17Z

Re-opened after history rewrite

This PR is a re-opened version of an earlier one. Between then and now, I used git filter-repo to remove some unused benchmark files from the repo's history. The rewrite worked as intended but changed every commit's SHA, which orphaned the previous branch from master and made the old PR unmergeable, so I closed it and re-targeted the same work against the rewritten history. The content is unchanged; only the underlying SHAs have shifted.

Important

If you have an existing local clone of this repository, delete it and re-clone from scratch before pulling these changes. Your local commits share no history with origin anymore - git pull on top of an old clone will fail or produce a broken state. A fresh git clone is the safe move.

Summary

This PR continues the Vs30 package refactor, addressing prior review feedback, adding two new bundled models and support for custom YAML configurations, and incorporating other improvements.

Addressed review feedback

Several review comments from the previous PR have all been addressed. Most do not warrant a specific call-out.

Two slightly more structural changes are worth flagging:

Shared parameters lifted from per-model configs into constants.py. Values that were previously declared in each of the bundled model config files are now defined once in code; the config files are correspondingly leaner.
Bundled model versions are specified by a StrEnum rather than their config file name. The four valid names live as FixedModelVersion members in vs30/constants.py, giving a single canonical source for the supported set.

Additional development work

Ensured exact correspondence between points and grid modes. New tests verify identical outputs at tricky locations like geological category boundaries, missing-value edges in the terrain raster, and similar edge cases.
Tightened grid-bounds semantics to ensure unambiguous pixel-edge interpretation and eliminate the potential for inconsistencies from GDAL tie-breaking behaviour.
Added the jaehwi_v1p0 model (in addition to the existing modified_foster_2019 and viktor_cpt_clustering), along with the gap-filling functionality it requires.
Added a near-reproduction of the Foster et al. (2019) model (foster_2019_approx). Exact reproduction wasn't pursued because the published model has known flaws.
Added support for custom YAML model configurations. Users can pass a YAML config path in place of a bundled model name to define their own model variants without modifying the package.
Added small end-to-end benchmark tests with reference outputs generated by the legacy code, to verify reproduction.
Removed the Python multiprocessing option. Testing across a wide range of scenarios showed it was consistently slower than the single-process path, which is already multi-threaded via the BLAS used by NumPy/SciPy.
Other performance touch-ups — einsum-based MVN distances, gapfill memory optimisation, Float32 coast-distance arrays, etc.
Added a brief GitHub wiki.

Why points and grid are separate pipelines

A natural design question is why the package maintains two separate pipelines rather than collapsing them into one. Both pipelines produce identical results at every location (verified by the correspondence tests above); they are kept separate purely for performance. Each direction of unification has its own performance problem:

Have vs30 grid call the points pipeline at each pixel. This loses the bulk array operations that the grid pipeline gets from GDAL: bulk raster reads, bulk sampling, bulk spatial adjustment. At national-grid sizes (millions of pixels) it would be impractically slow.
Have vs30 points run the grid pipeline and sample the desired sites from it. This would compute Vs30 across an entire grid regardless of how many sites are requested, making typical points-mode queries needlessly slow.

…test config loading - Consolidate mp.get_context("spawn") into parallel.py as single source - Extract _default_correlation_functions() in pipeline.py - Extract _collect_observation_csvs() in pipeline.py - Move shared test config loading to conftest.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Integrate gap-fill into points_pipeline so that on-land nodata points are filled via local grid sub-pipelines, matching the grid_pipeline gap-fill behaviour. Regenerate benchmark CSV to include the new vs30_before_gapfill and stdv_before_gapfill columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Now using Jaehwi's own sites_load_NSHM2022.py code

Evaluates approaches for testing grid/points pipeline consistency across the full NZ domain and all model versions. Settles on 3x3 local grids per test point with pre-computed deliberate + random point selection and a two-tier fast/slow test structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Five tasks: generate test points fixture, register slow marker, add config-loading helper, rewrite test with full NZ/all-model coverage, and verify existing tests still pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…del versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Drop legacy-R-code references from the corr_zero and "no nearby obs" comments - Tighten pixel/obs_data param descriptions - Tighten corr_zero description (drop "i.e." preamble) - Update max_points doc with tie-behavior note - Align noisy/cov_reduc wording with build_covariance_matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Tighten docstring (drop "In-memory" qualifier, "Model type:" prefix, trailing "Default is False") - Specify raster_data's purpose (provides transform for locating obs) - Drop the second comment line listing specific GID ranges (constants are referenced directly below) - Rewrite the within-grid/outside-grid sampling comment more concisely - Initialize coast_obs with np.full(np.nan) instead of np.empty, removing the dependency on downstream skipping uninitialized entries when coastal mod is off - Trim the 6-line legacy slope NODATA comment to 3 lines (the LEGACY rename comes in a follow-up commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…SENTINEL Drop "legacy" framing from the constant name and the three comment blocks that reference it. Behavior unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Specify raster_data fields used (valid-pixel mask, grid transform) - Tighten obs_data, model_type, max_dist_m descriptions - Switch Returns to named-entry style - Inline str(model_type).capitalize() at its two f-string sites; drop label Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Specify raster_data fields; tighten obs_data description - Reformat corr_fn line wrap - Update max_points doc with tie-behavior note - Align noisy/cov_reduc wording with build_covariance_matrix - Switch Returns to named-entry style Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rvation data" "Bundled" could be read as "comes from our bundled defaults", which is misleading — obs_data can come from a user-supplied CSV via YAML config. "Prepared" signals that residuals/noise_weights are populated (output of prepare_observation_data) without implying anything about source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- compute_spatial_adjustments → compute_spatial_pixel_adjustments - compute_spatial_adjustment_at_points → compute_spatial_point_adjustments Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Switch to multi-line summary form (consistent with other functions) - Drop "Delegates to compute_spatial_adjustment_for_pixel" implementation detail - Replace "Pre-filtered observation data..." with "Prepared observation data"; move the noisy-coupling note to the noisy param where it belongs - Tighten max_dist_m wording (in meters); add tie-behaviour note to max_points - Align noisy/cov_reduc wording with build_covariance_matrix - Mark noisy and cov_reduc as optional (both have defaults) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Clarify model_values_df required columns - Align corr_fn with the distances→correlations wording used elsewhere - Expand noisy to cover both residual and covariance effects Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop leading underscore on _build_obs_data per org convention - Normalise docstring openers to one-line summaries - Add types to Parameters sections (DataFrame, bool, etc.) - Tighten Returns descriptions and drop duplicated "Returns" wording - Replace bare "points_pipeline" identifier with "the points pipeline" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dev scripts Audit confirmed DEFAULT_GEOLOGY_PHI/DEFAULT_TERRAIN_PHI are unused in production: phi is read from YAML configs by correlations.resolve_correlation, and the 1407/993 values are duplicated as literals in vs30/configs/*.yaml. The Python constants were referenced only by 6 dev scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Reorder so GEOSPATIAL_DIR derives from RESOURCE_PATH; correct RESOURCE_PATH comment - Fix MIN_DIST_ENFORCED comment (the "division by zero" claim was wrong) - Tighten K_VALUE comment, reference CombinationMethod enum - One-line HybridGeologyParams and ModelType docstrings - Normalise caps-y section headers - Wrap long GEOLOGY_VS30_MEAN_STDDEV_FILENAME literal to match TERRAIN counterpart - Lower GAPFILL_MAX_HALF_WIDTH_M 50050 → 50000 (behavioural): the loop now runs 9 iterations instead of 10, with a largest attempted half-width of 45050 m (~90 km grid edge), still well beyond plausible donor distances within NZ. The old 50050 value was contorted to preserve a 10-iteration count matching an earlier arbitrary ceiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Was a no-op wrapper for GridConfig(**data). Never called from production, tests, or dev — confirmed via package-wide function-usage audit. Dates from when grid params were YAML-loaded; current CLI passes typed kwargs directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ntions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request executes a comprehensive refactoring of the VS30 package, introducing a modular architecture, a Typer-based CLI, and a YAML-driven configuration system for model versions. Key enhancements include updated dependencies for Python 3.11+, a robust suite of benchmark and consistency tests, and a standardized coordinate system using NZTM2000. Feedback highlights several critical areas for optimization and correction: the MVN spatial adjustment in spatial.py should utilize a spatial index to resolve performance bottlenecks caused by high computational complexity; the Bayesian update logic in category.py must clip the posterior standard deviation to correctly maintain minimum uncertainty; the exponential correlation function should not enforce a minimum distance at zero to avoid incorrect variance calculations; and the gap-filling logic in pipeline.py should be batched to improve efficiency for large point sets.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop unused threadpoolctl, declare pyproj as a test dep (used in test_benchmarks), and ignore test_benchmarks under DEP001 since it's imported across test files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolve the 19 non-qcore ty diagnostics on this branch: - gapfill.create_local_grid_config: ``round()`` the float bounds back to int so they match ``GridConfig`` annotations. - spatial.prepare_observation_data: narrow ``slope_array`` / ``coast_dist_array`` inside the GEOLOGY branch (function-entry check already raises if they're missing). - pipeline.compute_component_grid: restructure the ``if/elif/else`` over ``posterior_df`` into ``if/else { if/else }`` so the inner branches can narrow ``categorical_model_csv``. - pipeline.grid_pipeline: introduce ``GridPipelineResult`` TypedDict so each key has its precise value type, removing union noise at call-sites (``fill_one_point_via_local_grid``, tests). - pipeline.points_pipeline: convert the ``... if run_X else None`` conditional-expression assignments to ``if/else`` statements and add ``type guard`` asserts so ty can narrow downstream uses. Standardise type-guard asserts (including pre-existing ones) on a ``# type guard: <runtime invariant>`` comment so readers know they exist for the static checker. The 2 remaining ``unresolved-import: qcore`` diagnostics are environment-specific and persist in CI; they're unrelated to this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI runs ``deptry -ddg test,types``, classifying the ``test`` extras group as dev. DEP004 then fires on any dev-classified package imported in tests/. pytest is already on the ignore list for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the per-pixel O(N_obs) np.partition scan in select_observations_for_pixel with a per-pixel KDTree query. ObservationData carries the tree, built lazily in __post_init__ from locations. ~22x per-call speedup on the viktor 35,706-obs benchmark (microbenchmark); pinned raster/points benchmarks and grid/points consistency tests pass unchanged across all four model versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AndrewRidden-Harper and others added 30 commits March 31, 2026 05:06

add gap-fill design spec for filling nodata gaps in VS30 output

58d8ff1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add gap-fill constants for nodata filling

172db84

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add gapfill module with classify_nodata and fill_nodata_grid

b5bc512

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

return id_array from compute_model_grid for gap-fill

c9c85d7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add gap-fill Stage 6 to grid_pipeline

b0548e8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add gap-fill integration tests

9f747d4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gapfill improvements

04541d2

delete flawed Jaehwi data reconstruction

b85e431

Now using Jaehwi's own sites_load_NSHM2022.py code

tidied up constants.py

b19dc8f

gave tests more descriptive names

840f2a6

del docs for working

0b3b94d

add consistency test point generation script and fixture CSV

c23606e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix docstring and hoist coast_boundary in point generation script

2567214

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

register slow pytest marker in pyproject.toml

33b83b8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add load_fixed_model_config helper to test conftest

44c0692

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rewrite grid/points consistency test with full NZ coverage and all mo…

44c3427

…del versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

handle asymmetric NaN in consistency test failure messages

ecac661

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ruff linter fix

88550d7

numpydocs linter fix

4067843

ruff linter fix

3c02fb7

numpydoc linter fix

776ad48

ty type check fix

367b2fc

ty type check fix

a480f46

replace np.isclose with pytest.approx in tests

0105230

fix Bayesian update col type bug

cfdc8f6

skip slow tests by default

445bacf

AndrewRidden-Harper and others added 22 commits May 11, 2026 16:03

refactor: rename LEGACY_OBS_SLOPE_NODATA_SENTINEL → OBS_SLOPE_NODATA_…

ef309a5

…SENTINEL Drop "legacy" framing from the constant name and the three comment blocks that reference it. Behavior unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor: rename inner-loop MVN functions to be explicit about target

0db2cd0

- compute_spatial_adjustments → compute_spatial_pixel_adjustments - compute_spatial_adjustment_at_points → compute_spatial_point_adjustments Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(wiki): document custom YAML configs and refresh CLI references

e35dcce

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(readme): defer CLI examples to wiki

4913740

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(readme): drop "How It Works" section

006579f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(wiki): add Foster et al. (2019) citation under Supported Models

ff48c57

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(wiki): link in-repo PDF in Foster 2019 reference

2cee036

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(wiki): match repo style (Vs30) in Foster 2019 title

0ce2d87

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(tests): trim docstrings, drop low-value tests, and fix conve…

662ebf8

…ntions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(wiki): rewrite Home and Usage for CLI newcomers

3257dc3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor: drop dev/ scripts (preserved on dev_scripts_archive)

5c4fcad

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread vs30/spatial.py

Comment thread vs30/category.py

Comment thread vs30/correlations.py

Comment thread vs30/pipeline.py

AndrewRidden-Harper and others added 7 commits May 13, 2026 12:59

refactor(gapfill): tidy docstrings and comments

61c1a69

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs: satisfy numpydoc-lint in constants/raster/spatial

ae3e17d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(constants): clarify MIN_SIGMA is applied to the prior, not poste…

9db6efe

…rior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vs30 refactor#45

Vs30 refactor#45
AndrewRidden-Harper wants to merge 444 commits into
masterfrom
vs30_refactor

AndrewRidden-Harper commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AndrewRidden-Harper commented May 13, 2026

Re-opened after history rewrite

Summary

Addressed review feedback

Additional development work

Why points and grid are separate pipelines

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant