Skip to content

Vs30 refactor#45

Open
AndrewRidden-Harper wants to merge 444 commits into
masterfrom
vs30_refactor
Open

Vs30 refactor#45
AndrewRidden-Harper wants to merge 444 commits into
masterfrom
vs30_refactor

Conversation

@AndrewRidden-Harper
Copy link
Copy Markdown
Contributor

Re-opened after history rewrite

This PR is a re-opened version of an earlier one. Between then and now, I used git filter-repo to remove some unused benchmark files from the repo's history. The rewrite worked as intended but changed every commit's SHA, which orphaned the previous branch from master and made the old PR unmergeable, so I closed it and re-targeted the same work against the rewritten history. The content is unchanged; only the underlying SHAs have shifted.

Important

If you have an existing local clone of this repository, delete it and re-clone from scratch before pulling these changes. Your local commits share no history with origin anymore - git pull on top of an old clone will fail or produce a broken state. A fresh git clone is the safe move.

Summary

This PR continues the Vs30 package refactor, addressing prior review feedback, adding two new bundled models and support for custom YAML configurations, and incorporating other improvements.

Addressed review feedback

Several review comments from the previous PR have all been addressed. Most do not warrant a specific call-out.

Two slightly more structural changes are worth flagging:

  • Shared parameters lifted from per-model configs into constants.py. Values that were previously declared in each of the bundled model config files are now defined once in code; the config files are correspondingly leaner.
  • Bundled model versions are specified by a StrEnum rather than their config file name. The four valid names live as FixedModelVersion members in vs30/constants.py, giving a single canonical source for the supported set.

Additional development work

  • Ensured exact correspondence between points and grid modes. New tests verify identical outputs at tricky locations like geological category boundaries, missing-value edges in the terrain raster, and similar edge cases.
  • Tightened grid-bounds semantics to ensure unambiguous pixel-edge interpretation and eliminate the potential for inconsistencies from GDAL tie-breaking behaviour.
  • Added the jaehwi_v1p0 model (in addition to the existing modified_foster_2019 and viktor_cpt_clustering), along with the gap-filling functionality it requires.
  • Added a near-reproduction of the Foster et al. (2019) model (foster_2019_approx). Exact reproduction wasn't pursued because the published model has known flaws.
  • Added support for custom YAML model configurations. Users can pass a YAML config path in place of a bundled model name to define their own model variants without modifying the package.
  • Added small end-to-end benchmark tests with reference outputs generated by the legacy code, to verify reproduction.
  • Removed the Python multiprocessing option. Testing across a wide range of scenarios showed it was consistently slower than the single-process path, which is already multi-threaded via the BLAS used by NumPy/SciPy.
  • Other performance touch-ups — einsum-based MVN distances, gapfill memory optimisation, Float32 coast-distance arrays, etc.
  • Added a brief GitHub wiki.

Why points and grid are separate pipelines

A natural design question is why the package maintains two separate pipelines rather than collapsing them into one. Both pipelines produce identical results at every location (verified by the correspondence tests above); they are kept separate purely for performance. Each direction of unification has its own performance problem:

  • Have vs30 grid call the points pipeline at each pixel. This loses the bulk array operations that the grid pipeline gets from GDAL: bulk raster reads, bulk sampling, bulk spatial adjustment. At national-grid sizes (millions of pixels) it would be impractically slow.
  • Have vs30 points run the grid pipeline and sample the desired sites from it. This would compute Vs30 across an entire grid regardless of how many sites are requested, making typical points-mode queries needlessly slow.

AndrewRidden-Harper and others added 30 commits March 31, 2026 05:06
…test config loading

- Consolidate mp.get_context("spawn") into parallel.py as single source
- Extract _default_correlation_functions() in pipeline.py
- Extract _collect_observation_csvs() in pipeline.py
- Move shared test config loading to conftest.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Integrate gap-fill into points_pipeline so that on-land nodata points
are filled via local grid sub-pipelines, matching the grid_pipeline
gap-fill behaviour. Regenerate benchmark CSV to include the new
vs30_before_gapfill and stdv_before_gapfill columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Now using Jaehwi's own sites_load_NSHM2022.py code
Evaluates approaches for testing grid/points pipeline consistency
across the full NZ domain and all model versions. Settles on 3x3
local grids per test point with pre-computed deliberate + random
point selection and a two-tier fast/slow test structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Five tasks: generate test points fixture, register slow marker,
add config-loading helper, rewrite test with full NZ/all-model
coverage, and verify existing tests still pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…del versions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AndrewRidden-Harper and others added 22 commits May 11, 2026 16:03
- Drop legacy-R-code references from the corr_zero and "no nearby obs" comments
- Tighten pixel/obs_data param descriptions
- Tighten corr_zero description (drop "i.e." preamble)
- Update max_points doc with tie-behavior note
- Align noisy/cov_reduc wording with build_covariance_matrix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Tighten docstring (drop "In-memory" qualifier, "Model type:" prefix,
  trailing "Default is False")
- Specify raster_data's purpose (provides transform for locating obs)
- Drop the second comment line listing specific GID ranges (constants
  are referenced directly below)
- Rewrite the within-grid/outside-grid sampling comment more concisely
- Initialize coast_obs with np.full(np.nan) instead of np.empty, removing
  the dependency on downstream skipping uninitialized entries when coastal
  mod is off
- Trim the 6-line legacy slope NODATA comment to 3 lines (the LEGACY
  rename comes in a follow-up commit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SENTINEL

Drop "legacy" framing from the constant name and the three comment blocks
that reference it. Behavior unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Specify raster_data fields used (valid-pixel mask, grid transform)
- Tighten obs_data, model_type, max_dist_m descriptions
- Switch Returns to named-entry style
- Inline str(model_type).capitalize() at its two f-string sites; drop label

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Specify raster_data fields; tighten obs_data description
- Reformat corr_fn line wrap
- Update max_points doc with tie-behavior note
- Align noisy/cov_reduc wording with build_covariance_matrix
- Switch Returns to named-entry style

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rvation data"

"Bundled" could be read as "comes from our bundled defaults", which is
misleading — obs_data can come from a user-supplied CSV via YAML config.
"Prepared" signals that residuals/noise_weights are populated (output of
prepare_observation_data) without implying anything about source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- compute_spatial_adjustments         → compute_spatial_pixel_adjustments
- compute_spatial_adjustment_at_points → compute_spatial_point_adjustments

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Switch to multi-line summary form (consistent with other functions)
- Drop "Delegates to compute_spatial_adjustment_for_pixel" implementation detail
- Replace "Pre-filtered observation data..." with "Prepared observation data";
  move the noisy-coupling note to the noisy param where it belongs
- Tighten max_dist_m wording (in meters); add tie-behaviour note to max_points
- Align noisy/cov_reduc wording with build_covariance_matrix
- Mark noisy and cov_reduc as optional (both have defaults)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Clarify model_values_df required columns
- Align corr_fn with the distances→correlations wording used elsewhere
- Expand noisy to cover both residual and covariance effects

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop leading underscore on _build_obs_data per org convention
- Normalise docstring openers to one-line summaries
- Add types to Parameters sections (DataFrame, bool, etc.)
- Tighten Returns descriptions and drop duplicated "Returns" wording
- Replace bare "points_pipeline" identifier with "the points pipeline"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dev scripts

Audit confirmed DEFAULT_GEOLOGY_PHI/DEFAULT_TERRAIN_PHI are unused in
production: phi is read from YAML configs by correlations.resolve_correlation,
and the 1407/993 values are duplicated as literals in vs30/configs/*.yaml.
The Python constants were referenced only by 6 dev scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reorder so GEOSPATIAL_DIR derives from RESOURCE_PATH; correct RESOURCE_PATH comment
- Fix MIN_DIST_ENFORCED comment (the "division by zero" claim was wrong)
- Tighten K_VALUE comment, reference CombinationMethod enum
- One-line HybridGeologyParams and ModelType docstrings
- Normalise caps-y section headers
- Wrap long GEOLOGY_VS30_MEAN_STDDEV_FILENAME literal to match TERRAIN counterpart
- Lower GAPFILL_MAX_HALF_WIDTH_M 50050 → 50000 (behavioural):
  the loop now runs 9 iterations instead of 10, with a largest attempted
  half-width of 45050 m (~90 km grid edge), still well beyond plausible
  donor distances within NZ. The old 50050 value was contorted to
  preserve a 10-iteration count matching an earlier arbitrary ceiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was a no-op wrapper for GridConfig(**data). Never called from production,
tests, or dev — confirmed via package-wide function-usage audit. Dates
from when grid params were YAML-loaded; current CLI passes typed kwargs
directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntions

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request executes a comprehensive refactoring of the VS30 package, introducing a modular architecture, a Typer-based CLI, and a YAML-driven configuration system for model versions. Key enhancements include updated dependencies for Python 3.11+, a robust suite of benchmark and consistency tests, and a standardized coordinate system using NZTM2000. Feedback highlights several critical areas for optimization and correction: the MVN spatial adjustment in spatial.py should utilize a spatial index to resolve performance bottlenecks caused by high computational complexity; the Bayesian update logic in category.py must clip the posterior standard deviation to correctly maintain minimum uncertainty; the exponential correlation function should not enforce a minimum distance at zero to avoid incorrect variance calculations; and the gap-filling logic in pipeline.py should be batched to improve efficiency for large point sets.

Comment thread vs30/spatial.py
Comment thread vs30/category.py
Comment thread vs30/correlations.py
Comment thread vs30/pipeline.py
AndrewRidden-Harper and others added 7 commits May 13, 2026 12:59
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop unused threadpoolctl, declare pyproj as a test dep (used in
test_benchmarks), and ignore test_benchmarks under DEP001 since it's
imported across test files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve the 19 non-qcore ty diagnostics on this branch:

- gapfill.create_local_grid_config: ``round()`` the float bounds back to
  int so they match ``GridConfig`` annotations.
- spatial.prepare_observation_data: narrow ``slope_array`` /
  ``coast_dist_array`` inside the GEOLOGY branch (function-entry check
  already raises if they're missing).
- pipeline.compute_component_grid: restructure the
  ``if/elif/else`` over ``posterior_df`` into ``if/else { if/else }`` so
  the inner branches can narrow ``categorical_model_csv``.
- pipeline.grid_pipeline: introduce ``GridPipelineResult`` TypedDict so
  each key has its precise value type, removing union noise at
  call-sites (``fill_one_point_via_local_grid``, tests).
- pipeline.points_pipeline: convert the ``... if run_X else None``
  conditional-expression assignments to ``if/else`` statements and add
  ``type guard`` asserts so ty can narrow downstream uses.

Standardise type-guard asserts (including pre-existing ones) on a
``# type guard: <runtime invariant>`` comment so readers know they
exist for the static checker.

The 2 remaining ``unresolved-import: qcore`` diagnostics are
environment-specific and persist in CI; they're unrelated to this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs ``deptry -ddg test,types``, classifying the ``test`` extras
group as dev. DEP004 then fires on any dev-classified package imported
in tests/. pytest is already on the ignore list for the same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rior

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-pixel O(N_obs) np.partition scan in
select_observations_for_pixel with a per-pixel KDTree query.
ObservationData carries the tree, built lazily in __post_init__ from
locations. ~22x per-call speedup on the viktor 35,706-obs benchmark
(microbenchmark); pinned raster/points benchmarks and grid/points
consistency tests pass unchanged across all four model versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant