Vs30 refactor#45
Conversation
…test config loading
- Consolidate mp.get_context("spawn") into parallel.py as single source
- Extract _default_correlation_functions() in pipeline.py
- Extract _collect_observation_csvs() in pipeline.py
- Move shared test config loading to conftest.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Integrate gap-fill into points_pipeline so that on-land nodata points are filled via local grid sub-pipelines, matching the grid_pipeline gap-fill behaviour. Regenerate benchmark CSV to include the new vs30_before_gapfill and stdv_before_gapfill columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Now using Jaehwi's own sites_load_NSHM2022.py code
Evaluates approaches for testing grid/points pipeline consistency across the full NZ domain and all model versions. Settles on 3x3 local grids per test point with pre-computed deliberate + random point selection and a two-tier fast/slow test structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Five tasks: generate test points fixture, register slow marker, add config-loading helper, rewrite test with full NZ/all-model coverage, and verify existing tests still pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…del versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Drop legacy-R-code references from the corr_zero and "no nearby obs" comments - Tighten pixel/obs_data param descriptions - Tighten corr_zero description (drop "i.e." preamble) - Update max_points doc with tie-behavior note - Align noisy/cov_reduc wording with build_covariance_matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Tighten docstring (drop "In-memory" qualifier, "Model type:" prefix, trailing "Default is False") - Specify raster_data's purpose (provides transform for locating obs) - Drop the second comment line listing specific GID ranges (constants are referenced directly below) - Rewrite the within-grid/outside-grid sampling comment more concisely - Initialize coast_obs with np.full(np.nan) instead of np.empty, removing the dependency on downstream skipping uninitialized entries when coastal mod is off - Trim the 6-line legacy slope NODATA comment to 3 lines (the LEGACY rename comes in a follow-up commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SENTINEL Drop "legacy" framing from the constant name and the three comment blocks that reference it. Behavior unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Specify raster_data fields used (valid-pixel mask, grid transform) - Tighten obs_data, model_type, max_dist_m descriptions - Switch Returns to named-entry style - Inline str(model_type).capitalize() at its two f-string sites; drop label Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Specify raster_data fields; tighten obs_data description - Reformat corr_fn line wrap - Update max_points doc with tie-behavior note - Align noisy/cov_reduc wording with build_covariance_matrix - Switch Returns to named-entry style Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rvation data" "Bundled" could be read as "comes from our bundled defaults", which is misleading — obs_data can come from a user-supplied CSV via YAML config. "Prepared" signals that residuals/noise_weights are populated (output of prepare_observation_data) without implying anything about source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- compute_spatial_adjustments → compute_spatial_pixel_adjustments - compute_spatial_adjustment_at_points → compute_spatial_point_adjustments Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Switch to multi-line summary form (consistent with other functions) - Drop "Delegates to compute_spatial_adjustment_for_pixel" implementation detail - Replace "Pre-filtered observation data..." with "Prepared observation data"; move the noisy-coupling note to the noisy param where it belongs - Tighten max_dist_m wording (in meters); add tie-behaviour note to max_points - Align noisy/cov_reduc wording with build_covariance_matrix - Mark noisy and cov_reduc as optional (both have defaults) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Clarify model_values_df required columns - Align corr_fn with the distances→correlations wording used elsewhere - Expand noisy to cover both residual and covariance effects Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop leading underscore on _build_obs_data per org convention - Normalise docstring openers to one-line summaries - Add types to Parameters sections (DataFrame, bool, etc.) - Tighten Returns descriptions and drop duplicated "Returns" wording - Replace bare "points_pipeline" identifier with "the points pipeline" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dev scripts Audit confirmed DEFAULT_GEOLOGY_PHI/DEFAULT_TERRAIN_PHI are unused in production: phi is read from YAML configs by correlations.resolve_correlation, and the 1407/993 values are duplicated as literals in vs30/configs/*.yaml. The Python constants were referenced only by 6 dev scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reorder so GEOSPATIAL_DIR derives from RESOURCE_PATH; correct RESOURCE_PATH comment - Fix MIN_DIST_ENFORCED comment (the "division by zero" claim was wrong) - Tighten K_VALUE comment, reference CombinationMethod enum - One-line HybridGeologyParams and ModelType docstrings - Normalise caps-y section headers - Wrap long GEOLOGY_VS30_MEAN_STDDEV_FILENAME literal to match TERRAIN counterpart - Lower GAPFILL_MAX_HALF_WIDTH_M 50050 → 50000 (behavioural): the loop now runs 9 iterations instead of 10, with a largest attempted half-width of 45050 m (~90 km grid edge), still well beyond plausible donor distances within NZ. The old 50050 value was contorted to preserve a 10-iteration count matching an earlier arbitrary ceiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was a no-op wrapper for GridConfig(**data). Never called from production, tests, or dev — confirmed via package-wide function-usage audit. Dates from when grid params were YAML-loaded; current CLI passes typed kwargs directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request executes a comprehensive refactoring of the VS30 package, introducing a modular architecture, a Typer-based CLI, and a YAML-driven configuration system for model versions. Key enhancements include updated dependencies for Python 3.11+, a robust suite of benchmark and consistency tests, and a standardized coordinate system using NZTM2000. Feedback highlights several critical areas for optimization and correction: the MVN spatial adjustment in spatial.py should utilize a spatial index to resolve performance bottlenecks caused by high computational complexity; the Bayesian update logic in category.py must clip the posterior standard deviation to correctly maintain minimum uncertainty; the exponential correlation function should not enforce a minimum distance at zero to avoid incorrect variance calculations; and the gap-filling logic in pipeline.py should be batched to improve efficiency for large point sets.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop unused threadpoolctl, declare pyproj as a test dep (used in test_benchmarks), and ignore test_benchmarks under DEP001 since it's imported across test files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve the 19 non-qcore ty diagnostics on this branch:
- gapfill.create_local_grid_config: ``round()`` the float bounds back to
int so they match ``GridConfig`` annotations.
- spatial.prepare_observation_data: narrow ``slope_array`` /
``coast_dist_array`` inside the GEOLOGY branch (function-entry check
already raises if they're missing).
- pipeline.compute_component_grid: restructure the
``if/elif/else`` over ``posterior_df`` into ``if/else { if/else }`` so
the inner branches can narrow ``categorical_model_csv``.
- pipeline.grid_pipeline: introduce ``GridPipelineResult`` TypedDict so
each key has its precise value type, removing union noise at
call-sites (``fill_one_point_via_local_grid``, tests).
- pipeline.points_pipeline: convert the ``... if run_X else None``
conditional-expression assignments to ``if/else`` statements and add
``type guard`` asserts so ty can narrow downstream uses.
Standardise type-guard asserts (including pre-existing ones) on a
``# type guard: <runtime invariant>`` comment so readers know they
exist for the static checker.
The 2 remaining ``unresolved-import: qcore`` diagnostics are
environment-specific and persist in CI; they're unrelated to this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs ``deptry -ddg test,types``, classifying the ``test`` extras group as dev. DEP004 then fires on any dev-classified package imported in tests/. pytest is already on the ignore list for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-pixel O(N_obs) np.partition scan in select_observations_for_pixel with a per-pixel KDTree query. ObservationData carries the tree, built lazily in __post_init__ from locations. ~22x per-call speedup on the viktor 35,706-obs benchmark (microbenchmark); pinned raster/points benchmarks and grid/points consistency tests pass unchanged across all four model versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-opened after history rewrite
This PR is a re-opened version of an earlier one. Between then and now, I used
git filter-repoto remove some unused benchmark files from the repo's history. The rewrite worked as intended but changed every commit's SHA, which orphaned the previous branch frommasterand made the old PR unmergeable, so I closed it and re-targeted the same work against the rewritten history. The content is unchanged; only the underlying SHAs have shifted.Important
If you have an existing local clone of this repository, delete it and re-clone from scratch before pulling these changes. Your local commits share no history with
originanymore -git pullon top of an old clone will fail or produce a broken state. A freshgit cloneis the safe move.Summary
This PR continues the Vs30 package refactor, addressing prior review feedback, adding two new bundled models and support for custom YAML configurations, and incorporating other improvements.
Addressed review feedback
Several review comments from the previous PR have all been addressed. Most do not warrant a specific call-out.
Two slightly more structural changes are worth flagging:
constants.py. Values that were previously declared in each of the bundled model config files are now defined once in code; the config files are correspondingly leaner.StrEnumrather than their config file name. The four valid names live asFixedModelVersionmembers invs30/constants.py, giving a single canonical source for the supported set.Additional development work
jaehwi_v1p0model (in addition to the existingmodified_foster_2019andviktor_cpt_clustering), along with the gap-filling functionality it requires.foster_2019_approx). Exact reproduction wasn't pursued because the published model has known flaws.Why points and grid are separate pipelines
A natural design question is why the package maintains two separate pipelines rather than collapsing them into one. Both pipelines produce identical results at every location (verified by the correspondence tests above); they are kept separate purely for performance. Each direction of unification has its own performance problem:
vs30 gridcall the points pipeline at each pixel. This loses the bulk array operations that the grid pipeline gets from GDAL: bulk raster reads, bulk sampling, bulk spatial adjustment. At national-grid sizes (millions of pixels) it would be impractically slow.vs30 pointsrun the grid pipeline and sample the desired sites from it. This would compute Vs30 across an entire grid regardless of how many sites are requested, making typical points-mode queries needlessly slow.