Skip to content

Add cache versioning and dataset-local pre-cache lookup#230

Draft
cbyrohl wants to merge 5 commits into
mainfrom
precache-lookup
Draft

Add cache versioning and dataset-local pre-cache lookup#230
cbyrohl wants to merge 5 commits into
mainfrom
precache-lookup

Conversation

@cbyrohl
Copy link
Copy Markdown
Owner

@cbyrohl cbyrohl commented Feb 27, 2026

Summary

  • Introduces CACHE_FORMAT_VERSION = 1 constant that is stamped into every new cache file (HDF5 attribute + JSON field), enabling automatic invalidation when the format changes in future releases
  • Adds find_precached_file() to search ancestor directories for admin-pre-placed cache files at {ancestor}/postprocessing/scida/{hash}.{ext}, eliminating the need for every user to independently rebuild caches for shared datasets
  • Pre-cache is strictly read-only: scida never writes to or deletes pre-cache files
  • Precedence: user cache > dataset-local pre-cache > create from scratch
  • Invalid pre-caches (version mismatch, corrupt) fall through gracefully to user cache creation

Files changed

  • src/scida/misc.pyCACHE_FORMAT_VERSION, find_precached_file()
  • src/scida/io/_base.py — Version write/validate in ChunkedHDF5Loader, pre-cache fallback in load() and load_metadata()
  • src/scida/series.py — JSON version write/validate in DatasetSeries, pre-cache fallback in __init__(), safe deletion logic (never deletes pre-cache)
  • tests/test_precache.py — 14 new tests covering versioning and pre-cache for HDF5 and JSON paths

Closes #195

Test plan

  • 14 new unit/integration tests in tests/test_precache.py
  • 160 non-external tests pass (no regressions)
  • 345 external tests pass with real simulation data
  • Verify admin-placed pre-cache works on a shared filesystem (manual)

🤖 Generated with Claude Code

cbyrohl and others added 5 commits February 27, 2026 13:21
Introduces CACHE_FORMAT_VERSION to stamp every cache file (HDF5 and JSON),
enabling automatic invalidation when the format changes. Adds
find_precached_file() to search parent directories for admin-pre-placed
cache files at {ancestor}/postprocessing/scida/{hash}.{ext}, with strict
read-only semantics (pre-cache is never written to or deleted).

Closes #195

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add scida.devtools package with:
- _cache_deploy.py: deploy_precache() and deploy_series_precache() functions
  for copying local cache files to a shared basefolder, plus MPCDF_TARGETS list
- __main__.py: typer CLI (python -m scida.devtools) with build, deploy,
  build-deploy, and *-all commands for batch cache operations
- __init__.py: re-exports public API for backwards-compatible imports

Tests: 9 new tests covering deploy functions (single/multi path, error cases,
overwrite, round-trip loadability for both HDF5 and series JSON caches).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lazy=False on DatasetSeries doesn't force dataset init — delay_init
is always applied. Instead, explicitly access .data on each dataset
to trigger initialization and cache creation.

Verified with real data (TNGvariation_simulation): build creates
5 HDF5 caches + 1 series JSON, deploy copies all to basefolder.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
delay_init was always applied regardless of the lazy flag, so
lazy=False was a no-op. Now when lazy=False, evaluate_lazy() is
called on each dataset after construction, triggering full init
and cache creation.

This also simplifies the CLI back to using scida.load(lazy=False)
instead of the manual per-dataset workaround.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Commands are now scoped under `python -m scida.devtools cache`:
  cache build, cache deploy, cache build-deploy,
  cache build-all, cache deploy-all, cache build-deploy-all

This makes the naming clearer (operations are on caches) and
leaves room for future devtools subcommand groups.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scida example notebook slow due to slow TNG-Cluster simulation caching

1 participant