|
1 | 1 | # EOPF GeoZarr |
2 | 2 |
|
3 | | -Turn EOPF datasets into a GeoZarr-style Zarr v3 store. Keep the data values intact and add standard geospatial metadata, multiscale overviews, and per-variable dimensions. |
| 3 | +GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets. |
4 | 4 |
|
5 | | -## Quick Start |
| 5 | +Turn EOPF datasets into a GeoZarr-style Zarr v3 store while: |
| 6 | +- Preserving native CRS (no forced TMS reprojection) |
| 7 | +- Adding CF + GeoZarr compliant metadata |
| 8 | +- Building /2 multiscale overviews |
| 9 | +- Writing robust, retry-aware band data with validation |
6 | 10 |
|
7 | | -Install (uv): |
| 11 | +## Overview |
8 | 12 |
|
| 13 | +This library converts EOPF datatrees into GeoZarr-spec 0.4 aligned Zarr v3 stores without forcing web-mercator style tiling. It focuses on scientific fidelity (native CRS), robust metadata (CF + GeoZarr), and operational resilience (retry + completeness auditing) while supporting multiscale /2 overviews. |
| 14 | + |
| 15 | +## Key Features |
| 16 | + |
| 17 | +- **GeoZarr Specification Compliance** (0.4 features implemented) |
| 18 | +- **Native CRS Preservation** (UTM, polar, arbitrary projections) |
| 19 | +- **Multiscale /2 Overviews** (COG-style hierarchy as child groups) |
| 20 | +- **CF Conventions** (`standard_name`, `grid_mapping`, `_ARRAY_DIMENSIONS`) |
| 21 | +- **Resilient Writing** (band-by-band with retries & auditing) |
| 22 | +- **S3 & S3-Compatible Support** (AWS, OVH, MinIO, custom endpoints) |
| 23 | +- **Optional Parallel Processing** (local Dask cluster) |
| 24 | +- **Automatic Chunk Alignment** (prevents overlapping Dask/Zarr chunks) |
| 25 | +- **HTML Summary & Validation Tools** |
| 26 | +- **STAC & Benchmark Commands** |
| 27 | +- **Consolidated Metadata** (faster open) |
| 28 | + |
| 29 | +## GeoZarr Compliance Features |
| 30 | + |
| 31 | +- `_ARRAY_DIMENSIONS` attributes on all arrays |
| 32 | +- CF grid mapping variables with `GeoTransform` |
| 33 | +- Per-variable `grid_mapping` references |
| 34 | +- Multiscales metadata structure on parent groups |
| 35 | +- Native CRS tile matrix logic (no forced EPSG:3857) |
| 36 | + |
| 37 | +## Installation |
| 38 | + |
| 39 | +Stable: |
| 40 | +```bash |
| 41 | +pip install eopf-geozarr |
| 42 | +``` |
| 43 | + |
| 44 | +Development (uv): |
9 | 45 | ```bash |
10 | 46 | uv sync --frozen |
11 | 47 | uv run eopf-geozarr --help |
12 | 48 | ``` |
13 | 49 |
|
14 | | -Or pip: |
15 | | - |
| 50 | +Editable (pip): |
16 | 51 | ```bash |
17 | | -pip install -e . |
| 52 | +pip install -e .[dev] |
18 | 53 | ``` |
19 | 54 |
|
20 | | -## Workflows |
21 | | - |
22 | | -For Argo / batch orchestration use: https://github.com/EOPF-Explorer/data-model-pipeline |
| 55 | +## Quick Start (CLI) |
23 | 56 |
|
24 | | -## Convert |
| 57 | +Convert local → local: |
| 58 | +```bash |
| 59 | +eopf-geozarr convert input.zarr output_geozarr.zarr --groups /measurements/r10m /measurements/r20m |
| 60 | +``` |
25 | 61 |
|
26 | 62 | Remote → local: |
27 | | - |
28 | 63 | ```bash |
29 | | -uv run eopf-geozarr convert \ |
| 64 | +eopf-geozarr convert \ |
30 | 65 | "https://.../S2B_MSIL2A_... .zarr" \ |
31 | 66 | "/tmp/S2B_MSIL2A_..._geozarr.zarr" \ |
32 | | - --groups /measurements/reflectance \ |
33 | | - --verbose |
| 67 | + --groups /measurements/reflectance --verbose |
34 | 68 | ``` |
35 | 69 |
|
36 | 70 | Notes: |
37 | | -- Parent groups auto-expand to leaf datasets. |
38 | | -- Overviews use /2 coarsening; multiscales live on parent groups. |
39 | | -- Defaults: Blosc Zstd, conservative chunking, metadata consolidation after write. |
| 71 | +- Parent groups auto-expand to leaf datasets |
| 72 | +- Overviews: /2 coarsening, attached at parent multiscales |
| 73 | +- Defaults: Blosc Zstd level 3, conservative chunking, metadata consolidation |
40 | 74 |
|
41 | | -## S3 |
| 75 | +Info / HTML / Validate: |
| 76 | +```bash |
| 77 | +eopf-geozarr info /tmp/..._geozarr.zarr --html report.html |
| 78 | +eopf-geozarr validate /tmp/..._geozarr.zarr |
| 79 | +``` |
42 | 80 |
|
43 | | -Env for S3/S3-compatible storage: |
| 81 | +## S3 Support |
44 | 82 |
|
| 83 | +Environment vars: |
45 | 84 | ```bash |
46 | 85 | export AWS_ACCESS_KEY_ID=... |
47 | 86 | export AWS_SECRET_ACCESS_KEY=... |
48 | | -export AWS_REGION=eu-west-1 |
49 | | -# Custom endpoint (OVH, MinIO, etc.) |
50 | | -export AWS_ENDPOINT_URL=https://s3.your-endpoint.example |
| 87 | +export AWS_DEFAULT_REGION=eu-west-1 |
| 88 | +export AWS_ENDPOINT_URL=https://s3.your-endpoint.example # optional custom endpoint |
51 | 89 | ``` |
52 | 90 |
|
53 | | -Write to S3: |
54 | | - |
| 91 | +Write directly: |
55 | 92 | ```bash |
56 | | -uv run eopf-geozarr convert \ |
57 | | - "https://.../S2B_MSIL2A_... .zarr" \ |
58 | | - "s3://your-bucket/path/S2B_MSIL2A_..._geozarr.zarr" \ |
59 | | - --groups /measurements/reflectance \ |
60 | | - --verbose |
| 93 | +eopf-geozarr convert input.zarr s3://my-bucket/path/output_geozarr.zarr --groups /measurements/r10m |
61 | 94 | ``` |
62 | 95 |
|
63 | | -## Info & Validate |
| 96 | +Features: |
| 97 | +- Credential validation before write |
| 98 | +- Custom endpoints (OVH, MinIO, etc.) |
| 99 | +- Retry logic around object writes |
64 | 100 |
|
65 | | -Summary: |
| 101 | +## Parallel Processing with Dask |
66 | 102 |
|
67 | 103 | ```bash |
68 | | -uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 104 | +eopf-geozarr convert input.zarr out.zarr --dask-cluster --verbose |
69 | 105 | ``` |
| 106 | +Benefits: |
| 107 | +- Local cluster auto-start & cleanup |
| 108 | +- Chunk alignment to prevent overlapping writes |
| 109 | +- Better memory distribution for large scenes |
70 | 110 |
|
71 | | -HTML report: |
| 111 | +## Python API |
72 | 112 |
|
73 | | -```bash |
74 | | -uv run eopf-geozarr info "/tmp/S2B_MSIL2A_..._geozarr.zarr" --html /tmp/summary.html |
| 113 | +High-level dataset conversion: |
| 114 | +```python |
| 115 | +import xarray as xr |
| 116 | +from eopf_geozarr import create_geozarr_dataset |
| 117 | + |
| 118 | +dt = xr.open_datatree("path/to/eopf.zarr", engine="zarr") |
| 119 | +out = create_geozarr_dataset( |
| 120 | + dt_input=dt, |
| 121 | + groups=["/measurements/r10m", "/measurements/r20m"], |
| 122 | + output_path="/tmp/out_geozarr.zarr", |
| 123 | + spatial_chunk=4096, |
| 124 | + min_dimension=256, |
| 125 | + tile_width=256, |
| 126 | +) |
75 | 127 | ``` |
76 | 128 |
|
77 | | -Validate (counts only real data vars, skips `spatial_ref`/`crs`): |
| 129 | +Selective writer usage (advanced): |
| 130 | +```python |
| 131 | +from eopf_geozarr.conversion.geozarr import GeoZarrWriter |
| 132 | +writer = GeoZarrWriter(output_path="/tmp/out.zarr", spatial_chunk=4096) |
| 133 | +# writer.write_group(...) |
| 134 | +``` |
| 135 | + |
| 136 | +## API Reference |
| 137 | + |
| 138 | +`create_geozarr_dataset(dt_input, groups, output_path, spatial_chunk=4096, ...) -> xr.DataTree` |
| 139 | +: Produce a GeoZarr-compliant hierarchy. |
| 140 | + |
| 141 | +`setup_datatree_metadata_geozarr_spec_compliant(dt, groups) -> dict[str, xr.Dataset]` |
| 142 | +: Apply CF + GeoZarr metadata to selected groups. |
| 143 | + |
| 144 | +`downsample_2d_array(source_data, target_h, target_w) -> np.ndarray` |
| 145 | +: Block-average /2 overview generation primitive. |
| 146 | + |
| 147 | +`calculate_aligned_chunk_size(dimension_size, target_chunk_size) -> int` |
| 148 | +: Returns evenly dividing chunk to avoid overlap. |
| 149 | + |
| 150 | +## Architecture |
78 | 151 |
|
79 | | -```bash |
80 | | -uv run eopf-geozarr validate "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
81 | 152 | ``` |
| 153 | +eopf_geozarr/ |
| 154 | + commands/ # CLI subcommands (convert, validate, info, stac, benchmark) |
| 155 | + conversion/ # Core geozarr pipeline, helpers, multiscales, encodings |
| 156 | + metrics.py # Lightweight metrics hooks (optional) |
| 157 | +``` |
| 158 | + |
| 159 | +## Contributing to GeoZarr Specification |
82 | 160 |
|
83 | | -## Benchmark (optional) |
| 161 | +Upstream issue discussions influenced: |
| 162 | +- Arbitrary CRS preservation |
| 163 | +- Chunking performance & strategies |
| 164 | +- Multiscale hierarchy clarity |
84 | 165 |
|
| 166 | +## Benchmark & STAC Commands |
| 167 | + |
| 168 | +Benchmark: |
85 | 169 | ```bash |
86 | | -uv run eopf-geozarr benchmark "/tmp/..._geozarr.zarr" --samples 8 --window 1024 1024 |
| 170 | +eopf-geozarr benchmark /tmp/out_geozarr.zarr --samples 8 --window 1024 1024 |
87 | 171 | ``` |
88 | 172 |
|
89 | | -## STAC |
90 | | - |
| 173 | +STAC draft artifacts: |
91 | 174 | ```bash |
92 | | -uv run eopf-geozarr stac \ |
93 | | - "/tmp/..._geozarr.zarr" \ |
94 | | - "/tmp/..._collection.json" \ |
95 | | - --bbox "minx miny maxx maxy" \ |
96 | | - --start "YYYY-MM-DDTHH:MM:SSZ" \ |
97 | | - --end "YYYY-MM-DDTHH:MM:SSZ" |
| 175 | +eopf-geozarr stac /tmp/out_geozarr.zarr /tmp/collection.json \ |
| 176 | + --bbox "minx miny maxx maxy" --start 2025-01-01T00:00:00Z --end 2025-01-31T23:59:59Z |
98 | 177 | ``` |
99 | 178 |
|
100 | | -## Python API |
| 179 | +## What Gets Written |
101 | 180 |
|
102 | | -```python |
103 | | -from eopf_geozarr.conversion.geozarr import GeoZarrWriter |
104 | | -from eopf_geozarr.validation.validate import validate_store |
105 | | -from eopf_geozarr.info.summary import summarize |
| 181 | +- `_ARRAY_DIMENSIONS` per variable (deterministic axis order) |
| 182 | +- Per-variable `grid_mapping` referencing `spatial_ref` |
| 183 | +- Multiscales metadata on parent groups; /2 overviews |
| 184 | +- Blosc Zstd compression, conservative chunking |
| 185 | +- Consolidated metadata index |
| 186 | +- Band attribute propagation across levels |
106 | 187 |
|
107 | | -src = "https://.../S2B_MSIL2A_... .zarr" |
108 | | -dst = "/tmp/S2B_MSIL2A_..._geozarr.zarr" |
| 188 | +## Consolidated Metadata |
109 | 189 |
|
110 | | -writer = GeoZarrWriter(src, dst, storage_options={}) |
111 | | -writer.write(groups=["/measurements/reflectance"], verbose=True) |
| 190 | +Improves open performance. Spec discussion ongoing; toggle by disabling consolidation if strict minimalism required. |
112 | 191 |
|
113 | | -report = validate_store(dst) |
114 | | -print(report.ok) |
| 192 | +## Troubleshooting |
115 | 193 |
|
116 | | -tree = summarize(dst) |
117 | | -print(tree["summary"]) # or write HTML via CLI |
118 | | -``` |
| 194 | +| Symptom | Cause | Fix | |
| 195 | +|---------|-------|-----| |
| 196 | +| Parent group empty | Only leaf groups hold arrays | Use `--groups` or rely on auto-expansion | |
| 197 | +| Overlapping chunk error | Misaligned dask vs encoding chunks | Allow auto chunk alignment or reduce spatial_chunk | |
| 198 | +| S3 auth failure | Missing env vars or endpoint | Export AWS_* vars / set AWS_ENDPOINT_URL | |
| 199 | +| HTML path is a directory | Provided path not file | A default filename is created inside | |
119 | 200 |
|
120 | | -## What it writes |
| 201 | +## Development & Contributing |
121 | 202 |
|
122 | | -- `_ARRAY_DIMENSIONS` per variable (correct axis order). |
123 | | -- `grid_mapping = "spatial_ref"` per variable; `spatial_ref` holds CRS/georeferencing. |
124 | | -- Multiscales on parent groups; /2 overviews. |
125 | | -- Blosc Zstd compression; conservative chunking; consolidated metadata. |
126 | | -- Overviews keep per-band attributes (grid_mapping reattached across levels). |
| 203 | +```bash |
| 204 | +git clone <repo-url> |
| 205 | +cd eopf-geozarr |
| 206 | +pip install -e '.[dev]' |
| 207 | +pre-commit install |
| 208 | +pytest |
| 209 | +``` |
127 | 210 |
|
128 | | -## Consolidated metadata |
| 211 | +Quality stack: Black, isort, Ruff, Mypy, Pytest, Coverage. |
129 | 212 |
|
130 | | -Speeds up reads. Some tools note it isn’t in the core Zarr v3 spec yet; data stays valid. You can disable consolidation during writes or remove the index if preferred. |
| 213 | +## License & Acknowledgments |
131 | 214 |
|
132 | | -## Troubleshooting |
| 215 | +Apache 2.0. Built atop xarray, zarr, dask; follows evolving GeoZarr specification. |
133 | 216 |
|
134 | | -- Parent group shows no data vars: select leaves (CLI auto-expands). |
135 | | -- S3 errors: check env vars and `AWS_ENDPOINT_URL` for custom endpoints. |
136 | | -- HTML path is a directory: a default filename is created inside. |
| 217 | +--- |
| 218 | +For questions or issues open a GitHub issue. |
137 | 219 |
|
0 commit comments