Skip to content

Conversation

@jhamman
Copy link
Member

@jhamman jhamman commented Aug 11, 2025

This PR implements support for the ZEP 8 URL syntax in Zarr Python.

Some examples of what now works:

import zarr

root = zarr.open_group('s3://bucket/data.zip|zip:|zarr3:')  # S3 → ZIP → Zarr v3
arr = zarr.create_array('memory:|zarr2:group/array', shape=(10, ), dtype='i4')  # Memory → Zarr v2

# custom adapter for icechunk
ds = xr.open_zarr('s3://icechunk-public-data/v1/glad|icechunk:')  # icechunk (from xarray)

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

closes #2943
fixes #2831
xref: zarr-developers/zeps#48

cc @jbms

- Add comprehensive ZEP 8 URL parsing and resolution system
- Implement StoreAdapter ABC for extensible storage adapters
- Add built-in adapters for file, memory, S3, GCS, HTTPS schemes
- Support pipe-chained URLs like s3://bucket/data.zip|zip:|zarr3:
- Add URLSegment parsing with validation
- Integrate with zarr.open_group and zarr.open_array APIs
- Include demo script and comprehensive test suite
- Pass all existing tests + 35 new ZEP 8-specific tests
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Aug 11, 2025
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Aug 24, 2025
@jbms
Copy link

jbms commented Sep 11, 2025

One tricky thing about s3+https://endpoint/a/b is that it is ambiguous as to whether it is using "virtual host" syntax (i.e. endpoint refers to a single bucket and the path is "a/b") or "path" syntax (i.e. the bucket is "a" and the path is "b").

The "path" syntax is generally the default when running a regular s3-compatible server, but the "virtual host" syntax can commonly occur when someone defines a CNAME DNS entry to map their own domain or subdomain to an AWS S3 bucket.

When designing this syntax for Neuroglancer, it seemed like it would be annoying to require users to use separate syntax to disambiguate the two cases. Instead, for operations where it matters (namely List), Neuroglancer just automatically determines which of the two cases applies by trying both ways and seeing which one succeeds, and then caching the result so that subsequent list operations don't require two requests.

@codecov
Copy link

codecov bot commented Sep 11, 2025

Codecov Report

❌ Patch coverage is 67.57246% with 179 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.87%. Comparing base (ee9c182) to head (35526a5).
⚠️ Report is 42 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/storage/_builtin_adapters.py 58.36% 97 Missing ⚠️
src/zarr/storage/_zep8.py 83.95% 30 Missing ⚠️
src/zarr/abc/store_adapter.py 39.02% 25 Missing ⚠️
src/zarr/registry.py 60.00% 8 Missing ⚠️
src/zarr/storage/_zip.py 78.94% 8 Missing ⚠️
src/zarr/storage/_common.py 85.00% 3 Missing ⚠️
src/zarr/storage/_register_adapters.py 62.50% 3 Missing ⚠️
src/zarr/abc/__init__.py 0.00% 2 Missing ⚠️
src/zarr/api/asynchronous.py 0.00% 2 Missing ⚠️
src/zarr/storage/__init__.py 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (ee9c182) and HEAD (35526a5). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (ee9c182) HEAD (35526a5)
14 10
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3369       +/-   ##
===========================================
- Coverage   94.92%   60.87%   -34.06%     
===========================================
  Files          79       86        +7     
  Lines        9500    10231      +731     
===========================================
- Hits         9018     6228     -2790     
- Misses        482     4003     +3521     
Files with missing lines Coverage Δ
src/zarr/storage/_logging.py 61.94% <ø> (-38.06%) ⬇️
src/zarr/storage/__init__.py 9.52% <0.00%> (-85.48%) ⬇️
src/zarr/abc/__init__.py 0.00% <0.00%> (ø)
src/zarr/api/asynchronous.py 71.42% <0.00%> (-19.45%) ⬇️
src/zarr/storage/_common.py 68.57% <85.00%> (-21.96%) ⬇️
src/zarr/storage/_register_adapters.py 62.50% <62.50%> (ø)
src/zarr/registry.py 63.19% <60.00%> (-25.63%) ⬇️
src/zarr/storage/_zip.py 72.10% <78.94%> (-25.50%) ⬇️
src/zarr/abc/store_adapter.py 39.02% <39.02%> (ø)
src/zarr/storage/_zep8.py 83.95% <83.95%> (ø)
... and 1 more

... and 64 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jhamman jhamman marked this pull request as ready for review September 12, 2025 03:03
Examples::

# Basic ZIP file storage
zarr.open_array("file:data.zip|zip", mode='w', shape=(10, 10), dtype="f4")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a bad idea to break file uris.

https://datatracker.ietf.org/doc/html/rfc8089

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is considered in the spec doc this PR is building on: zarr-developers/zeps#48

https://github.com/jbms/zeps/blob/92bc64111c7612083560358efdd4450e061f3746/draft/ZEP0008.md?plain=1#L115-L119

And later is says:

  Implementations SHOULD not support `file://relative/path` since that
  is ambiguous with the `file://hostname/path` syntax defined by
  [RFC8089](https://datatracker.ietf.org/doc/html/rfc8089).

If you forsee serious issues here I'd encourage commenting on that PR on the standard.

Copy link
Contributor

@ianhi ianhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partway through the code at this point (currently following the logic of store resolution). Posting some comments now so I don't lose them (i've been burned by this before)

**In-memory storage:**

>>> # Create array in memory
>>> z = zarr.open_array("memory:", mode='w', shape=(5, 5), dtype="f4")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I then access this from somewhere else using this syntax?

e.g. memory:aesr80s9e8ra?

.. warning::
The :class:`zarr.storage.ObjectStore` class is experimental.

URL-based Storage (ZEP 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what this section is missing is a showcasing of what the equivalent zarr-pyhthon code would be to put it in terms people are more familiar with.

So each section would be:

zarr.open_array("file:zep8-data.zip|zip" ....)

# is equivalent to

zarr.open_array(zarr.storage.ZipStore(...)...)

Comment on lines +223 to +225
.. note::
When using ZEP 8 URLs with third-party libraries like xarray, the URL syntax allows
seamless integration without requiring zarr-specific store creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. note::
When using ZEP 8 URLs with third-party libraries like xarray, the URL syntax allows
seamless integration without requiring zarr-specific store creation.

This is already effectively stated above.

Comment on lines +201 to +204
URL-based Storage (ZEP 8)
~~~~~~~~~~~~~~~~~~~~~~~~~

Zarr supports URL-based storage following the ZEP 8 specification, which allows you to specify storage locations using URLs with chained adapters::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a link to the more extensive docs.


- ``file:path.zip|zip`` - ZIP file on local filesystem
- ``s3://bucket/data.zip|zip`` - ZIP file in S3 bucket
- ``memory:`` - In-memory storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``memory:`` - In-memory storage

not an example of piping

Comment on lines +14 to +19
from zarr.abc.store_adapter import StoreAdapter
from zarr.storage._local import LocalStore
from zarr.storage._logging import LoggingStore
from zarr.storage._memory import MemoryStore
from zarr.storage._zep8 import URLStoreResolver
from zarr.storage._zip import ZipStore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth considering makign these lazy, or not a big enough gain?

Comment on lines +218 to +221
>>> is_zep8_url("s3://bucket/data.zarr")
False
>>> is_zep8_url("file:///data.zarr")
False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really follow these returning false. seems like a downgrade in functionality, also having looked a the spec a bit I don't see where this is explicitly disallowed (though very possible i misread or misunderstood)

Comment on lines +343 to +348
if not is_zep8_url(url):
# Check if it's a simple scheme URL that we can handle
if "://" in url or ((":" in url) and not url.startswith("/")):
# Parse as a single segment URL - the parser should handle this
try:
segments = self.parser.parse(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I think this answers my question above. Since these are valid urls I think ideally is_zep8_url would handle these simple cases correctly.

Comment on lines +349 to +350
except Exception:
raise ValueError(f"Not a valid URL: {url}") from None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the expected exception's we are catching here. As written this might silently quash an exception that is a real bug.

I think this whole section can be consolidated and simplified a bit, especially as we don't actually do anyhthing differently here then in the else branch that calls the same parse

for i, segment in enumerate(segments):
if segment.adapter in ("zarr2", "zarr3"):
# Skip zarr format segments - they don't create stores
# TODO: these should propagate to the open call somehow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out this TODO as importnat before merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support ZEP 8 URL Syntax Can't conveniently open zip store from path with zarr v3

5 participants