Skip to content

Conversation

ianhi
Copy link
Contributor

@ianhi ianhi commented Sep 30, 2025

Previously pydap or netcdf if installed would grab any remote URL according the order of backend resolution.

@ianhi ianhi changed the title fix: be more more caution when claiming a backend can open a URL fix: be more cautious when claiming a backend can open a URL Sep 30, 2025
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @ianhi !

Comment on lines 213 to 220
if not (isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj)):
return False

# Check file extension to avoid claiming non-OPeNDAP URLs (e.g., remote Zarr stores)
_, ext = os.path.splitext(filename_or_obj.rstrip("/"))
# Pydap handles OPeNDAP endpoints, which typically have no extension or .nc/.nc4
# Reject URLs with non-OPeNDAP extensions like .zarr
return ext not in {".zarr", ".zip", ".tar", ".gz"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure on this. We could go further and require "dap" to be in the URL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a standard extension for OpenDAP URLs. @Mikejmnez do you know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with a co-worker on slack. He said:

There's no standard extension for DAP URLs. Explicitly excluding .zarr seems good enough for this disambiguation.

Copy link
Contributor

@Mikejmnez Mikejmnez Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Yes, there is no standard extension for opendap urls. OPeNDAP servers produce urls with the filename at the end, but for example NASA does something completely different. Excluding .zarr should be good.

What I am trying to push for this, is an opendap protocol-ization via the URL scheme. This is "dap2://<file_url>" vs "dap4://<file_url>". I already added it to the documentation back then dap2vdap4 Right now, if an opendap begins with http, then it is assumed to be dap2. This is completely on the client side and not a server thing. But pydap and python-netcdf4 support this, some NASA subsetting tools do this. Perhaps this may help separating opendap urls from non-opendap urls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually, Thredds (TDS) does have this "standard" way to specify the protocol that may help to discern between opendap url vs non-opendap url: a TDS dap2 url will have a dodsC in its urls. A TDS dap4 url will have a dap4 in its url. (see here). However, an organization running an opendap server may decide how their own urls are exposed.

@shoyer
Copy link
Member

shoyer commented Oct 1, 2025

Conceptually, I think there is an ambiguity about guess_can_open. Does it means that a backend possibly or definitively can open a dataset?

Based on how it's used in open_dataset, I think we should pick defaults closer to "definitely" (which is what you do in this PR). It's a better user experience to require an explicit engine than to guess wrong and raise a less informative downstream error.

@ianhi
Copy link
Contributor Author

ianhi commented Oct 1, 2025

I think there is an ambiguity about guess_can_open. Does it means that a backend possibly or definitively can open a dataset?

Based on how it's used in open_dataset, I think we should pick defaults closer to "definitely" (which is what you do in this PR). It's a better user experience to require an explicit engine than to guess wrong and raise a less informative downstream error.

I agree. I can make a follow up PR to the backends page to try make this more explicit.

For the case of the dap protocol I skimmed the specification https://opendap.github.io/dap4-specification/DAP4.html and didn't see anything in particular that made me feel confident about searching the URL for dap or opendap even if those are likely common, i think they are too restrictive on a URL. But this would also be a great question for @dcherian when he is back from vacation.

But for this PR I feel pretty happy with where it is and would defer further dap improvements to later, modulo a small fix to lower case the extension to be slightly more robust that I'll push shortly

@ianhi
Copy link
Contributor Author

ianhi commented Oct 1, 2025

modulo a small fix to lower case the extension to be slightly more robust that I'll push shortly

Actually im not sure that's reasonable. Other backends do a similar check without thinking about case:

if isinstance(filename_or_obj, str | os.PathLike):
_, ext = os.path.splitext(filename_or_obj)
return ext in {".nc", ".nc4", ".cdf", ".gz"}

is that more correct?

@shoyer
Copy link
Member

shoyer commented Oct 1, 2025

modulo a small fix to lower case the extension to be slightly more robust that I'll push shortly

Actually im not sure that's reasonable. Other backends do a similar check without thinking about case:

if isinstance(filename_or_obj, str | os.PathLike):
_, ext = os.path.splitext(filename_or_obj)
return ext in {".nc", ".nc4", ".cdf", ".gz"}

is that more correct?

Really hard to say.

Honestly, I think even this list of extensions is pretty generous:

  • Scipy doesn't read netCDF v4 files, so .nc4 seems unlikely
  • Scipy can only read .gz files if they contain a netCDF v3 file. So maybe .nc.gz would be more appropriate than allowing any .gz file.

@ianhi ianhi force-pushed the fix-netcdf4-remote-zarr-detection branch from b06a0a5 to 7ed1f0a Compare October 1, 2025 15:21
@Mikejmnez
Copy link
Contributor

Mikejmnez commented Oct 1, 2025

For the case of the dap protocol I skimmed the specification https://opendap.github.io/dap4-specification/DAP4.html and didn't see anything in particular that made me feel confident about searching the URL for dap or opendap even if those are likely common, i think they are too restrictive on a URL. But this would also be a great question for @dcherian when he is back from vacation.

.das .dds .dods (standard dap2 extensions), and .dmr, .dap (standard dap4 extensions) trigger downloads of either metadata or the entire file's binary data (.dods and .dap). URLS should not have those extensions...
These extensions are added by the backend (pydap or python-netcdf4), to create the python dataset objects. A url exposed by a server like THREDDS or Hyrax is likely not going to have this extension (it is possible only when a user manually creates it, which should not be entirely ruled out).

The only think that I can think of, that would be opendap specific, and part of the spec (dap4), appears when using a constraint expression in the url. These can often be part of the user-provided URL. For example:

url = "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4" # full data
url_ce2 = "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time[0:1:0];/Y[0:1:39];/X[0:1:39];/Eta[0:1:0][0:1:39][0:1:39]"
url_ce2 ="http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time=[0:1:0];/Y=[0:1:39];/X=[0:1:39];/time;/X;/Y;/Eta"

url_ce1 and url_ce2 make use of the two ways to subset via its url, and as such the dap4.ce= must appear in the query parameter. That is a exclusively opendap that is implemented by any dap4 server.

@ianhi
Copy link
Contributor Author

ianhi commented Oct 2, 2025

dap4.ce= must appear in the query parameter. That is a exclusively opendap that is implemented by any dap4 server.

hese can often be part of the user-provided URL.

Since it is often but not "always" what would you recommend conditioning guessing true on these? It would be a signfiicant increase in strictness over what is there now. The current changes are a medium increase. I'm not an dap user so I don't have a strong opinion.

@shoyer
Copy link
Member

shoyer commented Oct 3, 2025 via email

@Mikejmnez
Copy link
Contributor

Mikejmnez commented Oct 3, 2025

If the url has either dap2 or dap4 as its scheme, it will 100% be an opendap url. For example, dap4://opendap.earthdata.nasa.gov/collections/... and dap2://opendap.earthdata.... Both pydap and netcdf4-python understand this (as so do other non-python tools). In NASA land, there is a push for all tutorials have dap4 urls, for example. (these are not valid http schemes, but a client-side parse-able approach to specify the dap protocol somebody came up with some time ago).

@ianhi
Copy link
Contributor Author

ianhi commented Oct 3, 2025

Both pydap and netcdf4-python understand this

So the NETCDF4 backend should also say yes to urls with dap in them?

I tried our your URL with both pydap and netcdf4 backends, but ran into an error for both, unfortunately

backend=netcdf4

OSError: [Errno -70] NetCDF: DAP server error: 'http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4'

backend=pydap

seems to work but then fails with

RetryError: HTTPConnectionPool(host='test.opendap.org', port=80): Max retries exceeded with url: /opendap/dap4/StaggeredGrid.nc4.dds (Caused by ResponseError('too many 503 error responses'))

So i wasn't able to manually verify

@shoyer
Copy link
Member

shoyer commented Oct 3, 2025

So the NETCDF4 backend should also say yes to urls with dap in them?

I'll defer to @Mikejmnez, but my impression is that we should definitely be preferring pydap to netCDF4 for DAP. I think DAP support is optional in netCDF-C.

So I would lean towards not claiming DAP urls in the netcdf4 backend, or maybe just being sure we try pydap before netcdf4.

@Mikejmnez
Copy link
Contributor

So the NETCDF4 backend should also say yes to urls with dap in them?

I tried our your URL with both pydap and netcdf4 backends, but ran into an error for both, unfortunately

It looks like the test server was down overnight, and got restarted this morning. I tried it and it worked for me.

  • With pydap, all http, dap2, and dap4 work. http | https defaults to dap2.
  • With netcdf4 only http and dap4 work. dap2 does not, but http | https defaults to dap2.

My preference, would be to have any dap2 | dap4 scheme in the url, to be automatically assigned to pydap.

@dopplershift
Copy link
Contributor

dopplershift commented Oct 3, 2025

@shoyer While it's possible to build netcdf-c without DAP support, it's almost always built on and is frequently used. For instance, the netcdf4-python wheels are built with DAP support turned on in netCDF-C, as are the pacakges on conda-forge. So, PLEASE leave netcdf4-python able to read DAP by default.

EDIT: My default environment has neither h5netcdf nor pydap, just netcdf4. I'd really like for that environment to not suddenly start breaking my existing code/examples.

@Mikejmnez
Copy link
Contributor

And I am up for keeping the defaults as is, and definitely NOT break people's workflows. I think the issue at hand is how to identify dap urls.

@ianhi
Copy link
Contributor Author

ianhi commented Oct 3, 2025

So I would lean towards not claiming DAP urls in the netcdf4 backend, or maybe just being sure we try pydap before netcdf4.

I had a thought the other day when working on a custom zarr backend, that it would be nice to have a robust and cross-file-type system for expressing user preference for backend resolution order. It currently seems to be alphabetical, but with a special case of reordering the 3 built in netcdf backend. So the default guessing order will be:

  1. h5netcdf (from netcdf_engine_order[0])
  2. scipy (from netcdf_engine_order[1])
  3. netcdf4 (from netcdf_engine_order[2])
  4. pydap (alphabetically after netcdf4)
  5. store (alphabetically)
  6. zarr (alphabetically)

from the sorting here:

for be_name in OPTIONS["netcdf_engine_order"]:
if backend_entrypoints.get(be_name) is not None:
ordered_backends_entrypoints[be_name] = backend_entrypoints.pop(be_name)
ordered_backends_entrypoints.update(
{name: backend_entrypoints[name] for name in sorted(backend_entrypoints)}
)
return ordered_backends_entrypoints

but if i add a zarr backend for a subtype of zarr it will only come first due to alphbetical order, but if i had a backend named like z-zarr it would not be used.

So, PLEASE leave netcdf4-python able to read DAP by default.

I will make sure to not remove the current situation where netcdf4 can report as being able to a.

This is just about the automatic engine resolution, not changing anything for an explicit engine=... but that definitely could break things

@ianhi
Copy link
Contributor Author

ianhi commented Oct 3, 2025

This is also related to this issue i guess: #10810 (comment)

@ianhi
Copy link
Contributor Author

ianhi commented Oct 8, 2025

Ok. Following the meeting today I am now happy with the state of this PR. It improves on the current state by having engine's not guess things that cannot read. However, it does not change the resolution order in the case that a backend can read something. This means we are still left the ambiguity of what a url ending in .nc means. However, this is existing ambiguity, not introduced here.

I have also added a section to the io docs about the resolution order and briefly discuss the .nc ambiguity.

@ianhi ianhi force-pushed the fix-netcdf4-remote-zarr-detection branch from 66c9e1a to 079b290 Compare October 8, 2025 19:05
Comment on lines 7309 to 7310
("DAP4://example.com/dataset", "pydap"), # uppercase scheme
("https://example.com/services/DAP2/dataset", "pydap"), # uppercase in path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following #10804 (comment) can we check that netCDF4 gets chosen for these if pydap is not installed; you can use has_pydap from tests/__init__.py for the check and remove the requires_pydap decorator

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one minor test request

@dcherian dcherian added the plan to merge Final call for comments label Oct 15, 2025
@ianhi
Copy link
Contributor Author

ianhi commented Oct 15, 2025

The remaining failures are not casued by the content of this PR.

Docs build: failing due to intersphinx isssue

flaky tests. Not caused by this PR but related. This URL is returning a 404 http://test.opendap.org/opendap/dap4/unaligned_simple_datatree.nc.h5.dmr i think the test server is down? Or possibly the content of the server was modified. looking here http://test.opendap.org/opendap/dap4/contents.html it sseems that the file we were looking for is not present @Mikejmnez maybe knows more here?

@dcherian
Copy link
Contributor

Ok let's try this out!

@dcherian dcherian merged commit b5e4b0e into pydata:main Oct 16, 2025
44 of 47 checks passed
@ianhi ianhi deleted the fix-netcdf4-remote-zarr-detection branch October 16, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

netcdf4 backend claims **all** remote files - preventing reading zarr

5 participants