-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add support for NETCDF4_CLASSIC to h5netcdf engine #10686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…th h5netcdf engine
…ys with NETCDF4_CLASSIC
…cdf4 and h5netcdf
Xarray currently doesn't have any logic to build these metadata attributes. Currently this is all handled in h5netcdf. We should also make sure that trying to use NetCDF4-only features (e.g., groups) results in an error. |
The last commit uses h5dump to display differences between the expected and actual content of the HDF5 file. I was also able to add a I can try to raise an error if groups are used. The remaining differences are related to the SUPERBLOCK version , the
|
If you really want to get metadata attributes and precise HDF5 types right, that should all be handled in h5netcdf. I think that's also the right place for h5dump tests. In Xarray, all we should be doing for NETCDF4_CLASSIC is coercing some dtypes (using Xarray's encoders) to NetCDF3 compatible types. |
@huard Thanks for pushing this! For the superblock issue, please add kwarg For the |
@kmuehlbauer Thanks for the references, this is really helpful ! I'll remove the low-level stuff from this branch ( |
…o `netCDF4_.get_datatype` skips required conversions. Remove global attribute from create_test_data because it impacts other tests in other files.
This is ready for review. While I added a test doing a roundtrip between netCDF4 and h5netcdf CLASSIC format, it does check how files are actually written inside the HDF5 file, just that they can be written and read consistently by xarray. Non-standard reading rules can hide non-standard writing rules. I'm planning to add "binary compatibility" tests within |
xarray/backends/h5netcdf_.py
Outdated
if isinstance(value, bytes): | ||
value = np.bytes_(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this special logic only for converting bytes? This seems unrelated to what we need for NETCDF4_CLASSIC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure strings are written as NC_CHAR
, and not NC_STRING
. See https://engee.com/helpcenter/stable/en/julia/NetCDF/strings.html
This is in fact the detail that our third party software in C++ choked on. The netCDF C library has both nc_get_att_text
and nc_get_att_string
functions. Calling nc_get_att_text
on an NC_STRING
raises an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huard I'm just going over this again. I'm on board with @shoyer here.
Even without this addition:
if isinstance(value, bytes):
value = np.bytes_(value)
ds = xr.Dataset(
data_vars=dict(temp=("x", [1, 2, 3])),
coords=dict(x=[0, 1, 2]),
attrs=dict(
plain_bytes=b"hello",
numpy_bytes=np.bytes_(b"hello"),
),
)
encodes properly for both engines to fixed size strings (aka NC_CHAR)
ATTRIBUTE "numpy_bytes" {
DATATYPE H5T_STRING {
STRSIZE 5;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "hello"
}
}
ATTRIBUTE "plain_bytes" {
DATATYPE H5T_STRING {
STRSIZE 5;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "hello"
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing the helper function and just conditionally running encode_nc3_attr_value(value)
should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I remove the cast and test with h5netcdf v1.6.4, plain_bytes
is saved as a variable length string.
Is the plan to pin h5netcdf >=1.7 for the next xarray release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huard, you would need to test against h5netcdf main. Isn't the check for > 1.6.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, h5netcdf is not pinned at all. But we have the version check in place. So all good, or do I miss something .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, h5netcdf is not pinned at all. But we have the version check in place. So all good, or do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My objective was to have some basic CLASSIC functionality working with older h5netcdf releases. If we keep this line in, xarray is able to save CLASSIC "passing" files even without the h5netcdf's main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray (this branch) / h5netcdf 1.6.4 -> CLASSIC files won't be recognized as such by the netCDF library, but there is a fair chance 3rd party applications won't choke.
xarray (this branch minus the bytes_ cast) / h5netcdf 1.6.4 -> 3rd party apps likely to crash when reading attributes.
xarray (this branch minus the bytes_ cast) / h5netcdf 1.7.0 -> Fully compliant NETCDF4_CLASSIC format
def test_string_attributes_stored_as_char(self, tmp_path): | ||
import h5netcdf | ||
|
||
original = Dataset(attrs={"foo": "bar"}) | ||
store_path = tmp_path / "tmp.nc" | ||
original.to_netcdf(store_path, engine=self.engine, format=self.file_format) | ||
with h5netcdf.File(store_path, "r") as ds: | ||
# Check that the attribute is stored as a char array | ||
assert ds._h5file.attrs["foo"].dtype == np.dtype("S3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy's S
dtype actually corresponds to bytes
, not str
. I don't think we want to use it for storing attributes in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using fixed width chars replicates the behavior of the netCDF4 backend for the CLASSIC format. Again, this has to do with the NC_CHAR
vs NC_STRING
formats.
Sticking as close as possible to netCDF4 output increases my confidence that the h5netcdf outputs will be compatible with 3rd party software expecting the CLASSIC format.
xarray/backends/h5netcdf_.py
Outdated
if format == "NETCDF4_CLASSIC" and group is not None: | ||
raise ValueError("Cannot create sub-groups in `NETCDF4_CLASSIC` format.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does h5netcdf give a suitable error message here already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h5netcdf.File
does not even have a format
argument, so no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Made the suggested changes, but I'm afraid the string attributes need to be saved as fixed width char arrays to be compliant with the CLASSIC file format.
xarray/backends/h5netcdf_.py
Outdated
if format == "NETCDF4_CLASSIC" and group is not None: | ||
raise ValueError("Cannot create sub-groups in `NETCDF4_CLASSIC` format.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h5netcdf.File
does not even have a format
argument, so no.
xarray/backends/h5netcdf_.py
Outdated
if isinstance(value, bytes): | ||
value = np.bytes_(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure strings are written as NC_CHAR
, and not NC_STRING
. See https://engee.com/helpcenter/stable/en/julia/NetCDF/strings.html
This is in fact the detail that our third party software in C++ choked on. The netCDF C library has both nc_get_att_text
and nc_get_att_string
functions. Calling nc_get_att_text
on an NC_STRING
raises an error.
def test_string_attributes_stored_as_char(self, tmp_path): | ||
import h5netcdf | ||
|
||
original = Dataset(attrs={"foo": "bar"}) | ||
store_path = tmp_path / "tmp.nc" | ||
original.to_netcdf(store_path, engine=self.engine, format=self.file_format) | ||
with h5netcdf.File(store_path, "r") as ds: | ||
# Check that the attribute is stored as a char array | ||
assert ds._h5file.attrs["foo"].dtype == np.dtype("S3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using fixed width chars replicates the behavior of the netCDF4 backend for the CLASSIC format. Again, this has to do with the NC_CHAR
vs NC_STRING
formats.
Sticking as close as possible to netCDF4 output increases my confidence that the h5netcdf outputs will be compatible with 3rd party software expecting the CLASSIC format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to remove the cast to bytes if the next xarray releases pins h5netcdf >=1.7. If not, I think keeping the line is useful. |
Can we simply require h5netcdf>= 1.7.0 for classic writes instead? |
My original intent was to try to get as much mileage as possible within xarray, not knowing how the h5netcdf PR would fare. @dcherian if a h5netcdf released is planned before the next xarray release, I think your suggestion makes a lot of sense. Something like that ? if Version(h5netcdf.__version__) > Version("1.6.4"):
kwargs["format"] = format
elif format == "NETCDF4_CLASSIC":
raise ValueError("h5netcdf >= 1.7.0 is required to save output in NETCDF4_CLASSIC format.") |
… to NETCDF4_CLASSIC format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just remove convert_string
?
Co-authored-by: Kai Mühlbauer <[email protected]>
Co-authored-by: Kai Mühlbauer <[email protected]>
Co-authored-by: Kai Mühlbauer <[email protected]>
@huard FYI: I'll have h5netcdf 1.7.0 out later today. Just waiting for this one here to get in. |
Thanks @huard! |
Happy to contribute, thanks for your support. |
Added logic in the
h5netcdf
engine to write pseudo NETCDF4_CLASSIC files, reusing encoding logic used by the netcdf4` engine.The files generated with the PR using the latest
h5netcdf
release (1.6.4) won't be recognized by third party software as genuine NETCDF4_CLASSIC files, in part because they have no_nc3_strict
hidden global attribute. There are other differences with netCDF4 generated files, including string attributes padding, how_FillValue
is stored, etc. Changes toh5netcdf
will be necessary to make netCDF files fully compliant with the CLASSIC format.whats-new.rst