Skip to content

Conversation

huard
Copy link
Contributor

@huard huard commented Sep 3, 2025

Added logic in the h5netcdf engine to write pseudo NETCDF4_CLASSIC files, reusing encoding logic used by the netcdf4` engine.

The files generated with the PR using the latest h5netcdf release (1.6.4) won't be recognized by third party software as genuine NETCDF4_CLASSIC files, in part because they have no _nc3_strict hidden global attribute. There are other differences with netCDF4 generated files, including string attributes padding, how _FillValue is stored, etc. Changes to h5netcdf will be necessary to make netCDF files fully compliant with the CLASSIC format.

@shoyer
Copy link
Member

shoyer commented Sep 3, 2025

Xarray currently doesn't have any logic to build these metadata attributes. Currently this is all handled in h5netcdf.

We should also make sure that trying to use NetCDF4-only features (e.g., groups) results in an error.

@huard
Copy link
Contributor Author

huard commented Sep 3, 2025

The last commit uses h5dump to display differences between the expected and actual content of the HDF5 file. I was also able to add a _nc3_strict global attribute.

I can try to raise an error if groups are used.

The remaining differences are related to the SUPERBLOCK version , the STRPAD character, and the _FillValue. Not sure I'll be able to resolve those.

    SUPER_BLOCK {
  -    SUPERBLOCK_VERSION 0
  ?                       ^
  +    SUPERBLOCK_VERSION 2
  ?                       ^
...
         ATTRIBUTE "foo" {
             DATATYPE  H5T_STRING {
                STRSIZE 8;
  -             STRPAD H5T_STR_NULLPAD;
  ?                                ^^^
  +             STRPAD H5T_STR_NULLTERM;
  ?                                ^^^^
                CSET H5T_CSET_ASCII;
                CTYPE H5T_C_S1;
             }
...
          ATTRIBUTE "_FillValue" {
             DATATYPE  H5T_IEEE_F64LE
  -          DATASPACE  SCALAR
  +          DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
             DATA {
             (0): nan
             }
          }

@shoyer
Copy link
Member

shoyer commented Sep 3, 2025

If you really want to get metadata attributes and precise HDF5 types right, that should all be handled in h5netcdf. I think that's also the right place for h5dump tests.

In Xarray, all we should be doing for NETCDF4_CLASSIC is coercing some dtypes (using Xarray's encoders) to NetCDF3 compatible types.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Sep 3, 2025

@huard Thanks for pushing this!

For the superblock issue, please add kwarg libver="earliest" when opening the file for writing. This will create the file with superblock version 0 for maximum backwards compatibility.

For the NULLPAD vs. NULLTERM there is some reading material here PyTables/PyTables#264 and here h5netcdf/h5netcdf#116. This one would need to be implemented in h5netcdf, if need be.

@huard
Copy link
Contributor Author

huard commented Sep 4, 2025

@kmuehlbauer Thanks for the references, this is really helpful !

I'll remove the low-level stuff from this branch (_nc3_strict) and bring it into h5netcdf.

@huard
Copy link
Contributor Author

huard commented Sep 9, 2025

This is ready for review.

While I added a test doing a roundtrip between netCDF4 and h5netcdf CLASSIC format, it does check how files are actually written inside the HDF5 file, just that they can be written and read consistently by xarray. Non-standard reading rules can hide non-standard writing rules. I'm planning to add "binary compatibility" tests within h5netcdf.

Comment on lines 339 to 340
if isinstance(value, bytes):
value = np.bytes_(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this special logic only for converting bytes? This seems unrelated to what we need for NETCDF4_CLASSIC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure strings are written as NC_CHAR, and not NC_STRING. See https://engee.com/helpcenter/stable/en/julia/NetCDF/strings.html

This is in fact the detail that our third party software in C++ choked on. The netCDF C library has both nc_get_att_text and nc_get_att_string functions. Calling nc_get_att_text on an NC_STRING raises an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huard I'm just going over this again. I'm on board with @shoyer here.

Even without this addition:

    if isinstance(value, bytes):
        value = np.bytes_(value)
ds = xr.Dataset(
    data_vars=dict(temp=("x", [1, 2, 3])),
    coords=dict(x=[0, 1, 2]),
    attrs=dict(
        plain_bytes=b"hello",
        numpy_bytes=np.bytes_(b"hello"),
    ),
)

encodes properly for both engines to fixed size strings (aka NC_CHAR)

ATTRIBUTE "numpy_bytes" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "hello"
      }
   }
   ATTRIBUTE "plain_bytes" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "hello"
      }
   }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing the helper function and just conditionally running encode_nc3_attr_value(value) should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I remove the cast and test with h5netcdf v1.6.4, plain_bytes is saved as a variable length string.

Is the plan to pin h5netcdf >=1.7 for the next xarray release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huard, you would need to test against h5netcdf main. Isn't the check for > 1.6.4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, h5netcdf is not pinned at all. But we have the version check in place. So all good, or do I miss something .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, h5netcdf is not pinned at all. But we have the version check in place. So all good, or do I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My objective was to have some basic CLASSIC functionality working with older h5netcdf releases. If we keep this line in, xarray is able to save CLASSIC "passing" files even without the h5netcdf's main.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray (this branch) / h5netcdf 1.6.4 -> CLASSIC files won't be recognized as such by the netCDF library, but there is a fair chance 3rd party applications won't choke.

xarray (this branch minus the bytes_ cast) / h5netcdf 1.6.4 -> 3rd party apps likely to crash when reading attributes.

xarray (this branch minus the bytes_ cast) / h5netcdf 1.7.0 -> Fully compliant NETCDF4_CLASSIC format

Comment on lines +4679 to +4687
def test_string_attributes_stored_as_char(self, tmp_path):
import h5netcdf

original = Dataset(attrs={"foo": "bar"})
store_path = tmp_path / "tmp.nc"
original.to_netcdf(store_path, engine=self.engine, format=self.file_format)
with h5netcdf.File(store_path, "r") as ds:
# Check that the attribute is stored as a char array
assert ds._h5file.attrs["foo"].dtype == np.dtype("S3")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy's S dtype actually corresponds to bytes, not str. I don't think we want to use it for storing attributes in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using fixed width chars replicates the behavior of the netCDF4 backend for the CLASSIC format. Again, this has to do with the NC_CHAR vs NC_STRING formats.

Sticking as close as possible to netCDF4 output increases my confidence that the h5netcdf outputs will be compatible with 3rd party software expecting the CLASSIC format.

Comment on lines 145 to 146
if format == "NETCDF4_CLASSIC" and group is not None:
raise ValueError("Cannot create sub-groups in `NETCDF4_CLASSIC` format.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does h5netcdf give a suitable error message here already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h5netcdf.File does not even have a format argument, so no.

Copy link
Contributor Author

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Made the suggested changes, but I'm afraid the string attributes need to be saved as fixed width char arrays to be compliant with the CLASSIC file format.

Comment on lines 145 to 146
if format == "NETCDF4_CLASSIC" and group is not None:
raise ValueError("Cannot create sub-groups in `NETCDF4_CLASSIC` format.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h5netcdf.File does not even have a format argument, so no.

Comment on lines 339 to 340
if isinstance(value, bytes):
value = np.bytes_(value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure strings are written as NC_CHAR, and not NC_STRING. See https://engee.com/helpcenter/stable/en/julia/NetCDF/strings.html

This is in fact the detail that our third party software in C++ choked on. The netCDF C library has both nc_get_att_text and nc_get_att_string functions. Calling nc_get_att_text on an NC_STRING raises an error.

Comment on lines +4679 to +4687
def test_string_attributes_stored_as_char(self, tmp_path):
import h5netcdf

original = Dataset(attrs={"foo": "bar"})
store_path = tmp_path / "tmp.nc"
original.to_netcdf(store_path, engine=self.engine, format=self.file_format)
with h5netcdf.File(store_path, "r") as ds:
# Check that the attribute is stored as a char array
assert ds._h5file.attrs["foo"].dtype == np.dtype("S3")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using fixed width chars replicates the behavior of the netCDF4 backend for the CLASSIC format. Again, this has to do with the NC_CHAR vs NC_STRING formats.

Sticking as close as possible to netCDF4 output increases my confidence that the h5netcdf outputs will be compatible with 3rd party software expecting the CLASSIC format.

@dcherian dcherian requested a review from kmuehlbauer October 13, 2025 20:44
Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huard Sorry for letting this wait for so long. Thanks @dcherian for the reminder. This is looking good to me, one minor change needed. though.

@huard
Copy link
Contributor Author

huard commented Oct 14, 2025

I'm happy to remove the cast to bytes if the next xarray releases pins h5netcdf >=1.7. If not, I think keeping the line is useful.

@dcherian
Copy link
Contributor

Can we simply require h5netcdf>= 1.7.0 for classic writes instead?

@huard
Copy link
Contributor Author

huard commented Oct 14, 2025

My original intent was to try to get as much mileage as possible within xarray, not knowing how the h5netcdf PR would fare. @dcherian if a h5netcdf released is planned before the next xarray release, I think your suggestion makes a lot of sense.

Something like that ?

        if Version(h5netcdf.__version__) > Version("1.6.4"):
            kwargs["format"] = format
        elif format == "NETCDF4_CLASSIC":
            raise ValueError("h5netcdf >= 1.7.0 is required to save output in NETCDF4_CLASSIC format.")

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just remove convert_string?

huard and others added 3 commits October 15, 2025 08:22
@kmuehlbauer
Copy link
Contributor

@huard FYI: I'll have h5netcdf 1.7.0 out later today. Just waiting for this one here to get in.

@kmuehlbauer kmuehlbauer merged commit 58f26f9 into pydata:main Oct 15, 2025
34 of 37 checks passed
@kmuehlbauer
Copy link
Contributor

Thanks @huard!

@huard
Copy link
Contributor Author

huard commented Oct 15, 2025

Happy to contribute, thanks for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support "NETCDF4_CLASSIC" format with engine h5netcdf

4 participants