Skip to content

Conversation

@eschnett
Copy link
Contributor

@eschnett eschnett commented Jun 9, 2025

This is a first stab at implementing a Blosc2 codec. I believe the implementation is correct. I am looking for feedback.

@codecov
Copy link

codecov bot commented Jun 9, 2025

Codecov Report

Attention: Patch coverage is 86.55914% with 25 lines in your changes missing coverage. Please review.

Project coverage is 86.55%. Comparing base (4343c2a) to head (1a4b310).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
LibBlosc2/src/encode.jl 82.69% 18 Missing ⚠️
LibBlosc2/src/decode.jl 88.46% 6 Missing ⚠️
LibBlosc2/src/libblosc2.jl 96.29% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #54       +/-   ##
===========================================
- Coverage   98.24%   86.55%   -11.69%     
===========================================
  Files           5        4        -1     
  Lines         456      186      -270     
===========================================
- Hits          448      161      -287     
- Misses          8       25       +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nhz2
Copy link
Member

nhz2 commented Jun 10, 2025

If I understand correctly, the goal is to implement the HDF5 filter 32026 in https://github.com/silx-kit/hdf5plugin/blob/v5.1.0/src/PyTables/hdf5-blosc2/src/blosc2_filter.c

According to my reading of https://github.com/Blosc/c-blosc2/blob/v2.17.1/README_EXTENSION_FILENAMES.rst, there are also .b2frame and .b2nd formats. Therefore, the format here should be called Blosc2HDF5 to distinguish it from these.

@eschnett
Copy link
Contributor Author

My goal is to implement a stand-alone blosc2 compressor/decompressor. I did not intend to connect it to HDF5, although that should be possible.

The format I am implementing uses "super-chunks" which were introduced in blosc2. They allow compressing more than 2 GByte of data. Blosc2 still supports the compression methods used by blosc1 with their size limit. It would be possible to add support for this in LibBlosc2, e.g. by allowing a choice when compressing and choosing automatically when decompressing.

@eschnett eschnett marked this pull request as draft June 10, 2025 16:52
@eschnett
Copy link
Contributor Author

The b2nd format is for storing multi-dimensional arrays. The ChunkCodecs API doesn't easily give access to this information (everything is a stream of bytes) and thus I don't think this format is interesting here. The format I'm implementing is the cframe format, a "contiguous frame" holding the compressed data.

@eschnett eschnett marked this pull request as ready for review June 10, 2025 17:15
@eschnett
Copy link
Contributor Author

ping


# There's more unused/unchecked data
c[end-50] = 0x40
# BROKEN @test_throws Blosc2DecodingError decode(Blosc2DecodeOptions(), c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is okay for a format not to checksum everything.

Comment on lines +99 to +106
# Finally, this corruption has an effect
c[end-100] = 0x40
# Windows segfaults in this call with exit code 3221226356,
# indicating a heap corruption. That's clearly a bug in c-blosc2.
# It seems c-blosc2 does not checksum its compressed data.
if !Sys.iswindows()
@test_throws Blosc2DecodingError decode(Blosc2DecodeOptions(), c)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the file that is causing the segfault.
bad_file.txt

It would be good to see if this crashes https://github.com/Blosc/c-blosc2/blob/main/examples/decompress_file.c as well.

Until this is resolved, the documentation for this package should have warnings not to use the package with potentially invalid inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, decompress_file does segfault on this input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a bug in blosc2 or the example. This issue isn't with checksums, it is probably blosc2 missing a bounds check somewhere. Can you report this upstream?

@nhz2
Copy link
Member

nhz2 commented Sep 1, 2025

Question about the status of this PR. Should it be merged as is and then the problems fixed in future PRs before an initial release? Do you want to continue working on this PR branch, or do you want to close the PR?

@eschnett
Copy link
Contributor Author

eschnett commented Sep 2, 2025

I am still interested in this PR. I don't know your development routine, but it might be best to stay as PR until it is ready.

What is missing? To my knowledge there are only a few minor items outstanding: Tests and warnings to users.

There is a new version of Blosc2 available, I can check for the segfault with that version and otherwise report the problem upstream.

@nhz2
Copy link
Member

nhz2 commented Sep 2, 2025

Sounds good. The other thing missing is using the documented Schunk overhead instead of a guess (or manually creating the schunks in Julia if the C library may go over the proposed limit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants