Rectilinear (variable-length) chunk grid #25

d-v-b · 2025-10-14T13:54:55Z

Opening this PR with a draft of a rectilinear chunk grid spec

LDeakin

Looks good to me. Do we want to include a recommendation that implementations SHOULD use run length encoding where appropriate when saving metadata?

Implementation here: zarrs/zarrs#284

chunk-grids/rectilinear/README.md

jbms · 2025-10-14T15:54:50Z

chunk-grids/rectilinear/README.md

+
+#### Chunk edge lengths
+
+The edge lengths of the chunks along an array axis `A` are represented by an array that can contain two types of elements:


A previous proposal for rectilinear chunking also allowed a plain integer here, to indicate uniform chunking along the dimension. That makes this strictly a generalization of regular chunking and seems like a good idea to include.

While the run-length encoding makes regular chunking along a dimension still efficient to specify, a plain integer indicating uniform chunking has the advantage of allowing the dimension to be resized without also specifying the new chunk sizes.

Potentially you could allow the last item in the chunk_shapes entry for a given dimension to be [n, -1] to mean that all remaining chunks are size n --- that would allow the dimension to be resized since it would indicate the chunk size for new chunks.

That would reduce the need for also allowing a plain integer to specify uniform chunking, but it is still probably good to allow a plain integer.

we removed the plain integer form because it was ambiguous -- for a shape (10,), does {"chunk_shapes": [3]} expand to [3, 3, 3, 1] or [3, 3, 3, 3]? I think we can restore the single integer form if we can resolve this ambiguity.

For drop-in compatibility with the regular chunk grid, we do need to support some way of expressing chunks that overhang the grid of the array. But we might consider supporting this compatibility via something more explicit than the single-integer form.

And for resizing, I agree that it's convenient if resizing doesn't require re-defining the chunk grid, but for many resize operations with rectilinear chunks the appended data will have different chunking, and so any resize API for rectilinear chunks will need to support declaring new chunk shapes. So perhaps a convenient method for resizing with an automatic default chunk shape could be offered by an application, even if the chunk_grid is modified in any case.

A related question is whether we support declaring chunk shapes that substantially overflow the shape of the array, e.g. for a shape (10,), [3, 3, 3, 3, 3] which has a length of 15. Without a clear use case, I'm inclined against supporting this here. Curious to hear what you think.

Without a clear use case, I'm inclined against supporting this here. Curious to hear what you think.

Some feedback from folks at the Zarr summit was that we could allow declaring a sequence of chunks that exceeds the shape of the array, but disallow a sequence that would underfill the array. That seems reasonable to me.

we removed the plain integer form because it was ambiguous -- for a shape (10,), does {"chunk_shapes": [3]} expand to [3, 3, 3, 1] or [3, 3, 3, 3]? I think we can restore the single integer form if we can resolve this ambiguity.

I don't see where the 1 would come from. I imagined it would expand to [3, 3, 3, 3]. The point is that it would work the same as the regular grid.

For drop-in compatibility with the regular chunk grid, we do need to support some way of expressing chunks that overhang the grid of the array. But we might consider supporting this compatibility via something more explicit than the single-integer form.

And for resizing, I agree that it's convenient if resizing doesn't require re-defining the chunk grid, but for many resize operations with rectilinear chunks the appended data will have different chunking, and so any resize API for rectilinear chunks will need to support declaring new chunk shapes. So perhaps a convenient method for resizing with an automatic default chunk shape could be offered by an application, even if the chunk_grid is modified in any case.

A related question is whether we support declaring chunk shapes that substantially overflow the shape of the array, e.g. for a shape (10,), [3, 3, 3, 3, 3] which has a length of 15. Without a clear use case, I'm inclined against supporting this here. Curious to hear what you think.

Yes, definitely should support overhanging the array bounds, for compatibility with regular grid, to achieve any necessary chunk size constraints, and to create room for resizing without specifying new chunk sizes.

Without a clear use case, I'm inclined against supporting this here. Curious to hear what you think.

Some feedback from folks at the Zarr summit was that we could allow declaring a sequence of chunks that exceeds the shape of the array, but disallow a sequence that would underfill the array. That seems reasonable to me.

Agreed.

I don't see where the 1 would come from. I imagined it would expand to [3, 3, 3, 3]. The point is that it would work the same as the regular grid.

The 1 would come from trimming the final chunk to match the array's shape exactly. I will bring back the single-integer form, and explain this behavior, which will remove the ambiguity.

as of a9ebbf2 I added the single-integer declaration of the chunk edge lengths, and included allowance for declaring a sequence of chunk edge lengths that overflows the array shape by multiple chunks.

d-v-b · 2025-10-18T18:50:12Z

This is ready for review. I would like to include a small, complete array that demonstrates this chunk grid, and a JSON schema for the metadata.

normanrz

Thanks, @d-v-b!
This PR fulfills all requirements to be merged and is a great addition to zarr-extensions.
Schema JSON and examples would be great. Do you wish add these in this PR or in a later one?

d-v-b · 2025-10-20T08:09:17Z

if @jbms, @LDeakin, and @manzt are all OK with how this look now then I'm happy to merge and add the schema + example data in a subsequent PR. But I am also wondering if we want to handle the origin of the chunk grid here, while this PR is still open. See #30.

normanrz · 2025-10-20T09:42:14Z

I leave that to you. Let me know, when you want this PR merged.

jbms · 2025-10-20T13:47:23Z

If we want to support negative chunk indices (in order to allow expansion in the negative direction) then we need to be able to specify the sizes of the negative chunks also.

For a regular grid it is sufficient to have a single grid_origin parameter that specifies the start of chunk 0 in array index space.

But for the rectilinear grid we need to specify both the position in array index space and the position in chunk index space of the start of the chunk size list.

jbms · 2025-10-20T13:58:13Z

Another thing that has been suggested in the past is to allow the logical size of a chunk to differ from the physical size, i.e. allow chunks to be stored with unused padding at both the start and end.

For every chunk, you need to specify the physical size, logical size, and offset of the start of the logical chunk within the physical chunk.

This would allow you to insert and remove elements in the middle of a dimension without having to re-encode chunks, just rename them. And by using kerchunk or icechunk or OCDBT the renaming could be done "virtually".

To avoid even having to rename you could allow an arbitrary virtual to physical chunk index map, where you remap the chunk index also.

d-v-b · 2025-10-20T14:03:14Z

Speaking for Zarr Python, these would require some work to implement -- in particular, we will need to introduce a new array indexing API to work with negative indices that aren't referenced to the end of an array. So for that reason I'm inclined to keep this chunk grid simple for now, provided we are confident we can safely extend it in the future.

I think using additional keys in the configuration + defining new variants of the "kind" field to overload the meaning of chunk_shapes should cover the additional flexibility. Defining a totally new chunk grid could also work, but I think as long as the changes generalize this chunk grid, we should aim to update this spec to a new version with changes that will be breaking for naive implementations.

jbms · 2025-10-20T14:12:04Z

The kind should arguably be per-dimension rather than apply to all dimensions. Perhaps the kind could be indicated by the type of the json value instead.

d-v-b · 2025-10-20T14:26:20Z

The kind should arguably be per-dimension rather than apply to all dimensions.

Our intention for "kind" was that it defines the semantics for the chunk_shapes field. E.g., if "kind" was set to "reference", then chunk_shapes might be a path or URI, and thus per-dimension metadata would not be meaningful without resolving that reference. We could use the type of the chunk_shapes field to express this, but I think an explicit "kind" field gives us more flexibility here.

jbms · 2025-10-20T16:03:14Z

The kind should arguably be per-dimension rather than apply to all dimensions.

Our intention for "kind" was that it defines the semantics for the chunk_shapes field. E.g., if "kind" was set to "reference", then chunk_shapes might be a path or URI, and thus per-dimension metadata would not be meaningful without resolving that reference. We could use the type of the chunk_shapes field to express this, but I think an explicit "kind" field gives us more flexibility here.

What I meant is that you may want the chunk sizes for one dimension to be stored in a separate 1-d array, but another dimension might have uniform chunking or sizes specified inline in the metadata.

initial commit of readme

892e761

LDeakin reviewed Oct 14, 2025

View reviewed changes

chunk-grids/rectilinear/README.md Outdated Show resolved Hide resolved

chunk-grids/rectilinear/README.md Outdated Show resolved Hide resolved

apply feedback from code review

2c56410

jbms reviewed Oct 14, 2025

View reviewed changes

d-v-b added 2 commits October 15, 2025 11:32

expand prior work section

84831bc

punctuation

bee5f0a

sharkinsspatial mentioned this pull request Oct 16, 2025

Question about loadable_variables for leap-year daily-data concatenation zarr-developers/VirtualiZarr#813

Closed

d-v-b mentioned this pull request Oct 18, 2025

declaring the origin of an array #30

Open

d-v-b added 2 commits October 18, 2025 20:25

restore single-integer chunk edge length declaration

a9ebbf2

fix prose and logic

945853a

d-v-b marked this pull request as ready for review October 18, 2025 18:47

normanrz approved these changes Oct 19, 2025

View reviewed changes

jhamman mentioned this pull request Oct 20, 2025

Feature: support rectilinear chunk grid extension zarr-developers/zarr-python#3534

Draft

6 tasks


		#### Chunk edge lengths

		The edge lengths of the chunks along an array axis `A` are represented by an array that can contain two types of elements:

Rectilinear (variable-length) chunk grid #25

Are you sure you want to change the base?

Rectilinear (variable-length) chunk grid #25

Uh oh!

Conversation

d-v-b commented Oct 14, 2025

Uh oh!

LDeakin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Oct 18, 2025

Uh oh!

normanrz left a comment

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Oct 20, 2025

Uh oh!

normanrz commented Oct 20, 2025

Uh oh!

jbms commented Oct 20, 2025

Uh oh!

jbms commented Oct 20, 2025

Uh oh!

d-v-b commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbms commented Oct 20, 2025

Uh oh!

d-v-b commented Oct 20, 2025

Uh oh!

jbms commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LDeakin left a comment •

edited

Loading

d-v-b commented Oct 20, 2025 •

edited

Loading