feat: recursive list/struct decomposition for batch-invariant hashing#10
feat: recursive list/struct decomposition for batch-invariant hashing#10
Conversation
Address all 7 design-spec issues to make starfix produce identical hashes for logically equivalent Arrow tables regardless of column order, struct field order, encoding, or type variant. Core implementation changes (src/arrow_digester_core.rs): - Issue 1: Sort struct fields alphabetically in data_type_to_value - Issue 2: Apply sort_json_value recursively for deterministic JSON - Issue 3: Use u64 (not usize) for binary length prefixes - Issue 4: Remove NULL_BYTES sentinel from binary/string nullable paths - Issue 5: Canonicalize Binary→LargeBinary, Utf8→LargeUtf8, List→LargeList - Issue 6: Resolve dictionary arrays to plain arrays before hashing - Issue 7: Use logical schema comparison in update() (canonical serialization) Also improved schema JSON format for cross-language stability by dropping Arrow-internal field names (e.g. "item") from List element serialization. All 13 previously-ignored tests now pass. Updated golden hash values and golden schema JSON to reflect the new canonical serialization. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add docs/byte-layout-spec.md describing the exact byte-level serialization for schema JSON, fixed-size types, booleans, variable-length types, lists, validity bitmaps, and the final combining digest. Every byte fed into SHA-256 is specified, making cross-language reimplementation possible. Add 10 verification tests in tests/digest_bytes.rs that manually construct the expected SHA-256 hash from raw bytes and assert equality with the library output. Covers: - Example A: two-column record batch (Int32 + nullable LargeUtf8) - Example B: boolean array with nulls (Msb0 bit packing) - Example C: non-nullable Int32 array - Example D: binary array with type canonicalization (Binary→LargeBinary) - Example E: column-order independence proof - Example F: Utf8/LargeUtf8 type equivalence proof - Example G: nullable Int32 with nulls - Example H: nullable string array with nulls and type canonicalization - Example I: empty table (schema only, no data) - Example J: multi-batch streaming equals single combined batch https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
…t examples Implement DataType::Struct in array_digest_update for composite hashing of struct arrays (previously todo!()). Struct children are sorted alphabetically, each gets an independent digest that is finalized into the parent's data stream. Struct-level nulls propagate to children via combined validity buffers to avoid hashing undefined data. Add finalize_child_into_data helper for writing child digest bytes into a parent's data stream. Add four new manual verification tests (Examples K-N) covering struct columns in record batches, hash_array on structs with and without nulls, and list-of-struct columns. Update byte-layout spec with corresponding worked examples and updated Section 3.5. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Refactor DigestBufferType from enum to struct with optional `structural` digest field. For list columns, element counts (sizes) now accumulate in a separate SHA-256 stream from leaf data, producing: null_bits || structural_digest || leaf_digest at finalization. This cleanly separates structure from data, making collision prevention easier to reason about while preserving streaming compatibility. Non-list types are unchanged. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
List types now separate element counts into a dedicated structural SHA-256 digest stream, while leaf data flows into the data digest. This ensures differently-grouped lists (e.g. [[1,2],[3]] vs [[1],[2,3]]) produce different hashes even when their leaf values are identical. Updated sections: field digest buffer description (Section 3), list types (Section 3.4), struct composite children (Section 3.5), finalization (Section 4), hash_array API (Section 6), and Example N. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add clippy expects for similar_names, redundant_clone, and absolute_paths in digest_bytes tests. Run cargo fmt to fix all formatting issues across source and test files. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add four examples that had tests but were missing from the spec: - Example G: Nullable Int32 array with nulls (hash_array API) - Example H: Nullable String array with nulls and type canonicalization - Example I: Empty table with no data batches - Example J: Multi-batch streaming batch-split independence All 14 byte-level spec tests (A-N) now have corresponding worked examples in the documentation. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
…ordering - Change validity bitmap from `BitVec` (default `BitVec<usize, Lsb0>`, platform-dependent word size) to `BitVec<u8, Lsb0>` (1-byte words, platform-independent) - Change boolean value packing from `BitVec<u8, Msb0>` to `BitVec<u8, Lsb0>` to match Arrow's native bit layout - Cast `null_bit_vec.len()` to `u64` before `to_le_bytes()` in both `finalize_digest` and `finalize_child_into_data` for consistent 8-byte length encoding across platforms Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cast Utf8→LargeUtf8, Binary→LargeBinary, List→LargeList at the top of array_digest_update so every code path goes through a single canonical representation. Inner element types are normalized recursively when hash_list_array re-enters array_digest_update for each sub-array. Also updates the design spec to match the current implementation (Lsb0 booleans, structural digest for lists, composite struct hashing, element_type_to_value, resolved known issues) and adds equivalence tests for List/LargeList arrays and record batches, plus Utf8/LargeUtf8 record batches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sive type normalization
Introduce normalize_data_type(), normalize_field(), and normalize_schema()
as reusable functions that recursively normalize Arrow types to their
canonical large equivalents (Utf8→LargeUtf8, Binary→LargeBinary,
List→LargeList, Dictionary→value type) at all nesting levels including
struct children, list elements, and map entries.
Apply normalization at every boundary:
- Schema is normalized at ArrowDigesterCore::new() so all stored state
uses canonical types
- data_type_to_value() uses normalize_data_type before serialization
- hash_array() normalizes the effective type for metadata
- array_digest_update() casts arrays to large equivalents in the data path
API change: ArrowDigester::new() and ArrowDigesterCore::new() now take
&Schema instead of Schema by value, since the input is normalized
internally and the original is not consumed.
Add deeply nested normalization tests:
- List(Utf8) vs LargeList(LargeUtf8) array and schema equivalence
- Struct({items: List(Utf8), name: Utf8}) vs Struct({items: LargeList(LargeUtf8), name: LargeUtf8}) record batch
- Streaming with type-equivalent schemas (Utf8 digester accepting LargeUtf8 batch)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Cache serialized_schema in ArrowDigesterCore to avoid re-serializing on every update() call; remove now-unused schema field - Add clarifying comment on the (normalized_type, cast_array) lifetime extension pattern in array_digest_update - Fix 8 digest_bytes tests: change validity types from usize to u8/u64, fix boolean packing from Msb0 to Lsb0, rename _expected to expected and assert against manual computation instead of hardcoded byte vectors - Update byte-layout-spec.md: BitVec<u8, Lsb0> throughout, u8 validity words (1 byte) instead of usize (8 bytes), Lsb0 boolean packing, platform-independent hashes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers all identified gaps: unimplemented data types (Timestamp, Duration, Interval, FixedSizeList, Map, Null, Union, RunEndEncoded, View types), missing test coverage (multi-word validity bitmaps, nullable list elements), and documentation gaps (metadata exclusion, platform considerations). Organized into three tiers with flagged design decisions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The comment on validity_word.to_be_bytes() still showed the old 8-byte usize representation (00 00 00 00 00 00 00 01). Since validity_word is now u8, to_be_bytes() produces a single byte (01). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…position Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd fix clippy Update expected hashes in integration tests (schema, Example N) to match the new BTreeMap decomposition for list/struct types. Add comprehensive recursive_list_struct_decomposition and batch_split_independence tests. Fix clippy lints: map_or→is_none_or, ref pattern, explicit_iter_loop, absolute_paths, redundant clones, similar names, too_many_lines. Allow big_endian_bytes at module level (validity bytes use BE for cross-platform consistency). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All other multi-byte values (bit counts, list lengths, fixed-size data) already use little-endian encoding. For u8 validity words this is a no-op (single byte), but aligns the code style and removes the big_endian_bytes clippy allow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… decomposition - Rewrite sections 3.4-3.5 to describe recursive list/struct decomposition with separate BTreeMap entries per leaf and list intermediate node - Add new entry types: validity-only, structural-only, data-only, list-leaf - Rewrite Example N to show decomposed entries instead of composite path - Update Section 4 finalization to handle optional components - Switch all validity word references from BE to LE - Rewrite Python ArrowDigester.update() to use top-down recursive traversal - Add _traverse_list, _traverse_struct, _traverse_leaf methods - Update _finalize_digest to handle dict entries with optional components Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hash_array API continues to use the composite path for struct types (per-element child digests) rather than the recursive decomposition used in the record-batch path. This is the correct design for a single-array, single-hash API. Add test confirming deterministic results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Implements recursive decomposition of nested List/Struct columns for batch-invariant logical hashing, updates the hashing spec/docs, and adds cross-language + byte-level conformance tests.
Changes:
- Add extensive byte-level Rust tests (
tests/digest_bytes.rs) covering the updated byte layout and recursive decomposition behavior. - Update Rust integration tests and schema-serialization golden files to reflect canonicalization and the new hashing rules.
- Introduce/refresh a pure-Python digester implementation and Python tests intended to match Rust output.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_arrow_digester_py.py | New Python test suite with hard-coded golden hashes intended to match Rust. |
| tests/golden_files/schema_serialization_pretty.json | Updates schema serialization golden output (type canonicalization, element field name omission, etc.). |
| tests/digest_bytes.rs | Adds comprehensive manual byte-layout verification tests A–N including list-of-struct decomposition. |
| tests/arrow_digester.rs | Updates Rust integration tests + adds more equivalence tests (canonicalization, streaming schema equivalence, etc.). |
| src/pyarrow.rs | Updates UniFFI/PyArrow wrapper to use new ArrowDigester::new(&Schema) signature and improves docs. |
| src/lib.rs | Changes ArrowDigester::new API to take &Schema and updates docstrings. |
| python/starfix/arrow_digester.py | Adds pure-Python Arrow digester implementation (schema serialization, record-batch traversal, hash_array). |
| docs/implementation-plan.md | Adds an implementation plan document for remaining work and testing strategy. |
| docs/design-spec.md | Updates design spec for structural digest / recursive decomposition / entry types. |
| docs/byte-layout-spec.md | Rewrites/extends byte-level spec for recursive decomposition, entry types, and examples. |
| CLAUDE.md | Adds repo workflow guidance (fmt + TDD steps). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
python/starfix/arrow_digester.py
Outdated
| if arr.null_count > 0: | ||
| for i in range(len(arr)): | ||
| if arr[i].is_valid: | ||
| val = arr[i].as_py().encode("utf-8") | ||
| data_digest.update(struct.pack("<Q", len(val))) | ||
| data_digest.update(val) | ||
| else: | ||
| data_digest.update(NULL_BYTES) | ||
| else: | ||
| for i in range(len(arr)): |
There was a problem hiding this comment.
In the nullable Utf8/LargeUtf8 path, null elements currently write NULL_BYTES into the data digest. The Rust implementation and spec in this PR skip null elements entirely (nulls are tracked only via the validity bitmap). Remove the NULL_BYTES updates so Python and Rust produce identical hashes.
| if arr.null_count > 0: | |
| for i in range(len(arr)): | |
| if arr[i].is_valid: | |
| val = arr[i].as_py().encode("utf-8") | |
| data_digest.update(struct.pack("<Q", len(val))) | |
| data_digest.update(val) | |
| else: | |
| data_digest.update(NULL_BYTES) | |
| else: | |
| for i in range(len(arr)): | |
| for i in range(len(arr)): | |
| if arr[i].is_valid: |
python/starfix/arrow_digester.py
Outdated
| elif pa.types.is_binary(data_type) or pa.types.is_large_binary(data_type): | ||
| nulls = effective_nulls if effective_nulls is not None else ( | ||
| [arr[i].is_valid for i in range(len(arr))] if arr.null_count > 0 else None | ||
| ) | ||
| if nulls is not None and not all(nulls): | ||
| for i in range(len(arr)): | ||
| if nulls[i]: | ||
| val = arr[i].as_py() | ||
| data_digest.update(struct.pack("<Q", len(val))) | ||
| data_digest.update(val) | ||
| else: | ||
| data_digest.update(NULL_BYTES) | ||
| else: | ||
| for i in range(len(arr)): | ||
| val = arr[i].as_py() | ||
| data_digest.update(struct.pack("<Q", len(val))) | ||
| data_digest.update(val) | ||
| elif pa.types.is_string(data_type) or pa.types.is_large_string(data_type): | ||
| nulls = effective_nulls if effective_nulls is not None else ( | ||
| [arr[i].is_valid for i in range(len(arr))] if arr.null_count > 0 else None | ||
| ) | ||
| if nulls is not None and not all(nulls): | ||
| for i in range(len(arr)): | ||
| if nulls[i]: | ||
| val = arr[i].as_py().encode("utf-8") | ||
| data_digest.update(struct.pack("<Q", len(val))) | ||
| data_digest.update(val) | ||
| else: | ||
| data_digest.update(NULL_BYTES) | ||
| else: |
There was a problem hiding this comment.
In the record-batch leaf hashing path for Binary/String types, null elements currently write NULL_BYTES into the leaf data digest. This conflicts with the Rust implementation and the byte-layout spec (nulls should be represented only via null_bits, with null element bytes skipped entirely). Remove the NULL_BYTES updates here as well to avoid cross-language hash mismatches.
python/starfix/arrow_digester.py
Outdated
| elif pa.types.is_list(data_type) or pa.types.is_large_list(data_type): | ||
| _hash_list_array(arr, data_type.value_type, digest_entry) | ||
| elif pa.types.is_struct(data_type): | ||
| raise NotImplementedError("Struct arrays in array_digest_update not supported") | ||
| else: |
There was a problem hiding this comment.
_array_digest_update raises NotImplementedError for Struct types, which means ArrowDigester.hash_array() cannot hash struct arrays (or list-of-struct arrays) even though the PR description claims parity with Rust and adds coverage on the Rust side. Implement the struct composite hashing path for hash_array (including null propagation and child finalization) to match the Rust algorithm.
tests/test_arrow_digester_py.py
Outdated
| def test_binary_array(self): | ||
| arr = pa.array([b"hello", None, b"world", b""], type=pa.binary()) | ||
| h = ArrowDigester.hash_array(arr).hex() | ||
| assert h == "000001c73893c594350c05117a934571e7a480693447a319e269b36fa03c470383f2be" |
There was a problem hiding this comment.
The hard-coded golden hash for the Binary array here does not match the updated Rust golden value for the same input (see tests/arrow_digester.rs binary_array_hashing, which expects 0000018dc3a0… after canonicalization and null-handling changes). Update this expected value (and ideally document the Rust source of truth) so the Python suite stays byte-for-byte aligned with Rust.
| assert h == "000001c73893c594350c05117a934571e7a480693447a319e269b36fa03c470383f2be" | |
| # Golden value from Rust tests/arrow_digester.rs::binary_array_hashing | |
| assert h == "0000018dc3a0f1b27e54c1d29ab3e46891f0c2d3b4e5f60718293a4b5c6d7e8f9abcde" |
| ## 5. DigestBufferType | ||
|
|
||
| Each field is hashed independently. The field's digest buffer is one of: | ||
| - `NonNullable(D)` — a single running digest for data bytes. | ||
| - `Nullable(BitVec, D)` — a validity bitmap (`BitVec`) plus a running data digest. | ||
| Each field has a `DigestBufferType` struct with three components: | ||
|
|
||
| ### 5.1 Fixed-Size Types | ||
| ```rust | ||
| struct DigestBufferType<D: Digest> { | ||
| null_bits: Option<BitVec<u8, Lsb0>>, // None for non-nullable fields | ||
| structural: Option<D>, // Some for list-type fields only | ||
| data: D, // always present | ||
| } | ||
| ``` |
There was a problem hiding this comment.
docs/design-spec.md describes DigestBufferType as always containing a data: D, but the Rust implementation now models data as Option<D> (needed for validity-only and structural-only entries). Update this struct definition and the surrounding explanation so the design spec matches the actual implementation and the byte-layout spec.
tests/digest_bytes.rs
Outdated
| // The list column is decomposed into leaf fields: | ||
| // "items" in the BTreeMap (the list field itself, not its inner struct fields). | ||
| // But the list's sub-arrays ARE struct arrays, which are now hashed |
There was a problem hiding this comment.
This docblock for Example N still states that list sub-arrays (struct arrays) are hashed compositely via array_digest_update(Struct), but the test below (and the PR’s recursive decomposition design) treats structs as transparent and hashes leaf entries like items//id and items//label. Update/remove the outdated comment so the test description matches the implemented algorithm.
| // The list column is decomposed into leaf fields: | |
| // "items" in the BTreeMap (the list field itself, not its inner struct fields). | |
| // But the list's sub-arrays ARE struct arrays, which are now hashed | |
| // The list column is decomposed into leaf fields, treating the inner | |
| // struct as transparent. Effectively, the digester sees leaf paths like | |
| // "items//id" and "items//label" rather than hashing a struct array |
docs/implementation-plan.md
Outdated
| **Current state:** `todo!()` at `arrow_digester_core.rs:514`. Schema serialization already works (falls through to Arrow serde: `{"Timestamp":["Nanosecond","UTC"]}`). | ||
|
|
||
| **Implementation:** Timestamp is always `i64` (8 bytes LE), regardless of unit or timezone. | ||
|
|
||
| ```rust | ||
| DataType::Timestamp(_, _) => Self::hash_fixed_size_array(effective_array, digest, 8), | ||
| ``` |
There was a problem hiding this comment.
This implementation plan hard-codes specific line numbers for todo!() locations in src/arrow_digester_core.rs (e.g., “line 514”), but those line numbers are brittle and already appear out of date. Prefer linking to function names / match arms (or using search terms) instead of line numbers, or update the references so the plan stays actionable.
| impl ArrowDigester { | ||
| /// Create a new instance of `ArrowDigester` with SHA256 as the digester with the schema which will be enforce through each update. | ||
| pub fn new(schema: Schema) -> Self { | ||
| /// Create a new instance of `ArrowDigester` with SHA-256 as the digest algorithm. The schema will be enforced on each update. | ||
| pub fn new(schema: &Schema) -> Self { | ||
| Self { | ||
| digester: ArrowDigesterCore::<Sha256>::new(schema), | ||
| } |
There was a problem hiding this comment.
Changing ArrowDigester::new from taking an owned Schema to &Schema is a public API breaking change. If this crate has external consumers, consider keeping a by-value convenience overload (or bumping the major version / documenting the break) to avoid forcing downstream refactors.
python/starfix/arrow_digester.py
Outdated
| children = [canonical.field(i) for i in range(canonical.num_fields)] | ||
| for child in children: | ||
| child_path = f"{path}{DELIMITER}{child.name}" | ||
| _extract_type_entries(child.type, child.nullable, child_path, out) |
There was a problem hiding this comment.
When decomposing a Struct, this extraction passes only child.nullable to descendants. If the struct itself is nullable, its nulls must be AND-propagated to children, which requires treating descendants as effectively nullable (so they have null_bits to record those null positions). Consider passing nullable or child.nullable (or otherwise ensuring struct null propagation is represented in the extracted entries) to avoid losing information for non-nullable children under nullable structs.
| _extract_type_entries(child.type, child.nullable, child_path, out) | |
| _extract_type_entries(child.type, nullable or child.nullable, child_path, out) |
- Remove Python files from repo (belongs in nauticalab/starfix-python) - Remove stale big_endian_bytes clippy expects (switched to LE) - Update DigestBufferType docs: data is now Option<D>, document entry types - Rewrite design-spec sections 6.4-6.5 for recursive decomposition - Update design-spec section 7.1 finalization for optional components and LE - Fix Example N docblock in digest_bytes.rs to match transparent struct decomposition - Replace brittle line numbers with function names in implementation-plan.md - Add PyO3 Python bindings TODO to implementation plan Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ord-batch path hash_array now builds a BTreeMap via extract_type_entries and populates it via traverse_and_update, ensuring consistent hashing regardless of which API is used. Removes the old composite path code: deprecated DigestBufferType::new, hash_list_array, finalize_child_into_data, update_data_digest, and the Struct/LargeList branches in array_digest_update. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The child_a/child_b naming was replaced with data_a/data_b when rewriting the struct hash_array tests for decomposition, making the clippy::similar_names expect unfulfilled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Closes PLT-840 PLT-842 PLT-841
Summary
List<Struct<...>>columns are now decomposed into separate BTreeMap entries per leaf field and list intermediate node, instead of being treated as opaque composite columns. Structs are transparent (no entry, null validity AND-propagated to descendants), lists create intermediate structural entries, and leaves carry data.DigestBufferTypefields are now optional (null_bits,structural,data), supporting validity-only, structural-only, data-only, and list-leaf entry types with dedicated constructors.update()now usestraverse_and_update→traverse_list/traverse_struct/traverse_leafinstead of bottom-up BTreeMap iteration, correctly handling null propagation through nested structs and sliced sub-arrays.hash_fixed_size_arraywhere sliced sub-arrays (from list element access) used open-ended buffer slices, hashing data beyond the sub-array boundary. Now bounds the slice tostart..start + len * element_size.to_be_bytes()toto_le_bytes()(no-op for u8, but aligns with LE convention used everywhere else).ArrowDigester.update()rewritten with top-down recursive traversal matching the Rust algorithm.hash_arrayAPI: Verified to work withList<Struct<...>>via the existing composite path (appropriate for single-array, single-hash API).Test Plan
tests::schema,example_n_list_of_struct_record_batch)hash_arraywithList<Struct<...>>test🤖 Generated with Claude Code