Skip to content

Fix multi-source convert path collision (#442)#444

Open
kenibrewer wants to merge 3 commits intomainfrom
kenibrewer/issue-442
Open

Fix multi-source convert path collision (#442)#444
kenibrewer wants to merge 3 commits intomainfrom
kenibrewer/issue-442

Conversation

@kenibrewer
Copy link
Copy Markdown
Member

@kenibrewer kenibrewer commented Apr 29, 2026

Description

Fixes #442. When cytotable.convert() was called with multiple per-source subdirectories that share a parent directory name (e.g. analyses/{1,2,3}/analysis/), _source_pageset_to_parquet wrote each source's intermediate parquet to the same path, causing ArrowInvalid (concurrent write/read race) or FileNotFoundError (later writer overwrites). The fix inserts a short, stable SHA-1 hash of the source's parent path into the intermediate directory, guaranteeing per-source uniqueness; cleanup in _concat_source_group is extended to walk up the additional level. A regression test (test_convert_multi_source_colliding_parent_dir_names) reproduces the original failure on main and passes with the fix.

What is the nature of your change?

  • Bug fix (fixes an issue).

Checklist

  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of multi-source conversions with identical parent directory names, ensuring correct output file generation and cleanup.
  • Tests

    • Added regression test for multi-source conversion edge case.

Sibling sources whose parent directories share a name (e.g.
analyses/{1,2,3}/analysis) wrote to identical intermediate parquet paths,
causing ArrowInvalid (race) or FileNotFoundError (overwrite). Add a stable
SHA-1 hash of the source's parent path to the intermediate directory, and
extend cleanup in _concat_source_group to remove the new dir and its parent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds SHA-1 hash-based path discrimination to parquet exports for multi-source conversions with colliding parent directory names, expands directory cleanup after concatenation, bumps mypy to v1.20.2, and introduces a regression test validating the multi-site scenario.

Changes

Multi-source collision handling

Layer / File(s) Summary
Pre-commit configuration
.pre-commit-config.yaml
mypy version pin bumped from v1.20.1 to v1.20.2.
Hash-based path discrimination
cytotable/convert.py
_source_pageset_to_parquet now computes a SHA-1 hash from source parent directory and incorporates it into destination path to ensure uniqueness when parent names collide.
Expanded directory cleanup
cytotable/convert.py
_concat_source_group cleanup logic now removes both the chunk parent directory and its parent directory after concatenation, with improved error handling.
Regression test
tests/test_convert_threaded.py
New test_convert_multi_source_colliding_parent_dir_names creates a multi-site structure with identical parent directories, runs multi-source conversion with join=False, and validates output compartment keys and parquet row counts scale to 3× single-site output.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 Three sites with matching names did gleam,
But collisions caused the parquet dream to scream!
With hashlib's magic hash so true,
We've tamed the chaos—hooray—brand new! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix multi-source convert path collision (#442)' clearly and concisely summarizes the main change: fixing a path collision bug in multi-source convert operations, with specific reference to the issue number.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kenibrewer/issue-442

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kenibrewer kenibrewer marked this pull request as draft April 29, 2026 16:24
pre-commit-ci-lite Bot and others added 2 commits April 29, 2026 16:25
- Mark sha1 path-discriminator hash as usedforsecurity=False (bandit B324).
- Narrow convert() return type with isinstance(dict) asserts in the new
  multi-source regression test so it passes mypy (test_convert.py is
  mypy-excluded; test_convert_threaded.py is not).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@kenibrewer kenibrewer marked this pull request as ready for review May 7, 2026 16:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.pre-commit-config.yaml:
- Line 66: The .pre-commit-config.yaml specifies an invalid mypy rev v1.20.2;
update the rev for the mypy hook to a valid release (e.g., change the rev value
from "v1.20.2" to "v1.20.1" or "v1.20.0") so the pre-commit hook points to an
existing mypy release.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c49fdcfb-7c00-4a65-814f-29ddea900389

📥 Commits

Reviewing files that changed from the base of the PR and between a957113 and 91d1d75.

📒 Files selected for processing (3)
  • .pre-commit-config.yaml
  • cytotable/convert.py
  • tests/test_convert_threaded.py

Comment thread .pre-commit-config.yaml
- id: vulture
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.20.1
rev: v1.20.2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

mypy version 1.20.2 release notes or changelog

💡 Result:

No evidence found for a mypy version 1.20.2 release as of 2026-05-07. The latest versions are mypy 1.20.0 (released March 31, 2026 [1][2]) and 1.20.1 (April 2026 [3][4][5]). GitHub tags confirm up to v1.20.0 [1], no v1.20.2 tag or PyPI page [2]. Changelog at v1.20.1 includes 1.20.1 fixes like SQLite cache sync disable and narrowing fixes [3][4]. Official release notes for 1.20.0 cover better type narrowing, Python 3.9 support drop, SQLite cache default, local partial types improvements, and Mypy 2.0 plans [6][7]. For full 1.20 details, see https://github.com/python/mypy/blob/v1.20.1/CHANGELOG.md [3] or https://mypy.readthedocs.io/en/stable/changelog.html [7].

Citations:


Change mypy version to v1.20.1 or v1.20.0.

mypy v1.20.2 does not exist. As of May 2026, the latest available versions are v1.20.1 (April 2026) and v1.20.0 (March 2026). Update the revision to one of these releases.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.pre-commit-config.yaml at line 66, The .pre-commit-config.yaml specifies an
invalid mypy rev v1.20.2; update the rev for the mypy hook to a valid release
(e.g., change the rev value from "v1.20.2" to "v1.20.1" or "v1.20.0") so the
pre-commit hook points to an existing mypy release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-source convert(): intermediate parquet paths collide across sources, causing write/read race (ArrowInvalid / FileNotFoundError)

1 participant