Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Dec 19, 2025

…d only check pipeline version of first record

partially resolves #5758

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Dec 19, 2025
@anna-parker anna-parker marked this pull request as ready for review December 19, 2025 06:57
@anna-parker anna-parker requested review from corneliusroemer and theosanderson and removed request for corneliusroemer December 19, 2025 06:57
@claude
Copy link
Contributor

claude bot commented Dec 19, 2025

Code Review

Thank you for this performance optimization! The switch to orjsonl is a good approach to address the performance concerns in issue #5758. Here are my findings:

✅ Strengths

  1. Performance improvement: Using a specialized NDJSON parser (orjsonl) instead of manual line-by-line parsing should significantly improve performance
  2. Code simplification: The new implementation is cleaner with ~25 fewer lines of code
  3. Early exit optimization: Only parsing the first record's pipeline version (instead of collecting all versions) is a smart optimization that further improves performance

🐛 Critical Issues

1. Missing zstd decompression capability verification

The PR removes zstandard imports and direct decompression logic, relying entirely on orjsonl.stream(). However:

  • The docstring still says "Decompress and analyze a zstd-compressed NDJSON file" (line 24)
  • The file paths suggest data is zstd-compressed (.ndjson.zst extension mentioned in README)
  • Action needed: Verify that orjsonl.stream() can handle zstd-compressed files directly. If not, this is a breaking change that will cause runtime failures.

2. Overly broad exception handling

except Exception as exc:  # Line 44
    msg = f"Failed to decompress {path}: {exc}"
  • Previously caught specific zstandard.ZstdError and json.JSONDecodeError exceptions
  • Now catches all exceptions, which could mask bugs (e.g., AttributeError, TypeError)
  • Error message still says "Failed to decompress" even though decompression might not be the issue
  • Recommendation: Use more specific exception types from orjsonl if available, or at least update the error message to be more generic

3. Breaking API change without version bump

The NdjsonAnalysis dataclass changed from:

pipeline_versions: set[int]  # Old
pipeline_version: int | None  # New

This is a breaking change to the data structure. While all internal usages are updated, consider:

  • Whether this could affect any external consumers
  • Adding a migration note if this is a public API

4. Semantic change in pipeline version handling

The old code checked if pipeline_version before adding to set (line 29 in diff), filtering out falsy values. The new code sets pipeline_version = record.get("metadata", {}).get("pipelineVersion") which could be:

  • None (missing key)
  • 0 (valid integer)
  • Empty string
  • Other falsy values

The lineage.py check if not pipeline_version: (line 24) will incorrectly reject pipeline version 0 if that's ever valid.

  • Recommendation: Use explicit is None checks instead of truthiness checks

⚠️ Potential Issues

5. Type safety concern

pipeline_version = record.get("metadata", {}).get("pipelineVersion")  # type: ignore

The # type: ignore comment suppresses type checking. This could hide issues:

  • record might not be a dict (depends on orjsonl's return type)
  • pipelineVersion might be a string "1" instead of int 1
  • Recommendation: Add explicit type conversion: int(version) if (version := ...) is not None else None

6. Removed validation logic

The old code explicitly:

  • Checked for multiple pipeline versions and raised an error
  • Validated JSON on each line
  • Skipped empty lines

The new code:

  • Only reads the first record's pipeline version
  • Assumes all records have the same version (which might be fine, but should be documented)
  • Question: Is it intentional to assume all records share the same pipeline version? If so, this should be documented in the README or as a code comment.

7. Error message mismatch

Line 46 still says "Failed to decompress" but the function now does both decompression and parsing, and the error could come from either step.

🧪 Testing Concerns

8. No new tests for orjsonl integration

  • The PR adds a new dependency but no tests verifying it works correctly
  • No tests for error handling with the new library
  • Recommendation: Add tests for:
    • Files with missing pipelineVersion
    • Files where first record has no pipelineVersion but later records do
    • Corrupted/invalid NDJSON data
    • Verifying zstd decompression works

9. Test data format assumption

All existing tests use records like {"metadata": {"pipelineVersion": "1"}} (string "1"), but the code appears to expect integers. This mismatch should be verified.

📝 Documentation Gaps

  1. Missing PR checklist completion
  • Documentation has not been adapted (README still references validation behavior that changed)
  • No mention of automated tests covering the feature
  • No manual testing documentation
  1. Incomplete commit messages
  • "testing" and "ignore" are not descriptive commit messages
  • Consider squashing commits before merge

🎯 Recommendations

Before merging:

  1. ✅ Verify orjsonl.stream() handles zstd-compressed files (add test or documentation)
  2. ✅ Use more specific exception handling or update error messages
  3. ✅ Add tests covering the new code path
  4. ✅ Fix type safety (remove # type: ignore, add proper type conversion)
  5. ✅ Use is None instead of truthiness checks for pipeline version
  6. ✅ Update README if the "multiple pipeline versions" validation is intentionally removed
  7. ✅ Clean up commit messages

Nice to have:

  • Add a benchmark or timing comparison showing the performance improvement
  • Consider logging when pipeline version is missing from the first record
  • Document the assumption that all records share the same pipeline version

Summary

The performance optimization is valuable, but there are several critical issues (especially around zstd decompression and error handling) that should be addressed before merging. The semantic changes to pipeline version handling need careful review to ensure they don't introduce bugs.

Copy link
Contributor

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is orjsonl as fast as jq? If not let's use jq for validation that the json is valid.

All we want to check: is the json valid. What's pipeline version in first field, what's the line number.

raise RuntimeError(msg) from exc
for record in orjsonl.stream(path):
record_count += 1
if pipeline_version is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not helping efficiency, we should count lines using a dedicated tool that's fast, like zstdcat and WC

And parse the pipeline version just by reading the first line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other things being equal, using subprocess is quite a bit more brittle than doing things in code, and orjsonl is indeed very quick so I'm a bit uncertain here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but yeah, sorry, agree we don't actually need to parse all the JSON at all)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is definitely an improvement over the current state - but potentially a subprocess with jq is more efficient

@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

silo-import python is slow

4 participants