feat(silo-import): use orjsonl to increase speed of parsing ndjson an… #5764

anna-parker · 2025-12-19T06:15:37Z

…d only check pipeline version of first record

partially resolves #5758

Screenshot

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

…d only check pipeline version of first record

claude · 2025-12-19T06:59:44Z

Code Review

Thank you for this performance optimization! The switch to orjsonl is a good approach to address the performance concerns in issue #5758. Here are my findings:

✅ Strengths

Performance improvement: Using a specialized NDJSON parser (orjsonl) instead of manual line-by-line parsing should significantly improve performance
Code simplification: The new implementation is cleaner with ~25 fewer lines of code
Early exit optimization: Only parsing the first record's pipeline version (instead of collecting all versions) is a smart optimization that further improves performance

🐛 Critical Issues

1. Missing zstd decompression capability verification

The PR removes zstandard imports and direct decompression logic, relying entirely on orjsonl.stream(). However:

The docstring still says "Decompress and analyze a zstd-compressed NDJSON file" (line 24)
The file paths suggest data is zstd-compressed (.ndjson.zst extension mentioned in README)
Action needed: Verify that orjsonl.stream() can handle zstd-compressed files directly. If not, this is a breaking change that will cause runtime failures.

2. Overly broad exception handling

except Exception as exc:  # Line 44
    msg = f"Failed to decompress {path}: {exc}"

Previously caught specific zstandard.ZstdError and json.JSONDecodeError exceptions
Now catches all exceptions, which could mask bugs (e.g., AttributeError, TypeError)
Error message still says "Failed to decompress" even though decompression might not be the issue
Recommendation: Use more specific exception types from orjsonl if available, or at least update the error message to be more generic

3. Breaking API change without version bump

The NdjsonAnalysis dataclass changed from:

pipeline_versions: set[int]  # Old
pipeline_version: int | None  # New

This is a breaking change to the data structure. While all internal usages are updated, consider:

Whether this could affect any external consumers
Adding a migration note if this is a public API

4. Semantic change in pipeline version handling

The old code checked if pipeline_version before adding to set (line 29 in diff), filtering out falsy values. The new code sets pipeline_version = record.get("metadata", {}).get("pipelineVersion") which could be:

None (missing key)
0 (valid integer)
Empty string
Other falsy values

The lineage.py check if not pipeline_version: (line 24) will incorrectly reject pipeline version 0 if that's ever valid.

Recommendation: Use explicit is None checks instead of truthiness checks

⚠️ Potential Issues

5. Type safety concern

pipeline_version = record.get("metadata", {}).get("pipelineVersion")  # type: ignore

The # type: ignore comment suppresses type checking. This could hide issues:

record might not be a dict (depends on orjsonl's return type)
pipelineVersion might be a string "1" instead of int 1
Recommendation: Add explicit type conversion: int(version) if (version := ...) is not None else None

6. Removed validation logic

The old code explicitly:

Checked for multiple pipeline versions and raised an error
Validated JSON on each line
Skipped empty lines

The new code:

Only reads the first record's pipeline version
Assumes all records have the same version (which might be fine, but should be documented)
Question: Is it intentional to assume all records share the same pipeline version? If so, this should be documented in the README or as a code comment.

7. Error message mismatch

Line 46 still says "Failed to decompress" but the function now does both decompression and parsing, and the error could come from either step.

🧪 Testing Concerns

8. No new tests for orjsonl integration

The PR adds a new dependency but no tests verifying it works correctly
No tests for error handling with the new library
Recommendation: Add tests for:
- Files with missing pipelineVersion
- Files where first record has no pipelineVersion but later records do
- Corrupted/invalid NDJSON data
- Verifying zstd decompression works

9. Test data format assumption

All existing tests use records like {"metadata": {"pipelineVersion": "1"}} (string "1"), but the code appears to expect integers. This mismatch should be verified.

📝 Documentation Gaps

Missing PR checklist completion

Documentation has not been adapted (README still references validation behavior that changed)
No mention of automated tests covering the feature
No manual testing documentation

Incomplete commit messages

"testing" and "ignore" are not descriptive commit messages
Consider squashing commits before merge

🎯 Recommendations

Before merging:

✅ Verify orjsonl.stream() handles zstd-compressed files (add test or documentation)
✅ Use more specific exception handling or update error messages
✅ Add tests covering the new code path
✅ Fix type safety (remove # type: ignore, add proper type conversion)
✅ Use is None instead of truthiness checks for pipeline version
✅ Update README if the "multiple pipeline versions" validation is intentionally removed
✅ Clean up commit messages

Nice to have:

Add a benchmark or timing comparison showing the performance improvement
Consider logging when pipeline version is missing from the first record
Document the assumption that all records share the same pipeline version

Summary

The performance optimization is valuable, but there are several critical issues (especially around zstd decompression and error handling) that should be addressed before merging. The semantic changes to pipeline version handling need careful review to ensure they don't introduce bugs.

corneliusroemer

Is orjsonl as fast as jq? If not let's use jq for validation that the json is valid.

All we want to check: is the json valid. What's pipeline version in first field, what's the line number.

silo-import/src/silo_import/decompressor.py

corneliusroemer · 2025-12-19T10:06:23Z

silo-import/src/silo_import/decompressor.py

-                    raise RuntimeError(msg) from exc
+        for record in orjsonl.stream(path):
+            record_count += 1
+            if pipeline_version is None:


That's not helping efficiency, we should count lines using a dedicated tool that's fast, like zstdcat and WC

And parse the pipeline version just by reading the first line.

All other things being equal, using subprocess is quite a bit more brittle than doing things in code, and orjsonl is indeed very quick so I'm a bit uncertain here

(but yeah, sorry, agree we don't actually need to parse all the JSON at all)

I agree that this is definitely an improvement over the current state - but potentially a subprocess with jq is more efficient

anna-parker added 3 commits December 19, 2025 07:14

feat(silo-import): use orjsonl to increase speed of parsing ndjson an…

ab344e7

…d only check pipeline version of first record

testing

4ac5646

ignore

6968985

anna-parker added the preview Triggers a deployment to argocd label Dec 19, 2025

anna-parker marked this pull request as ready for review December 19, 2025 06:57

anna-parker requested review from corneliusroemer and theosanderson and removed request for corneliusroemer December 19, 2025 06:57

anna-parker mentioned this pull request Dec 19, 2025

chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5683

Open

3 tasks

corneliusroemer reviewed Dec 19, 2025

View reviewed changes

anna-parker removed the preview Triggers a deployment to argocd label Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(silo-import): use orjsonl to increase speed of parsing ndjson an… #5764

feat(silo-import): use orjsonl to increase speed of parsing ndjson an… #5764

anna-parker commented Dec 19, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 19, 2025

Uh oh!

corneliusroemer left a comment

Uh oh!

Uh oh!

corneliusroemer Dec 19, 2025

Uh oh!

theosanderson Dec 19, 2025

Uh oh!

theosanderson Dec 19, 2025

Uh oh!

anna-parker Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(silo-import): use orjsonl to increase speed of parsing ndjson an… #5764

Are you sure you want to change the base?

feat(silo-import): use orjsonl to increase speed of parsing ndjson an… #5764

Conversation

anna-parker commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshot

PR Checklist

Uh oh!

claude bot commented Dec 19, 2025

Code Review

✅ Strengths

🐛 Critical Issues

1. Missing zstd decompression capability verification

2. Overly broad exception handling

3. Breaking API change without version bump

4. Semantic change in pipeline version handling

⚠️ Potential Issues

5. Type safety concern

6. Removed validation logic

7. Error message mismatch

🧪 Testing Concerns

8. No new tests for orjsonl integration

9. Test data format assumption

📝 Documentation Gaps

🎯 Recommendations

Summary

Uh oh!

corneliusroemer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

corneliusroemer Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

theosanderson Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

theosanderson Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

anna-parker Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anna-parker commented Dec 19, 2025 •

edited

Loading