-
Notifications
You must be signed in to change notification settings - Fork 9
feat(silo-import): use orjsonl to increase speed of parsing ndjson an… #5764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Code ReviewThank you for this performance optimization! The switch to ✅ Strengths
🐛 Critical Issues1. Missing zstd decompression capability verificationThe PR removes
2. Overly broad exception handlingexcept Exception as exc: # Line 44
msg = f"Failed to decompress {path}: {exc}"
3. Breaking API change without version bumpThe pipeline_versions: set[int] # Old
pipeline_version: int | None # NewThis is a breaking change to the data structure. While all internal usages are updated, consider:
4. Semantic change in pipeline version handlingThe old code checked
The
|
corneliusroemer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is orjsonl as fast as jq? If not let's use jq for validation that the json is valid.
All we want to check: is the json valid. What's pipeline version in first field, what's the line number.
| raise RuntimeError(msg) from exc | ||
| for record in orjsonl.stream(path): | ||
| record_count += 1 | ||
| if pipeline_version is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not helping efficiency, we should count lines using a dedicated tool that's fast, like zstdcat and WC
And parse the pipeline version just by reading the first line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other things being equal, using subprocess is quite a bit more brittle than doing things in code, and orjsonl is indeed very quick so I'm a bit uncertain here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but yeah, sorry, agree we don't actually need to parse all the JSON at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is definitely an improvement over the current state - but potentially a subprocess with jq is more efficient
…d only check pipeline version of first record
partially resolves #5758
Screenshot
PR Checklist
🚀 Preview: Add
previewlabel to enable