-
Notifications
You must be signed in to change notification settings - Fork 9
chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ease time waiting for server to respond
674a9a4 to
7a8e6bf
Compare
cac2198 to
bef20a2
Compare
5d151f3 to
b78b800
Compare
Code Review: SILO 0.8 Upgrade and Data Format TransformationI've reviewed this PR and have the following feedback: ✅ Positive Aspects
🔴 Critical Issues1. Error Handling - Generic Exception (High Priority)# silo-import/src/silo_import/transformer.py:35
raise BaseException(msg) from eProblem: Raising Recommendation: Create a specific exception class or use a standard exception: class TransformationError(Exception):
"""Raised when data transformation fails"""
pass
# Then in transformer.py:
raise TransformationError(msg) from e2. Missing Error Handling for Transformation (High Priority)The transformation happens in Recommendation: Add proper error handling: # In download_manager.py around line 191-192
try:
transform_data_format(data_path, transformed_path)
except Exception as exc:
logger.error("Data transformation failed: %s", exc)
safe_remove(download_dir)
raise TransformationError from exc3. Missing stdout Capture (Medium Priority)# silo-import/src/silo_import/transformer.py:24-31
subprocess.run(
cmd,
shell=True,
check=True,
executable="/bin/bash",
stderr=subprocess.PIPE, # ✓ captured
text=True,
)Problem: Only Recommendation: Add 4. Missing Binary Verification (Medium Priority)The code assumes Recommendation: Add a check or better error message: def transform_data_format(data_path, transformed_path):
# Check if binary exists
if not shutil.which('legacy-ndjson-transformer'):
raise TransformationError(
"legacy-ndjson-transformer binary not found in PATH"
)
# ... rest of function
|
fengelniederhammer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check how much faster this actually is? I'm a little sceptical about this change.
Pro:
- Reuses the existing implementation from SILO, avoiding duplication
Con:
- This reads the NDJSON now twice:
- first in:
# Decompress and analyze the data try: analysis = analyze_ndjson(data_path) - then in the transform script
- first in:
- The duplication is only a single function (something like 30 lines of code plus a unit test)
- This adds complexity (integrating Rust into the Python container, using a syscall instead of a native Python function).
I do see the benefits, but I'm not sure whether it's worth it here. What do the others think?
|
see my comments here: #5683 (comment) - we should test both versions on staging to see the influence on performance |
resolves #4868
Alternative to #5683
Reuses the Rust binary to transform data into the format required by SILO: https://github.com/GenSpectrum/LAPIS-SILO/tree/main/tools/legacyNdjsonTransformer.
The rust transformer requires certain fields in the input format such as
alignedAminoAcidSequences- these will be returned by the get-released-data endpoint so I decided it was best to just update the tests to include these fields. This also makes our tests more realistic.Screenshot
PR Checklist
🚀 Preview: Add
previewlabel to enable