chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733

anna-parker · 2025-12-17T10:22:03Z

resolves #4868

Alternative to #5683

Reuses the Rust binary to transform data into the format required by SILO: https://github.com/GenSpectrum/LAPIS-SILO/tree/main/tools/legacyNdjsonTransformer.

The rust transformer requires certain fields in the input format such as alignedAminoAcidSequences - these will be returned by the get-released-data endpoint so I decided it was best to just update the tests to include these fields. This also makes our tests more realistic.

Screenshot

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?) -? I just tested that previews come up with the same amount of sequences

🚀 Preview: Add preview label to enable

…ease time waiting for server to respond

ena-submission/src/ena_deposition/call_loculus.py

silo-import/Dockerfile

claude · 2025-12-17T14:53:13Z

Code Review: SILO 0.8 Upgrade and Data Format Transformation

I've reviewed this PR and have the following feedback:

✅ Positive Aspects

Good architectural approach: Reusing the existing Rust transformer from LAPIS-SILO is a smart decision that avoids code duplication and maintenance burden.
Comprehensive test updates: The test suite has been properly updated to include realistic data with all required fields (alignedAminoAcidSequences, nucleotideInsertions, etc.), making tests more representative of production data.
Multi-stage Docker build: The Dockerfile properly uses a multi-stage build to compile the Rust binary separately, keeping the final image clean.
Consistent naming: The separation between DATA_FILENAME (untransformed) and TRANSFORMED_DATA_FILENAME is clear and well-documented.

🔴 Critical Issues

1. Error Handling - Generic Exception (High Priority)

# silo-import/src/silo_import/transformer.py:35
raise BaseException(msg) from e

Problem: Raising BaseException is considered bad practice in Python. It bypasses normal exception handling and can catch system exits.

Recommendation: Create a specific exception class or use a standard exception:

class TransformationError(Exception):
    """Raised when data transformation fails"""
    pass

# Then in transformer.py:
raise TransformationError(msg) from e

2. Missing Error Handling for Transformation (High Priority)

The transformation happens in download_manager.py:192 but there's no try-except block around it. If the transformer fails, the download directory cleanup logic may not trigger properly.

Recommendation: Add proper error handling:

# In download_manager.py around line 191-192
try:
    transform_data_format(data_path, transformed_path)
except Exception as exc:
    logger.error("Data transformation failed: %s", exc)
    safe_remove(download_dir)
    raise TransformationError from exc

3. Missing stdout Capture (Medium Priority)

# silo-import/src/silo_import/transformer.py:24-31
subprocess.run(
    cmd,
    shell=True,
    check=True,
    executable="/bin/bash",
    stderr=subprocess.PIPE,  # ✓ captured
    text=True,
)

Problem: Only stderr is captured. If the transformer writes to stdout, it could clutter logs or cause issues.

Recommendation: Add stdout=subprocess.PIPE as well.

4. Missing Binary Verification (Medium Priority)

The code assumes legacy-ndjson-transformer is available but doesn't verify it before attempting transformation.

Recommendation: Add a check or better error message:

def transform_data_format(data_path, transformed_path):
    # Check if binary exists
    if not shutil.which('legacy-ndjson-transformer'):
        raise TransformationError(
            "legacy-ndjson-transformer binary not found in PATH"
        )
    # ... rest of function

⚠️ Minor Issues

5. GitHub Actions: Deprecated Action

# .github/workflows/silo-import-tests.yml:25
uses: actions-rs/toolchain@v1

Note: The actions-rs organization is no longer maintained. Consider using dtolnay/rust-toolchain instead:

- name: Set up Rust
  uses: dtolnay/rust-toolchain@stable

6. Missing Type Hints

The new transform_data_format function lacks type hints:

def transform_data_format(data_path, transformed_path):  # Missing types

Recommendation:

def transform_data_format(data_path: Path | str, transformed_path: Path | str) -> None:

7. Dockerfile: Missing Newline at EOF

# silo-import/Dockerfile:30
ENTRYPOINT ["silo-import"]

The file is missing a newline at the end (cosmetic issue).

📝 Documentation & Testing

✅ Good:

README.md updated with link to transformer documentation
Tests updated with realistic data structures
Mock data helpers properly refactored

Suggestion: Consider adding a comment in download_manager.py explaining why transformation happens after decompression validation but before record count validation.

🔒 Security Considerations

✅ Good: Using shlex.quote() to properly escape file paths in the shell command prevents injection vulnerabilities.

⚠️ Note: The code uses shell=True which is generally discouraged, but in this case it's necessary for the pipe chain (zstdcat | transformer | zstd). The use of shlex.quote() mitigates the risk.

🎯 Performance Considerations

Question: Does the transformation happen in-memory or stream-based? For large datasets, consider whether the pipe chain handles backpressure correctly. The current implementation should be fine since it's using Unix pipes, but worth monitoring memory usage with large files.

📊 Test Coverage

✅ Tests cover:

Full import cycle with transformation
Multiple runs with state persistence
Hard refresh scenarios
Error recovery and cleanup

❌ Missing test coverage for:

Transformation failure scenarios
Binary not found errors
Corrupted intermediate data

Recommendation: Add a test that mocks transformer failure to ensure proper cleanup.

Summary

This is a solid PR that follows good architectural patterns. The main concerns are:

Fix the exception handling (use specific exception, not BaseException)
Add error handling around the transformation call in download_manager.py
Capture stdout in subprocess call
Add binary verification before transformation

The changes are well-tested and the approach of reusing the Rust transformer is sound. Once the error handling is improved, this will be ready to merge.

Review completed with attention to code quality, security, performance, and test coverage as requested.

fengelniederhammer

Did you check how much faster this actually is? I'm a little sceptical about this change.

Pro:

Reuses the existing implementation from SILO, avoiding duplication

Con:

This reads the NDJSON now twice:

first in:

     # Decompress and analyze the data
     try:
         analysis = analyze_ndjson(data_path)

then in the transform script

The duplication is only a single function (something like 30 lines of code plus a unit test)
This adds complexity (integrating Rust into the Python container, using a syscall instead of a native Python function).

I do see the benefits, but I'm not sure whether it's worth it here. What do the others think?

anna-parker · 2025-12-19T08:27:32Z

see my comments here: #5683 (comment) - we should test both versions on staging to see the influence on performance

anna-parker mentioned this pull request Dec 17, 2025

chore(kubernetes, silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5683

Open

3 tasks

anna-parker added 2 commits December 17, 2025 11:22

feat(deposition): add uuid, and logs if get-released-data fails, decr…

638fe32

…ease time waiting for server to respond

feat(silo-importer): add rust binary to dockerfile

7a8e6bf

anna-parker force-pushed the silo-importer-rust branch from 674a9a4 to 7a8e6bf Compare December 17, 2025 10:23

anna-parker commented Dec 17, 2025

View reviewed changes

ena-submission/src/ena_deposition/call_loculus.py Outdated Show resolved Hide resolved

Update ena-submission/src/ena_deposition/call_loculus.py

3ab8b01

anna-parker commented Dec 17, 2025

View reviewed changes

silo-import/Dockerfile Outdated Show resolved Hide resolved

anna-parker added 2 commits December 17, 2025 11:49

use transformed data

b4797fa

format

2187e3a

anna-parker added the preview Triggers a deployment to argocd label Dec 17, 2025

anna-parker added 7 commits December 17, 2025 11:53

update tag

6ee4332

compare transformed paths

5a00ff6

testing

345cf80

fix workflow for tests

f0f1c77

testing

a8f07f8

fix linter errors

390eb5c

update tests

bef20a2

anna-parker force-pushed the silo-importer-rust branch from cac2198 to bef20a2 Compare December 17, 2025 12:44

anna-parker added 4 commits December 17, 2025 13:54

clean up

25b4fb5

fixup

442911d

simplify

a023a6f

update tests

b78b800

anna-parker force-pushed the silo-importer-rust branch from 5d151f3 to b78b800 Compare December 17, 2025 13:39

anna-parker added 6 commits December 17, 2025 14:44

use from path

0b6e4d0

testing

2734e02

fixup

009defe

fix tests

b2cc662

improve logging

9ff00d9

update docs

708fdc7

anna-parker changed the title ~~Silo importer rust~~ chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job Dec 17, 2025

anna-parker marked this pull request as ready for review December 17, 2025 14:51

anna-parker requested review from corneliusroemer, fengelniederhammer and theosanderson December 17, 2025 14:52

improve errors

8767f54

fengelniederhammer reviewed Dec 19, 2025

View reviewed changes

anna-parker removed the preview Triggers a deployment to argocd label Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733

chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733

anna-parker commented Dec 17, 2025 •

edited by loculus-bot

Loading

Uh oh!

Uh oh!

Uh oh!

claude bot commented Dec 17, 2025

Uh oh!

fengelniederhammer left a comment

Uh oh!

anna-parker commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733

Are you sure you want to change the base?

chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job #5733

Conversation

anna-parker commented Dec 17, 2025 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshot

PR Checklist

Uh oh!

Uh oh!

Uh oh!

claude bot commented Dec 17, 2025

Code Review: SILO 0.8 Upgrade and Data Format Transformation

✅ Positive Aspects

🔴 Critical Issues

1. Error Handling - Generic Exception (High Priority)

2. Missing Error Handling for Transformation (High Priority)

3. Missing stdout Capture (Medium Priority)

4. Missing Binary Verification (Medium Priority)

⚠️ Minor Issues

5. GitHub Actions: Deprecated Action

6. Missing Type Hints

7. Dockerfile: Missing Newline at EOF

📝 Documentation & Testing

🔒 Security Considerations

🎯 Performance Considerations

📊 Test Coverage

Summary

Uh oh!

fengelniederhammer left a comment

Choose a reason for hiding this comment

Uh oh!

anna-parker commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anna-parker commented Dec 17, 2025 •

edited by loculus-bot

Loading