Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Dec 17, 2025

resolves #4868

Alternative to #5683

Reuses the Rust binary to transform data into the format required by SILO: https://github.com/GenSpectrum/LAPIS-SILO/tree/main/tools/legacyNdjsonTransformer.

The rust transformer requires certain fields in the input format such as alignedAminoAcidSequences - these will be returned by the get-released-data endpoint so I decided it was best to just update the tests to include these fields. This also makes our tests more realistic.

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?) -? I just tested that previews come up with the same amount of sequences

🚀 Preview: Add preview label to enable

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Dec 17, 2025
@anna-parker anna-parker changed the title Silo importer rust chore(silo-import): upgrade to SILO 0.8, adapt to the new input data format in the import job Dec 17, 2025
@anna-parker anna-parker marked this pull request as ready for review December 17, 2025 14:51
@claude
Copy link
Contributor

claude bot commented Dec 17, 2025

Code Review: SILO 0.8 Upgrade and Data Format Transformation

I've reviewed this PR and have the following feedback:

✅ Positive Aspects

  1. Good architectural approach: Reusing the existing Rust transformer from LAPIS-SILO is a smart decision that avoids code duplication and maintenance burden.

  2. Comprehensive test updates: The test suite has been properly updated to include realistic data with all required fields (alignedAminoAcidSequences, nucleotideInsertions, etc.), making tests more representative of production data.

  3. Multi-stage Docker build: The Dockerfile properly uses a multi-stage build to compile the Rust binary separately, keeping the final image clean.

  4. Consistent naming: The separation between DATA_FILENAME (untransformed) and TRANSFORMED_DATA_FILENAME is clear and well-documented.

🔴 Critical Issues

1. Error Handling - Generic Exception (High Priority)

# silo-import/src/silo_import/transformer.py:35
raise BaseException(msg) from e

Problem: Raising BaseException is considered bad practice in Python. It bypasses normal exception handling and can catch system exits.

Recommendation: Create a specific exception class or use a standard exception:

class TransformationError(Exception):
    """Raised when data transformation fails"""
    pass

# Then in transformer.py:
raise TransformationError(msg) from e

2. Missing Error Handling for Transformation (High Priority)

The transformation happens in download_manager.py:192 but there's no try-except block around it. If the transformer fails, the download directory cleanup logic may not trigger properly.

Recommendation: Add proper error handling:

# In download_manager.py around line 191-192
try:
    transform_data_format(data_path, transformed_path)
except Exception as exc:
    logger.error("Data transformation failed: %s", exc)
    safe_remove(download_dir)
    raise TransformationError from exc

3. Missing stdout Capture (Medium Priority)

# silo-import/src/silo_import/transformer.py:24-31
subprocess.run(
    cmd,
    shell=True,
    check=True,
    executable="/bin/bash",
    stderr=subprocess.PIPE,  # ✓ captured
    text=True,
)

Problem: Only stderr is captured. If the transformer writes to stdout, it could clutter logs or cause issues.

Recommendation: Add stdout=subprocess.PIPE as well.

4. Missing Binary Verification (Medium Priority)

The code assumes legacy-ndjson-transformer is available but doesn't verify it before attempting transformation.

Recommendation: Add a check or better error message:

def transform_data_format(data_path, transformed_path):
    # Check if binary exists
    if not shutil.which('legacy-ndjson-transformer'):
        raise TransformationError(
            "legacy-ndjson-transformer binary not found in PATH"
        )
    # ... rest of function

⚠️ Minor Issues

5. GitHub Actions: Deprecated Action

# .github/workflows/silo-import-tests.yml:25
uses: actions-rs/toolchain@v1

Note: The actions-rs organization is no longer maintained. Consider using dtolnay/rust-toolchain instead:

- name: Set up Rust
  uses: dtolnay/rust-toolchain@stable

6. Missing Type Hints

The new transform_data_format function lacks type hints:

def transform_data_format(data_path, transformed_path):  # Missing types

Recommendation:

def transform_data_format(data_path: Path | str, transformed_path: Path | str) -> None:

7. Dockerfile: Missing Newline at EOF

# silo-import/Dockerfile:30
ENTRYPOINT ["silo-import"]

The file is missing a newline at the end (cosmetic issue).

📝 Documentation & Testing

Good:

  • README.md updated with link to transformer documentation
  • Tests updated with realistic data structures
  • Mock data helpers properly refactored

Suggestion: Consider adding a comment in download_manager.py explaining why transformation happens after decompression validation but before record count validation.

🔒 Security Considerations

Good: Using shlex.quote() to properly escape file paths in the shell command prevents injection vulnerabilities.

⚠️ Note: The code uses shell=True which is generally discouraged, but in this case it's necessary for the pipe chain (zstdcat | transformer | zstd). The use of shlex.quote() mitigates the risk.

🎯 Performance Considerations

Question: Does the transformation happen in-memory or stream-based? For large datasets, consider whether the pipe chain handles backpressure correctly. The current implementation should be fine since it's using Unix pipes, but worth monitoring memory usage with large files.

📊 Test Coverage

✅ Tests cover:

  • Full import cycle with transformation
  • Multiple runs with state persistence
  • Hard refresh scenarios
  • Error recovery and cleanup

❌ Missing test coverage for:

  • Transformation failure scenarios
  • Binary not found errors
  • Corrupted intermediate data

Recommendation: Add a test that mocks transformer failure to ensure proper cleanup.

Summary

This is a solid PR that follows good architectural patterns. The main concerns are:

  1. Fix the exception handling (use specific exception, not BaseException)
  2. Add error handling around the transformation call in download_manager.py
  3. Capture stdout in subprocess call
  4. Add binary verification before transformation

The changes are well-tested and the approach of reusing the Rust transformer is sound. Once the error handling is improved, this will be ready to merge.


Review completed with attention to code quality, security, performance, and test coverage as requested.

Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check how much faster this actually is? I'm a little sceptical about this change.

Pro:

  • Reuses the existing implementation from SILO, avoiding duplication

Con:

  • This reads the NDJSON now twice:
    • first in:
           # Decompress and analyze the data
           try:
               analysis = analyze_ndjson(data_path)
      
    • then in the transform script
  • The duplication is only a single function (something like 30 lines of code plus a unit test)
  • This adds complexity (integrating Rust into the Python container, using a syscall instead of a native Python function).

I do see the benefits, but I'm not sure whether it's worth it here. What do the others think?

@anna-parker
Copy link
Contributor Author

see my comments here: #5683 (comment) - we should test both versions on staging to see the influence on performance

@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update to SILO 0.8 and adapt input data for SILO

3 participants