Skip to content

DNA/RNA – Duplicate Sequence Detection #8

@hansen-maria

Description

@hansen-maria

Is your feature request related to a problem? Please describe.
BioProfileKit already detects reverse-complement duplicates but does not report exact sequence duplicates. Duplicate sequences in training or analysis datasets can bias statistical analyses, inflate apparent sequence diversity, and indicate data integrity issues (e.g. the same entry present twice after a database merge).

Describe the solution you'd like
In sequence_data.py, add duplicate sequence statistics to DNARNAColumns:

  • n_exact_duplicates: int — count of sequences that appear more than once
  • exact_duplicate_ratio: float — percentage of total sequences
  • duplicate_length_distribution: dict — breakdown of duplicates by sequence length bucket (to detect whether duplicates cluster at specific lengths)
  • top_duplicated_sequences: list[tuple[str, int]] — top-N most duplicated sequences and their counts

Display in a new "Duplicates" sub-tab in the DNA/RNA section of columns.jinja, with a focus on length-specific duplicate patterns (short sequences are expected to repeat more frequently than long ones).

Describe alternatives you've considered
Hashing-based deduplication as a preprocessing step rather than reporting. Reporting is preferred because the decision to deduplicate should remain with the scientist.

Additional context
Length-stratified analysis is important: a 9-mer epitope appearing 500 times may be biologically meaningful (a dominant epitope), while a 200-nt sequence appearing twice is likely a data error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions