Is your feature request related to a problem? Please describe.
BioProfileKit already detects reverse-complement duplicates but does not report exact sequence duplicates. Duplicate sequences in training or analysis datasets can bias statistical analyses, inflate apparent sequence diversity, and indicate data integrity issues (e.g. the same entry present twice after a database merge).
Describe the solution you'd like
In sequence_data.py, add duplicate sequence statistics to DNARNAColumns:
n_exact_duplicates: int — count of sequences that appear more than once
exact_duplicate_ratio: float — percentage of total sequences
duplicate_length_distribution: dict — breakdown of duplicates by sequence length bucket (to detect whether duplicates cluster at specific lengths)
top_duplicated_sequences: list[tuple[str, int]] — top-N most duplicated sequences and their counts
Display in a new "Duplicates" sub-tab in the DNA/RNA section of columns.jinja, with a focus on length-specific duplicate patterns (short sequences are expected to repeat more frequently than long ones).
Describe alternatives you've considered
Hashing-based deduplication as a preprocessing step rather than reporting. Reporting is preferred because the decision to deduplicate should remain with the scientist.
Additional context
Length-stratified analysis is important: a 9-mer epitope appearing 500 times may be biologically meaningful (a dominant epitope), while a 200-nt sequence appearing twice is likely a data error.
Is your feature request related to a problem? Please describe.
BioProfileKit already detects reverse-complement duplicates but does not report exact sequence duplicates. Duplicate sequences in training or analysis datasets can bias statistical analyses, inflate apparent sequence diversity, and indicate data integrity issues (e.g. the same entry present twice after a database merge).
Describe the solution you'd like
In
sequence_data.py, add duplicate sequence statistics toDNARNAColumns:n_exact_duplicates: int— count of sequences that appear more than onceexact_duplicate_ratio: float— percentage of total sequencesduplicate_length_distribution: dict— breakdown of duplicates by sequence length bucket (to detect whether duplicates cluster at specific lengths)top_duplicated_sequences: list[tuple[str, int]]— top-N most duplicated sequences and their countsDisplay in a new "Duplicates" sub-tab in the DNA/RNA section of
columns.jinja, with a focus on length-specific duplicate patterns (short sequences are expected to repeat more frequently than long ones).Describe alternatives you've considered
Hashing-based deduplication as a preprocessing step rather than reporting. Reporting is preferred because the decision to deduplicate should remain with the scientist.
Additional context
Length-stratified analysis is important: a 9-mer epitope appearing 500 times may be biologically meaningful (a dominant epitope), while a 200-nt sequence appearing twice is likely a data error.