DNA/RNA – Duplicate Sequence Detection

**Is your feature request related to a problem? Please describe.**
BioProfileKit already detects reverse-complement duplicates but does not report exact sequence duplicates. Duplicate sequences in training or analysis datasets can bias statistical analyses, inflate apparent sequence diversity, and indicate data integrity issues (e.g. the same entry present twice after a database merge).

**Describe the solution you'd like**
In `sequence_data.py`, add duplicate sequence statistics to `DNARNAColumns`:
- `n_exact_duplicates: int` — count of sequences that appear more than once
- `exact_duplicate_ratio: float` — percentage of total sequences
- `duplicate_length_distribution: dict` — breakdown of duplicates by sequence length bucket (to detect whether duplicates cluster at specific lengths)
- `top_duplicated_sequences: list[tuple[str, int]]` — top-N most duplicated sequences and their counts

Display in a new "Duplicates" sub-tab in the DNA/RNA section of `columns.jinja`, with a focus on length-specific duplicate patterns (short sequences are expected to repeat more frequently than long ones).

**Describe alternatives you've considered**
Hashing-based deduplication as a preprocessing step rather than reporting. Reporting is preferred because the decision to deduplicate should remain with the scientist.

**Additional context**
Length-stratified analysis is important: a 9-mer epitope appearing 500 times may be biologically meaningful (a dominant epitope), while a 200-nt sequence appearing twice is likely a data error.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNA/RNA – Duplicate Sequence Detection #8

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DNA/RNA – Duplicate Sequence Detection #8

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions