Is your feature request related to a problem? Please describe.
BioProfileKit validates GO terms and COG categories but has no support for UniProt accession numbers, which are ubiquitous in proteomics datasets. Malformed or retired UniProt IDs (e.g. obsolete accessions that have been merged or deleted) pass through undetected.
Describe the solution you'd like
Add a UniProt ID validation module to functional_annotation.py (or a new uniprot.py):
- Format validation: UniProt accessions follow a strict regex pattern (
[OPQ][0-9][A-Z0-9]{3}[0-9] or [A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}). Flag any IDs that do not match.
- Existence validation (optional, requires network): Query the UniProt REST API (
https://rest.uniprot.org/uniprotkb/{id}/) to check whether the accession is active. Cache results locally to avoid repeated requests.
- Obsolete ID detection: Detect IDs that redirect to a merged entry (HTTP 301) and report the current canonical accession.
Surface results in a new "UniProt" tab in columns.jinja, consistent with the existing GO/COG annotation tabs. Activate via --func uniprot CLI option.
Describe alternatives you've considered
Downloading the full UniProt ID list locally (as done for taxonomy). The current compressed list is ~2 GB — network validation with caching is a more practical approach for accession columns.
Additional context
UniProt accession format regex is documented at: https://www.uniprot.org/help/accession_numbers. The REST API supports batch lookups of up to 500 IDs per request, making validation of large columns feasible.
Is your feature request related to a problem? Please describe.
BioProfileKit validates GO terms and COG categories but has no support for UniProt accession numbers, which are ubiquitous in proteomics datasets. Malformed or retired UniProt IDs (e.g. obsolete accessions that have been merged or deleted) pass through undetected.
Describe the solution you'd like
Add a UniProt ID validation module to
functional_annotation.py(or a newuniprot.py):[OPQ][0-9][A-Z0-9]{3}[0-9]or[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}). Flag any IDs that do not match.https://rest.uniprot.org/uniprotkb/{id}/) to check whether the accession is active. Cache results locally to avoid repeated requests.Surface results in a new "UniProt" tab in
columns.jinja, consistent with the existing GO/COG annotation tabs. Activate via--func uniprotCLI option.Describe alternatives you've considered
Downloading the full UniProt ID list locally (as done for taxonomy). The current compressed list is ~2 GB — network validation with caching is a more practical approach for accession columns.
Additional context
UniProt accession format regex is documented at: https://www.uniprot.org/help/accession_numbers. The REST API supports batch lookups of up to 500 IDs per request, making validation of large columns feasible.