Skip to content

Biological – UniProt ID Validation #12

@hansen-maria

Description

@hansen-maria

Is your feature request related to a problem? Please describe.
BioProfileKit validates GO terms and COG categories but has no support for UniProt accession numbers, which are ubiquitous in proteomics datasets. Malformed or retired UniProt IDs (e.g. obsolete accessions that have been merged or deleted) pass through undetected.

Describe the solution you'd like
Add a UniProt ID validation module to functional_annotation.py (or a new uniprot.py):

  1. Format validation: UniProt accessions follow a strict regex pattern ([OPQ][0-9][A-Z0-9]{3}[0-9] or [A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}). Flag any IDs that do not match.
  2. Existence validation (optional, requires network): Query the UniProt REST API (https://rest.uniprot.org/uniprotkb/{id}/) to check whether the accession is active. Cache results locally to avoid repeated requests.
  3. Obsolete ID detection: Detect IDs that redirect to a merged entry (HTTP 301) and report the current canonical accession.

Surface results in a new "UniProt" tab in columns.jinja, consistent with the existing GO/COG annotation tabs. Activate via --func uniprot CLI option.

Describe alternatives you've considered
Downloading the full UniProt ID list locally (as done for taxonomy). The current compressed list is ~2 GB — network validation with caching is a more practical approach for accession columns.

Additional context
UniProt accession format regex is documented at: https://www.uniprot.org/help/accession_numbers. The REST API supports batch lookups of up to 500 IDs per request, making validation of large columns feasible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions