Skip to content

Gene symbol mapping in final_data_clean.py drops valid results #1

@dglemos

Description

@dglemos

Hi,
I was reviewing the script annotate_pubmed/final_data_clean.py and noticed an issue that may be causing valid results to be dropped.

In the step where PubTator gene annotations are mapped to NCBI gene information, the code only uses ['GeneID','Symbol'] from the gene_info.gz file. However, if PubTator mentions gene synonyms or old symbols it can lead to valid mappings being excluded due to symbol mismatches.

I reimplemented the script to use the latest + old symbols from the G2P download file.
Also there is an issue with memory usage, I also improved this in my script.
You can find my version here: https://github.com/dglemos/LitDD_mining/blob/main/annotate_pubmed/final_data_clean_v2.py

Best wishes,
Diana

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions