Gene symbol mapping in final_data_clean.py drops valid results

Hi,
I was reviewing the script [annotate_pubmed/final_data_clean.py](https://github.com/biomedicalinformaticsgroup/LitDD_mining/blob/main/annotate_pubmed/final_data_clean.py) and noticed an issue that may be causing valid results to be dropped.

In the step where PubTator gene annotations are mapped to NCBI gene information, the code only uses `['GeneID','Symbol']` from the `gene_info.gz` file. However, if PubTator mentions gene synonyms or old symbols it can lead to valid mappings being excluded due to symbol mismatches.

I reimplemented the script to use the latest + old symbols from the G2P download file.
Also there is an issue with memory usage, I also improved this in my script.
You can find my version here: https://github.com/dglemos/LitDD_mining/blob/main/annotate_pubmed/final_data_clean_v2.py

Best wishes,
Diana

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gene symbol mapping in final_data_clean.py drops valid results #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gene symbol mapping in final_data_clean.py drops valid results #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions