Hi,
I was reviewing the script annotate_pubmed/final_data_clean.py and noticed an issue that may be causing valid results to be dropped.
In the step where PubTator gene annotations are mapped to NCBI gene information, the code only uses ['GeneID','Symbol'] from the gene_info.gz file. However, if PubTator mentions gene synonyms or old symbols it can lead to valid mappings being excluded due to symbol mismatches.
I reimplemented the script to use the latest + old symbols from the G2P download file.
Also there is an issue with memory usage, I also improved this in my script.
You can find my version here: https://github.com/dglemos/LitDD_mining/blob/main/annotate_pubmed/final_data_clean_v2.py
Best wishes,
Diana
Hi,
I was reviewing the script annotate_pubmed/final_data_clean.py and noticed an issue that may be causing valid results to be dropped.
In the step where PubTator gene annotations are mapped to NCBI gene information, the code only uses
['GeneID','Symbol']from thegene_info.gzfile. However, if PubTator mentions gene synonyms or old symbols it can lead to valid mappings being excluded due to symbol mismatches.I reimplemented the script to use the latest + old symbols from the G2P download file.
Also there is an issue with memory usage, I also improved this in my script.
You can find my version here: https://github.com/dglemos/LitDD_mining/blob/main/annotate_pubmed/final_data_clean_v2.py
Best wishes,
Diana