This repository stands for applying and evaluating HUNER pre-trained model ("disease_all") on "BC5CDR-Disease" data set .
- Install docker
- Download pretrained model (
"disease_all") from here, place it intohuner/modelsdirectory and untar it using
tar xzf disease_all.tar.gzFor applying prediction on BC5CDR-Disease data set we need to remove labeles from .tsv file and convert it to pre-tokenized .txt file that tokens are seprated by whitespace.
-
Use
tokenized_txt.pyinhelperfolder for preprocess your.tsvdata and make it ready for using as model input.e.g.
tokenized_test.txtSelegiline - induced postural hypotension in Parkinson ' s disease : a longitudinal study on the effects of drug withdrawal . -
Start HUNER server using
./start_server.sh disease_all
model must reside in
modelsdirectory . -
While server is running use another terminal tab for tagging input data using
python client.py --name disease_all --assume_tokenized /path/to/tokenized_test.txt OUTPUT.CONLL
The output will then be written to
OUTPUT.CONLL.
OUTPUT.CONLL sample result on tokenized_test.txt looks like this
Torsade POS B-NP
de POS I-NP
pointes POS I-NP
ventricular POS I-NP
tachycardia POS I-NP
during POS O
low POS O
dose POS O
intermittent POS O
dobutamine POS O
treatment POS O
in POS O
a POS O
patient POS O
with POS O
dilated POS B-NP
cardiomyopathy POS I-NP
and POS O
congestive POS B-NP
heart POS I-NP
failure POS I-NP
. POS O
The POS O
authors POS O
describe POS O
the POS O
case POS O
of POS O
a POS O
56 POS O
- POS O
year POS O
- POS O
old POS O
woman POS O
with POS O
chronic POS O
, POS O
severe POS O
heart POS B-NP
failure POS I-NP
secondary POS O
to POS O
dilated POS B-NP
cardiomyopathy POS I-NP
and POS O
absence POS O
of POS O
significant POS O
ventricular POS B-NP
arrhythmias POS I-NP
who POS O
developed POS O
QT POS B-NP
prolongation POS I-NP
and POS O
torsade POS B-NP
de POS I-NP
pointes POS I-NP
ventricular POS I-NP
tachycardia POS I-NP
during POS O
one POS O
cycle POS O
of POS O
intermittent POS O
low POS O
dose POS O
( POS O
2 POS O
. POS O
5 POS O
mcg POS O
/ POS O
kg POS O
per POS O
min POS O
) POS O
dobutamine POS O
. POS O
We use seqeval classification_report(y_true, y_pred) metric to evaluate HUNER model .
-
Create a Conda environment called "seqeval" with Python 3.7.6:
conda create -n seqeval python=3.7.6
-
Activate the Conda environment:
conda activate seqeval
To install seqeval, simply run:
$ pip install seqeval[cpu]
If you want to install seqeval on GPU environment, please run:
$ pip install seqeval[gpu]- numpy >= 1.14.0
Since OUTPUT.CONLL format is a little bit different from BC5CDR-Disease IOB schemed, we need to modify our BC5CDR-Disease data.
-
BC5CDR-DiseaseTorsade B de I pointes I ventricular B tachycardia I during O low O dose O intermittent O dobutamine O treatment O in O a O patient O with O dilated B cardiomyopathy I and O congestive B heart I failure I . O -
OUTPUT.CONLLTorsade POS B-NP de POS I-NP pointes POS I-NP ventricular POS I-NP tachycardia POS I-NP during POS O low POS O dose POS O intermittent POS O dobutamine POS O treatment POS O in POS O a POS O patient POS O with POS O dilated POS B-NP cardiomyopathy POS I-NP and POS O congestive POS B-NP heart POS I-NP failure POS I-NP . POS O
Use test.tsv or any file that you used it for prediction in BC5CDR-Disease data set and replace all B tags with B-NP and all I tags with I-NP using Exel .
E.g.test.tsv shuold look like this after modification .
Torsade B-NP
de I-NP
pointes I-NP
ventricular B-NP
tachycardia I-NP
during O
low O
dose O
intermittent O
dobutamine O
treatment O
in O
a O
patient O
with O
dilated B-NP
cardiomyopathy I-NP
and O
congestive B-NP
heart I-NP
failure I-NP
. O
Now use evaluation.py in helper/evaluation folder to evaluate model .