A machine learning model elucidating molecular structures from tandem mass spectrometry data.
This is the official code release for our paper:
E. Dorigatti, J. Groß, J. Kühlborn, R. Möckel, F. Maier and J. Keupp, Enhancing automated drug substance impurity structure elucidation from tandem mass spectra through transfer learning and domain knowledge, Digital Discovery, 2025, DOI: 10.1039/D5DD00115C.
Installation via Poetry:
poetry install
The supplementary material for our publication is on Zenodo and contains, amongst others:
Predictions for new MS/MS spectra can be obtained via the following command:
> python seismiq/prediction/predict.py --help
Usage: predict.py [OPTIONS] MODEL_NAME INPUT_FILE RESULT_FILE
Options:
--num-beams INTEGER Number of beams for beam search
--max-sampling-steps INTEGER Maximum number of sampling steps
--peak-mz-noise FLOAT Noise level for peak m/z values
--skip-wrong-atom-count / --keep-wrong-atom-count
Skip samples with wrong heavy atom count
--keep-partial-samples / --skip-partial-samples
Keep samples that were not fully generated
--match-hydrogen-count / --no-match-hydrogen-count
Match number of hydrogen atoms in generated
samples
--help Show this message and exit.
Where the model name is the explicit path to a checkpoint file or the name of a checkpoint file, without extension, contained in the path given by the environment variable SEISMIQ_CHECKPOINTS_FOLDER
.
We provide checkpoints for the pretrained model as well as a model finetuned on CASMI in our Zenodo dataset:
mkdir -p dev
wget -O dev/checkpoints.zip "https://zenodo.org/records/16438770/files/checkpoints.zip?download=1" \
&& unzip dev/checkpoints.zip -d dev/ \
&& rm dev/checkpoints.zip \
&& export SEISMIQ_CHECKPOINTS_FOLDER=dev/checkpoints
wget -O dev/tokenizer.pkl "https://zenodo.org/records/16438770/files/tokenizer.pkl?download=1" \
&& export SEISMIQ_TOKENIZER_OVERRIDE=dev/tokenizer.pkl
Then, to obtain predictions from the pretrained model (make sure to have enough RAM or a capable GPU):
python seismiq/prediction/predict.py \
seismiq_pretrained \
resources/examples/casmi_2016_5.json \
predictions.csv
The input file should contain a JSON list of challenges including, at minimum, the spectrum and sum formula, for example:
[
{
"sum_formula": "C7H10N2",
"spectrum": [
[
79.0414, // mass
1767018.1 // unnormalized intensity
],
// etc. ...
]
// additional optional keys
// - smiles_prefix (str) : prompt for the model
// - adduct_shift (float) : adduct shift, defaults to M+H
// - true_smiles (str) : actual SMILES used to evaluate the predictions (if given)
// - max_sampling_steps (int) : maximum number of tokens to sample
},
// etc ...
]
And the output file will contain hte predictions in CSV format, for example:
index,perplexity,tanimoto,pred_smiles,generation_count
0,2.434374139169987,0.0,CCC=NNc1ncnc(N)n1,1
1,2.45338491081399,0.0,N#CNC(=NCCCN)NC#N,1
2,2.200015703998429,0.0,C=CCN=C(N)Nc1ncn[nH]1,1
3,2.3174031691083026,0.0,C=CCNC(N)=Nc1ncn[nH]1,1
Where pred_smiles
contains the predicted smiles and generation_count
the number of times the model predicted this molecule, including predictions with different SMILES (only the lowest perplexity SMILES is returned).
Data preparation and model training are handled via Lightning and configured through yaml files passed to the training script:
python seismiq/prediction/train.py fit --config configs/seismiq_pretrained.yaml
This configuration trains the pretrained model as described in our publication, using the dataset on Zenodo:
mkdir -p dev
wget -O dev/training_data.csv.gz "https://zenodo.org/records/16438770/files/training_data.csv.gz?download=1" \
&& gunzip dev/training_data.csv.gz
# Test datasets, necessary to remove test molecules from training data
wget -O dev/test_data.zip "https://zenodo.org/records/16438770/files/test_datasets.zip?download=1" \
&& unzip dev/test_data.zip -d dev/ \
&& rm dev/test_data.zip \
&& export SEISMIQ_TEST_DATA_FOLDER=dev/test_datasets
wget -O dev/tokenizer.pkl "https://zenodo.org/records/16438770/files/tokenizer.pkl?download=1" \
&& export SEISMIQ_TOKENIZER_OVERRIDE=dev/tokenizer.pkl
On first launch, the CSV dataset will be converted to pickle files. It will take a while.
Data preparation is performed automatically upon invokation of the training script whenever the base folder indicated in the data storage configuration does not exist:
data:
class_path: seismiq.prediction.llm.data_module.EncoderDecoderLlmDataModule
init_args:
storage:
class_path: seismiq.prediction.data.storage.OnDiskBlockDataStorage
init_args:
base_folder: dev/training_data
block_size: 10000
preparer:
class_path: seismiq.prediction.data.preparation.CsvDataPreparer
init_args:
csv_file: dev/training_data.csv
# other data module parameters ...
Data for training is saved as a sequence of pickle files, and the storage
object takes care of saving and loading samples from this format.
The preparer
object reads data from some source and transforms it into SeismiqSample
objects, which eventually end up in the pickle files, containing all necessary information to train the model.
The default CsvDataPreparer
reads the provided CSV.
We also provide a SyntheticDataPreparer
class which can be used to produce a dataset of simulated mass spectra given a list of molecules in SMILES format.
This is meant to be subclassed or encapsulated by another class that loads these molecules from somewhere.
To simulate the mass spectra for training, please install CFM-ID and FragGenie, which can be obtained from:
The two predictors are wrapped by two scripts indicated by the environment variables
SEISMIQ_FRAGGENIE_PROGRAM
and SEISMIQ_CFM_ID_PROGRAM
.
Examples for these scripts are provided in the ./scripts
folder.
To fine-tune the model on a dataset, use the pretrained configuration in conjunction with the dataset configuration and the finetuning configuration. The latter also specifies a default path for the pre-trained checkpoint, which can be overridden. If using another checkpoint, also make sure that it is compatible with the specified pretrained configuration.
Concretely, to fine-tune on simulated spectra of the CASMI challenges:
wget -O dev/all_casmi_simulated.pkl "https://zenodo.org/records/16438770/files/all_casmi_simulated.pkl?download=1"
python seismiq/prediction/train.py fit --config configs/seismiq_pretrained.yaml \
--config configs/seismiq_finetuned.yaml \
--config configs/data_casmi.yaml \
--ckpt_path dev/checkpoints/seismiq_pretrained.ckpt
Evaluation is performed by two scripts, one for de novo generation and one for fragmentation, for example:
# or seismiq/prediction/eval_on_fragments.py
python seismiq/prediction/eval_on_test_datasets.py run-single \
seismiq_pretrained casmi_2016 dev/test_results/pretrained-c16.pkl
These scripts can also launch a parallel evaluation of all models on all datasets on SLURM:
python seismiq/prediction/eval_on_test_datasets.py \
make-slurm-command --slurm-flags "--partition gpu --mem 96G --gres gpu:1" | bash
To use the public test datasets, define the env variable SEISMIQ_TEST_DATA_FOLDER
pointing
to a folder with the appropriate json files, and SEISMIQ_CHECKPOINTS_FOLDER
pointing to
a folder with the checkpoints.
Our model's predictions are provided on Zenodo, and the code to download them and generate the
result figures in the paper is in the Jupyter notebook notebooks/figures.ipynb
.
In the same folder, the notebook confidence_intervals.ipynb
computes the confidence intervals
reported in the supplementary information S5, and sirius_ranking.ipynb
compares ranking by
perplexity and ranking by CSI:FingerID score, producing figure S2 in the supplementary
information.
Exemplary scripts and files are provided to demonstrate how to extract reaction templates from a given reaction smiles string and apply templates for forward prediction on reactants, as it was done for the results in the manuscript.
By subjecting a reaction string to the code, a template can be extracted, for example:
python seismiq/impurity_simulation/extract_template.py \
resources/examples/reaction_smiles.txt ./templates.txt
By providing reactants in SMILES notation and reaction templates, possible impurity structures can be generated, for example:
python seismiq/impurity_simulation/template_impurity_prediction.py \
resources/examples/reaction_smiles.txt resources/examples/reaction_templates.txt ./impurities.py
If you use our model or dataset, we would be grateful if you acknowledged our publication:
@article{Dorigatti_2025_seismiq,
title = {Enhancing automated drug substance impurity structure elucidation from tandem mass spectra through transfer learning and domain knowledge.},
ISSN = {2635-098X},
url = {http://dx.doi.org/10.1039/D5DD00115C},
DOI = {10.1039/d5dd00115c},
journal = {Digital Discovery},
publisher = {Royal Society of Chemistry (RSC)},
author = {Dorigatti, Emilio and Groß, Jonathan and K\"{u}hlborn, Jonas and M\"{o}ckel, Robert and Maier, Frank and Keupp, Julian},
year = {2025}
}