In this project, I test if a general large language model (LLM) can learn to handle chemistry tasks well.
I start with OLMo-7B, a general LLM pre-trained on the DOLMA dataset. I then do continued pre-training using QLoRA.
I test the model on the MoleculeNet benchmark. I skip big datasets like QM9 and SIDER to save time and resources.
I train the model on 2.1 million raw SMILES strings from the smiles-molecules-chembl dataset.
I use QLoRA for pre-training with these settings:
-
Target:
all_linear -
Rank: 64
-
Alpha: 128
-
Learning rate: 5e-5
-
Training Script: RawSmiles.py (for ChemOlmo-7B)
All benchmark code is in the Notebooks folder.

