LACAN filter: Leveraging adjacent co-ocurrence of atomic neighborhoods for molecular filtering
"All sorts of things in the world behave like mirrors" -Jacques Lacan
Some molecular fragments are common, but they have the tendency not to occur together. For example, alkyloxy radicals are frequent motifs in medicinal chemistry datasets, whereas the linkage of both radicals into a peroxide is rather uncommon. Likewise, halides and amines are some of the most commonly occurring atomic neighborhoods, and yet their pairing results in the unstable and toxic haloamine motif. We apply this concept using co-occurences of ECFP2 like atomic neighborhoods at the bond interface, and leverage co-occurence patterns to construct a molecular filter that highlights uncommon linkages.
This is version 0.0.2alpha. This version is still experimental, and breaking changes are still expected. Several changes have been added since 0.0.1alpha, including a change to manually hash the environments instead of relying on hacky usage of the rdkit morgan fingerprint generator. Also introduced in this version is functionality for molecule generation. This is currently unoptimized and subject to change.
clone this repo, activate your environment, navigate to root dir and run:
pip install .
Some notebooks with typical use cases are provided in lacan/example_notebooks. Note that these notebook will need jupyter installed in the python environment. The molecule generation notebook additionally requires scikit-learn installed in the python environment.
import lacan and inspect a molecule by running the following commands:
from lacan import lacan
from rdkit import Chem
p = lacan.load_profile("chembl")
m = Chem.MolFromSmiles("c1ccccc1CCN(OCCc1occc1)")
score,info = lacan.score_mol(m,p)
print(info["bad_bonds"])which will output a dictionary with an entry for every bond in the molecule. Currently the filter is binary, so the score is 1 if the molecule passes the filter and 0 if it doesn't. The problem bonds output follow rdkit bond numbering which means we can visualize problem bonds in our molecules easily as follows:
from rdkit.Chem import Draw
d = Draw.MolToImage(m,highlightBonds=info["bad_bonds"])
display(d)giving the following result:
This correctly identified the N-O linkage as problematic.
This filter enables us to recombine fragments and filter out linkages that are rare in the reference set. Lacan has a "breeding" or crossover functionality where two molecules get fragmented and recombined. By subjecting the recombinations to LACAN filter we can retain only decent looking "median molecules".
example:
from lacan import breed
from rdkit import Chem
from rdkit.Chem import Draw
m1 = Chem.MolFromSmiles("c1cc(ccc1[C@@H]2CCNC[C@H]2COc3ccc4c(c3)OCO4)F")
m2 = Chem.MolFromSmiles("CNCCC(C1=CC=CC=C1)OC2=CC=C(C=C2)C(F)(F)F")
median_molecules = breed.breed(m1,m2,p,nmols=9)this outputs the following molecules that are "in between" its parents fluoxetine and sertraline:
d = Draw.MolsToGridImage(median_molecules)
display(d)Random molecules can by generated simply using
ms = gen.generate_filtered_molecules(n_jobs=-1,
n_molecules=9,
profile=p,
seed=456,
min_atoms=20)For generation towards a goal, see the example notebooks, which showcase this functionality.
If you want to build a custom profile using your own reference data set, this can be done through the LACAN cli as follows
python lacan.py -i your_dataset_here.smi -m profile -p my_new_profile
This will create a pickled profile in the data folder which you can then invoke using:
p = lacan.load_profile("my_new_profile")

