Recently @dr_greg_landrum shared very cool Knime work flow which can enumerate possible molecules from given dataset. You can get the Work flow from following URL.
https://workflows.knime.com/knime/hub/workflows/99_Community%3A03_RDKit%3A10_Enumerate_Extra_Patent_Compounds
This work flow is really useful and elegant. If you have interested in, I recommend to use it!!!
And Patrick Walters disclosed nice code for Free-Wilson analysis on github.
https://github.com/PatWalters/Free-Wilson
Inspired these nice code, I want to write code for molecular enumeration from given molecules in python. ;)
At frist, import libraries.
import os import numpy as np import pandas as pd from rdkit import Chem from rdkit.Chem import rdFMCS from rdkit.Chem import rdFingerprintGenerator from rdkit.Chem import AllChem from rdkit.Chem import DataStructs from rdkit.Chem import rdRGroupDecomposition from rdkit.Chem import rdmolops from rdkit.Chem import RDConfig from rdkit.Chem import PandasTools from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem.Scaffolds import MurckoScaffold from collections import defaultdict from itertools import product import igraph from rdkit.Chem import Draw from rdkit.Chem.Draw import IPythonConsole import re IPythonConsole.ipython_useSVG = True
Then define functions. To do retrieve set of similar compounds from the dataset, I used molecular similarity graph. I used python-igraph. To use igraph, I can pick up the largest subgraph from parent graph.
In this post, I used cdk2.sdf as demo data.
def gengraph(mols,fpgen , threshold=0.7): fps = [fpgen.GetFingerprint(m) for m in mols] num_v = len(mols) graph = igraph.Graph() graph.add_vertices(num_v) for i in range(num_v): for j in range(i): if DataStructs.TanimotoSimilarity(fps[i], fps[j]) >= threshold: graph.add_edge(i, j) return graph fpgen = rdFingerprintGenerator.GetMorganGenerator(2) mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf'))] for mol in mols: AllChem.Compute2DCoords(mol) fps = [fpgen.GetFingerprint(m) for m in mols] graph = gengraph(mols, fpgen, 0.4) blks=graph.blocks() simmols_idx = sorted(list(blks), key=lambda x: len(x), reverse=True) simmols = [mols[i] for i in simmols_idx[0]] scaff = [MurckoScaffold.GetScaffoldForMol(m) for m in simmols]
Then, get MCS. I investigated some conditions and I selected mcs2.
mcs1 = rdFMCS.FindMCS(scaff, threshold=0.7) mcs2 = rdFMCS.FindMCS(scaff, threshold=0.7, completeRingsOnly=True, matchValences=True, bondCompare=rdFMCS.BondCompare.CompareOrderExact, atomCompare=rdFMCS.AtomCompare.CompareElements, ) mcs3 = rdFMCS.FindMCS(scaff, threshold=0.7, ringMatchesRingOnly=True, completeRingsOnly=True, atomCompare=rdFMCS.AtomCompare.CompareAny)
Next, select molecule which has mcs2 as substructure from dataset.
And there is a hacky method, I define getMCSSmiles because following process needs to molecule which can sanitize as core.
The idea came from rdkit cook book.
https://www.rdkit.org/docs/Cookbook.html
mols_has_core = [] core = Chem.MolFromSmarts(mcs2.smartsString) for mol in mols: if mol.HasSubstructMatch(core): AllChem.Compute2DCoords(mol) mols_has_core.append(mol) def getMCSSmiles(mol, mcs): mcsp = Chem.MolFromSmarts(mcs.smartsString) match = mol.GetSubstructMatch(mcsp) smi = Chem.MolFragmentToSmiles(mol, atomsToUse=match) return smi mcs_smi = getMCSSmiles(mols_has_core[0], mcs2) core = Chem.MolFromSmiles(mcs_smi) core
Now I could get core for RGroup decomposition. So I did RGDecomposition with RDKit function. And make dataset and core for molecule generation.
rgp = rdRGroupDecomposition.RGroupDecompositionParameters() rgp.removeHydrogensPostMatch = True rgp.alignment =True rgp.removeAllHydrogenRGroups=True rg = rdRGroupDecomposition.RGroupDecomposition(core, rgp) for mol in mols_has_core: rg.Add(mol) rg.Process() dataset = rg.GetRGroupsAsColumns() core = Chem.RemoveHs(dataset["Core"][0])
Almost there. To generate possible combination of molecules I need following steps. 1) RGroup decomposition, 2) retrieve information of side chains, 3) Generate new molecules from all possible combination of side chains.
To do that I define two functions named makebond and enumeratemol.
def makebond(target, chain): newmol = Chem.RWMol(rdmolops.CombineMols(target, chain)) atoms = newmol.GetAtoms() mapper = defaultdict(list) for idx, atm in enumerate(atoms): atom_map_num = atm.GetAtomMapNum() mapper[atom_map_num].append(idx) for idx, a_list in mapper.items(): if len(a_list) == 2: atm1, atm2 = a_list rm_atoms = [newmol.GetAtomWithIdx(atm1),newmol.GetAtomWithIdx(atm2)] nbr1 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm1)) for x in newmol.GetAtomWithIdx(atm1).GetBonds()][0] nbr1.SetAtomMapNum(idx) nbr2 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm2)) for x in newmol.GetAtomWithIdx(atm2).GetBonds()][0] nbr2.SetAtomMapNum(idx) newmol.AddBond(nbr1.GetIdx(), nbr2.GetIdx(), order=Chem.rdchem.BondType.SINGLE) nbr1.SetAtomMapNum(0) nbr2.SetAtomMapNum(0) newmol.RemoveAtom(rm_atoms[0].GetIdx()) newmol.RemoveAtom(rm_atoms[1].GetIdx()) newmol = newmol.GetMol() return newmol def enumeratemol(core,rg, maxmol=10000): dataset = rg.GetRGroupsAsColumns() labels = list(dataset.keys()) pat = re.compile("R\d+") labels = [label for label in labels if pat.match(label)] rgs = np.asarray([dataset[label] for label in labels]) i, j = rgs.shape combs = [k for k in product(range(j), repeat=i)] res = [] for i in combs: mol = core for idx,j in enumerate(i): mol = makebond(mol, rgs[idx][j]) AllChem.Compute2DCoords(mol) mol = Chem.RemoveHs(mol) res.append(mol) return res
Now ready! Let’s go generation step. And check structures.
res = enumeratemol(core,rg) Draw.MolsToGridImage(res[:20])
It seems work fine. At last, check number of generated molecules.
print(14**3, len(res)) > 2744 2744
Yah! The code seems work fine. It is not so efficient way because if data has many RGroups, huge number of molecules are generated.
So I think combination of predictive model or / and molecular filter is useful for undesired molecule cutting.
I could learn useful function of RDKit today. Really useful toolkit for chemoinformatics!
Whole code of the post can view from following URL.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/enum_cpds.ipynb
Any suggestion and advices are appreciated. ;)
Hi Taka,
This code is not working for me I have issues to reproduce your jupyter notebook example.
BR
Guillaume
Hi Guillaume,
I’ll check my code and back to here. Pls give me a time.
Best
Me again,
I confirmed the issue and now I fixed issue.
Could you please try again following code?
enum_cpds.ipynb
hosted with ❤ by GitHub
Thanks