Generate possible molecules from a dataset #Chemoinformatics #RDKit

Recently @dr_greg_landrum  shared very cool Knime work flow which can enumerate possible molecules from given dataset. You can get the Work flow from following URL.
https://workflows.knime.com/knime/hub/workflows/99_Community%3A03_RDKit%3A10_Enumerate_Extra_Patent_Compounds
This work flow is really useful and elegant. If you have interested in, I recommend to use it!!!
And Patrick Walters disclosed nice code for Free-Wilson analysis on github.
https://github.com/PatWalters/Free-Wilson

Inspired these nice code, I want to write code for molecular enumeration from given molecules in python. ;)
At frist, import libraries.

import os
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdFMCS
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdRGroupDecomposition
from rdkit.Chem import rdmolops
from rdkit.Chem import RDConfig
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Scaffolds import MurckoScaffold

from collections import defaultdict
from itertools import product
import igraph
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import re
IPythonConsole.ipython_useSVG = True

Then define functions. To do retrieve set of similar compounds from the dataset, I used molecular similarity graph. I used python-igraph. To use igraph, I can pick up the largest subgraph from parent graph.
In this post, I used cdk2.sdf as demo data.

def gengraph(mols,fpgen , threshold=0.7):
    fps = [fpgen.GetFingerprint(m) for m in mols]
    num_v = len(mols)
    graph = igraph.Graph()
    graph.add_vertices(num_v)
    for i in range(num_v):
        for j in range(i):
            if DataStructs.TanimotoSimilarity(fps[i], fps[j]) >= threshold:
                graph.add_edge(i, j)
    return graph
fpgen = rdFingerprintGenerator.GetMorganGenerator(2)
mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf'))]
for mol in mols:
    AllChem.Compute2DCoords(mol)
fps = [fpgen.GetFingerprint(m) for m in mols]
graph = gengraph(mols, fpgen, 0.4)
blks=graph.blocks()
simmols_idx = sorted(list(blks), key=lambda x: len(x), reverse=True)
simmols = [mols[i] for i in simmols_idx[0]]
scaff = [MurckoScaffold.GetScaffoldForMol(m) for m in simmols]

Then, get MCS. I investigated some conditions and I selected mcs2.

mcs1 = rdFMCS.FindMCS(scaff, threshold=0.7)
mcs2 = rdFMCS.FindMCS(scaff, threshold=0.7, 
                      completeRingsOnly=True,
                      matchValences=True,
                      bondCompare=rdFMCS.BondCompare.CompareOrderExact,
                      atomCompare=rdFMCS.AtomCompare.CompareElements,
                      )
mcs3 = rdFMCS.FindMCS(scaff, threshold=0.7,
                      ringMatchesRingOnly=True,
                      completeRingsOnly=True,
                      atomCompare=rdFMCS.AtomCompare.CompareAny)

Next, select molecule which has mcs2 as substructure from dataset.
And there is a hacky method, I define getMCSSmiles because following process needs to molecule which can sanitize as core.
The idea came from rdkit cook book.
https://www.rdkit.org/docs/Cookbook.html

mols_has_core = []
core = Chem.MolFromSmarts(mcs2.smartsString)
for mol in mols:
    if mol.HasSubstructMatch(core):
        AllChem.Compute2DCoords(mol)
        mols_has_core.append(mol)
def getMCSSmiles(mol, mcs):
    mcsp = Chem.MolFromSmarts(mcs.smartsString)
    match = mol.GetSubstructMatch(mcsp)
    smi = Chem.MolFragmentToSmiles(mol, atomsToUse=match)
    return smi
mcs_smi = getMCSSmiles(mols_has_core[0], mcs2)
core = Chem.MolFromSmiles(mcs_smi)
core

Now I could get core for RGroup decomposition. So I did RGDecomposition with RDKit function. And make dataset and core for molecule generation.

rgp = rdRGroupDecomposition.RGroupDecompositionParameters()
rgp.removeHydrogensPostMatch = True
rgp.alignment =True
rgp.removeAllHydrogenRGroups=True
rg = rdRGroupDecomposition.RGroupDecomposition(core, rgp)
for mol in mols_has_core:
    rg.Add(mol)
rg.Process()
dataset = rg.GetRGroupsAsColumns()
core =  Chem.RemoveHs(dataset["Core"][0])

Almost there. To generate possible combination of molecules I need following steps. 1) RGroup decomposition, 2) retrieve information of side chains, 3) Generate new molecules from all possible combination of side chains.
To do that I define two functions named makebond and enumeratemol.

def makebond(target, chain):
    newmol = Chem.RWMol(rdmolops.CombineMols(target, chain))
    atoms = newmol.GetAtoms()
    mapper = defaultdict(list)
    for idx, atm in enumerate(atoms):
        atom_map_num = atm.GetAtomMapNum()
        mapper[atom_map_num].append(idx)
    for idx, a_list in mapper.items():
        if len(a_list) == 2:
            atm1, atm2 = a_list
            rm_atoms = [newmol.GetAtomWithIdx(atm1),newmol.GetAtomWithIdx(atm2)]
            nbr1 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm1)) for x in newmol.GetAtomWithIdx(atm1).GetBonds()][0]
            nbr1.SetAtomMapNum(idx)
            nbr2 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm2)) for x in newmol.GetAtomWithIdx(atm2).GetBonds()][0]
            nbr2.SetAtomMapNum(idx)
    newmol.AddBond(nbr1.GetIdx(), nbr2.GetIdx(), order=Chem.rdchem.BondType.SINGLE)
    nbr1.SetAtomMapNum(0)
    nbr2.SetAtomMapNum(0)
    newmol.RemoveAtom(rm_atoms[0].GetIdx())
    newmol.RemoveAtom(rm_atoms[1].GetIdx())
    newmol = newmol.GetMol()
    return newmol

def enumeratemol(core,rg, maxmol=10000):
    dataset = rg.GetRGroupsAsColumns()
    labels = list(dataset.keys())
    pat = re.compile("R\d+")
    labels = [label for label in labels if pat.match(label)]
    rgs = np.asarray([dataset[label] for label in labels])
    i, j = rgs.shape
    combs = [k for k in product(range(j), repeat=i)]
    res = []
    for i in combs:
        mol = core
        for idx,j in enumerate(i):
            mol = makebond(mol, rgs[idx][j])
        AllChem.Compute2DCoords(mol)
        mol = Chem.RemoveHs(mol)
        res.append(mol)
    return res

Now ready! Let’s go generation step. And check structures.

res = enumeratemol(core,rg)
Draw.MolsToGridImage(res[:20])

It seems work fine. At last, check number of generated molecules.

print(14**3, len(res))
> 2744 2744

Yah! The code seems work fine. It is not so efficient way because if data has many RGroups, huge number of molecules are generated.
So I think combination of predictive model or / and molecular filter is useful for undesired molecule cutting.
I could learn useful function of RDKit today. Really useful toolkit for chemoinformatics!
Whole code of the post can view from following URL.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/enum_cpds.ipynb

Any suggestion and advices are appreciated. ;)

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

4 thoughts on “Generate possible molecules from a dataset #Chemoinformatics #RDKit

  1. Hi Taka,

    This code is not working for me I have issues to reproduce your jupyter notebook example.

    BR

    Guillaume

    1. Me again,
      I confirmed the issue and now I fixed issue.
      Could you please try again following code?


      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.

      view raw

      enum_cpds.ipynb

      hosted with ❤ by GitHub

      Thanks

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.