Make RGroup decompose table with old ver. RDKit #RDKit #Chemoinformatics

Recent version of RDKit has rdRGroupDecomposition module for R Group decomposition. You know, it is very useful module for SAR analysis. However this function is not implemented in old version of rdkit (i.e. rdkit for 32 bit windows). I would like to implement similar module for 32bit windows system.

My approach with code
So, I tried to write RGroup decomposition function. Following code is slow, it needs more improvement for large molecular data sets.
At first import required packages and dataset.

import pandas as pd
from rdkit.Chem import Draw
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdFMCS
from rdkit.Chem import RWMol
from collections import defaultdict
from rdkit.Chem.rdDepictor import SetPreferCoordGen
from rdkit import RDPaths
import os
mols = [m for m in Chem.SDMolSupplier(os.path.join(RDPaths.RDDocsDir,'Book/data/cdk2.sdf'))]
for m in mols:

Then find MCS for molecules. I would like to extract core structure so set some options for FindMCS.

mcs = rdFMCS.FindMCS(mols[:4], completeRingsOnly=True, ringMatchesRingOnly=True)
core = Chem.MolFromSmarts(mcs.smartsString)

Then define RGroupDecomposer class for rgroup decomposition. I remove not in ring atoms from core structure because MedChem often uses it for a scaffold. To do it, RWmol module is used. It is useful for molecular editing. Following code automatically remove molecule which does not have core structure.

class RGroupDecomposer():

    def __init__(self, mols, core):
        self.mols = mols
        self.core = core
    def get_scaffold(self):
        emol = RWMol(self.core)
        atms = [a for a in emol.GetAtoms()]
        for a in atms:
            if not a.IsInRing():
        return emol.GetMol()
    def rg_decompose(self):
        self.coremol = self.get_scaffold()
        self.match_mols = [mol for mol in self.mols if mol.HasSubstructMatch(self.coremol)]
        self.hmols = [Chem.AddHs(mol) for mol in self.match_mols]
        [AllChem.Compute2DCoords(mol) for mol in self.hmols]
        self.side_chains = [Chem.ReplaceCore(mol, self.coremol) for mol in self.hmols]
        self.core_info = Chem.ReplaceSidechains(self.hmols[0], self.coremol)
    def get_df(self):
        data = defaultdict(list)
        for i, side_chains in enumerate(self.side_chains):
            # Thank you for comment! fixed it
            side_chains = Chem.MolToSmiles(side_chains).split('.')
            for idx, side_chain in enumerate(side_chains):
                side_chain = Chem.MolToSmiles(Chem.RemoveHs(Chem.MolFromSmiles(side_chain)))
        df = pd.DataFrame(data)
        return df

# this is helper function to convert dataframe of smiles to dataframe of ROMol.
def allsmi2rdmol(df):
    cols = df.columns.to_list()
    for col in cols:
        PandasTools.AddMoleculeColumnToFrame(df, smilesCol=str(col), molCol=str(col))
    return df

Now I finished all tasks.
Let’s check it. I used CDK2.sdf for test.

rgd = RGroupDecomposer(mols[:40], core)
df = rgd.get_df()
newdf = allsmi2rdmol(df)
from IPython.display import display_html
from IPython.display import HTML

It seems work. I used HTML for svg image visualization. Usually this step is not required but my environment can’t render mol image without the step even if RenderImagesInAllDataFrames set True……

I’m not sure how to fix the problem ;-(

Today’s code is uploaded my gist.

code example for RGr decompose

Any comments and suggestions will be greatly appreciated.