Objective
Recent version of RDKit has rdRGroupDecomposition module for R Group decomposition. You know, it is very useful module for SAR analysis. However this function is not implemented in old version of rdkit (i.e. rdkit for 32 bit windows). I would like to implement similar module for 32bit windows system.
My approach with code
So, I tried to write RGroup decomposition function. Following code is slow, it needs more improvement for large molecular data sets.
At first import required packages and dataset.
import pandas as pd from rdkit.Chem import Draw from rdkit.Chem import PandasTools from rdkit.Chem.Draw import IPythonConsole from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import rdFMCS from rdkit.Chem import RWMol from collections import defaultdict from rdkit.Chem.rdDepictor import SetPreferCoordGen SetPreferCoordGen(True) from rdkit import RDPaths import os mols = [m for m in Chem.SDMolSupplier(os.path.join(RDPaths.RDDocsDir,'Book/data/cdk2.sdf'))] for m in mols: AllChem.Compute2DCoords(m)
Then find MCS for molecules. I would like to extract core structure so set some options for FindMCS.
mcs = rdFMCS.FindMCS(mols[:4], completeRingsOnly=True, ringMatchesRingOnly=True) core = Chem.MolFromSmarts(mcs.smartsString)

Then define RGroupDecomposer class for rgroup decomposition. I remove not in ring atoms from core structure because MedChem often uses it for a scaffold. To do it, RWmol module is used. It is useful for molecular editing. Following code automatically remove molecule which does not have core structure.
class RGroupDecomposer(): def __init__(self, mols, core): self.mols = mols self.core = core def get_scaffold(self): emol = RWMol(self.core) atms = [a for a in emol.GetAtoms()] for a in atms: if not a.IsInRing(): emol.RemoveAtom(a.GetIdx()) return emol.GetMol() def rg_decompose(self): self.coremol = self.get_scaffold() self.match_mols = [mol for mol in self.mols if mol.HasSubstructMatch(self.coremol)] self.hmols = [Chem.AddHs(mol) for mol in self.match_mols] [AllChem.Compute2DCoords(mol) for mol in self.hmols] self.side_chains = [Chem.ReplaceCore(mol, self.coremol) for mol in self.hmols] self.core_info = Chem.ReplaceSidechains(self.hmols[0], self.coremol) def get_df(self): data = defaultdict(list) for i, side_chains in enumerate(self.side_chains): # Thank you for comment! fixed it data['mol'].apped(Chem.MolToSmiles(self.match_mols[i])) side_chains = Chem.MolToSmiles(side_chains).split('.') for idx, side_chain in enumerate(side_chains): side_chain = Chem.MolToSmiles(Chem.RemoveHs(Chem.MolFromSmiles(side_chain))) data['R_{}'.format(idx+1)].append(side_chain) data['core'].append(Chem.MolToSmiles(self.core_info)) df = pd.DataFrame(data) return df # this is helper function to convert dataframe of smiles to dataframe of ROMol. def allsmi2rdmol(df): cols = df.columns.to_list() for col in cols: PandasTools.AddMoleculeColumnToFrame(df, smilesCol=str(col), molCol=str(col)) return df
Now I finished all tasks.
Let’s check it. I used CDK2.sdf for test.
rgd = RGroupDecomposer(mols[:40], core) rgd.rg_decompose() df = rgd.get_df() newdf = allsmi2rdmol(df) from IPython.display import display_html from IPython.display import HTML HTML(df.to_html())

It seems work. I used HTML for svg image visualization. Usually this step is not required but my environment can’t render mol image without the step even if RenderImagesInAllDataFrames set True……
I’m not sure how to fix the problem ;-(
Today’s code is uploaded my gist.
Any comments and suggestions will be greatly appreciated.
Thanks for this post!
One change I would suggest to fix the error where mol column is not showing the correct molecule structure in each row by adding the append command: data[‘mol’].append(Chem.MolToSmiles(self.match_mols[i]))
or modifying RGroupDecomposer class, in the function get_df:
Hi ,
Thanks for your comment! I’ll fix it ASAP.
Done ;)