Draw RDKit mol/reaction object on HTML without static png image #RDKit #memo

I think this post might not useful for many people. Just for my memorandum.

When I make web app, I use draw SVG function for rdkit object rendering. But from last my post and Greg’s gist, I could learn draw png image without writing static png files. So I wonder that can I draw PNG image on HTML without png file.

Fortunately I could find the solution!

To do that, at first draw rdkit object and convert it png byte text.
Then encode base64 and decode UTF8, that’s all!

Generated text can render in HTML with <img src=’data:image/png;base64> tag.

Today’s code is below. ;)

Model interporation with new drawing code of RDKit #RDKit #Machine learning #chemoinformatics

Following code does not use new drawing code but it revised one of my old post. :)

I think, everyone who visits my blog has already read Gregs nice blog post about the drawing similarity map with new code. If you don’t read it I recommend to read it soon. URL is below. ;)

Now I like the drawing code because it can draw only atom property of own molecule but also compare molecular similarity and fingerprint contribution of predictive model.

Interpolation of the machine learning model is very important because it is use for next design of the new molecule.

So let’s write code with new drawing code. Following code is almost same as RDKit cook book and Greg’s blog post. Import some packages and check the version of rdkit. I used the newest version.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from functools import partial
from rdkit import Chem
from rdkit import rdBase
from rdkit import RDPaths
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import SimilarityMaps, IPythonConsole
from IPython.display import SVG
import io
from PIL import Image
import rdkit
import numpy as np
from sklearn.ensemble import RandomForestClassifier

Load solubility data which is provided from rdkit data folder for example.

trainpath = os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf')
testpath =  os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath) if m is not None]
test_mols = [m for m in Chem.SDMolSupplier(testpath) if m is not None]
val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}

Then define some helper function.

def mol2arr(mol, fpfunc):
    fp = fpfunc(mol)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
# https://rdkit.blogspot.com/2020/01/similarity-maps-with-new-drawing-code.html
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
fpfunc = partial(SimilarityMaps.GetMorganFingerprint, nBits=1024, radius=2)

Then calculate fingerprint and get solubility class. And make randomforest classifier object and fit it. It’s very common way for sunday chemoinformatician. ;)

trainX = [fpfunc(m) for m in train_mols]
trainY = [val_dict[m.GetProp('SOL_classification')] for m in train_mols]
testX = [fpfunc(m) for m in test_mols]
testY = [val_dict[m.GetProp('SOL_classification')] for m in test_mols]
rfc = RandomForestClassifier(random_state=794)
rfc.fit(trainX, trainY)

Next define getProba function the code is borrowed from rdkit cookbook. One difference to cookbook is I used third probability because I would like to visualize effect of the each fingerprint to the high solubility. And defined draw function with some optional arguments. ‘alpha’ is useful because it can control the blending value for the contour lines. It is key for beautiful visualization.

# https://www.rdkit.org/docs/Cookbook.html
# Following example I would like to get probability of High solubility
def getProba(fp, predctionFunction):
    return predctionFunction((fp,))[0][2]
def drawmol(mols, idx):
    d = Draw.MolDraw2DCairo(1,1)
    _, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, rfc.predict_proba),

Now almost there. Let’s visualize High and Low solubility molecules with the drawing code!

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']
drawmol(high_test_mols, 0)

Hmm good, the model detects hydroxyl group is positive effect to solubility. How about second example?

drawmol(high_test_mols, 10)

OK, carbonyl group detected as positive group for solubility.

drawmol(low_test_mols, 5)

In this case it is difficult because all atoms are lipophilic. So remove any atoms are positive effect to solubility in my understanding.

The code investigates the effect of the each atom by removing one by one. And calculate difference of probability of prob molecule and atom remove molecule.

So, the magnitude of color shows relative effect of the each atom not absolute effect. This is reason why low solubility molecule shows many positive (red) colors in this example.

Any way, model interporation with 2D color mapping on the molecule is quite nice tool for QSAR analysis.

I would like to write more example codes.

Today’s code can be found on my gist and github.


Any comments suggestions and requests will be greatly appreciated. ;)


Visualise dataset using seaborn.

I often use scatter plot to analyse relationship of 2 valuables.
There are a lots of tools for visualise dataset.
Now, I challenged to use seaborn.
Seaborn is python library based on matplotlib. And easy to make cool visualisation.

Following snippet, I read sample data from rdkit install folder and plot MolWt vs LogP using jointplot.

Jointplot can show not only relation ship but also distribution of dataset.
If you want to draw it, you need only type ‘sns.jointplot’! ;-)
Also this library is compatible with Pandas.
Seaborn seems to young library (most resent version is ver 0.6.) but very useful and cool library I think.
Here is sample. This is jointplot using hexbinplot.
Screen Shot 2015-08-24 at 10.47.04 PM

I pushed snippet to git-hub.

#this is not .py format
#code from ipynb
#using pybal inline mode

get_ipython().magic(u'matplotlib inline')
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors
from rdkit import rdBase
from rdkit import RDConfig
import seaborn as sns
import os
datapath = os.path.join(RDConfig.RDDocsDir,"Book/data/cdk2.sdf")
moldf = PandasTools.LoadSDF(datapath)
def molwt(mol):
    mw = Descriptors.MolWt(mol)
    return mw
def logp(mol):
    logp = Descriptors.MolLogP(mol)
    return logp
moldf['MW'] = moldf.ROMol.apply(molwt)
moldf['LOGP'] = moldf.ROMol.apply(logp)
from scipy.stats import kendalltau
x = moldf.MW
y = moldf.LOGP
sns.jointplot(x,y, kind='hex', stat_func=kendalltau)
sns.jointplot(x,y, kind='kde', stat_func=kendalltau)
sns.jointplot(x,y, kind='resid', stat_func=kendalltau)