Compare the view point of different QSAR models #RDKit #visualize #chemoinformatics

Some days ago, I posted how to visualize SVG images horizontally and ngboost for QSAR problem. It worked well. And I found that different models showed different performance. So my question is that which point each model detects important for molecular properties.

Fortunately rdkit has GetSimilarityMapForModel method which can render the probe molecule with model’s feature importance.

Today I tried to combine previous method and this useful rdkit method to compare the models feature. At first import required packages.

import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
import numpy as np
import io
from PIL import Image
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from ngboost import NGBClassifier
from ngboost.ngboost import NGBoost
from ngboost.learners import default_tree_learner
from ngboost.distns import Normal, LogNormal
from ngboost.distns import k_categorical
from sklearn.metrics import classification_report
from functools import partial
from rdkit.Chem import Draw
from IPython.display import SVG

Then read train and test data, calculate fingerprint and get target values.

train = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
train_mols = [m for m in Chem.SDMolSupplier(train)]
test = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
test_mols = [m for m in Chem.SDMolSupplier(test)]

val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}


def mol2arr(mol, radi=2, nBits=1024):
    arr = np.zeros((1,))
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radi, nBits)
    DataStructs.ConvertToNumpyArray(fp, arr)

    return arr

trainX = np.array([mol2arr(m) for m in train_mols])
trainY = np.array([val_dict[m.GetProp('SOL_classification')] for m in train_mols])

testX = np.array([mol2arr(m) for m in test_mols])
testY = np.array([val_dict[m.GetProp('SOL_classification')] for m in test_mols])

Next, define classifier models. And train them.

adbcls = AdaBoostClassifier()
svc = SVC(probability=True)
ngbc = NGBClassifier(Dist=k_categorical(3))
clsset = [adbcls, svc, ngbc]
for clsfier in clsset:
    clsfier.fit(trainX, trainY)

Then I made subgroup which is grouped by solubility class.

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']

Then I defined drawmol function, the function makes SVG text of similaritymap for model. I passed MolDraw2DSVG to draw2d options, it means that GetSimilarityMapForModel draw the output to SVG drawer. And defined HorizontalDisplay class as same as previous post.

def drawmol(mols, idx, predictfunc):
    d = Draw.MolDraw2DSVG(200,250)
    fig, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, predictfunc.predict_proba),
                                                           draw2d = d,
                                                           colorMap='coolwarm', #this option wasn't applied why?? 
                                                           size=(100,100),
                                                           contourLines=0,
                                                           step=0.001,
                                                           alpha=0.2,
                                                           )
    d.FinishDrawing()
    txt = d.GetDrawingText()
    return txt

class HorizontalDisplay:
    def __init__(self, svgtxtlist):
        self.svgtxtlist = svgtxtlist

    def _repr_html_(self):
        template = '<div style="float:left;padding:10px;">{0}</div>'
        return "\n".join(template.format(svgtxt)
                         for svgtxt in self.svgtxtlist)

Now ready. Let’s display molecules with these methods.

Following example are selected from high/low test molecules.

#Adaboost, SVC, NGBOOST
svgset = []
for cls in clsset:
svgset.append(drawmol(high_test_mols, 3, cls))
HorizontalDisplay(svgset)

Adaboost classifier strongly responded tert-methyl group compared to other models. Ngboost classifier says methyl groups are not good for solubility.

Next example, Adaboost couldn’t phenol OH groups are positive for solubility but SVC and NGBoost said OH is important for solubility. But these two modes showed different response to aromatic carbon.

By using the method, user can reveal that where is important point for the model. So it will be good material for discussion with chemists I think.

But the method is open to some debate. Coloring, normalize atom features, etc…

QSAR model should not be used as just a black box which return positive or negative, it should be used for discussion and good decision making.

Today’s code can be found following URL. Thanks for reading.

https://github.com/iwatobipen/playground/blob/master/ngboost2.ipynb

Draw molecules as SVG in horizontal layout #Drawing #RDKit #memo

As you know, Greg posted cool code about new drawing code options of rdkit 202003. You can read details of them in following URL
http://rdkit.blogspot.com/2020/04/new-drawing-options-in-202003-release.html

It’s really cool! New version of rdkit can render molecule with many options in high quarity.

In the post, molecules are rendered as SVG image one molecule per one cell.

I would like to draw molecules in one cells like Draw.MolsToGridImage.

So I traced the blog post and try to render molecules in one cell. I found good solution to do it, the strategy is that get svg text of molecules and embed them in to the div tag and then draw it as HTML.

OK let’s try it. Following code is almost same but example molecules are different. Defined HorizontalDisplay class for displaying multiple molecules at later.

import os
from rdkit import Chem
from rdkit.Chem import rdBase
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdDepictor
import rdkit
from IPython.display import SVG

from rdkit.Chem import ChemicalFeatures
fdef = os.path.join(RDConfig.RDContribDir, 'M_Kossner/BaseFeatures_DIP2_NoMicrospecies.fdef')
ffact = ChemicalFeatures.BuildFeatureFactory(fdef)
rdDepictor.SetPreferCoordGen(True)

# the code is bollowed from following url.
# https://bit.ly/3aTUBCU
class HorizontalDisplay:
    def __init__(self, *args):
        self.args = args

    def _repr_html_(self):
        template = '<div data-carousel-extra='{"blog_id":39198697,"permalink":"https:\/\/iwatobipen.wordpress.com\/2020\/05\/01\/draw-molecules-as-svg-in-horizontal-layout-drawing-rdkit-memo\/"}' style="float:left;padding:10px;">{0}</div>'
        return "\n".join(template.format(arg)
                         for arg in self.args)

Then draw molecules one by one and get SVGdrawing text of each drawing setting. First two codes are draw molecule with / without atom indices. And 3rd, 4th codes are draw molecules with atom features.

d2d = rdMolDraw2D.MolDraw2DSVG(250, 200)
d2d.DrawMolecule(mol1)
d2d.FinishDrawing()
text1 = d2d.GetDrawingText()

d2d = rdMolDraw2D.MolDraw2DSVG(250, 200)
d2d.drawOptions().addAtomIndices = True
d2d.DrawMolecule(mol1)
d2d.FinishDrawing()
text2 = d2d.GetDrawingText()

from collections import defaultdict
feats = ffact.GetFeaturesForMol(mol1)
colors = {'SingleAtomDonor':(1,0.5,0.5),
          'SingleAtomAcceptor':(0.5,0.5,1),
          'BasicGroup':(0,1,1),
          'Arom6':(1,0.8,0.5),
          'Hphobe':(0.7,0.7,0.3)}

atomHighlighs = defaultdict(list)
highlightRads = {}
for feat in feats:
    if feat.GetType() in colors:
        clr = colors[feat.GetType()]
        for aid in feat.GetAtomIds():
            atomHighlighs[aid].append(clr)
            highlightRads[aid] = 0.4

d2d = rdMolDraw2D.MolDraw2DSVG(250, 200)
d2d.DrawMoleculeWithHighlights(mol1, '', dict(atomHighlighs), {}, highlightRads, {})
d2d.drawOptions().addAtomIndices = True
d2d.FinishDrawing()
text3 = d2d.GetDrawingText()

d2d = rdMolDraw2D.MolDraw2DSVG(250, 200)
dos = d2d.drawOptions()
dos.atomHighlightsAreCricle = True
dos.fillHighlights = False
d2d.DrawMoleculeWithHighlights(mol1, '', dict(atomHighlighs), {}, highlightRads, {})
d2d.drawOptions().addAtomIndices = True
d2d.FinishDrawing()
text4 = d2d.GetDrawingText()

Now I had 4 svg texts. Finally call HorizontalDisplay.

HorizontalDisplay(text1, text2, text3, text4)

Now I could get multiple molecules in one cell ;).

It can’t control the number of the molecules in one row, so it seems there is room for improvement. Any comments and suggestions will be greatly appreciated.

I uploaded the code my github repo and gist. And hope reader will have a nice weekend. (gist couldn’t draw molecules per row)

https://github.com/iwatobipen/playground/blob/master/Horizontal_display_SVGmols.ipynb

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/Horizontal_display_SVGmols.ipynb

Rendering molecular orbital on Jupyter notebook #psikit #py3dmol #rdkit #memo

@fmkz___ and I( @iwatobipen ) are developing psikit which is a thin wrapper of psi4 and rdkit. I hope the package integrates quantum chemistry (Psi4) and chemoinformatics (RDKit).

By using psikit, user can make molecular orbital data very convinienlry.

Rendering MO is useful for understanding molecular electrostatic shape and nature, but sometime it is difficult to medicinal chemist because it requires CADD skills for rendering it.

Psikit has useful method named ‘create_cube_files’, the method make cube files of HOMO, LUMO and ESP. And also psikit can call pymol to rendering them. However it can’t render MO directory on jupyter notebook so I would like to show you how to render MO on jupyter notebook today.

Fortunately, py3Dmol is easy tool for rendering CUBE file!

Sample code is below. I used psikit for quantum calculation and used py3dmol for rendering. First, calculate energy and create cube files. Default gridspace value is set 0.3 but I recommend it should be changed more large r value. Because cube file with small gridspace value will become huge size file.

import py3Dmol
from psikit import Psikit
from rdkit import Chem
pk = Psikit()
pk.read_from_smiles('O')
pk.energy()
pk.create_cube_files(gridspace=0.5)

After running the method, cube files are generated. So read the file.

homo_voldata = open("Psi_a_5_1-A\"_HOMO.cube", "r").read()
lumo_voldata = open("Psi_a_6_5-A\'_LUMO.cube", "r").read()

Now ready to render. Make view object and attach some dataset. addVolumetricData method loads cube data with options. Then call addModel with molblock data. RDKit can convert molobject to molblock by using MolToMolBlock method.

v = py3Dmol.view()
v.addVolumetricData(homo_voldata, "cube", {'isoval': -0.03, 'color': "red", 'opacity': 0.75})
v.addVolumetricData(homo_voldata, "cube", {'isoval': 0.03, 'color': "blue", 'opacity': 0.75})
v.addModel(Chem.MolToMolBlock(pk.mol), 'mol')
v.setStyle({'stick':{}})
v.zoomTo()
v.show()

Works fine. The molecule object is rotatable. LUMO can be rendered with same manner.

In summary, psi4, rdkit and py3dmol integration can render molecule with MO at the touch of a button. I uploaded today’s code on my repository.

https://github.com/iwatobipen/playground/blob/master/mo/RenderMO.ipynb

Embed interactive plot in jupyter notebook with panel #chemoinformatics #RDKit #memo #panel

As you know Jupyter notebook is very useful tool for data scientist. It can analyze scientific data with nice view. And there are lots of packages for data visualization. And I often use matplotlib and seaborn for my task. However few days ago, I found an interesting package named Panel which is high level app and dashbording app for python. I posted another package dash before but I’ve never used panel. So I tried to use panel with rdkit.

Panel can install via conda from pyviz channel or pip. I installed panel by using conda.

After installed panel I tested it.

At first I tried to visualize rdkit mol object. Import packages and molecules.

import numpy as np
import pandas as pd
import os
import panel as pn
from rdkit import Chem
from rdkit import RDPaths
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import matplotlib.pyplot as plt
plt.style.use('ggplot')

sdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/cdk2.sdf')
mols = [m for m in Chem.SDMolSupplier(sdf)]
for m in mols:
    AllChem.Compute2DCoords(m)

Then made IntSlider to set index of molecule which would like to render. Following example uses depends decorator for making interactive view.

pn.extension()
slider = pn.widgets.IntSlider(start=0, end=len(mols), step=1, value=0)
@pn.depends(slider.param.value)
def callback(value):
    return Draw.MolToImage(mols[value])
row = pn.Column(slider, callback)
row

Now molecule was rendered with slider widget.

It seems work well, next I tried to add range slider to draw molecules. Code is almost same as above.

pn.extension()
rangeslider = pn.widgets.IntRangeSlider(start=0, end=len(mols), step=1)
@pn.depends(rangeslider.param.value)
def callback(value):
    return Draw.MolsToGridImage(mols[value[0]: value[1]], molsPerRow=5)
pn.Column(rangeslider, callback)

IntRangeSlider returns tuple of user defined value. So I could select range which would like to render on the notebook.

Next example, I tried to make interactive scatter plot of molecular properties.

To do it, I calculated molecular descriptors with rdkit Descriptors class.

from rdkit.Chem import Descriptors
from collections import defaultdict
dlist = Descriptors._descList
desc_dec = defaultdict(list)
for mol in mols:
    for k, f  in dlist:
        desc_dec[k].append(f(mol))
df = pd.DataFrame(desc_dec)

Following example used Select widget to select x and y axis and FloatSlider for setting alpha of scatter plot.

from matplotlib.figure import Figure, FigureCanvasBase
columns = df.columns.to_list()
x = pn.widgets.Select(options=columns, name='x_', value='MolWt')
y = pn.widgets.Select(options=columns, name='y_', value='qed')
alpha = pn.widgets.FloatSlider(name='alpha', value=0.5)

@pn.depends(x.param.value, y.param.value, alpha.param.value)
def get_plot(x_v, y_v, a):
    with plt.xkcd():
        fig = Figure()
        ax = fig.subplots()
        FigureCanvasBase(fig)
        ax.set_xlabel(x_v)
        ax.set_ylabel(y_v)
        ax.scatter(df[x_v], df[y_v], c='blue', alpha = a)
        #fig = df.plot.scatter(x_v, x_v)
        return fig
pn.Column(pn.Row(x, y, alpha), get_plot)

I updated matplotlib version to 3.1.3. From version 3.1.2 matplotlib can make plot with xkcd taste. ;) It’s easy to do it just make plot in with plt.xkcd() line.

Now the scatter plot can interactively select axis and alpha params. I would like to know how to get index of each point and get the value. If I can it I could render molecule when I mouse over the point.

I uploaded today’s code my repo.

https://github.com/iwatobipen/playground/blob/master/Panel_interactive_plot.ipynb

Interactive plot can’t see from githubrepo or nbvier so if reader has interest the package I recommend to try it your own PC. ;)

Draw RDKit mol/reaction object on HTML without static png image #RDKit #memo

I think this post might not useful for many people. Just for my memorandum.

When I make web app, I use draw SVG function for rdkit object rendering. But from last my post and Greg’s gist, I could learn draw png image without writing static png files. So I wonder that can I draw PNG image on HTML without png file.

Fortunately I could find the solution!

To do that, at first draw rdkit object and convert it png byte text.
Then encode base64 and decode UTF8, that’s all!

Generated text can render in HTML with <img src=’data:image/png;base64> tag.

Today’s code is below. ;)

Model interporation with new drawing code of RDKit #RDKit #Machine learning #chemoinformatics

Following code does not use new drawing code but it revised one of my old post. :)

I think, everyone who visits my blog has already read Gregs nice blog post about the drawing similarity map with new code. If you don’t read it I recommend to read it soon. URL is below. ;)
http://rdkit.blogspot.com/2020/01/similarity-maps-with-new-drawing-code.html

Now I like the drawing code because it can draw only atom property of own molecule but also compare molecular similarity and fingerprint contribution of predictive model.

Interpolation of the machine learning model is very important because it is use for next design of the new molecule.

So let’s write code with new drawing code. Following code is almost same as RDKit cook book and Greg’s blog post. Import some packages and check the version of rdkit. I used the newest version.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from functools import partial
from rdkit import Chem
from rdkit import rdBase
from rdkit import RDPaths
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import SimilarityMaps, IPythonConsole
from IPython.display import SVG
import io
from PIL import Image
import rdkit
import numpy as np
from sklearn.ensemble import RandomForestClassifier
print(rdkit.__version__)
>2019.09.2

Load solubility data which is provided from rdkit data folder for example.

trainpath = os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf')
testpath =  os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath) if m is not None]
test_mols = [m for m in Chem.SDMolSupplier(testpath) if m is not None]
val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}

Then define some helper function.

def mol2arr(mol, fpfunc):
    fp = fpfunc(mol)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
# https://rdkit.blogspot.com/2020/01/similarity-maps-with-new-drawing-code.html
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
fpfunc = partial(SimilarityMaps.GetMorganFingerprint, nBits=1024, radius=2)

Then calculate fingerprint and get solubility class. And make randomforest classifier object and fit it. It’s very common way for sunday chemoinformatician. ;)

SOL_classification
trainX = [fpfunc(m) for m in train_mols]
trainY = [val_dict[m.GetProp('SOL_classification')] for m in train_mols]
​
testX = [fpfunc(m) for m in test_mols]
testY = [val_dict[m.GetProp('SOL_classification')] for m in test_mols]
rfc = RandomForestClassifier(random_state=794)
rfc.fit(trainX, trainY)

Next define getProba function the code is borrowed from rdkit cookbook. One difference to cookbook is I used third probability because I would like to visualize effect of the each fingerprint to the high solubility. And defined draw function with some optional arguments. ‘alpha’ is useful because it can control the blending value for the contour lines. It is key for beautiful visualization.

# https://www.rdkit.org/docs/Cookbook.html
# Following example I would like to get probability of High solubility
def getProba(fp, predctionFunction):
    return predctionFunction((fp,))[0][2]
def drawmol(mols, idx):
    d = Draw.MolDraw2DCairo(1,1)
    _, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, rfc.predict_proba),
                                                           colorMap='coolwarm',
                                                          size=(200,200),
                                                          step=0.001,
                                                          alpha=0.2)
    d.FinishDrawing()
    show_png(d.GetDrawingText())
    print(mols[idx].GetProp('SOL_classification'))

Now almost there. Let’s visualize High and Low solubility molecules with the drawing code!

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']
drawmol(high_test_mols, 0)

Hmm good, the model detects hydroxyl group is positive effect to solubility. How about second example?

drawmol(high_test_mols, 10)

OK, carbonyl group detected as positive group for solubility.

drawmol(low_test_mols, 5)

In this case it is difficult because all atoms are lipophilic. So remove any atoms are positive effect to solubility in my understanding.

The code investigates the effect of the each atom by removing one by one. And calculate difference of probability of prob molecule and atom remove molecule.

So, the magnitude of color shows relative effect of the each atom not absolute effect. This is reason why low solubility molecule shows many positive (red) colors in this example.

Any way, model interporation with 2D color mapping on the molecule is quite nice tool for QSAR analysis.

I would like to write more example codes.

Today’s code can be found on my gist and github.

https://github.com/iwatobipen/playground/blob/master/modelviz.ipynb

Any comments suggestions and requests will be greatly appreciated. ;)

Thanks,

Visualise dataset using seaborn.

I often use scatter plot to analyse relationship of 2 valuables.
There are a lots of tools for visualise dataset.
Now, I challenged to use seaborn.
Seaborn is python library based on matplotlib. And easy to make cool visualisation.

Following snippet, I read sample data from rdkit install folder and plot MolWt vs LogP using jointplot.

Jointplot can show not only relation ship but also distribution of dataset.
If you want to draw it, you need only type ‘sns.jointplot’! ;-)
Also this library is compatible with Pandas.
Seaborn seems to young library (most resent version is ver 0.6.) but very useful and cool library I think.
Here is sample. This is jointplot using hexbinplot.
Screen Shot 2015-08-24 at 10.47.04 PM

I pushed snippet to git-hub.
https://github.com/iwatobipen/mishimasyk/blob/master/sample_seaborn.ipynb

#this is not .py format
#code from ipynb
#using pybal inline mode

get_ipython().magic(u'matplotlib inline')
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors
from rdkit import rdBase
from rdkit import RDConfig
import seaborn as sns
import os
datapath = os.path.join(RDConfig.RDDocsDir,"Book/data/cdk2.sdf")
moldf = PandasTools.LoadSDF(datapath)
def molwt(mol):
    mw = Descriptors.MolWt(mol)
    return mw
def logp(mol):
    logp = Descriptors.MolLogP(mol)
    return logp
moldf['MW'] = moldf.ROMol.apply(molwt)
moldf['LOGP'] = moldf.ROMol.apply(logp)
moldf.head(5)
from scipy.stats import kendalltau
sns.set(style='ticks')
x = moldf.MW
y = moldf.LOGP
sns.jointplot(x,y, kind='hex', stat_func=kendalltau)
sns.jointplot(x,y, kind='kde', stat_func=kendalltau)
sns.jointplot(x,y, kind='resid', stat_func=kendalltau)