Coloring molecule with RESP charge on pymol #Pymol

To visualize 3D structure of molecule, PyMol is nice tool. What I would like to write on the post is how to visualize calculated RESP charge on pymol ;)
One idea is embed calculated RESP charge to b_factor of molecule pdb. PDB file can make easily from rdkit mol object.
A problem for me is how to edit b_factor of PDB file. I found good tool to solve it today, python package ‘BioPandas’! It can install from pypi or anaconda.
By using biopandas, user can handle pdb file like pandas dataframe.

At first I would like to show examples of biopandas. If you often working with PDB you feel it is very useful I think.
Following code is an example for loading PDB and retrieve protein and ligand data as pandas dataframe.

from biopandas.pdb import PandasPdb
pbdobj = PandasPdb('1atp.pdb')
#Check Atom data
#Check ligand data

Now I can access pdb file as pandas data frame, it means that it is easy to edit and analyze pdb data! More examples are described in official site. So I move to next.

Following example is …
Calculate RESP charge with Psikit and then make pdf file from rdkit mol object and finally replace the b_factor to calculated RESP charge.
At frist load packages.

import os
import sys
from biopandas.pdb import PandasPdb
import pandas as pd
sys.path.append('path for /psikit')
from psikit import Psikit

Then define the function which make pdb file.

def make_pdb_with_Resp(smi, fname=None):
    pk = Psikit()
    charge = pk.resp_charge
    RESP = charge['Restrained Electrostatic Potential Charges']
    if fname == None:
        fname = 'mol'
        fname = fname
    Chem.MolToPDBFile(pk.mol, '{}.pdb'.format(fname))
    pdbobj = PandasPdb().read_pdb('{}.pdb'.format(fname))
   # Following step is editing state, you can see the approach like a pandas method ;)
    pdbobj.df['HETATM']['b_factor'] = RESP

Run the function! After running the code, I could get two files named aceticacid.pdb and tetrazole.pdb

It can load from pymol.


Now I would like to color the molecule with edited b_factor. So I use spectrum command on pymol command line.

spectrum b, blue_red, minimum=-1., maximum=1.
Acetic acid with RESP charge

Acetic acid has negative charge on two oxygen atoms. And carbonyl carbon has the most positive charge.

Tetrazole with RESP charge

The result of tetrazole is strange for me because the nitrogen at position 2 which has hydrogen shows positive. And the nitrogen at position 1 is the most negative. I investigated two protonated state of the tetrazole, 1H-tetrazole and 2H-tetrazole.
But both results shows the nitrogen atom which has hydrogen has positive charge than other nitrogen atoms. I found that BioPandas is useful. But I could not get solution for the last problem. Hmm…

Draw HOMO LUMO with psikit-2 #chemoinformatics

Yesterday I wrote a post about psikit function for HOMO LUMO drawing. And @fmkz__ gave me very suggestive comment.

He performed QM calculation about tetrazole and disclosed the its distribution of negative charge.

I had interested in his suggestion and I tried it. By using psikit, I calculated energy of acetic acid and 5-methyl-1H-tetrazole and got HOMO LUMO cube files from the results.

Results is below. At first HOMO and LUMO of acetic acid.

HOMO of acetic acid
LUMO of acetic acid

These results shows that carbonyl C=O has large MO and it means carbonyl is most reactive position of the acetic acid. And also LUMO around the alpha carbon is large. It indicates that the site is reactive with electrophile I think.

Next, let’s see result of tetrazole. Tetrazole is often used in medicinal chemistry as a bioisoster of carboxylic acid.

HOMO of tetrazole
LUMO of tetrazole

These results shows 3-N position has large MO compared to other nitrogen atoms. I think it indicates that 3-N position is more reactive than other nitrogen. And I found good example to explain the result in slide share. URL is below.
Page 21 of the slide shows example of 5-sub tetrazoles alkylation.
The page shows example of reaction between 5-Me-1H-tetrazole and trityl-Cl. And most of alkylation occurred at n-3 position.

Borrowed from slide share

It is very exciting for me because I could know that QM is very useful tool for reactivity prediction. The result motivates me to enhance of psikit. ;-)

Any suggestion and advices are highly appreciated.

Draw HOMO LUMO with psikit #RDKit #Psi4 #PyMol

Now I and @fmkz___ are developing a thin wrapper library for Psi4 and RDKit named psikit.
Today I added new function for viewing molecular orbital HOMO and LUMO with pymol. Psi4 has the function which can output cube format file of MO named ‘cubeprop’. So I try to implement the function in to psikit.
By using the library, you can get HOMO/LUMO cube file easily.
Example is below.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from psikit import Psikit
smi = 'COc1cccnc1'
mol = Chem.MolFromSmiles(smi)

Just call ‘getMOview’ after optimization. It took a little bit long time on my PC.

pk = Psikit()
%time pk.optimize()
>Optimizer: Optimization complete!
>CPU times: user 1min 54s, sys: 3.97 s, total: 1min 58s
>Wall time: 31.5 s
# get MO view!

After executing the method, psikit makes 5 files named target.mol, Psi_a_n_n-A.cube, Psi_b_n_n-A.cube, Psi_a_n+1_n+1-A.cube, Psi_b_n+1_n+1-A.cube. First file is mol file. Second and third is information of its HOMO and forth, fifth is information of LUMO.

Then load all files from pymol. Change HOMO/LUMO object show setting as dot. I could get following image.

Optimized molecule
Molecule and HOMO
Molecule and LUMO
Molecule and HOMO/LUMO

The function and pymol seem to work fine. But I want to compare data between this result and another QM’s result.
Current version of psikit is still under development and some bugs. I would like to fix and tune the code.

Call Knime from Jupyter notebook! #Chemoinformatics #RDKit #Knime

I read exiting blog post yesterday! URL is below.
@dr_greg_landrum developed very cool tools which can call knime from jupyter notebook and can execute jupyter notebool from knime.

Details of the tool is described in the Knime blog post. I am interested the tool and I can’t wait to try it in myself. So I used it from my mac book pro. At first I installed python knime package via pypi.

iwatobipen$ pip install knime

Now ready. I tried to make sample work flow. My workflow receives SMILES strings from jupyter and calculates RDKit descriptors, normalize then and return the result to notebook.

After that, I build regression model for solubility and apply it to test data. Dataset is supplied from rdkit Book/data folder.

OK, let’s go to the code. Following code is referenced Gregs blog post URL is above. First import packages.

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
import os
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
import knime

Next, prepare dataset for training and test. Type of data which passes Knime is pandas dataframe.

train_path = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
test_path = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')

train_mols = [m for m in Chem.SDMolSupplier(train_path)]
train_y = np.asarray([m.GetProp('SOL') for m in train_mols], dtype=np.float32)
train_table = {'smiles':[Chem.MolToSmiles(m) for m in train_mols]}
train_df = pd.DataFrame(train_table)

test_mols =  [m for m in Chem.SDMolSupplier(test_path)]
test_y = np.asanyarray([m.GetProp('SOL') for m in test_mols], dtype=np.float32)
test_table = {'smiles':[Chem.MolToSmiles(m) for m in test_mols]}
test_df = pd.DataFrame(test_table)

Next, define Knime executable path and workspace path. My env is Mac so it is a little bit different to original blog post.

#My Knime env uploaded to 3.7 from 3.6.
knime.executable_path = '/Applications/KNIME'
workspace = '/Users/iwatobipen/knime-workspace/'

Then check workflow. I made following workflow in advance. And I could see image of the WF on notebook. ;)

workflow = 'jupyter_integration'
knime.Workflow(workflow_path=workflow, workspace_path=workspace)

Now ready, let’s run the WF for descriptor calculation!

# training data
with knime.Workflow(workflow_path=workflow, workspace_path=workspace) as wf:
    wf.data_table_inputs[0] = train_df
train_x = wf.data_table_outputs[0]

# test data
with knime.Workflow(workflow_path=workflow, workspace_path=workspace) as wf:
    wf.data_table_inputs[0] = test_df
test_x = wf.data_table_outputs[0]

I could get dataset for build regression model and test. So I fit SVR of sklearn and test the model performance.

svr = SVR()
svr.gamma = 'auto', train_y)

pred = svr.predict(test_x)
print(r2_score(test_y, pred))
print(mean_squared_error(test_y, pred))

It seems not so bat. Check the performance with matplotlib.

a, b = min(test_y), max(test_y)
data = np.linspace(a,b, num=100)
plt.scatter(pred, test_y, c='b', alpha=0.5)
plt.plot(data, data, c='r')

The model can predict solubility of test molecules with high accuracy. In summary, integration Knime and Jupyter notebook has high potential for chemoinformatics I think. Because jupyter has flexibility and knime is powerful tool for routine work.
Whole code can view from following URL.

Alignment free 3D molecular descriptor #Chemoinformatics

3D based QSAR has big challenges, first, conformation and second is alignment of 3D molecules. Major tool for 3D QSAR is ROCS which is developed by OpenEye. It is commercial for industrial. And there are some open source tool kit to do it such as shape-it and rdkit.
Recently researcher from AZ and University of Sheffield disclosed new approach for alignment free 3D molecular descriptor.
URL is below.
Alignment free approach is worth to solve the problems described above I think.

Their developed descriptor is based on solvent accessible surface. The surface is converted triangulated mesh with TMSmesh then converted Laplace-Beltrami spectrum finally converted local and global geometry descriptors. This descriptor like a PCA of the molecule. It computes eigenvalue of the molecule mesh. I have not understood the theory of the descriptor yet :/ but have interest.

In this article they investigated the performance of the method compared to shape-it with DUD-E as an experimental dataset. And they found that the descriptor is useful for virtual 3D screening and the spectral geometry
descriptors outperform the alignment-free CDK D-Moments, an open source implementation of UFSR.

I would like to try to use it. So I try it.
At first pull the script from their github repo.

iwatobipen $git clone

To use the code, mayavi which is library for 3D visualization is required. I installed it at first. For mac, to install mayavi, gcc>=6.0 is required.

Now ready. Let’s try. Unfortunately mesh generator named ‘TMSmesh’ is provided for Linux. I can not run on OSX. So following example code is using default dataset.

import sys
import numpy as np
from mayavi import mlab
# visualize result on jupyter notebook
from molsg import laplacemesh as lp
# define show mol
def plot_mol(mesh, scalars=np.zeros(1)):
    """Use mayavi to plot a molecule mesh."""
    vertices, faces = mesh
    x, y, z = vertices[:, 0], vertices[:, 1], vertices[:, 2]
    fig = mlab.figure(size=(600, 600), bgcolor=(0, 0, 0))
    mesh = mlab.triangular_mesh(x, y, z, faces, scalars=scalars)
    return fig, mesh
# load molecule and plot
mol = np.load('data/ampc/npy/actives18.pqr0.50.41.2.npy')
eigs = lp.compute_lb_fem(vertices=mol[0], faces=mol[1], k=100)
f,m = plot_mol(mol, scalars=eigs.phi[:, 1])

Then I could get following image on notebook. Mayavi can visualize eigenvalue of molecular surface.

Also I could get different view with different layer.

f,m = plot_mol(mol, scalars=eigs.phi[:, 5])

Then calculate descriptors.

#calculate descriptor
from scipy.spatial import distance

from molsg.laplacemesh import compute_lb_fem
from molsg.localgeometry import WKS_descriptor
import molsg.covariance as cov
evals = 32

mol_names = sorted(os.listdir('data/ampc/npy/'))
mols = (np.load('data/ampc/npy/{}'.format(m)) for m in mol_names)
eigs = (compute_lb_fem(vertices=m[0], faces=m[1], k=100) for m in mols)
wks = (WKS_descriptor(e, evals=evals) for e in eigs)
covs = np.asarray([cov.covariance_descriptor(w).flatten() for w in wks])

# use first molecule as reference
similarity = distance.cdist(covs[:10], covs,
ranking = np.argsort(similarity)
labels = np.asarray([1 if m[0] == 'a' else 0 for m in mol_names])
>(500, 1024)
array([[3.09230117e-05, 2.71652856e-05, 2.37224905e-05, ...,
        1.46243371e-04, 1.51304765e-04, 1.56195547e-04],
       [4.23007865e-05, 3.51307176e-05, 2.88876041e-05, ...,
        8.37931399e-05, 8.68547261e-05, 8.98288537e-05],
       [4.89053334e-05, 4.14805807e-05, 3.47848596e-05, ...,
        8.90841519e-05, 9.20655185e-05, 9.49510127e-05],
       [6.53880098e-05, 5.46076646e-05, 4.49903276e-05, ...,
        8.51860637e-05, 8.78986016e-05, 9.05161684e-05],
       [4.03589638e-05, 3.60930158e-05, 3.19959977e-05, ...,
        8.99413803e-05, 9.29632995e-05, 9.59047178e-05],
       [8.57205191e-05, 7.30392856e-05, 6.15872617e-05, ...,
        4.91097210e-05, 5.07646684e-05, 5.23721508e-05]])

Alignment free 3D descriptor is useful for chemoinformatics. And the descriptor is useful for early stage of drug discovery project I think. Because shape based descriptor is fuzzy to describe feature of ligand and protein interaction. However it will be useful for scaffold hopping.

TMSmesh requires pqr format. It is not familiar for medchem. It can easily get from SDF with openbabel! Command example is below. The option “-m” means split molecules, generate pqr format file for each morelue.

iwatobipen$ babel -isdf yoursdf.sdf -opqr -m

The descripor is very interesting for me but to use the tool I need to make workflow because it required many toolkit such as TMSmesh, openbabel, and rdkit…

I uploaded the code on my repo.

Plot Chemical space with d3js based library #RDKit #Chemoinformatics

Recently I posted making interactive plot on jupyter notebook.
I used altair for doing it. Today, I used d3js and matplotlib based package to make scatter plot.
mlpd3 is another tool for making interactive plot with python.

In the original site, many examples are provided. It seems easy to make any kinds of plot with tooltip. ;) So, I want to try whether the module can SVG image as tooltip.
From original site document mlpd3 supports following version of python. The mpld3 project is compatible with Python 2.6-2.7 and 3.3-3.4. But I want to run my code on python3.6. So I installed mpld3 to py3.6 env.
And current version of mpld3 causes error. I referred SOF and installed bug fixed version from github repo.

python -m pip install --user "git+"

Now ready. Let’s write code! To run the code on jupyter notebook, mpld3.enable_notebook() option is recommended.

%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mpld3
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import DataStructs
from sklearn.decomposition import PCA
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from mpld3 import plugins

Then perform PCA with morgan fingerprint. This is not so difficult. And make moltosvg function for tooltip. I browed the function from rdkit blog post. I used svg strings as tooltip.

def fp2arr(fp):
    arr = np.zeros((1,))
    return arr

# Original code is described in the rdkit blog post.

def moltosvg(mol,molSize=(450,15),kekulize=True):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    svg = drawer.GetDrawingText()
    return svg.replace('svg:','')

fpgen = rdFingerprintGenerator.GetMorganGenerator(2)
mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf'))]
for m in mols:
fps = [fpgen.GetFingerprint(m) for m in mols]
X = np.asarray([fp2arr(fp) for fp in fps])

pca = PCA(n_components=3)
res = pca.fit_transform(X)
# make set of SVG
svgs = [moltosvg(m) for m in mols]

Finally make scatter plot. This can do with almost same way to matplotlib. To use mlpd3, user do not need to write Javascript for making tooltip. Of course you can write your own javascript for implementation of more complex features .

fig, ax = plt.subplots()
ax.set_title('Viz chemical space!')
points = ax.scatter(res[:,0], res[:,1])
# This is key point for making tooltip!
tooltip = plugins.PointHTMLTooltip(points, svgs)
plugins.connect(fig, tooltip)

After running the code, I could get interactive plot which has chemical structures as a tooltip. ;)
I would like to make frame work for chemical space visualizer.

Scatter plot

Whole code can view from nb viewer.

Generate possible molecules from a dataset #Chemoinformatics #RDKit

Recently @dr_greg_landrum  shared very cool Knime work flow which can enumerate possible molecules from given dataset. You can get the Work flow from following URL.
This work flow is really useful and elegant. If you have interested in, I recommend to use it!!!
And Patrick Walters disclosed nice code for Free-Wilson analysis on github.

Inspired these nice code, I want to write code for molecular enumeration from given molecules in python. ;)
At frist, import libraries.

import os
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdFMCS
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdRGroupDecomposition
from rdkit.Chem import rdmolops
from rdkit.Chem import RDConfig
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Scaffolds import MurckoScaffold

from collections import defaultdict
from itertools import product
import igraph
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import re
IPythonConsole.ipython_useSVG = True

Then define functions. To do retrieve set of similar compounds from the dataset, I used molecular similarity graph. I used python-igraph. To use igraph, I can pick up the largest subgraph from parent graph.
In this post, I used cdk2.sdf as demo data.

def gengraph(mols,fpgen , threshold=0.7):
    fps = [fpgen.GetFingerprint(m) for m in mols]
    num_v = len(mols)
    graph = igraph.Graph()
    for i in range(num_v):
        for j in range(i):
            if DataStructs.TanimotoSimilarity(fps[i], fps[j]) >= threshold:
                graph.add_edge(i, j)
    return graph
fpgen = rdFingerprintGenerator.GetMorganGenerator(2)
mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf'))]
for mol in mols:
fps = [fpgen.GetFingerprint(m) for m in mols]
graph = gengraph(mols, fpgen, 0.4)
simmols_idx = sorted(list(blks), key=lambda x: len(x), reverse=True)
simmols = [mols[i] for i in simmols_idx[0]]
scaff = [MurckoScaffold.GetScaffoldForMol(m) for m in simmols]

Then, get MCS. I investigated some conditions and I selected mcs2.

mcs1 = rdFMCS.FindMCS(scaff, threshold=0.7)
mcs2 = rdFMCS.FindMCS(scaff, threshold=0.7, 
mcs3 = rdFMCS.FindMCS(scaff, threshold=0.7,

Next, select molecule which has mcs2 as substructure from dataset.
And there is a hacky method, I define getMCSSmiles because following process needs to molecule which can sanitize as core.
The idea came from rdkit cook book.

mols_has_core = []
core = Chem.MolFromSmarts(mcs2.smartsString)
for mol in mols:
    if mol.HasSubstructMatch(core):
def getMCSSmiles(mol, mcs):
    mcsp = Chem.MolFromSmarts(mcs.smartsString)
    match = mol.GetSubstructMatch(mcsp)
    smi = Chem.MolFragmentToSmiles(mol, atomsToUse=match)
    return smi
mcs_smi = getMCSSmiles(mols_has_core[0], mcs2)
core = Chem.MolFromSmiles(mcs_smi)

Now I could get core for RGroup decomposition. So I did RGDecomposition with RDKit function. And make dataset and core for molecule generation.

rgp = rdRGroupDecomposition.RGroupDecompositionParameters()
rgp.removeHydrogensPostMatch = True
rgp.alignment =True
rg = rdRGroupDecomposition.RGroupDecomposition(core, rgp)
for mol in mols_has_core:
dataset = rg.GetRGroupsAsColumns()
core =  Chem.RemoveHs(dataset["Core"][0])

Almost there. To generate possible combination of molecules I need following steps. 1) RGroup decomposition, 2) retrieve information of side chains, 3) Generate new molecules from all possible combination of side chains.
To do that I define two functions named makebond and enumeratemol.

def makebond(target, chain):
    newmol = Chem.RWMol(rdmolops.CombineMols(target, chain))
    atoms = newmol.GetAtoms()
    mapper = defaultdict(list)
    for idx, atm in enumerate(atoms):
        atom_map_num = atm.GetAtomMapNum()
    for idx, a_list in mapper.items():
        if len(a_list) == 2:
            atm1, atm2 = a_list
            rm_atoms = [newmol.GetAtomWithIdx(atm1),newmol.GetAtomWithIdx(atm2)]
            nbr1 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm1)) for x in newmol.GetAtomWithIdx(atm1).GetBonds()][0]
            nbr2 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm2)) for x in newmol.GetAtomWithIdx(atm2).GetBonds()][0]
    newmol.AddBond(nbr1.GetIdx(), nbr2.GetIdx(), order=Chem.rdchem.BondType.SINGLE)
    newmol = newmol.GetMol()
    return newmol

def enumeratemol(core,rg, maxmol=10000):
    dataset = rg.GetRGroupsAsColumns()
    labels = list(dataset.keys())
    pat = re.compile("R\d+")
    labels = [label for label in labels if pat.match(label)]
    rgs = np.asarray([dataset[label] for label in labels])
    i, j = rgs.shape
    combs = [k for k in product(range(j), repeat=i)]
    res = []
    for i in combs:
        mol = core
        for idx,j in enumerate(i):
            mol = makebond(mol, rgs[idx][j])
        mol = Chem.RemoveHs(mol)
    return res

Now ready! Let’s go generation step. And check structures.

res = enumeratemol(core,rg)

It seems work fine. At last, check number of generated molecules.

print(14**3, len(res))
> 2744 2744

Yah! The code seems work fine. It is not so efficient way because if data has many RGroups, huge number of molecules are generated.
So I think combination of predictive model or / and molecular filter is useful for undesired molecule cutting.
I could learn useful function of RDKit today. Really useful toolkit for chemoinformatics!
Whole code of the post can view from following URL.

Any suggestion and advices are appreciated. ;)