3D Alignment function of RDKit #RDKit

During the UGM, I was interested in Ben Tehan & Rob Smith’s great work.
They showed me a nice example of molecular alignment with RDKit.
RDKit has several function to perform 3D alignment. In the Drug Discovery 3D alignment of ligands is important not only Comp Chem but also Med Chem. After their presentation, I talked them and they told me that GetCrippenO3A is useful for 3D alignment.
Hmm, that’s sounds interesting.
I tried to use the function.
My example code is below. Following code run on ipython notebook. To visualize 3D structure of molecules, I used py3Dmol. It can visualize multiple 3D molecules and easy to use!

import py3Dmol
import copy
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import rdBase
from rdkit.Chem import rdMolAlign
from rdkit.Chem import rdMolDescriptors
import numpy as np
p = AllChem.ETKDGv2()
p.verbose = True

If p.verbose set True, user can get SMARTS patterns which hit the definition of ETKDG like below.

[cH0:1]c:2!@;-[O:3][C:4]: 8 7 6 5, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 3 5 6 7, (0, 0, 4, 0, 0, 0)
[O:1][CX4:2]!@;-[CX3:3]=[O:4]: 6 5 3 4, (0, 3, 0, 0, 0, 0)
[C:1][CX4:2]!@;-[CX3:3]=[O:4]: 0 1 3 4, (0, 0, 1, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 12 11 10 5, (0, 0, 4, 0, 0, 0)
[OX2:1][CX4:2]!@;-[CX4:3][OX2:4]: 10 11 12 16, (0, 0, 3, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 12 11 10 5, (0, 0, 4, 0, 0, 0)
[OX2:1][CX4:2]!@;-[CX4:3][N:4]: 10 11 12 17, (0, 0, 4, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)

Next load molecules and generate conformers. I used cdk2.sdf which is provided in rdkit as sample.

mols = [m for m in Chem.SDMolSupplier('cdk2.sdf') if m != None][:6]
for mol in mols:
hmols_1 = [Chem.AddHs(m) for m in mols]
hmols_2 = copy.deepcopy(hmols_1)
# Generate 100 conformers per each molecule
for mol in hmols_1:
    AllChem.EmbedMultipleConfs(mol, 100, p)

for mol in hmols_2:
    AllChem.EmbedMultipleConfs(mol, 100, p)
# for Ipython notebook

To conduct GetCrippenO3A and GetO3A, I calculate crippen_contrib of each atom and MMFF params of molecules.

crippen_contribs = [rdMolDescriptors._CalcCrippenContribs(mol) for mol in hmols_1]
crippen_ref_contrib = crippen_contribs[0]
crippen_prob_contribs = crippen_contribs[1:]
ref_mol1 = hmols_1[0]
prob_mols_1 = hmols_1[1:]

mmff_params = [AllChem.MMFFGetMoleculeProperties(mol) for mol in hmols_2]
mmff_ref_param = mmff_params[0]
mmff_prob_params = mmff_params[1:]
ref_mol2 = hmols_2[0]
prob_mols_2 = hmols_2[1:]

OK Let’s align molecules and visualize them!
I retrieved the best score index from multi conformers of each molecule and added viewer.
For crippenO3A…

p_crippen = py3Dmol.view(width=600, height=400)
p_crippen.addModel(Chem.MolToMolBlock(ref_mol1), 'sdf')
crippen_score = []
for idx, mol in enumerate(prob_mols_1):
    tempscore = []
    for cid in range(100):
        crippenO3A = rdMolAlign.GetCrippenO3A(mol, ref_mol1, crippen_prob_contribs[idx], crippen_ref_contrib, cid, 0)
    best = np.argmax(tempscore)
    p_crippen.addModel(Chem.MolToMolBlock(mol, confId=int(best)), 'sdf')

For O3A…

p_O3A = py3Dmol.view(width=600, height=400)
p_O3A.addModel(Chem.MolToMolBlock(ref_mol2), 'sdf')
pyO3A_score = []
for idx, mol in enumerate(prob_mols_2):
    tempscore = []
    for cid in range(100):
        pyO3A = rdMolAlign.GetO3A(mol, ref_mol2, mmff_prob_params[idx], mmff_ref_param, cid, 0)
    best = np.argmax(tempscore)
    p_O3A.addModel(Chem.MolToMolBlock(mol, confId=int(best)), 'sdf')

In my example, both methods shows good results. To check the details, I will calculate ShapeTanimoto and/or RMSD etc.

In summary, rdkit has many useful functions not only 2D but also 3D. I would like to use this function in my project.

All code of the post, I uploaded my github repo.


Enjoyed RDKitUGM2018 #RDKit

I got back Japan from Cambridge today. Time flies when you’re having fun.
This is the first time I participate RDKit UGM and RDKit Hackathon. It was amazing experience for me.
Twitter Hash Tag was very attractive. If reader who is interested in, I recommend to search #RDKitUGM2018 in twitter.
I could talk face to face many people who I’m following twitter. Thank you for talking with me. ;-)
There were many exiting topics in the meeting. Especially I enjoyed Dr. Segler’s talk about computer aided synthesis planning. It was nice work! His system can analyze retrosynthetic route very efficiently.

And Gregs’s presentation gave me very important message.
Pros of open source is good community and contribution of users. I agree his opinion. I would like to have ab internal discussion about it.

I really surprised that I could meet my blog reader and could get many positive comment about the blog.
I was so happy that I cried….

This blog is memo for myself, but it make me happy that my blog be someone’s help.

I really thank the meeting organizer and participants.
I hope I can meet everyone next year

what is probabilistic programming?

I did not know what PPL is.
Recently I knew probabilistic programing and found nice article in arxiv.
A deep probabilistic programming language is a language for specifying both deep NN and probabilistic models.
Probabilistic programming creates systems that help make decisions in the face of uncertainty.

In this article, author describes pros and cons of Deep learning and probabilistic programming.
Advantages from probabilistic programming(PP) is that PP can give not prediction but also a probability. It means PP give probabilistic models to my understanding.
On the other hand, advantages from Deep Learning is feature extraction from large dataset. DL gives high accuracy.
Stan, PyMC is major package for PPL. If you have used tensorflow for DL, I recommend consider edward as one option.
Edward does not support recent version of tensorflow. For new version of tensorflow, tensorflow-probability is recommended.

Today, I used edward for PP.
Following example is very simple case.

At first linear regression.
y = wx + b
Define data builder and define variable.
For PP, w and b is stochastic variable, define each variable with Normal().

%matplotlib inline
import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Normal
from matplotlib import pyplot as plt
def buildtodayta(N, w):
    D = len(w)
    x = np.random.normal(0., 2., size=(N,D))
    y = np.dot(x, w) + np.random.normal(0., 0.01, size=N)
    return x, y
N = 40
D = 10
w_true = np.random.randn(D)*0.4
xt, yt = buildtodayta(N, w_true)
yt.shape, xt.shape

X = tf.placeholder(tf.float32, [N,D])
w = Normal(loc=tf.zeros(D), scale=tf.ones(D))
b = Normal(loc=tf.zeros(1), scale=tf.ones(1))
y = Normal(loc=ed.dot(X, w)+b, scale=tf.ones(N))

Then define theta. Theta is set of latent variables. In the following code, qw and qb is theta.

qw = Normal(loc=tf.get_variable("qw/loc", [D]), scale=tf.nn.softplus(tf.get_variable("qw/scale", [D])))
qb = Normal(loc=tf.get_variable("qb/loc", [1]), scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))

Finally, to find optimal theta conduct minimization of Kullback-Leibler divergence of q(x|z) and p(x|z).

inference = ed.KLqp({w:qw, b:qb}, data={X:xt, y:yt})
inference.run(n_samples=5, n_iter=100)
y_post = ed.copy(y, {w: qw, b: qb})
def visualise(X_data, y_data, w, b, n_samples=10):
  w_samples = w.sample(n_samples)[:, 0].eval()
  b_samples = b.sample(n_samples).eval()
  plt.scatter(X_data[:, 0], y_data)
  plt.ylim([-10, 10])
  inputs = np.linspace(-8, 8, num=400)
  for ns in range(n_samples):
    output = inputs * w_samples[ns] + b_samples[ns]
    plt.plot(inputs, output)

After learning, I can sampling from qw and qb.
Let’s compare before and after.

Before learning, w and b is not optimized, regression lines are unreasonable.

visualise(xt, yt, w, b, n_samples=3)

After learning the regression lines catch the trend of the dataset.

visualise(xt, yt, qw, qb, n_samples=3)

Next, try to cosine curve.

%matplotlib inline
import edward as ed
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from edward.models import Normal
def build_toy_dataset(N=50, noise_std=0.1):
    x = np.linspace(-14,14, num=N)
    y = np.cos(x) + np.random.normal(0, noise_std, size=N)
    x = x.astype(np.float32).reshape((N, 1))
    y = y.astype(np.float32)
    return x, y
plt.scatter(build_toy_dataset()[0], build_toy_dataset()[1], c="b")

Next, define DNN (2layers not so deeeeeeeeep ;-)).

def neural_network(x, W_0, W_1, b_0, b_1):
    h = tf.tanh(tf.matmul(x, W_0) + b_0)
    h = tf.matmul(h, W_1) + b_1
    return tf.reshape(h, [-1])
N = 50
D = 1
x_train, y_train = build_toy_dataset(N)
W_0 = Normal(loc=tf.zeros([D,30]), scale=tf.ones([D,30]))
W_1 = Normal(loc=tf.zeros([30,1]), scale=tf.ones([30,1]))
b_0 = Normal(loc=tf.zeros(30), scale=tf.ones(30))
b_1 = Normal(loc=tf.zeros(1), scale=tf.ones(1))
x = x_train
y = Normal(loc = neural_network(x, W_0, W_1, b_0, b_1), scale=tf.ones(N) * 0.1)
qW_0 = Normal(loc=tf.get_variable("qW_0/loc", [D,30]), scale=tf.nn.softplus(tf.get_variable("qW_0/scale", [D,30])))
qW_1 = Normal(loc=tf.get_variable("qW_1/loc", [30,1]), scale=tf.nn.softplus(tf.get_variable("qW_1/scale", [30,1])))
qb_0 = Normal(loc=tf.get_variable("qb_0/loc", [30]), scale=tf.nn.softplus(tf.get_variable("qb_0/scale", [30])))
qb_1 = Normal(loc=tf.get_variable("qb_1/loc", [1]), scale=tf.nn.softplus(tf.get_variable("qb_1/scale", [1])))

rs = np.random.RandomState(0)
inputs = np.linspace(-9, 9, num=400, dtype=np.float32)
x = tf.expand_dims(inputs, 1)
mus = tf.stack([neural_network(x, qW_0.sample(), qW_1.sample(), qb_0.sample(), qb_1.sample()) for _ in range(10)])
sess = ed.get_session()
outputs = mus.eval()

fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
ax.set_title('iteration: 0')
ax.plot(x_train, y_train, 'ks', alpha=0.5, label=('x', 'y'))
ax.plot(inputs, outputs[0].T, 'b', lw=2, alpha=0.5, label='prior draws')
ax.plot(inputs, outputs[1:].T, 'y', lw=2, alpha=0.5)
ax.set_xlim([-10, 10])

inference = ed.KLqp({W_0: qW_0, b_0:qb_0, W_1:qW_1, b_1:qb_1}, data={y: y_train})
inference.run(n_iter=2000, n_samples=5)
outputs = mus.eval()
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111)
ax.set_title('iteration: 2000')
ax.plot(x_train, y_train, 'ks', alpha=0.5, label=('x', 'y'))
ax.plot(inputs, outputs[0].T, 'b', alpha=0.5, lw=2, label='post draws')
ax.plot(inputs, outputs[1:].T, 'y', alpha=0.5, lw=2, label='post draws')
ax.set_xlim([-14, 14])