Rotate molecule and visualize it #RDKit #Chemoinformatics

Some days ago, I found an interesting question in rdkit mailing list. The question is how to rotate molecule around an axis. I do not have any idea to do it. But RDKit has a function to do it. I read API and try to do it.

rdMolTransforms.TransformConformer function can rotate molecue with transform matrix. So, I think I can do it if rotate matrix is defined.

Rotation matrix is a simple. Like below.

def rot_ar_x(radi):
    return  np.array([[1, 0, 0, 0],
                      [0, np.cos(radi), -np.sin(radi), 0],
                      [0, np.sin(radi), np.cos(radi), 0],
                     [0, 0, 0, 1]], dtype=np.double)

def rot_ar_y(radi):
    return  np.array([[np.cos(radi), 0, np.sin(radi), 0],
                      [0, 1, 0, 0],
                      [-np.sin(radi), 0, np.cos(radi), 0],
                     [0, 0, 0, 1]], dtype=np.double)

def rot_ar_z(radi):
    return  np.array([[np.cos(radi), -np.sin(radi), 0, 0],
                      [np.sin(radi), np.cos(radi), 0, 0],
                      [0, 0, 1, 0],
                     [0, 0, 0, 1]], dtype=np.double)

Let’s write code!

import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolTransforms
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit import rdBase
from IPython import display
import copy
import py3Dmol

def rot_ar_x(radi):
    return  np.array([[1, 0, 0, 0],
                      [0, np.cos(radi), -np.sin(radi), 0],
                      [0, np.sin(radi), np.cos(radi), 0],
                     [0, 0, 0, 1]], dtype=np.double)

def rot_ar_y(radi):
    return  np.array([[np.cos(radi), 0, np.sin(radi), 0],
                      [0, 1, 0, 0],
                      [-np.sin(radi), 0, np.cos(radi), 0],
                     [0, 0, 0, 1]], dtype=np.double)

def rot_ar_z(radi):
    return  np.array([[np.cos(radi), -np.sin(radi), 0, 0],
                      [np.sin(radi), np.cos(radi), 0, 0],
                      [0, 0, 1, 0],
                     [0, 0, 0, 1]], dtype=np.double)
tforms = {0: rot_ar_x, 1: rot_ar_y, 2: rot_ar_z}

I used simple molecule for test.

mol = Chem.AddHs(Chem.MolFromSmiles("COC1CN(C)CC1"))
AllChem.EmbedMolecule(mol)

Then I defined draw function. The function borrowed from rdkit-blog. ;)

# ref  http://rdkit.blogspot.com/2016/07/using-ipywidgets-and-py3dmol-to-browse.html
def drawit2(m,p,confId=-1):
    mb = Chem.MolToMolBlock(m,confId=confId)
    p.addModel(mb,'sdf')
    p.setStyle({'stick':{}})
    p.setBackgroundColor('0xeeeeee')
    p.zoomTo()

Now ready. Let’s rotate molecule around x, y, z axis.

# around X axis
p = py3Dmol.view(width=400, height=400)
for i in range(11):
    rdMolTransforms.TransformConformer(mol.GetConformer(0), tforms[0](2*np.pi/10))
    drawit2(mol,p)
p.show()

Looks good.

# Before align
v = PyMol.MolViewer()
v.DeleteAll()
process = subprocess.Popen(['python', showfeatpath, '--writeFeats','./before_align_cdk2.sdf'], stdout=subprocess.PIPE)
stdout = process.communicate()[0]
png=v.GetPNG()
display.display(png)

Rotate Z axis last!

# around Y axis
p = py3Dmol.view(width=400, height=400)
for i in range(11):
    rdMolTransforms.TransformConformer(mol.GetConformer(0), tforms[1](2*np.pi/10))
    drawit2(mol,p)
p.show()

Works fine. I do not understand why transformation matrix is 4 x 4. I wonder if you have any comments about it.

The code is uploaded to my repo and you can check from following URL.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/rotation_mol2.ipynbh

Advertisements

Visualize pharmacophore with RDKit #RDKit #Pymol #Chemoinformatics

Some years ago, I wrote a post about how to communicate pymol and RDKit. In the post, I demonstrated how to visualize Phamarcophore in rdkit.

And recently I got a query about the post and think about more efficient way.
What I want to write in the post is new approach to visualize pharmacophore with RDKit.

To do it, I ran pymol in server mode. The environment of pymol session is not need to same in rdkit env. I installed pymol via homebrew. So this version is python 3.7. And my environment of rdkit is python3.6.

iwatobipen$ pymol -R

Then write code on jupyter notebook. I picked up several mols from cdk2.sdf which is provided from rdkit Docs/Book/data folder.

from IPython import display
import os
import subprocess
from rdkit import Chem
from rdkit import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolAlign
from rdkit.Chem import rdShapeHelpers
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import PyMol
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import copy
import pprint
mols = Chem.SDMolSupplier('./before_align_cdk2.sdf', removeHs=False)
cdk2mol = [m for m in mols]
for m in cdk2mol:
    AllChem.EmbedMolecule(m, AllChem.ETKDGv2())

Then aligned molecules with crippenO3A method. I used first molecule as reference. And after align I saved molecule to align_cdk2.sdf file.

cdk2mol2 = copy.deepcopy(cdk2mol)
crippen_contribs = [rdMolDescriptors._CalcCrippenContribs(mol) for mol in cdk2mol2]
ref = cdk2mol2[0]
ref_contrib = crippen_contribs[0]
targets = cdk2mol2[1:]
targets_contrib = crippen_contribs[1:]

for i, target in enumerate(targets):
    crippenO3A = rdMolAlign.GetCrippenO3A(target, ref, targets_contrib[i], ref_contrib)
    crippenO3A.Align()

w = Chem.SDWriter('./align_cdk2.sdf')
w.write(ref)
for mol in targets:
    w.write(mol)
w.close()

Key function of ShowFeatures is ‘rdkit/Chem/Features/ShowFeats.py’
To use the function, it is needed to run the code like below.

iwatobipen$ python ShowFeats.py 

I want to call the function in jupyter notebook. So I used subprocess.
Following example shows before alignment and after alignment.

showfeatpath = os.path.join(RDConfig.RDCodeDir, 'Chem/Features/ShowFeats.py')

At first visualize molecules before align. ‘display’ function can display png image on notebook.

# Before align
v = PyMol.MolViewer()
v.DeleteAll()
process = subprocess.Popen(['python', showfeatpath, '--writeFeats','./before_align_cdk2.sdf'], stdout=subprocess.PIPE)
stdout = process.communicate()[0]
png=v.GetPNG()
display.display(png)

Hmm, the function can detect pharmacophores of each molecule but not aligned form. Next visualize molecules

#After align
v = PyMol.MolViewer()
v.DeleteAll()
process = subprocess.Popen(['python', showfeatpath,'--writeFeats','./align_cdk2.sdf'], stdout=subprocess.PIPE)
stdout = process.communicate()[0]
png=v.GetPNG()
display.display(png)

Now I could get following image.

And I set –writeFeats option. The option provides position of each pharmacophores.

res = stdout.decode('utf-8').replace('\t', ' ').split('\n')
pprint.pprint(res)

['# Family    X     Y    Z    Radius  # Atom_ids',
 'Donor -4.447 -0.580 -2.127 1.0 # 11',
 'Donor -2.369 -0.730 -2.637 1.0 # 13',
 'Donor -4.121 -0.009 0.275 1.0 # 14',
 'Donor -3.602 0.536 2.585 1.0 # 17',
 'Acceptor 2.288 -0.534 -1.549 1.0 # 5',
 'Acceptor -0.161 -0.294 -0.684 1.0 # 7',
 'Acceptor -2.369 -0.730 -2.637 1.0 # 13',
 'Acceptor -4.121 -0.009 0.275 1.0 # 14',
 'Acceptor -1.919 0.112 0.908 1.0 # 16',
 'PosIonizable -3.325 -0.576 -2.046 1.0 # 9 13 12 11 10',
 'Aromatic -3.325 -0.576 -2.046 1.0 # 9 10 11 12 13',
 'Aromatic -2.826 -0.102 -0.038 1.0 # 8 9 10 14 15 16',
 'Hydrophobe 3.473 -0.083 0.406 1.0 # 2',
 'LumpedHydrophobe 3.899 0.272 0.319 1.0 # 2 1 3',
 'Donor -4.491 -0.607 -2.118 1.0 # 2',
 'Donor -2.390 -0.682 -2.613 1.0 # 5',
 'Donor -4.131 -0.045 0.283 1.0 # 9',
 'Donor -3.541 0.496 2.576 1.0 # 10',
 'Acceptor -2.390 -0.682 -2.613 1.0 # 5',
 'Acceptor -1.879 0.136 0.883 1.0 # 7',
 'Acceptor -4.131 -0.045 0.283 1.0 # 9',
 'Acceptor -0.163 -0.211 -0.758 1.0 # 11',
 'Acceptor 2.410 -1.330 -1.015 1.0 # 17',
 'PosIonizable -3.355 -0.566 -2.032 1.0 # 3 2 1 5 4',
 'Aromatic -3.355 -0.566 -2.032 1.0 # 1 2 3 4 5',
 'Aromatic -2.828 -0.097 -0.047 1.0 # 3 4 6 7 8 9',
 'Donor -4.513 -0.557 -2.079 1.0 # 2',
 'Donor -2.426 -0.726 -2.623 1.0 # 5',
 'Donor -4.148 -0.000 0.288 1.0 # 9',
 'Donor -3.550 0.529 2.554 1.0 # 10',
 'Donor 2.549 0.697 -1.275 1.0 # 18',
 'Acceptor -2.426 -0.726 -2.623 1.0 # 5',
 'Acceptor -1.881 0.097 0.872 1.0 # 7',
 'Acceptor -4.148 -0.000 0.288 1.0 # 9',
 'Acceptor -0.137 -0.319 -0.735 1.0 # 11',
 'Acceptor 4.371 2.116 -1.804 1.0 # 17',
 'PosIonizable -3.375 -0.565 -2.021 1.0 # 3 2 1 5 4',
 'Aromatic -3.375 -0.565 -2.021 1.0 # 1 2 3 4 5',
 'Aromatic -2.824 -0.106 -0.053 1.0 # 3 4 6 7 8 9',
 '']

I want to do with Features.ShowFeats function directly from ipython notebook but it was difficult. So I call this function from other process.

BTW, it is interesting and cool code for visualize pharmacophores. RDKit is nice tool for chemoinformatics.

All code is uploaded to my repo.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/viz_p4core/display_pharmacophore.ipynb
https://github.com/iwatobipen/playground/tree/master/viz_p4core

New functionality of psikit #Chemoinformatics #RDKit #Psi4

Happy new year! ‘Akemashite Omedetou Gozaimasu’ in Japanese!

I and @kzfm_ -san are developing a library which uses a rdkit & psi4 named psikit. You can find brief introduction the concept following URL.
http://blog.kzfmix.com/entry/1536978824

Today I added new function that can calculate RESP charge of give molecule. RESP charge means ‘Restraints for Deriving
Atomic Charges’. Details can read from this link.

OK, I would like to show how to call new function. It is easy to use but it needs long time for large molecule. I used acetic acid as example. To calculate RESP Charge, user need to run optimization.

############################################
# This code came from RESP branch, not master branch.
############################################

from psikit import Psikit
smi = 'CC(=O)O'
# load smiles and perform optimzation
pk = Psikit()
pk.read_from_smiles(smi)
%time pk.optimize()
>Optimizer: Optimization complete!
>CPU times: user 21.6 s, sys: 920 ms, total: 22.5 s
>Wall time: 6.23 s
>-227.82082530474275

Now I can get HOMO, LUMO and RESP Charge. RESP charges are set as atomic properties of pk.mol object. Let’s check before calculation and after calculation.

# Get Atoms from pk.mol object and check atom properties
atoms = pk.mol.GetAtoms()
print(list(atoms[0].GetPropNames()))
> []
# run the calculation
charges = pk.resp_charge
atoms = pk.mol.GetAtoms()
print(list(atoms[0].GetPropNames()))
> ['EP_C', 'RESP_C']  # EP_C means electro static potential, RESP_C means RESP charge.

Finally draw molecule with RESP charges.

from rdkit.Chem.Draw import MolDrawOptions
from rdkit.Chem.Draw import rdMolDraw2D
from IPython.display import SVG
view = rdMolDraw2D.MolDraw2DSVG(600,800)
op = view.drawOptions()
for idx, atom in enumerate(pk.mol.GetAtoms()):
    num = float(atom.GetProp('RESP_C'))
    op.atomLabels[idx] = "{}({:,.2f})".format(atom.GetSymbol(), num)
AllChem.Compute2DCoords(pk.mol)
view.DrawMolecule(pk.mol)
view.FinishDrawing()
svg = view.GetDrawingText().replace('svg:', '')
SVG(svg)
acetic acid with RESP charge 6-31G** basis

The results indicates Oxygen atom has the most negative charge and alpha carbon shows more negative than carbonyl carbon.
The library is now under development. We would like to release the package near the feature.

All code of the post can get from the URL below.

https://nbviewer.jupyter.org/github/Mishima-syk/psikit/blob/resp/psikit/example.ipynb

Make interactive chemical space plot in jupyter notebook #cheminformatics #Altair

I often use seaborn for data visualization. With the library, user can make beautiful visualization.
BTW, today I tried to use another library that can make interactive plot in jupyter notebook.
Name of the library is ‘altair’.
https://altair-viz.github.io/index.html
The library can be installed from pip or conda and this package based vega and vega-lite. Vega is a python package for data visualization.

Regarding the tutorial, Altair can make beautiful plow with very simple code. I wrote an example that plot chemical space of cdk2.sdf which is provided by RDKit.

Following code is conducted in Google colab.
At first install rdkit in my environment. Fortunately Altair can call directly without installation to colab.

!wget https://repo.anaconda.com/miniconda/Miniconda3-4.5.1-Linux-x86_64.sh
!chmod +x Miniconda3-4.5.1-Linux-x86_64.sh
!time bash ./Miniconda3-4.5.1-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

After installation of RDKit, append rdkit path to sys path,

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')

Now ready. Let’s import some libraries.

import pandas as pd
import numpy as np
import altair as alt

from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import PandasTools
from rdkit.Chem import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA

Then load dataset

moldf = PandasTools.LoadSDF(os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf'))
moldf['SMILES'] = moldf.ROMol.apply(Chem.MolToSmiles)
def mol2fparr(mol):
    arr = np.zeros((0,))
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,2)
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

The conduct PCA with molecular fingerprint for plot chemical space.
And make new dataset. cdk2.sdf has Cluster number, so I used the annotation for coloring.

pca = PCA(n_components=2)
X = np.asarray([mol2fparr(mol) for mol in moldf.ROMol])
print(X.shape)
res = pca.fit_transform(X)
print(res.shape)
moldf['PCA1'] = res[:,0]
moldf['PCA2'] = res[:,1]
moldf2 = moldf[['ID', 'PCA1', 'PCA2', 'SMILES' ]]
moldf2['Cluster'] = ["{:0=2}".format(int(cls)) for cls in moldf.loc[:,'Cluster']]

To make scatter plot in Altair, it is easy just call ‘alt.Cahrt.mark_point(pandas data frame)’
mark_* is the method which can access many kids of plots.
From the document, following plots are provided.

Mark Name Method Description Example
area mark_area() A filled area plot. Simple Stacked Area Chart
bar mark_bar() A bar plot. Simple Bar Chart
circle mark_circle() A scatter plot with filled circles. One Dot Per Zipcode
geoshape mark_geoshape() A geographic shape Choropleth Map
line mark_line() A line plot. Simple Line Chart
point mark_point() A scatter plot with configurable point shapes. Multi-panel Scatter Plot with Linked Brushing
rect mark_rect() A filled rectangle, used for heatmaps Simple Heatmap
rule mark_rule() A vertical or horizontal line spanning the axis. Candlestick Chart
square mark_square() A scatter plot with filled squares. N/A
text mark_text() A scatter plot with points represented by text. Bar Chart with Labels
tick mark_tick() A vertical or horizontal tick mark. Simple Strip Plot

Now I would like to make scatter plot, so I call mark_point.
“interactive()” method returns interactive plot in jupyter.
So After run the code, I can see interactive plot in notebook the plot returns tooltip when mouse over the point.

alt.Chart(moldf2).mark_point().encode(
           x = 'PCA1',
           y = 'PCA2',
           color = 'Cluster',
           tooltip = ['ID', 'SMILES']).interactive()

This library is interesting for me because it is easy to implement tooltip. I tried to embed SVG image to tooltip but it did not work. I would like to know how to embed image to the tooltip if it possible.

How to visualize your data is important because it receives different impression from different visualization even if data is same.

Reader who is interested in the post can found whole code from google colab or github. ;-)
https://colab.research.google.com/drive/1hKcWRBcQG51eGsbpDBF2gl6CoZsmVvTs
https://github.com/iwatobipen/playground/blob/master/plot_chemicalspace.ipynb

Build stacking Classification QSAR model with mlxtend #chemoinformatics #mlxtend #RDKit

I posed about the ML method named ‘blending’ somedays ago. And reader recommended me that how about try to use “mlxtend”.
When I learned ensemble learning package in python I had found it but never used.
So try to use the library to build model.
Mlxtend is easy to install and good document is provided from following URL.
http://rasbt.github.io/mlxtend/

Following code is example for stacking.
In ipython notebook…
Use base.csv for test and load some functions.

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
import pandas as pd

df = pd.read_csv("bace.csv")

The dataset has pIC50 for objective value.

mols = [Chem.MolFromSmiles(smi) for smi in df.mol]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol,2, nBits=1024) for mol in mols]
pIC50 = [i for i in df.pIC50]
Draw.MolsToGridImage(mols[:10], legends=["pIC50 "+str(i) for i in pIC50[:10]], molsPerRow=5)

Images of compounds is below.


Then calculate molecular fingerprint. And make binary activity array as y_bin.

X = []
for fp in fps:
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    X.append(arr)
X = np.array(X)
y = np.array(pIC50)
y_bin = np.asarray(y>7, dtype=np.int)

Then load some classifier model from sklearn and split data for training and testing.

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from mlxtend.classifier import StackingClassifier
from mlxtend.plotting import plot_decision_regions
from mlxtend.plotting import plot_confusion_matrix
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(X,y_bin, test_size=0.2)

To make stacking classifier, it is very simple just call StackingClassifier and set classifier and meta_classifier as arguments.
I use SVC as meta_classifier.

clf1 = RandomForestClassifier(random_state=794)
clf2 = GaussianNB()
clf3 = XGBClassifier(random_state=0)
clf4 = SVC(random_state=0)
clflist = ["RF", "GNB", "XGB", "SVC", "SCLF"]

sclf = StackingClassifier(classifiers=[clf1,clf2,clf3], meta_classifier=clf4)

Then let’s learn the data!

skf = StratifiedKFold(n_splits=5)
for j, (train_idx,test_idx) in enumerate(skf.split(x_train, y_train)):
    for i, clf in enumerate([clf1, clf2, clf3, clf4, sclf]):
        clf.fit(x_train[train_idx],y_train[train_idx])
        ypred = clf.predict(x_train[test_idx])
        acc = accuracy_score(y_train[test_idx], ypred)
        b_acc = balanced_accuracy_score(y_train[test_idx], ypred)
        print("round {}".format(j))
        print(clflist[i])
        print("accuracy {}".format(acc))
        print("balanced accuracy {}".format(b_acc))
        print("="*20)

> output
round 0
RF
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
====================
round 0
GNB
accuracy 0.6625514403292181
balanced accuracy 0.680450351191296
====================
round 0
XGB
accuracy 0.8271604938271605
balanced accuracy 0.8136275995042005
====================
round 0
SVC
accuracy 0.7325102880658436
balanced accuracy 0.7072717256576229
====================
round 0
SCLF
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
====================
round 1
RF
accuracy 0.7603305785123967
balanced accuracy 0.7534683684794672
====================
round 1
GNB
accuracy 0.640495867768595
balanced accuracy 0.6634988901220866
====================
round 1
XGB
accuracy 0.8140495867768595
balanced accuracy 0.8127081021087681
====================
round 1
SVC
accuracy 0.756198347107438
balanced accuracy 0.7414678135405106
===================
.....

Reader who is interested in stacking, you can find nice document here
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#example-1-simple-stacked-classification

And my all code is uploaded to myrepo on github.
http://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/mlxtend_test.ipynb

Mlxtend has many functions for building, analyzing and visualizing machine learning model and data. I will use the package more and more.

Vote Vote Vote #chemoinformatics

Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Following explanation from sklearn document.

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

I used the classifier for QSAR modeling.

Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]),
                                        ("GBC", clf_dict["GBC"]),
                                        ("XGB", clf_dict["XGB"]),
                                        ("SVC", clf_dict["SVC"])        
                                    ], voting='hard')

dataset = np.load("train.npz")['arr_0']
X = dataset[:,:-1]
y = dataset[:,-1]
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)
nfolds = 10
skf = StratifiedKFold(nfolds)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(test_X)
print("Voting!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

rf_clf = RandomForestClassifier(n_estimators=100)
 
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(test_X)
print("Random Forest!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

svc_clf = SVC(probability=True, gamma='auto')
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    svc_clf.fit(X_train, y_train)
y_pred = svc_clf.predict(test_X)
print("SV!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred)) 

Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)

iwatobipen$ python voting.py 
Voting!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Random Forest!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

SV!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Visualize pharmacophore in RDKit #RDKit

RDKit has pharmacophore feature assignment function. The function can retrieve molecular features based on pre-defined ph4core.
And RDKit IPythonconsole can draw molecules on ipython notebook.
Today I tried to visualize ph4core on notebook.
Code is very simple.

from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import rdBase
from rdkit.RDPaths import RDDocsDir
from rdkit.RDPaths import RDDataDir
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import os
print(rdBase.rdkitVersion)
IPythonConsole.ipython_useSVG=True
> 2018.09.1

First, load feature definition.

fdefFile = os.path.join(RDDataDir,'BaseFeatures.fdef')
featFact = ChemicalFeatures.BuildFeatureFactory(fdefFile)

Then calculate pharmacophore. And compute 2D cordes.

mols = [m for m in Chem.SDMolSupplier(os.path.join(RDDocsDir,"Book/data/cdk2.sdf"))]
featslists = [featFact.GetFeaturesForMol(mol) for mol in mols]
for mol in mols:
    AllChem.Compute2DCoords(mol)

Next I defined drawing function. To highlight ph4core, highlightAtomLists and legends are used as optional arguments of MolsToGridImage.

def drawp4core(mol, feats):
    atoms_list = {}
    for feat in feats:
        atom_ids = feat.GetAtomIds()
        feat_type = feat.GetType()
        atoms_list[feat_type] = atom_ids
    return Draw.MolsToGridImage([mol]*len(atoms_list), legends=list(atoms_list.keys()), highlightAtomLists=list(atoms_list.values()))

Test the function.

im = drawp4core(mols[1], featslists[1])
im

I could get following image.

The function worked and it could pick up pharmacophore of given molecule. ;-)