how to get molecular graph features

Belated I am interested in deepchem that is an open-source deep learning toolkit for drug discovery. Deep-chem supported many features for chemoinformatics.
And one of interested feature is calculation of molecular graphs. It is more primitive than hashed finger print. I tried to caluclate it.

Currently the toolkit supports only linux, so I installed deepchem via docker.
The installation was very easy.

iwatobipen$ docker pull deepchemio/deepchem
# wait a moment.... 😉
iwatobipen$ docker run -i -t deepchemio/deepchem
iwatobipen$ pip install jupyter
# following code is not necessary.
iwatobipen$ apt-get install vim

That’s all.
Next, I worked in docker env.

import deepchem as dc
from deepchem.feat import graph_features
from rdkit import Chem
convmol=graph_features.ConvMolFeaturizer()
mol = Chem.MolFromSmiles('c1ccccc1')
# convmol needs list of molecules
fs = convmol.featurize( [mol] )
f = fs[ 0 ]
# check method
dir( f )
Out[41]:
[ .....
 'agglomerate_mols',
 'atom_features',
 'canon_adj_list',
 'deg_adj_lists',
 'deg_block_indices',
 'deg_id_list',
 'deg_list',
 'deg_slice',
 'deg_start',
 'get_adjacency_list',
 'get_atom_features',
 'get_atoms_with_deg',
 'get_deg_adjacency_lists',
 'get_deg_slice',
 'get_null_mol',
 'get_num_atoms',
 'get_num_atoms_with_deg',
 'max_deg',
 'membership',
 'min_deg',
 'n_atoms',
 'n_feat']

To get atom features, use ‘get_atom_features’
To get edge information, use ‘get_adjacency_list’

f.get_atom_features()
Out[42]:
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1]])
f.get_adjacency_list()
Out[43]: [[1, 5], [0, 2], [1, 3], [2, 4], [3, 5], [4, 0]]

The array of atom feature means, carbon atom, degree is 2, SP2, and aromatic as one hot vector.

Next step, I will try to build model by using molecular graph.

自分の仕事はなんぞや

といつもながら考える。
今のご時世、社内に限らず合成のリソースは確保できる。極論言えば、CMOの方が何かと雑多なことがないぶんだけ合成に割く時間が多いような気もします。
もちろん予算の兼ね合いはあるわけですが、in-houseでやるメリットとout sourcing するメリットを冷静に比べて判断することがより求められているご時世かと。
私は大学時代は天然仏前合成にがっつりのめり込んでいて、入社後も実験台好きでやってきましたがもうベンチに立つ時間も(というか全然)ほとんどなくなってしまいました。
かと言って分子設計にがっつり時間使っているかというとそうでもない。
創薬は総合競技。
ちゃんと自分の役割を考えないと、、、
http://onlinelibrary.wiley.com/doi/10.1002/anie.201210006/abstract

Compile LeapPython and Manipulate protein structure with Leap Motion.

I was interested in Kinect as input device, because Kinect can detect motion as input it feels feature. 😉
And some days ago, I got new input device “Leap Motion”.
The Leap Motion controller is a small USB peripheral device which is designed to be placed on a physical desktop, facing upward. ( from wiki )
I want to use this device with PyMol!
Let’s try it.
At first to use Leap Motion from python, user need to get Leapmotion SDK. My environment is OSX, so I got SDK v2.3 from developer site.
https://developer.leapmotion.com/sdk/v2
Installation is little bit complicated for me, because I want to use LeapPython library from virtual environment but the library is linked with native python.
So, I build or change dynamic link following procedure.

In Python2.7, I changed link with following command.

iwatobipen$ cd /somepath/LeapSDK/lib
iwatobipen$ otool -L LeapPython.so
LeapPython.so:
	/Library/Frameworks/Python.framework/Versions/2.7/Python (compatibility version 2.7.0, current version 2.7.0)
	@loader_path/libLeap.dylib (compatibility version 0.7.0, current version 2.3.1)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0)
# change Python link from native to anaconda
iwatobipen$ install_name_tool -change /Library/Frameworks/Python.framework/Versions/2.7/Python /Users/iwatobipen/.pyenv/versions/anaconda-2.4.0/lib/libpython2.7.dylib LeapPython.so

In Python3.5, I compiled LeapPython.so using swig.
Newest swig will cause error, so I used swig version 3.0.3
I referred following url to do that. https://support.leapmotion.com/hc/en-us/articles/223784048-Generating-a-Python-3-3-0-Wrapper-with-SWIG-2-0-9
OK Let’s go

# Copy Leap.h, LeapMath.h, Leap.i, and libLeap.dylib into one folder. And type following command.
# Generate LeapPython.cpp withswig -c++ -python -o LeapPython.cpp -interface LeapPython Leap.i
iwatobipen$ swig -c++ -python -o LeapPython.cpp -interface LeapPython Leap.i
# Compile and link 
clang++ -arch x86_64 -I/Users/iwatobipen/.pyenv/versions/anaconda-2.4.0/include/python2.7 LeapPython.cpp libLeap.dylib /Users/iwatobipen/.pyenv/versions/anaconda-2.4.0/lib/libpython2.7.dylib -shared -o LeapPython.so

Now I got LeapPython.so linked each python version in anaconda.

Unfortunately, I could not install pymol in python3.5. Following code was run in python2.7. I installed pymol by using conda install command.
Pymol Wiki provided sample code to use Leap Motion from pymol. I used the example. I added “path” which placed LeapPython.so.
test.py

import sys
import math
from pymol import cmd
sys.path.insert( 0, '/Users/iwatobipen/develop/py2env/LeapSDK/build/')

import Leap
from Leap import Matrix, Vector

class PymolListener(Leap.Listener):
    def __init__(self, *args, **kwargs):
        super(PymolListener, self).__init__(*args, **kwargs)

        self.prev_frame = None

        self.controller = Leap.Controller()
        self.controller.add_listener(self)

    def __del__(self):
        self.controller.remove_listener(self)

        super(PymolListener, self).__del__()

    def update_view(self, frame):
        if not self.prev_frame:
            return

        view = list(cmd.get_view())

        if frame.rotation_probability(self.prev_frame) > 0.1:
            m = frame.rotation_matrix(self.prev_frame)
            m *= Matrix(Vector(*view[0:3]),
                        Vector(*view[3:6]),
                        Vector(*view[6:9]))
            view[:9] = m.to_array_3x3()

        if frame.scale_probability(self.prev_frame) > 0.1:
            s = frame.scale_factor(self.prev_frame)
            delta_z = math.log(s) * 100.0
            view[11] += delta_z
            view[15] -= delta_z
            view[16] -= delta_z

        cmd.set_view(view)

    def on_frame(self, controller):
        frame = controller.frame()

        self.update_view(frame)

        self.prev_frame = frame

listener = PymolListener()

Then run pymol and fetched sample pdb.

iwatobipen$ pymol
# from pymolconsole.
pymol> fetch 1atp
pymol> run test.py

Result is …. It seems work well. Now, I am reading API reference to write code.

Target prediction using local ChEMBL

Yesterday, I posed about target prediction using ChEMBLDB web API.
If I want to predict many molecules, it will need many time. So, I changed code to use local chembldb.
I used sqlalchemy, because the library is powerful and flexible to use any RDB.
Test code is following. The sample code needs a smiles strings for input, and returns top 10 predicted target.
I think python is very powerful language. I can do chemical structure handling, RDB searching, merge data etc. etc. by using only python!

from sqlalchemy import create_engine, MetaData
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.externals import joblib

import pandas as pd
import numpy as np
import sys

smiles = sys.argv[ 1 ]
morgan_nb = joblib.load( 'models_22/10uM/mNB_10uM_all.pkl' )
classes = list( morgan_nb.targets )

mol = Chem.MolFromSmiles( smiles )
fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits = 2048 )
res = np.zeros( len(fp), np.int32 )
DataStructs.ConvertToNumpyArray( fp, res )

probas = list( morgan_nb.predict_proba( res.reshape(1,-1))[0] )
predictions = pd.DataFrame(  list(zip(classes, probas)), columns=[ 'id', 'proba' ])

top10_pred = predictions.sort_values( by = 'proba', ascending = False ).head( 10 )
db = create_engine( 'postgres+psycopg2://<username>:<password>@localhost/chembl_22' )
conn = db.connect()

def getprefname( chemblid ):
    res = conn.execute( "select chembl_id, pref_name,organism from target_dictionary where chembl_id='{0}'".format( chemblid ))
    res = res.fetchall()
    return res[0]

plist = []
for i, e in enumerate( top10_pred['id'] ):
    plist.append( list(getprefname(e)) )
conn.close()
target_info = pd.DataFrame( plist, columns = ['id', 'name', 'organism'] )
summary_df = pd.merge( top10_pred, target_info, on='id')

print( summary_df )

OK, check the performance.
From shell script.

Tofa.

iwatobipen$ python targetprediction.py 'CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N'
           id     proba                                   name      organism
0  CHEMBL2835  1.000000           Tyrosine-protein kinase JAK1  Homo sapiens
1  CHEMBL2148  1.000000           Tyrosine-protein kinase JAK3  Homo sapiens
2  CHEMBL2971  1.000000           Tyrosine-protein kinase JAK2  Homo sapiens
3  CHEMBL5073  1.000000                     CaM kinase I delta  Homo sapiens
4  CHEMBL3553  0.999986           Tyrosine-protein kinase TYK2  Homo sapiens
5  CHEMBL4147  0.999966                    CaM kinase II alpha  Homo sapiens
6  CHEMBL4924  0.999896    Ribosomal protein S6 kinase alpha 6  Homo sapiens
7  CHEMBL5698  0.999871         NUAK family SNF1-like kinase 2  Homo sapiens
8  CHEMBL3032  0.999684                      Protein kinase N2  Homo sapiens
9  CHEMBL5683  0.999640  Serine/threonine-protein kinase DCLK1  Homo sapiens

imatinib

iwatobipen$ python targetprediction.py 'CN1CCN(CC2=CC=C(C=C2)C(=O)NC2=CC(NC3=NC=CC(=N3)C3=CN=CC=C3)=C(C)C=C2)CC1'
           id     proba                                               name  \
0  CHEMBL1862  1.000000                        Tyrosine-protein kinase ABL   
1  CHEMBL5145  1.000000              Serine/threonine-protein kinase B-raf   
2  CHEMBL1936  1.000000                   Stem cell growth factor receptor   
3  CHEMBL2007  1.000000      Platelet-derived growth factor receptor alpha   
4  CHEMBL5122  1.000000             Discoidin domain-containing receptor 2   
5  CHEMBL1974  0.999999              Tyrosine-protein kinase receptor FLT3   
6  CHEMBL3905  0.999999                        Tyrosine-protein kinase Lyn   
7  CHEMBL4722  0.999994           Serine/threonine-protein kinase Aurora-A   
8   CHEMBL279  0.999991      Vascular endothelial growth factor receptor 2   
9  CHEMBL5319  0.999988  Epithelial discoidin domain-containing receptor 1   

       organism  
0  Homo sapiens  
1  Homo sapiens  
2  Homo sapiens  
3  Homo sapiens  
4  Homo sapiens  
5  Homo sapiens  
6  Homo sapiens  
7  Homo sapiens  
8  Homo sapiens  
9  Homo sapiens  

Known molecules are predicted with high accuracy. Next I want to try unknown molecules. 😉

Target prediction using ChEMBL

You know, there are some database that can publicly available database in chemo informatics area.
ChEMBL DB is one of useful database. George Papadatos introduced useful tool for target prediction using ChEMBL. He provided chembl target prediction model via ftp server !
So, everyone can use the model.
I used the model and tried to target prediction.
At first, I get the model from ftp server. And launched jupyter notebook. 😉

iwatobipen$ wget wget ftp://ftp.ebi.ac.uk/pub/databases/chembl/target_predictions/chembl_22_models.tar.gz
iwatobipen$ tar -vzxf chembl_22_models.tar.gz
iwatobipen$ jupyter notebook

It ready! Go ahead.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit import DataStructs
import pandas as pd
from pandas import concat
from collections import OrderedDict
import requests
import numpy
from sklearn.externals import joblib
from rdkit import rdBase
print( rdBase.rdkitVersion )
>2016.09.2

I tried in python3.5 environment.

morgan_nb = joblib.load( 'models_22/10uM/mNB_10uM_all.pkl' )
classes = list( morgan_nb.targets )
len( classes )
> 1524 # model has 1524 targets ( classes )

I used sitagliptin as input molecule.

smiles = 'C1CN2C(=NN=C2C(F)(F)F)CN1C(=O)C[C@@H](CC3=CC(=C(C=C3F)F)F)N'
mol = Chem.MolFromSmiles( smiles )
mol

Next, calculate morgan fingerprint and convert the fingerprint to numpy array.

fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=2048 )
res = numpy.zeros( len(fp), numpy.int32 )
DataStructs.ConvertToNumpyArray( fp, res )

Predict target and sort the result by probability.

probas = list( morgan_nb.predict_proba( res.reshape(1,-1))[0] ) 
top_pred = predictions.sort_values( by='proba', ascending = False).head(10)
top_pred

Jupyter notebook ver5 changed table view!

Then convert from chembl ID to target name.

def fetch_WS( trgt ):
    re = requests.get( 'https://www.ebi.ac.uk/chembl/api/data/target/{0}.json'.format(trgt) )
    return ( trgt, re.json()['pref_name'], re.json()['organism'] )
plist = []
for i , e in enumerate( top_pred['id'] ):
    plist.append( fetch_WS(e) )
target_info = pd.DataFrame( plist, columns=['id', 'name', 'organism'])
pd.merge( top_pred, target_info )

The model predicted stagliptin is DPP4 modulator! I think this work is interesting. I will try to predict another molecules and integrate local ChEMBL DB to improve performance.
😉

Original source code is following URL. Thanks for useful information!!!!
https://github.com/madgpap/notebooks
http://chembl.blogspot.jp/2016/03/target-prediction-models-update.html

Python Sudoku solver ;-D

Recently, I am learning about linear optimization using python.
From Wiki..
Sudoku is a logic-based, combinatorial number-placement puzzle. The objective is to fill 9 x 9 grid with digits so that each column, each row, and each of the 3 x 3 subgrids that compose the grid contains all of the digits from 1 to 9.

I used python library named “pulp” to solve the problem.
https://pythonhosted.org/PuLP/index.html

For convenience, I used 4 x 4 grid contains 2 x 2 subgrid problem.
At first, I got partially filled box.

+----+----+
|    |4   |
|    |  1 |
+----+----+
|4   |    |
|2 1 |    |
+----+----+



To use pulp, user need to set objective function at first. In this problem. Objective function is set 0. Because Sudoku does not need minimize or maximize the objective function.
Vals, Rows, Cols are list. And Boxes represents the grid.
LpVariable.dicts creates a dictionary of LP variables.
So, following code,
LpVariable.dicts( “test”, ( [1,2],[3,4] ), 0,1, LpInteger )
returns
{1: {3: test_1_3, 4: test_1_4}, 2: {3: test_2_3, 4: test_2_4}}.
First argument is name that is used prefix and second is indexs and third and forth is lower and upper bound of the variables.
I used LpInteger, so each variable will take 0 or 1.
To solve the problem the code make 4 x 4 x 4, total 64 variables.

from pulp import *
Sequence = [ str(i) for i in range(1,5) ]
Vals = Sequence
Rows = Sequence
Cols = Sequence
Boxes = []
for i in range( 2 ):
    for j in range( 2 ):
        Boxes += [ [ (Rows[ 2*i + k ], Cols[ 2*j + l]) for k in range( 2) for l in range( 2 ) ] ]

prob = LpProblem( "sudoku4x4", LpMinimize  )
choices = LpVariable.dicts( "choice",( Vals, Rows, Cols ), 0,1, LpInteger )
# set problem
prob += 0, 'objective function'

Next, set constrain.
There are 4 variables for each square. In the each row and column only one of them can take value of “1”. And input default conditions.

# set constrain

for r in Rows:
    for c in Cols:
        prob += lpSum( [choices[v][r][c] for v in Vals] ) == 1, ''

for v in Vals:
    for r in Rows:
        prob += lpSum( [choices[v][r][c] for c in Cols] ) == 1, ''
    for c in Cols:
        prob += lpSum( [choices[v][r][c] for r in Rows] ) == 1, ''
    for b in Boxes:
        prob += lpSum( [choices[v][r][c] for (r,c) in b] ) == 1, ''

prob += choices['4']['1']['3'] == 1
prob += choices['1']['2']['4'] == 1
prob += choices['4']['3']['1'] == 1
prob += choices['2']['4']['1'] == 1
prob += choices['1']['4']['2'] == 1

Now ready to solve the problem.

prob.solve()

print( "status:", LpStatus[prob.status] )

f = open( "sudokuout.txt", "w" )
for r in Rows:
    if r == "1" or r == "3":
        f.write( "+----+----+\n" )
    for c in Cols:
        for v in Vals:
            if value( choices[v][r][c] ) == 1:
                if c == "1" or c == "3":
                    f.write( "|" )
                f.write(  v+" " )
                if c == "4":
                    f.write('|\n')
f.write( "+----+----+\n" )
f.close()

Check the outputfile.
The problem was solved.


iwatobipen$ cat sudokuout.txt
+----+----+
|1 2 |4 3 |
|3 4 |2 1 |
+----+----+
|4 3 |1 2 |
|2 1 |3 4 |
+----+----+

Cross entropy clustering

My kids love mickey mouse. 😉
If you have mickey mouse plot, can you cluster the data reasonably ?
R package named gmum.r provides interesting solution to do it.
The package has many features for data analysis. If reader who has interest it please use it.
I used the package to cluster mickey mouse data!

library(gmum.r)
data("cec.mouse1.spherical")
plot(  cec.mouse1.spherical )

Mickey mouse !

Next, compare k-means and CEC.

 k<-kmeans(cec.mouse1.spherical,10)
plot(cec.mouse1.spherical,col=k$cluster) 

Oh! colorful mickey !!!

Next use CEC.

c <- CEC(k=3, x=dataset, control.nstart=10, method.type='spherical')
plot(c)


The result fits my feeling.

I think CEC is similar to EM based clustering.