Graph convolution classification with deepchem

I posted about graph convolution regression using deepchem. And today, I tried graph convolution classification using deepchem.
Code is almost same as regression model. The only a difference point is use dc.models.MultitaskGraphClassifier instead of dc.models.MultitaskGraphRegressor.
I got sample ( JAK3 inhibitor ) data from chembl and tried to make model.

At first I used pandas to convert activity class ( active, non active )to 0,1 bit. Easy to do it.

import panda as pd
import pandas as pd
df = pd.read_table('jak3_chembl.txt', header=0)
df['activity_class'] = pd.factorize( df.ACTIVITY_COMMENT )
pd.factorize( df.ACTIVITY_COMMENT )
len(pd.factorize( df.ACTIVITY_COMMENT ))
df['activity_class'] = pd.factorize( df.ACTIVITY_COMMENT )[0]

df.to_csv('./preprocessed_jak3.csv', index=False)

Next wrote model and test it.

import tensorflow as tf
import deepchem as dc
import numpy as np
import pandas as pd

graph_featurizer = dc.feat.graph_features.ConvMolFeaturizer()
loader = dc.data.data_loader.CSVLoader( tasks=['activity_class'], smiles_field="CANONICAL_SMILES", id_field="CMPD_CHEMBLID", featurizer=graph_featurizer )
dataset = loader.featurize( './preprocessed_jak3.csv' )

splitter = dc.splits.splitters.RandomSplitter()
trainset,testset = splitter.train_test_split( dataset )

hp = dc.molnet.preset_hyper_parameters
param = hp.hps[ 'graphconv' ]
print(param['batch_size'])
g = tf.Graph()
graph_model = dc.nn.SequentialGraph( 75 )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), 75, activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.Dense( int(param['n_fully_connected_nodes']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphGather( 10 , activation='tanh'))

with tf.Session() as sess:
    model_graphconv = dc.models.MultitaskGraphClassifier( graph_model,
                                                      1,
                                                      75,
                                                     batch_size=10,
                                                     learning_rate = param['learning_rate'],
                                                     optimizer_type = 'adam',
                                                     beta1=.9,beta2=.999)
    model_graphconv.fit( trainset, nb_epoch=50 )

train_scores = {}
#regression_metric = dc.metrics.Metric( dc.metrics.pearson_r2_score, np.mean )
classification_metric = dc.metrics.Metric( dc.metrics.roc_auc_score, np.mean )
train_scores['graphconvreg'] = model_graphconv.evaluate( trainset,[ classification_metric ]  )
p=model_graphconv.predict( testset )

for i in range( len(p )):
    print( p[i], testset.y[i] )

print(train_scores) 

root@08d8f729f78b:/deepchem/pen_test# python graphconv_jak3.py

And datalog file is….

Loading raw samples now.
shard_size: 8192
About to start loading CSV from ./preprocessed_jak3.csv
Loading shard 1 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 0 took 2.023 s
TIMING: dataset construction took 3.830 s
Loading dataset from disk.
TIMING: dataset construction took 2.263 s
Loading dataset from disk.
TIMING: dataset construction took 1.147 s
Loading dataset from disk.
50
Training for 50 epochs
Starting epoch 0
On batch 0
...............
On batch 0
On batch 50
computed_metrics: [0.97176380945032259]
{'graphconvreg': {'mean-roc_auc_score': 0.97176380945032259}}

Not so bad.
Classification model gives better result than regression model.
All code is pushed my github repository.
https://github.com/iwatobipen/deeplearning

Graph convolution regression with deepchem

Somedays ago, I posted blog about deepchem. I am still playing with deepchem. Today I tried to use graph convolution regression model.
Deepchem provided Graph convolution Regressor. Cool.
I used solubility data provided from AstraZeneca. https://www.ebi.ac.uk/chembl/assay/inspect/CHEMBL3301364
My test code is following. Almost same as deepchem”s example code.
CSVLoader method is very useful because it can not only read data but also calculate graph feature of each molecule.
Next, Define of Graph convolution network.

import tensorflow as tf
import deepchem as dc
import numpy as np
graph_featurizer = dc.feat.graph_features.ConvMolFeaturizer()
loader = dc.data.data_loader.CSVLoader( tasks=['LogS'], smiles_field="CANONICAL_SMILES", id_field="CMPD_CHEMBLID", featurizer=graph_featurizer )
dataset = loader.featurize( './bioactivity.csv' )

splitter = dc.splits.splitters.RandomSplitter()
trainset,testset = splitter.train_test_split( dataset )

hp = dc.molnet.preset_hyper_parameters
param = hp.hps[ 'graphconvreg' ]
print(param['batch_size'])
g = tf.Graph()
graph_model = dc.nn.SequentialGraph( 75 )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), 75, activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.Dense( int(param['n_fully_connected_nodes']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
#graph_model.add( dc.nn.GraphGather(param['batch_size'], activation='tanh'))
graph_model.add( dc.nn.GraphGather( 10 , activation='tanh'))

with tf.Session() as sess:
    model_graphconv = dc.models.MultitaskGraphRegressor( graph_model,
                                                      1,
                                                      75,
                                                     batch_size=10,
                                                     learning_rate = param['learning_rate'],
                                                     optimizer_type = 'adam',
                                                     beta1=.9,beta2=.999)
    model_graphconv.fit( trainset, nb_epoch=30 )

train_scores = {}
regression_metric = dc.metrics.Metric( dc.metrics.pearson_r2_score, np.mean )
train_scores['graphconvreg'] = model_graphconv.evaluate( trainset,[ regression_metric ]  )
p=model_graphconv.predict( testset )

print(train_scores) 

Next run the code.

root@08d8f729f78b:/deepchem/pen_test# python graphconv_test.py > datalog

And datalog file is….

Loading raw samples now.
shard_size: 8192
About to start loading CSV from ./bioactivity.csv
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
...
Starting epoch 29
On batch 0
On batch 50
On batch 100
computed_metrics: [0.52744994044080606]
{'graphconvreg': {'mean-pearson_r2_score': 0.52744994044080606}}

r2 score is still row, but I think it can improve by change of nb_epochs.

All sample code was uploaded to github.
https://github.com/iwatobipen/deeplearning/blob/master/datalog

integration of spotfire and pdb viewer

Some years ago, I heard a presentation about implementation of pdb viewer in spotfire in JCUP. It was really impressive for me because spotfire can not handle PDB files. You know, spotfire is one of the popular tool for data visualization. I like the tool.

Recently I found unique library for spotfire named ‘JSViz’. The library is not native library but user can get it from community site. To use JSViz, spotfire can communicate JS library such as D3.js, highcharts.js etc. 😉

Lots of examples are provided from the site.
I thought “Hmm… If there is pdb viewer written in javascript, I can implement pdb viewer in spotfire”.

So, I tried it.
Install jsviz at first.
And then wrote pdb_loader_script using template. I used pv.js for PDB loader.
JSViz gets data from spotfire as sfdata. sfdata is JSON format. If reader who needs more details for the data structure, I recommend read original document. ( or comment the post )
My data format is following.
#, pdb_id, ligandname
1,1ATP, ANP

And used pdb_id and ligandname for sfdata.
My strategy is….
1. Build external pdb supply server. ( simple http server written in python )
2. Access the url and get pdb file from the server and render it ( using jsviz ).

Following code is JSViz sample code. The code render protein as cartoon and ligand as ball and stick.

/*
 Copyright (c) 2016 TIBCO Software Inc

 THIS SOFTWARE IS PROVIDED BY TIBCO SOFTWARE INC. ''AS IS'' AND ANY EXPRESS OR
 IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
 MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
 SHALL TIBCO SOFTWARE BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

//////////////////////////////////////////////////////////////////////////////
// #region Drawing Code

var pv = require("bio-pv");

//
//
// Main Drawing Method
//

function renderCore(sfdata)
{
    if (resizing) {
        return;
    }

    // Log entering renderCore
    log ( "Entering renderCore" );

    // Extract the columns
    var columns = sfdata.columns;
    // Extract the data array section
	var chartdata = sfdata.data;

    // count the marked rows in the data set, needed later for marking rendering logic
    var markedRows = 0;
    for (var i = 0; i < chartdata.length; i++) {
        if (chartdata[i].hints.marked) {
            markedRows = markedRows + 1;
        }
    }
    var width = window.innerWidth;
    var height = window.innerHeight;

    //
    // Replace the following code with actual Visualization code
    // This code just displays a summary of the data passed in to renderCore
    //
    //displayWelcomeMessage ( document.getElementById ( "viewer" ), sfdata );

    displaypdb(document.getElementById('js_chart'), chartdata);
    wait ( sfdata.wait, sfdata.static );
};

//
// #endregion Drawing Code
//////////////////////////////////////////////////////////////////////////////

//////////////////////////////////////////////////////////////////////////////
// #region Marking Code
//

//
// This method receives the marking mode and marking rectangle coordinates
// on mouse-up when drawing a marking rectangle
//
function markModel(markMode, rectangle)
{
	// Implementation of logic to call markIndices or markIndices2 goes here
}

//
// Legacy mark function 2014 HF2
//
function mark(event)
{
}

//
// #endregion Marking Code
//////////////////////////////////////////////////////////////////////////////

//////////////////////////////////////////////////////////////////////////////
// #region Resizing Code
//

var resizing = false;

window.onresize = function (event) {
    resizing = true;
    if ($("#js_chart")) {
    }
    resizing = false;
};

//
// #endregion Resizing Code
//////////////////////////////////////////////////////////////////////////////

//
// This is a sample visualization that indicates that JSViz is installed
// and configured correctly.  It is an example of how to draw standard
// HTML objects based on the data sent from JSViz.
//

function displaypdb( div, chartdata ){
	var html;
	div.innerHTML = "
<div id='viewer'>pdb</div>
";
    var options = {
	  background: 'lightgrey',
      width: 800,
      height: 600,
      antialias: true,
       quality : 'medium'
       };
    // insert the viewer under the Dom element with id 'gl'.
    var viewer = pv.Viewer(document.getElementById('viewer'), options);
	var pdb_id = ( chartdata[0].items[0] );
	var ligand_name = ( chartdata[0].items[1] );
	var url = 'http://localhost:9000/'+pdb_id+'.pdb'
	$.ajax( url )
    .done(function(data) {
    var structure = pv.io.pdb(data);
	var ligand = structure.select({rnames : [ ligand_name ]});
	viewer.ballsAndSticks('ligand', ligand);

	viewer.cartoon('protein', structure, { color : color.ssSuccession() });
	viewer.centerOn(structure);
});
};

Go next. Following code is simple HTTP server. In the same folder, place pdbfiles for supply.

import os
import sys
import http.server
import socketserver
PORT = 9000
class HTTPRequestHandler(http.server.SimpleHTTPRequestHandler):
    def end_headers(self):
        self.send_header('Access-Control-Allow-Origin', '*')
        http.server.SimpleHTTPRequestHandler.end_headers(self)

def server(port):
    httpd = socketserver.TCPServer(('', port), HTTPRequestHandler)
    return httpd

if __name__ == "__main__":
    port = PORT
    httpd = server(port)
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        print("\n...shutting down http server")
        httpd.shutdown()
        sys.exit()

This is very brief introduction. Also to use JSViz, it can get user event like a clicking the ligand, residue etc….
It seems very interesting. But do I need to develop new visualization in spotfire ? ;-p

ref
https://community.tibco.com/wiki/javascript-visualization-framework-jsviz-and-tibco-spotfire
https://biasmv.github.io/pv/

how to get molecular graph features

Belated I am interested in deepchem that is an open-source deep learning toolkit for drug discovery. Deep-chem supported many features for chemoinformatics.
And one of interested feature is calculation of molecular graphs. It is more primitive than hashed finger print. I tried to caluclate it.

Currently the toolkit supports only linux, so I installed deepchem via docker.
The installation was very easy.

iwatobipen$ docker pull deepchemio/deepchem
# wait a moment.... 😉
iwatobipen$ docker run -i -t deepchemio/deepchem
iwatobipen$ pip install jupyter
# following code is not necessary.
iwatobipen$ apt-get install vim

That’s all.
Next, I worked in docker env.

import deepchem as dc
from deepchem.feat import graph_features
from rdkit import Chem
convmol=graph_features.ConvMolFeaturizer()
mol = Chem.MolFromSmiles('c1ccccc1')
# convmol needs list of molecules
fs = convmol.featurize( [mol] )
f = fs[ 0 ]
# check method
dir( f )
Out[41]:
[ .....
 'agglomerate_mols',
 'atom_features',
 'canon_adj_list',
 'deg_adj_lists',
 'deg_block_indices',
 'deg_id_list',
 'deg_list',
 'deg_slice',
 'deg_start',
 'get_adjacency_list',
 'get_atom_features',
 'get_atoms_with_deg',
 'get_deg_adjacency_lists',
 'get_deg_slice',
 'get_null_mol',
 'get_num_atoms',
 'get_num_atoms_with_deg',
 'max_deg',
 'membership',
 'min_deg',
 'n_atoms',
 'n_feat']

To get atom features, use ‘get_atom_features’
To get edge information, use ‘get_adjacency_list’

f.get_atom_features()
Out[42]:
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1]])
f.get_adjacency_list()
Out[43]: [[1, 5], [0, 2], [1, 3], [2, 4], [3, 5], [4, 0]]

The array of atom feature means, carbon atom, degree is 2, SP2, and aromatic as one hot vector.

Next step, I will try to build model by using molecular graph.

自分の仕事はなんぞや

といつもながら考える。
今のご時世、社内に限らず合成のリソースは確保できる。極論言えば、CMOの方が何かと雑多なことがないぶんだけ合成に割く時間が多いような気もします。
もちろん予算の兼ね合いはあるわけですが、in-houseでやるメリットとout sourcing するメリットを冷静に比べて判断することがより求められているご時世かと。
私は大学時代は天然仏前合成にがっつりのめり込んでいて、入社後も実験台好きでやってきましたがもうベンチに立つ時間も(というか全然)ほとんどなくなってしまいました。
かと言って分子設計にがっつり時間使っているかというとそうでもない。
創薬は総合競技。
ちゃんと自分の役割を考えないと、、、
http://onlinelibrary.wiley.com/doi/10.1002/anie.201210006/abstract

Compile LeapPython and Manipulate protein structure with Leap Motion.

I was interested in Kinect as input device, because Kinect can detect motion as input it feels feature. 😉
And some days ago, I got new input device “Leap Motion”.
The Leap Motion controller is a small USB peripheral device which is designed to be placed on a physical desktop, facing upward. ( from wiki )
I want to use this device with PyMol!
Let’s try it.
At first to use Leap Motion from python, user need to get Leapmotion SDK. My environment is OSX, so I got SDK v2.3 from developer site.
https://developer.leapmotion.com/sdk/v2
Installation is little bit complicated for me, because I want to use LeapPython library from virtual environment but the library is linked with native python.
So, I build or change dynamic link following procedure.

In Python2.7, I changed link with following command.

iwatobipen$ cd /somepath/LeapSDK/lib
iwatobipen$ otool -L LeapPython.so
LeapPython.so:
	/Library/Frameworks/Python.framework/Versions/2.7/Python (compatibility version 2.7.0, current version 2.7.0)
	@loader_path/libLeap.dylib (compatibility version 0.7.0, current version 2.3.1)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0)
# change Python link from native to anaconda
iwatobipen$ install_name_tool -change /Library/Frameworks/Python.framework/Versions/2.7/Python /Users/iwatobipen/.pyenv/versions/anaconda-2.4.0/lib/libpython2.7.dylib LeapPython.so

In Python3.5, I compiled LeapPython.so using swig.
Newest swig will cause error, so I used swig version 3.0.3
I referred following url to do that. https://support.leapmotion.com/hc/en-us/articles/223784048-Generating-a-Python-3-3-0-Wrapper-with-SWIG-2-0-9
OK Let’s go

# Copy Leap.h, LeapMath.h, Leap.i, and libLeap.dylib into one folder. And type following command.
# Generate LeapPython.cpp withswig -c++ -python -o LeapPython.cpp -interface LeapPython Leap.i
iwatobipen$ swig -c++ -python -o LeapPython.cpp -interface LeapPython Leap.i
# Compile and link 
clang++ -arch x86_64 -I/Users/iwatobipen/.pyenv/versions/anaconda-2.4.0/include/python2.7 LeapPython.cpp libLeap.dylib /Users/iwatobipen/.pyenv/versions/anaconda-2.4.0/lib/libpython2.7.dylib -shared -o LeapPython.so

Now I got LeapPython.so linked each python version in anaconda.

Unfortunately, I could not install pymol in python3.5. Following code was run in python2.7. I installed pymol by using conda install command.
Pymol Wiki provided sample code to use Leap Motion from pymol. I used the example. I added “path” which placed LeapPython.so.
test.py

import sys
import math
from pymol import cmd
sys.path.insert( 0, '/Users/iwatobipen/develop/py2env/LeapSDK/build/')

import Leap
from Leap import Matrix, Vector

class PymolListener(Leap.Listener):
    def __init__(self, *args, **kwargs):
        super(PymolListener, self).__init__(*args, **kwargs)

        self.prev_frame = None

        self.controller = Leap.Controller()
        self.controller.add_listener(self)

    def __del__(self):
        self.controller.remove_listener(self)

        super(PymolListener, self).__del__()

    def update_view(self, frame):
        if not self.prev_frame:
            return

        view = list(cmd.get_view())

        if frame.rotation_probability(self.prev_frame) > 0.1:
            m = frame.rotation_matrix(self.prev_frame)
            m *= Matrix(Vector(*view[0:3]),
                        Vector(*view[3:6]),
                        Vector(*view[6:9]))
            view[:9] = m.to_array_3x3()

        if frame.scale_probability(self.prev_frame) > 0.1:
            s = frame.scale_factor(self.prev_frame)
            delta_z = math.log(s) * 100.0
            view[11] += delta_z
            view[15] -= delta_z
            view[16] -= delta_z

        cmd.set_view(view)

    def on_frame(self, controller):
        frame = controller.frame()

        self.update_view(frame)

        self.prev_frame = frame

listener = PymolListener()

Then run pymol and fetched sample pdb.

iwatobipen$ pymol
# from pymolconsole.
pymol> fetch 1atp
pymol> run test.py

Result is …. It seems work well. Now, I am reading API reference to write code.

Target prediction using local ChEMBL

Yesterday, I posed about target prediction using ChEMBLDB web API.
If I want to predict many molecules, it will need many time. So, I changed code to use local chembldb.
I used sqlalchemy, because the library is powerful and flexible to use any RDB.
Test code is following. The sample code needs a smiles strings for input, and returns top 10 predicted target.
I think python is very powerful language. I can do chemical structure handling, RDB searching, merge data etc. etc. by using only python!

from sqlalchemy import create_engine, MetaData
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.externals import joblib

import pandas as pd
import numpy as np
import sys

smiles = sys.argv[ 1 ]
morgan_nb = joblib.load( 'models_22/10uM/mNB_10uM_all.pkl' )
classes = list( morgan_nb.targets )

mol = Chem.MolFromSmiles( smiles )
fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits = 2048 )
res = np.zeros( len(fp), np.int32 )
DataStructs.ConvertToNumpyArray( fp, res )

probas = list( morgan_nb.predict_proba( res.reshape(1,-1))[0] )
predictions = pd.DataFrame(  list(zip(classes, probas)), columns=[ 'id', 'proba' ])

top10_pred = predictions.sort_values( by = 'proba', ascending = False ).head( 10 )
db = create_engine( 'postgres+psycopg2://<username>:<password>@localhost/chembl_22' )
conn = db.connect()

def getprefname( chemblid ):
    res = conn.execute( "select chembl_id, pref_name,organism from target_dictionary where chembl_id='{0}'".format( chemblid ))
    res = res.fetchall()
    return res[0]

plist = []
for i, e in enumerate( top10_pred['id'] ):
    plist.append( list(getprefname(e)) )
conn.close()
target_info = pd.DataFrame( plist, columns = ['id', 'name', 'organism'] )
summary_df = pd.merge( top10_pred, target_info, on='id')

print( summary_df )

OK, check the performance.
From shell script.

Tofa.

iwatobipen$ python targetprediction.py 'CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N'
           id     proba                                   name      organism
0  CHEMBL2835  1.000000           Tyrosine-protein kinase JAK1  Homo sapiens
1  CHEMBL2148  1.000000           Tyrosine-protein kinase JAK3  Homo sapiens
2  CHEMBL2971  1.000000           Tyrosine-protein kinase JAK2  Homo sapiens
3  CHEMBL5073  1.000000                     CaM kinase I delta  Homo sapiens
4  CHEMBL3553  0.999986           Tyrosine-protein kinase TYK2  Homo sapiens
5  CHEMBL4147  0.999966                    CaM kinase II alpha  Homo sapiens
6  CHEMBL4924  0.999896    Ribosomal protein S6 kinase alpha 6  Homo sapiens
7  CHEMBL5698  0.999871         NUAK family SNF1-like kinase 2  Homo sapiens
8  CHEMBL3032  0.999684                      Protein kinase N2  Homo sapiens
9  CHEMBL5683  0.999640  Serine/threonine-protein kinase DCLK1  Homo sapiens

imatinib

iwatobipen$ python targetprediction.py 'CN1CCN(CC2=CC=C(C=C2)C(=O)NC2=CC(NC3=NC=CC(=N3)C3=CN=CC=C3)=C(C)C=C2)CC1'
           id     proba                                               name  \
0  CHEMBL1862  1.000000                        Tyrosine-protein kinase ABL   
1  CHEMBL5145  1.000000              Serine/threonine-protein kinase B-raf   
2  CHEMBL1936  1.000000                   Stem cell growth factor receptor   
3  CHEMBL2007  1.000000      Platelet-derived growth factor receptor alpha   
4  CHEMBL5122  1.000000             Discoidin domain-containing receptor 2   
5  CHEMBL1974  0.999999              Tyrosine-protein kinase receptor FLT3   
6  CHEMBL3905  0.999999                        Tyrosine-protein kinase Lyn   
7  CHEMBL4722  0.999994           Serine/threonine-protein kinase Aurora-A   
8   CHEMBL279  0.999991      Vascular endothelial growth factor receptor 2   
9  CHEMBL5319  0.999988  Epithelial discoidin domain-containing receptor 1   

       organism  
0  Homo sapiens  
1  Homo sapiens  
2  Homo sapiens  
3  Homo sapiens  
4  Homo sapiens  
5  Homo sapiens  
6  Homo sapiens  
7  Homo sapiens  
8  Homo sapiens  
9  Homo sapiens  

Known molecules are predicted with high accuracy. Next I want to try unknown molecules. 😉