Target prediction using local ChEMBL

Yesterday, I posed about target prediction using ChEMBLDB web API.
If I want to predict many molecules, it will need many time. So, I changed code to use local chembldb.
I used sqlalchemy, because the library is powerful and flexible to use any RDB.
Test code is following. The sample code needs a smiles strings for input, and returns top 10 predicted target.
I think python is very powerful language. I can do chemical structure handling, RDB searching, merge data etc. etc. by using only python!

from sqlalchemy import create_engine, MetaData
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.externals import joblib

import pandas as pd
import numpy as np
import sys

smiles = sys.argv[ 1 ]
morgan_nb = joblib.load( 'models_22/10uM/mNB_10uM_all.pkl' )
classes = list( morgan_nb.targets )

mol = Chem.MolFromSmiles( smiles )
fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits = 2048 )
res = np.zeros( len(fp), np.int32 )
DataStructs.ConvertToNumpyArray( fp, res )

probas = list( morgan_nb.predict_proba( res.reshape(1,-1))[0] )
predictions = pd.DataFrame(  list(zip(classes, probas)), columns=[ 'id', 'proba' ])

top10_pred = predictions.sort_values( by = 'proba', ascending = False ).head( 10 )
db = create_engine( 'postgres+psycopg2://<username>:<password>@localhost/chembl_22' )
conn = db.connect()

def getprefname( chemblid ):
    res = conn.execute( "select chembl_id, pref_name,organism from target_dictionary where chembl_id='{0}'".format( chemblid ))
    res = res.fetchall()
    return res[0]

plist = []
for i, e in enumerate( top10_pred['id'] ):
    plist.append( list(getprefname(e)) )
target_info = pd.DataFrame( plist, columns = ['id', 'name', 'organism'] )
summary_df = pd.merge( top10_pred, target_info, on='id')

print( summary_df )

OK, check the performance.
From shell script.


iwatobipen$ python 'CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N'
           id     proba                                   name      organism
0  CHEMBL2835  1.000000           Tyrosine-protein kinase JAK1  Homo sapiens
1  CHEMBL2148  1.000000           Tyrosine-protein kinase JAK3  Homo sapiens
2  CHEMBL2971  1.000000           Tyrosine-protein kinase JAK2  Homo sapiens
3  CHEMBL5073  1.000000                     CaM kinase I delta  Homo sapiens
4  CHEMBL3553  0.999986           Tyrosine-protein kinase TYK2  Homo sapiens
5  CHEMBL4147  0.999966                    CaM kinase II alpha  Homo sapiens
6  CHEMBL4924  0.999896    Ribosomal protein S6 kinase alpha 6  Homo sapiens
7  CHEMBL5698  0.999871         NUAK family SNF1-like kinase 2  Homo sapiens
8  CHEMBL3032  0.999684                      Protein kinase N2  Homo sapiens
9  CHEMBL5683  0.999640  Serine/threonine-protein kinase DCLK1  Homo sapiens


iwatobipen$ python 'CN1CCN(CC2=CC=C(C=C2)C(=O)NC2=CC(NC3=NC=CC(=N3)C3=CN=CC=C3)=C(C)C=C2)CC1'
           id     proba                                               name  \
0  CHEMBL1862  1.000000                        Tyrosine-protein kinase ABL   
1  CHEMBL5145  1.000000              Serine/threonine-protein kinase B-raf   
2  CHEMBL1936  1.000000                   Stem cell growth factor receptor   
3  CHEMBL2007  1.000000      Platelet-derived growth factor receptor alpha   
4  CHEMBL5122  1.000000             Discoidin domain-containing receptor 2   
5  CHEMBL1974  0.999999              Tyrosine-protein kinase receptor FLT3   
6  CHEMBL3905  0.999999                        Tyrosine-protein kinase Lyn   
7  CHEMBL4722  0.999994           Serine/threonine-protein kinase Aurora-A   
8   CHEMBL279  0.999991      Vascular endothelial growth factor receptor 2   
9  CHEMBL5319  0.999988  Epithelial discoidin domain-containing receptor 1   

0  Homo sapiens  
1  Homo sapiens  
2  Homo sapiens  
3  Homo sapiens  
4  Homo sapiens  
5  Homo sapiens  
6  Homo sapiens  
7  Homo sapiens  
8  Homo sapiens  
9  Homo sapiens  

Known molecules are predicted with high accuracy. Next I want to try unknown molecules. 😉


