Yesterday, I posed about target prediction using ChEMBLDB web API.
If I want to predict many molecules, it will need many time. So, I changed code to use local chembldb.
I used sqlalchemy, because the library is powerful and flexible to use any RDB.
Test code is following. The sample code needs a smiles strings for input, and returns top 10 predicted target.
I think python is very powerful language. I can do chemical structure handling, RDB searching, merge data etc. etc. by using only python!
from sqlalchemy import create_engine, MetaData from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs from sklearn.externals import joblib import pandas as pd import numpy as np import sys smiles = sys.argv[ 1 ] morgan_nb = joblib.load( 'models_22/10uM/mNB_10uM_all.pkl' ) classes = list( morgan_nb.targets ) mol = Chem.MolFromSmiles( smiles ) fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits = 2048 ) res = np.zeros( len(fp), np.int32 ) DataStructs.ConvertToNumpyArray( fp, res ) probas = list( morgan_nb.predict_proba( res.reshape(1,-1))[0] ) predictions = pd.DataFrame( list(zip(classes, probas)), columns=[ 'id', 'proba' ]) top10_pred = predictions.sort_values( by = 'proba', ascending = False ).head( 10 ) db = create_engine( 'postgres+psycopg2://<username>:<password>@localhost/chembl_22' ) conn = db.connect() def getprefname( chemblid ): res = conn.execute( "select chembl_id, pref_name,organism from target_dictionary where chembl_id='{0}'".format( chemblid )) res = res.fetchall() return res[0] plist = [] for i, e in enumerate( top10_pred['id'] ): plist.append( list(getprefname(e)) ) conn.close() target_info = pd.DataFrame( plist, columns = ['id', 'name', 'organism'] ) summary_df = pd.merge( top10_pred, target_info, on='id') print( summary_df )
OK, check the performance.
From shell script.
Tofa.
iwatobipen$ python targetprediction.py 'CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N' id proba name organism 0 CHEMBL2835 1.000000 Tyrosine-protein kinase JAK1 Homo sapiens 1 CHEMBL2148 1.000000 Tyrosine-protein kinase JAK3 Homo sapiens 2 CHEMBL2971 1.000000 Tyrosine-protein kinase JAK2 Homo sapiens 3 CHEMBL5073 1.000000 CaM kinase I delta Homo sapiens 4 CHEMBL3553 0.999986 Tyrosine-protein kinase TYK2 Homo sapiens 5 CHEMBL4147 0.999966 CaM kinase II alpha Homo sapiens 6 CHEMBL4924 0.999896 Ribosomal protein S6 kinase alpha 6 Homo sapiens 7 CHEMBL5698 0.999871 NUAK family SNF1-like kinase 2 Homo sapiens 8 CHEMBL3032 0.999684 Protein kinase N2 Homo sapiens 9 CHEMBL5683 0.999640 Serine/threonine-protein kinase DCLK1 Homo sapiens
imatinib
iwatobipen$ python targetprediction.py 'CN1CCN(CC2=CC=C(C=C2)C(=O)NC2=CC(NC3=NC=CC(=N3)C3=CN=CC=C3)=C(C)C=C2)CC1' id proba name \ 0 CHEMBL1862 1.000000 Tyrosine-protein kinase ABL 1 CHEMBL5145 1.000000 Serine/threonine-protein kinase B-raf 2 CHEMBL1936 1.000000 Stem cell growth factor receptor 3 CHEMBL2007 1.000000 Platelet-derived growth factor receptor alpha 4 CHEMBL5122 1.000000 Discoidin domain-containing receptor 2 5 CHEMBL1974 0.999999 Tyrosine-protein kinase receptor FLT3 6 CHEMBL3905 0.999999 Tyrosine-protein kinase Lyn 7 CHEMBL4722 0.999994 Serine/threonine-protein kinase Aurora-A 8 CHEMBL279 0.999991 Vascular endothelial growth factor receptor 2 9 CHEMBL5319 0.999988 Epithelial discoidin domain-containing receptor 1 organism 0 Homo sapiens 1 Homo sapiens 2 Homo sapiens 3 Homo sapiens 4 Homo sapiens 5 Homo sapiens 6 Homo sapiens 7 Homo sapiens 8 Homo sapiens 9 Homo sapiens
Known molecules are predicted with high accuracy. Next I want to try unknown molecules. ;-)