Target prediction using ChEMBL

You know, there are some database that can publicly available database in chemo informatics area.
ChEMBL DB is one of useful database. George Papadatos introduced useful tool for target prediction using ChEMBL. He provided chembl target prediction model via ftp server !
So, everyone can use the model.
I used the model and tried to target prediction.
At first, I get the model from ftp server. And launched jupyter notebook. 😉

iwatobipen$ wget wget
iwatobipen$ tar -vzxf chembl_22_models.tar.gz
iwatobipen$ jupyter notebook

It ready! Go ahead.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit import DataStructs
import pandas as pd
from pandas import concat
from collections import OrderedDict
import requests
import numpy
from sklearn.externals import joblib
from rdkit import rdBase
print( rdBase.rdkitVersion )

I tried in python3.5 environment.

morgan_nb = joblib.load( 'models_22/10uM/mNB_10uM_all.pkl' )
classes = list( morgan_nb.targets )
len( classes )
> 1524 # model has 1524 targets ( classes )

I used sitagliptin as input molecule.

smiles = 'C1CN2C(=NN=C2C(F)(F)F)CN1C(=O)C[C@@H](CC3=CC(=C(C=C3F)F)F)N'
mol = Chem.MolFromSmiles( smiles )

Next, calculate morgan fingerprint and convert the fingerprint to numpy array.

fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=2048 )
res = numpy.zeros( len(fp), numpy.int32 )
DataStructs.ConvertToNumpyArray( fp, res )

Predict target and sort the result by probability.

probas = list( morgan_nb.predict_proba( res.reshape(1,-1))[0] ) 
top_pred = predictions.sort_values( by='proba', ascending = False).head(10)

Jupyter notebook ver5 changed table view!

Then convert from chembl ID to target name.

def fetch_WS( trgt ):
    re = requests.get( '{0}.json'.format(trgt) )
    return ( trgt, re.json()['pref_name'], re.json()['organism'] )
plist = []
for i , e in enumerate( top_pred['id'] ):
    plist.append( fetch_WS(e) )
target_info = pd.DataFrame( plist, columns=['id', 'name', 'organism'])
pd.merge( top_pred, target_info )

The model predicted stagliptin is DPP4 modulator! I think this work is interesting. I will try to predict another molecules and integrate local ChEMBL DB to improve performance.

Original source code is following URL. Thanks for useful information!!!!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s