Target prediction using ChEMBL

You know, there are some database that can publicly available database in chemo informatics area.
ChEMBL DB is one of useful database. George Papadatos introduced useful tool for target prediction using ChEMBL. He provided chembl target prediction model via ftp server !
So, everyone can use the model.
I used the model and tried to target prediction.
At first, I get the model from ftp server. And launched jupyter notebook. ;-)

iwatobipen$ wget wget
iwatobipen$ tar -vzxf chembl_22_models.tar.gz
iwatobipen$ jupyter notebook

It ready! Go ahead.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit import DataStructs
import pandas as pd
from pandas import concat
from collections import OrderedDict
import requests
import numpy
from sklearn.externals import joblib
from rdkit import rdBase
print( rdBase.rdkitVersion )

I tried in python3.5 environment.

morgan_nb = joblib.load( 'models_22/10uM/mNB_10uM_all.pkl' )
classes = list( morgan_nb.targets )
len( classes )
> 1524 # model has 1524 targets ( classes )

I used sitagliptin as input molecule.

smiles = 'C1CN2C(=NN=C2C(F)(F)F)CN1C(=O)C[C@@H](CC3=CC(=C(C=C3F)F)F)N'
mol = Chem.MolFromSmiles( smiles )

Next, calculate morgan fingerprint and convert the fingerprint to numpy array.

fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=2048 )
res = numpy.zeros( len(fp), numpy.int32 )
DataStructs.ConvertToNumpyArray( fp, res )

Predict target and sort the result by probability.

probas = list( morgan_nb.predict_proba( res.reshape(1,-1))[0] ) 
top_pred = predictions.sort_values( by='proba', ascending = False).head(10)

Jupyter notebook ver5 changed table view!

Then convert from chembl ID to target name.

def fetch_WS( trgt ):
    re = requests.get( '{0}.json'.format(trgt) )
    return ( trgt, re.json()['pref_name'], re.json()['organism'] )
plist = []
for i , e in enumerate( top_pred['id'] ):
    plist.append( fetch_WS(e) )
target_info = pd.DataFrame( plist, columns=['id', 'name', 'organism'])
pd.merge( top_pred, target_info )

The model predicted stagliptin is DPP4 modulator! I think this work is interesting. I will try to predict another molecules and integrate local ChEMBL DB to improve performance.

Original source code is following URL. Thanks for useful information!!!!

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

3 thoughts on “Target prediction using ChEMBL

  1. Thank you for posting this!

    Have you also tried to build any kind of machine learning model for target prediction (instead of using this given model)? If you have please direct me to where I can find it (or any kind of machine learning based target prediction model building script). I want to see if I can increase the number of targets and play with the input data that is used for training the model. Thank you for all your material!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: