You know, there are some database that can publicly available database in chemo informatics area.
ChEMBL DB is one of useful database. George Papadatos introduced useful tool for target prediction using ChEMBL. He provided chembl target prediction model via ftp server !
So, everyone can use the model.
I used the model and tried to target prediction.
At first, I get the model from ftp server. And launched jupyter notebook. ;-)
iwatobipen$ wget wget ftp://ftp.ebi.ac.uk/pub/databases/chembl/target_predictions/chembl_22_models.tar.gz iwatobipen$ tar -vzxf chembl_22_models.tar.gz iwatobipen$ jupyter notebook
It ready! Go ahead.
from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem import PandasTools from rdkit import DataStructs import pandas as pd from pandas import concat from collections import OrderedDict import requests import numpy from sklearn.externals import joblib from rdkit import rdBase print( rdBase.rdkitVersion ) >2016.09.2
I tried in python3.5 environment.
morgan_nb = joblib.load( 'models_22/10uM/mNB_10uM_all.pkl' ) classes = list( morgan_nb.targets ) len( classes ) > 1524 # model has 1524 targets ( classes )
I used sitagliptin as input molecule.
smiles = 'C1CN2C(=NN=C2C(F)(F)F)CN1C(=O)C[C@@H](CC3=CC(=C(C=C3F)F)F)N' mol = Chem.MolFromSmiles( smiles ) mol
Next, calculate morgan fingerprint and convert the fingerprint to numpy array.
fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=2048 ) res = numpy.zeros( len(fp), numpy.int32 ) DataStructs.ConvertToNumpyArray( fp, res )
Predict target and sort the result by probability.
probas = list( morgan_nb.predict_proba( res.reshape(1,-1))[0] ) top_pred = predictions.sort_values( by='proba', ascending = False).head(10) top_pred
Jupyter notebook ver5 changed table view!
Then convert from chembl ID to target name.
def fetch_WS( trgt ): re = requests.get( 'https://www.ebi.ac.uk/chembl/api/data/target/{0}.json'.format(trgt) ) return ( trgt, re.json()['pref_name'], re.json()['organism'] ) plist = [] for i , e in enumerate( top_pred['id'] ): plist.append( fetch_WS(e) ) target_info = pd.DataFrame( plist, columns=['id', 'name', 'organism']) pd.merge( top_pred, target_info )
The model predicted stagliptin is DPP4 modulator! I think this work is interesting. I will try to predict another molecules and integrate local ChEMBL DB to improve performance.
;-)
Original source code is following URL. Thanks for useful information!!!!
https://github.com/madgpap/notebooks
http://chembl.blogspot.jp/2016/03/target-prediction-models-update.html
Thank you for posting this!
Have you also tried to build any kind of machine learning model for target prediction (instead of using this given model)? If you have please direct me to where I can find it (or any kind of machine learning based target prediction model building script). I want to see if I can increase the number of targets and play with the input data that is used for training the model. Thank you for all your material!
Hi Sue_chem,
There are many approaches to do it.
One great example is provided from chembl github repo.
https://github.com/chembl/chembl_target_predictions
It uses sklearn for learning. If you would like to try other method, please let me know. ;-)
Thank you so much!