Target prediction by conformal prediction with ChEMBL data #chemoinformatics #docker #memo

Recently I’m learning conformal prediction and today I used the ready to use model for target prediction which is trained by chembl_24 (little bit old) with lightGBM.

ChEMBL team provides nice package. You can get the model as docker image. URL is below.
https://github.com/chembl/of_conformal
And original article is here.
https://jcheminf.biomedcentral.com/track/pdf/10.1186/s13321-018-0325-4

To use the image in your local environment it is very easy just type following code.

docker run -p 8080:8080 chembl/mcp

After typing the code, docker image will run and server will be launched. Now target prediction server is ready to use. ;)

Original github repo describes how to use the server from terminal with curl command.

BTW, I would like to use it from python. It is easy to do it by using requests package.

My example is below.

example code


The model predicts that the compound has activity against Xanthine dehydrogenase. (Maybe the compound was included training data…)

It is interesting point that conformal prediction returns not only active/non active label but also both/empty label. It means the model predicts posi and nega with uncertainly.

To curate data and build model requires several steps and time. So ready to use system is user friendly. If one is allowed to wish so much I would like to build model with new version of chembl so I will be happy if source trainig code will be available. ;)

Communicate ChEMBL27 with rdkit postgres cartridge and sqlalchemy #RDKit #ChEMBL #Postgres #razi

As you know ChEMBL 27 was released recently, thanks great effort for EBI ;)

Fortunately ChEMBL provides common DB format dump file and RDKit has postgres DB cartridge. It means that you can search compound with rdkit functionality in postgres.

BTW, to handle the database, sqlalchemy which is ORMapper is very useful. So is it nice if we can communicate ChEMBL DB with sqlalchemy and rdkit function? I think yes! Fortunately razi provides us the methods to communicate postgres DB with chemoinformatics methods.

However, I couldn’t find how to communicate DB which has rdk schema already.

So I tried to do it today. Luckly, @ymasaKit_ provides very useful tutorial in his repo! The document is written in Japanese but it’s useful because code is written in python. ;)

OK let’s start today’s code! At first, install postgresql, rdkit-cartridge and ChEMBL27. I used conda for convenience.

$ conda install -c conda-forge postgresql
$ conda install -c rdkit rdkit-postgresql
$ initdb -D ~/postgresdata
$ postgers -D ~/postgresdaa
$ cd ~/Downloads
$ wget ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_27_postgresql.tar.gz
$ tar -vxf  chembl_27_postgresql.tar.gz
$ cd chembl_27_postgresql/chembl_27/chemb_27_postgresql
$ psql -U postgres
pgdb=# create database chembl_27;
pgdb=# \q;
$ pg_restore --no-owner -h localhost -p 5432 -U iwatobipen -d chembl_27 chembl_27_postgresql.dmp

Now I installed chembl27 to my postgresql DB. Then make rdk schema. Details are described in rdkit original document. URL is below.
https://www.rdkit.org/docs/Cartridge.html

chembl_27=# create extension if not exists rdkit;
chembl_27=# create schema rdk;
chembl_27=# select * into rdk.mols from (select molregno,mol_from_ctab(molfile::cstring) m  from compound_structures) tmp where m is not null;
chembl_27=# create index molidx on rdk.mols using gist(m);
CREATE INDEX
chembl_27=# alter table rdk.mols add primary key (molregno);
ALTER TABLE
chembl_27=# select molregno,torsionbv_fp(m) as torsionbv,morganbv_fp(m) as mfp2,featmorganbv_fp(m) as ffp2 into rdk.fps from rdk.mols;
chembl_27=# create index fps_ttbv_idx on rdk.fps using gist(torsionbv);
CREATE INDEX
chembl_27=# create index fps_mfp2_idx on rdk.fps using gist(mfp2);
CREATE INDEX
chembl_27=# create index fps_ffp2_idx on rdk.fps using gist(ffp2);
CREATE INDEX
chembl_27=# alter table rdk.fps add primary key (molregno);
ALTER TABLE

Now I could make rdk.mols and rdk.fps table in my chembl_27 DB.

Let’s move to python!. To communicate postgresql with sqlalchemy, psycopg2 should be installed at first. Map mols and fps table at first. rdk schema has custom column such as rdkit mol object and fingerprints so autoload option should not use. And extend_existing=True is used for mapping mol column.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from sqlalchemy import create_engine, MetaData
from sqlalchemy.orm import sessionmaker
from sqlalchemy import Table
from sqlalchemy import Column, BIGINT, Index
from sqlalchemy.orm import mapper
from sqlalchemy.ext.declarative import declarative_base
from razi import rdkit_postgresql
from razi.rdkit_postgresql.types import Mol
from razi.rdkit_postgresql.types import Bfp
db = create_engine('postgres://iwatobipen@localhost:5432/chembl_27',
                   #echo=True
                  )
metadata = MetaData(schema='rdk')
Base = declarative_base()


class Mols(Base):
    __table__ = Table('mols',
                      metadata,
                      Column('molregno', BIGINT, primary_key=True),
                      Column('m', Mol),
                      extend_existing=True,
                      )
class Fps(Base):
    __table__ = Table('fps',
                     metadata,
                     Column('molregno', BIGINT, primary_key=True),
                     Column('torsionbv', Bfp),
                     Column('mfp2', Bfp),
                     Column('ffp2', Bfp),
                     extend_existing=True)

Now I could map rdk.mol, rdk.fps table to my defined class. Next I tested some chmoinformatics related query.

First example is retrieve data from rdk.mols table.

ession = sessionmaker(bind=db)
session = Session()
cpds = session.query(Mols)
from IPython.display import display
for row in cpds[:4]:
    display(row.m)
    print(row.molregno)

Next, retrieve compounds with substructure query.

cpds = session.query(Mols)
cpds = cpds.filter(Mols.m.hassubstruct('n1cccc2ccccc12'))
for row in cpds[:4]:
    display(row.m)
    print(row.molregno)

Seems work fine. ;) Next example is query combination substructure and molecular weight. Razi supports queries which are defined in rdkit-postgres cartridge. Substructure query is same as above.

from razi.rdkit_postgresql.functions import mol_amw
cpds = session.query(Mols)
cpds = cpds.filter(Mols.m.hassubstruct('n1cccc2ccccc12')).filter(mol_amw(Mols.m) <= 200)
for row in cpds[:4]:
    display(row.m)
    print(row.molregno)

And last example is similarity search with rdk.fps table. At first I changed threshold with SQL statement. And then search fps table and display matched compounds.

session.execute('set rdkit.tanimoto_threshold=0.7')
fps = session.query(Fps)

from razi.rdkit_postgresql.functions import featmorganbv_fp
smiles = 'COC(=O)Nc1nc2cc(C(=O)Nc3cccc(CO)n3)ccc2[nH]1'
mol = Chem.MolFromSmiles(smiles)
fp = featmorganbv_fp(smiles)
mol
res = fps.filter(Fps.ffp2.tanimoto_sml(fp))
regno = []
for row in res:
    print(row.molregno)
    regno.append(row.molregno)

cpds = session.query(Mols)
cpds = cpds.filter(Mols.molregno.in_(regno))
for cpd in cpds[:10]:
    display(cpd.m)

All examples seems work fine. Razi and posgresql combination is useful because user can handle database without SQL statement and more pythonic.

Another work, @fmkz__ developed pychembldb. It is useful package to communicate ChEMBLDB with python. So I think it is interesting that pychembldb enhancement with razi. ;)

I uploaded today’s code on my gist and repo.
https://github.com/iwatobipen/playground/blob/master/rdk_razi.ipynb

Thanks for reading.

Conformal prediction with python and rdkit_2 #RDKit #QSAR #Conformal_prediction

I posted about conformal prediction with python and rdkit some days ago. After that I could get very informative advice from @kjelljorner. Thanks a lot! His advice was below.


Kjell Jorner @kjelljorner3dReplying to @iwatobipen I can recommend the cross conformal prediction or bootstrapped conformal prediction (also in nonconformist) to avoid having to put aside data for calibration. There is also a related method called the jackknife+ which can be applied to any method:

Fortunately nonconformist has some implementations of ACP. So I tried to use acp module with same dataset.

For convenience, shared my gist link below instead of pasting code on the blog post.

In the post, I tested AggregatedCp and BootstrapConfomralClassifier. These methods internally make train and calibration set with defined sampling method. In the default setting, aggregatedCp uses bootstrapsampler. So both method is almost same I think. And results are almost same.

By using acp method, used don’t need to separate train data to train and calibration data before training. It seems natural and useful way to use conformal prediction against chemoinformatics problems.

BTW, in the these area, there are many skillful people in the world and they have an open mind. I feel writing the post is useful for me because it keeps my idea, tips code(this is main reason for keeping the blog) but also gives opportunity for having a discussion about chemoinformatics and drug discovery. ;)

Conformal prediction with python and rdkit #RDKit #QSAR #Conformal_prediction

Recently Greg shared nice webiner about conformal prediction in Youtube. https://www.youtube.com/watch?v=_ZVuEWEfwuw

He introduced basic concept of conformal prediction and demonstration with excellent work flow of KNIME. I recommend to check the site. ;)

Conformal prediction is not new method. It can estimate confidence of predicted values. Traditional predictive model can predict probability, or class of given task (classification task) but can’t estimate how confidence the predicted values are.

Also good document is freely accessible from following URL.
http://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

I like KIME but I love python too so I searched package to conduct conformal prediction with python. The package name is ‘nonconformist‘. The packge depends scikit-learn but it don’t support recent version of scikit-learn. So some modification is required (PR is original repo, hope it will marge soon). So, I modified original code and uploaded my repository.
https://github.com/iwatobipen/nonconformist

To install the package, I used pip.

$ git clone https://github.com/iwatobipen/nonconformist.git
$ cd nonconformist
$ pip install .

Now ready to use nonconformist. Following example data comes from rdkit provided solubility data set and use ICP (inductive conformal prediction). At first import packages.

import os
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from sklearn.ensemble import RandomForestClassifier
from nonconformist.nc import ClassifierNc
from nonconformist.nc import ClassifierAdapter
from nonconformist.icp import IcpClassifier
from nonconformist.evaluation import ClassIcpCvHelper
from nonconformist.evaluation import cross_val_score

Then load train and test data set. For comformal prediction, calibration data is required so I divided train and calibration data. And calculate molecular fingerprint. In the knime tutorial I mentioned above, lots of train calibration data is generated for train model. Following code, I made one train – calibration table(data).

train = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
test =  os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
trainmol = [m for m in Chem.SDMolSupplier(train)]
testmol = [m for m in Chem.SDMolSupplier(test)]
label2cls = {'(A) low':0, '(B) medium':1, '(C) high':2}
def fp2arr(fp):
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
trainfps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, 1024) for m in trainmol]
trainfps = np.array([fp2arr(fp) for fp in trainfps])

testfps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, 1024) for m in testmol]
testfps = np.array([fp2arr(fp) for fp in testfps])

train_cls = [label2cls[m.GetProp('SOL_classification')] for m in trainmol]
train_cls = np.array(train_cls)
test_cls = [label2cls[m.GetProp('SOL_classification')] for m in testmol]
test_cls = np.array(test_cls)
print(trainfps.shape, train_cls.shape, testfps.shape, test_cls.shape)

#train data is divided to train and calibration data
ids = np.random.permutation(train_cls.size)
# Use first 700 data for train and second set is used for calibration
trainX, calibX = trainfps[ids[:700],:],trainfps[ids[700:],:] 
trainY, calibY = train_cls[ids[:700]],train_cls[ids[700:]] 

testX = testfps
testY = test_cls

Now ready to train with nonconformist. I used RandomForestClassifier for base classifier. And made IcpClassifier object with random forest.

rf = RandomForestClassifier(n_estimators=500, random_state=794)
nc = ClassifierNc(ClassifierAdapter(rf))
icp = IcpClassifier(nc)

train and predict method is same as scikit-learn. So if you are familiar to scikit-learn, it’s easy to understand!

icp.fit(trainX, trainY)
icp.calibrate(calibX, calibY)

Nonconformist can predict with significance. So you can predict value with any confidence settings.

pred = icp.predict(testX)
pred95 = icp.predict(testX, significance=0.05).astype(np.int32)
pred80 = icp.predict(testX, significance=0.2).astype(np.int32)

Interestingly, classification with conformal prediction result has many combination of classes. My data set has 3 classes, so traditional classifier will returns 3 classes [1,0,0], [0,1,0], [0,0,1]. However conformal predictor compares significanse and each predicted p-value of class. And if the p-value > significance, it labeled true. So the predictor has possibility of return following combinations. [0,0,0], [0,0,1], [0,1,1], [1,1,1], [1,0,0], [0,1,0],[1,1,0],[1,0,1]..

OK let’s check model results.

tp = 0
for idx, j in enumerate(testY):
    print(j, np.argmax(pred[idx]), j == np.argmax(pred[idx]) , pred80[idx], ":", pred95[idx])
    if j == np.argmax(pred[idx]):
        tp += 1
> output
0 2 False [0 1 1] : [1 1 1]
0 1 False [0 1 1] : [1 1 1]
1 1 True [0 1 0] : [1 1 0]
1 1 True [0 1 0] : [1 1 0]
1 1 True [0 1 0] : [0 1 0]
1 1 True [0 1 0] : [1 1 1]
0 0 True [1 0 0] : [1 0 0]
1 0 False [1 1 0] : [1 1 1]

print(tp/testY.size)
> 0.6848249027237354

Fortunately nonconformist has cross validation method. To do that, ClassIcpCvHelper is used.

icpmodel = ClassIcpCvHelper(icp)
res = cross_val_score(icpmodel,
                trainfps,
                train_cls,
                iterations=10,
                scoring_funcs=[class_avg_c],
                significance_levels=[0.05, 0.1, 0.2],
               )
#it will take few sec-min
res.head(10)

	iter	fold	significance	class_avg_c
0	0	0	0.05	2.135922
1	0	0	0.10	1.708738
2	0	0	0.20	1.330097
3	0	1	0.05	2.495146
4	0	1	0.10	1.922330
5	0	1	0.20	1.281553
6	0	2	0.05	2.194175
....

Evaluation method is difficult to under stand for me. For example, class_n_correct method returns ratio of predicted true / ground truth. But, conformal prediction return multiple true labels, so max values will be over number of true labels.

Conformal prediction is good tool for QSAR, I would like to learn knime workflow of CP in this week end.

Today’s code is uploaded my gist. ;)

Deep learning based reaction mapper #rdkit #deeplearning #AAM #chemoinformatics

Here is a great article about AtomAtom Mapping with Deep Learning!
https://chemrxiv.org/articles/Unsupervised_Attention-Guided_Atom-Mapping/12298559
The author use attention method and train model for reaction mapping. AAM is important technology but there are few tools to do it.

The author shared their code! Thanks. I have interest the code. So I installed the code and used the package. Following code is almost same as original code but reactions are different.

The model can annotate complex multi component ‘Ugi’ reaction. But failed to map simple Diels Alder reaction and Claisen Rearrangement.

I would like to test more reactions for evaluate the model. It seems very useful for me.

And now I’m reading the article. Thanks for sharing nice code and article!

Think about de novo molecule generation #memo #journal #RDKit #CReM

Recently there are many publications about de-novo molecular generator which mainly use Deep Learning. One problem of the approach is that generated molecules are not systematic so it’s difficult to synthesis them with parallel chemistry. So sometime chemists dislike the proposal from generated form the method I think.

Rule or Rxn or MMP based molecule generation is another approach to do that. It’s based on more chemist friend rules. They are not new but useful method and also related approaches are still reported in these days.

Some days ago I found new article in J. Cheminform. The title was ‘CReM: chemically reasonable mutation framework for structure generation’. URL is below.
https://jcheminf.biomedcentral.com/track/pdf/10.1186/s13321-020-00431-w

The author proposed new workflow for molecular framework mutation it seems like MMP approach, it degrade molecule to fragment with local context (radi1-5) for making interchangable fragment database, like MMP key-value structures are stored. And the data is used for ‘MUTATE’, ‘GROW’ and ‘LINK’ for new structure generation.

I felt that the article is very similar to following ACS article reported by Kawai et al.
https://pubs.acs.org/doi/pdf/10.1021/ci400418c

They proposed similar approach for molecular generation with fragment database.

Compared these approach, I think main difference is that CreM can set context radius. The setting affects feature of generated molecules.

In the fig4, and fig5 of the J. Cheminform article, the author shows properties of generated molecules with different radius. For example novelty, diversity score is decreased when large radius(5) is used. It means that more context similar compounds are generated with the setting.

As you know, CreM author disclosed the implementation so let’s use it. It is easy to install crem has very few dependency, just rdkit and gaucamol(optional). At first I installed CReM with pip and get ready to use DB.

$ git clone https://github.com/DrrDom/crem.git
$ cd crem
$ pip install .
# get data set Thanks for providing the db!
$ wget http://www.qsar4u.com/files/cremdb/replacements02_sc2.5.db.gz
$ gzip -d replacements02_sc2.5.db.gz

And I uploaded an example code on my gist.

By using the dataset, it took few minutes for structure generation. After generating the molecule rdkit can render mols with Drawing function.

It seems that radius=1 generates more diverse compounds set. It is easy to use for molecular generation.

Ok now we can use deep learning based and rule based structure generator. Each methods has pros and cons. As author said that CReM can generate chemistry reasonable structure but can’t generate new rings which isn’t fragment db.

Which is good proposal for medchem new structure constructed from know fragments or new structure with novel fragments?

It’s depends on situation but novel fragments requires new chemistry or many wet experiments. AI driven drug discovery can’t replace all wet experiments to dry experiments. Which molecules do you make at first and next, experimental design is key for the many projects.

Have a nice weekend. ;)

Molecules drawing code memo (Highlight Functional groups) #memo

Here is an example for draw molecules with FG highlighting. Improved version of the code can be found in last part of the post.

RDKit has fdef file for common functional groups. So we can use it for highlighting atoms as same as pharmacohore. Today’s example is just for my memo. Because I often forget many useful options.

MolDraw2DSVG object can draw molecules with DrawMolecules method. So it is easy to draw molecules as high quality SVG image.

Greg suggested me following improved version of the code! Thanks a lot!!