Conformal prediction with python and rdkit #RDKit #QSAR #Conformal_prediction

Recently Greg shared nice webiner about conformal prediction in Youtube. https://www.youtube.com/watch?v=_ZVuEWEfwuw

He introduced basic concept of conformal prediction and demonstration with excellent work flow of KNIME. I recommend to check the site. ;)

Conformal prediction is not new method. It can estimate confidence of predicted values. Traditional predictive model can predict probability, or class of given task (classification task) but can’t estimate how confidence the predicted values are.

Also good document is freely accessible from following URL.
http://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

I like KIME but I love python too so I searched package to conduct conformal prediction with python. The package name is ‘nonconformist‘. The packge depends scikit-learn but it don’t support recent version of scikit-learn. So some modification is required (PR is original repo, hope it will marge soon). So, I modified original code and uploaded my repository.
https://github.com/iwatobipen/nonconformist

To install the package, I used pip.

$ git clone https://github.com/iwatobipen/nonconformist.git
$ cd nonconformist
$ pip install .

Now ready to use nonconformist. Following example data comes from rdkit provided solubility data set and use ICP (inductive conformal prediction). At first import packages.

import os
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from sklearn.ensemble import RandomForestClassifier
from nonconformist.nc import ClassifierNc
from nonconformist.nc import ClassifierAdapter
from nonconformist.icp import IcpClassifier
from nonconformist.evaluation import ClassIcpCvHelper
from nonconformist.evaluation import cross_val_score

Then load train and test data set. For comformal prediction, calibration data is required so I divided train and calibration data. And calculate molecular fingerprint. In the knime tutorial I mentioned above, lots of train calibration data is generated for train model. Following code, I made one train – calibration table(data).

train = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
test =  os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
trainmol = [m for m in Chem.SDMolSupplier(train)]
testmol = [m for m in Chem.SDMolSupplier(test)]
label2cls = {'(A) low':0, '(B) medium':1, '(C) high':2}
def fp2arr(fp):
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
trainfps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, 1024) for m in trainmol]
trainfps = np.array([fp2arr(fp) for fp in trainfps])

testfps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, 1024) for m in testmol]
testfps = np.array([fp2arr(fp) for fp in testfps])

train_cls = [label2cls[m.GetProp('SOL_classification')] for m in trainmol]
train_cls = np.array(train_cls)
test_cls = [label2cls[m.GetProp('SOL_classification')] for m in testmol]
test_cls = np.array(test_cls)
print(trainfps.shape, train_cls.shape, testfps.shape, test_cls.shape)

#train data is divided to train and calibration data
ids = np.random.permutation(train_cls.size)
# Use first 700 data for train and second set is used for calibration
trainX, calibX = trainfps[ids[:700],:],trainfps[ids[700:],:] 
trainY, calibY = train_cls[ids[:700]],train_cls[ids[700:]] 

testX = testfps
testY = test_cls

Now ready to train with nonconformist. I used RandomForestClassifier for base classifier. And made IcpClassifier object with random forest.

rf = RandomForestClassifier(n_estimators=500, random_state=794)
nc = ClassifierNc(ClassifierAdapter(rf))
icp = IcpClassifier(nc)

train and predict method is same as scikit-learn. So if you are familiar to scikit-learn, it’s easy to understand!

icp.fit(trainX, trainY)
icp.calibrate(calibX, calibY)

Nonconformist can predict with significance. So you can predict value with any confidence settings.

pred = icp.predict(testX)
pred95 = icp.predict(testX, significance=0.05).astype(np.int32)
pred80 = icp.predict(testX, significance=0.2).astype(np.int32)

Interestingly, classification with conformal prediction result has many combination of classes. My data set has 3 classes, so traditional classifier will returns 3 classes [1,0,0], [0,1,0], [0,0,1]. However conformal predictor compares significanse and each predicted p-value of class. And if the p-value > significance, it labeled true. So the predictor has possibility of return following combinations. [0,0,0], [0,0,1], [0,1,1], [1,1,1], [1,0,0], [0,1,0],[1,1,0],[1,0,1]..

OK let’s check model results.

tp = 0
for idx, j in enumerate(testY):
    print(j, np.argmax(pred[idx]), j == np.argmax(pred[idx]) , pred80[idx], ":", pred95[idx])
    if j == np.argmax(pred[idx]):
        tp += 1
> output
0 2 False [0 1 1] : [1 1 1]
0 1 False [0 1 1] : [1 1 1]
1 1 True [0 1 0] : [1 1 0]
1 1 True [0 1 0] : [1 1 0]
1 1 True [0 1 0] : [0 1 0]
1 1 True [0 1 0] : [1 1 1]
0 0 True [1 0 0] : [1 0 0]
1 0 False [1 1 0] : [1 1 1]

print(tp/testY.size)
> 0.6848249027237354

Fortunately nonconformist has cross validation method. To do that, ClassIcpCvHelper is used.

icpmodel = ClassIcpCvHelper(icp)
res = cross_val_score(icpmodel,
                trainfps,
                train_cls,
                iterations=10,
                scoring_funcs=[class_avg_c],
                significance_levels=[0.05, 0.1, 0.2],
               )
#it will take few sec-min
res.head(10)

	iter	fold	significance	class_avg_c
0	0	0	0.05	2.135922
1	0	0	0.10	1.708738
2	0	0	0.20	1.330097
3	0	1	0.05	2.495146
4	0	1	0.10	1.922330
5	0	1	0.20	1.281553
6	0	2	0.05	2.194175
....

Evaluation method is difficult to under stand for me. For example, class_n_correct method returns ratio of predicted true / ground truth. But, conformal prediction return multiple true labels, so max values will be over number of true labels.

Conformal prediction is good tool for QSAR, I would like to learn knime workflow of CP in this week end.

Today’s code is uploaded my gist. ;)

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw CP_RDKit.ipynb hosted with ❤ by GitHub
Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: