Detect outlier with Isolationforest #chemoinformatics #scikit-learn #RDKit

Applicability domain(AD) is often discussed in chemoinformatics for QSAR.

Because current QSAR often can’t predict unsimilar compounds against training set. The definition of unsimilar is often used with tanimoto similarity with molecular fingerprint.

So QSAR model user need to take care about where new designed molecule in the model AD or not.

Some days ago, I read interesting article reported in arxiv about synthetic route design with machine learning. The article used isolationforest for outlier detection.

Isolation forest is a method for outlier detection. It seems useful for AD definition in QSAR I think. So I tried to use the method for molecules.

Following code is very simple. At first calculate Morganfingerprint.

Second, conducted PCA for dimension reduction.

Third, fit the data to isolation forest model. After building the model, I can use predict method which returns 1(not outlier) or -1 (outlier).

OK let’s go coding!

Import library and load SDF.

%matplotlib inline
import os
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit import RDConfig
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import numpy as np
plt.style.use('ggplot')
sdf = os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf')
mols = [m for m in Chem.SDMolSupplier(sdf)]
for mol in mols:
    AllChem.Compute2DCoords(mol)

Then calculate molecular fingerprint and perform PCA. It is very easy to do it ;)

X = [list(AllChem.GetMorganFingerprintAsBitVect(mol, 2)) for mol in mols]
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcares = pca.fit_transform(X)

Next, train isolationforest wish calculated fingerprint.

isoforest = IsolationForest(contamination='legacy')
isoforest.fit(pcares)
isoforest_res = isoforest.fit_predict(pcares)

Now almost ready. Map in or out with meshed grid and plot training data on the sheet. Isolation class has decision function which decides where the input data is outlier or not.

xx, yy = np.meshgrid(np.linspace(-5,5, 50), np.linspace(-5,5, 50))
cmap = {1:'blue', -1:'red'}
cs = [cmap[k] for k in isoforest_res]
z = isoforest.decision_function(np.c_[xx.ravel(), yy.ravel()])
plt.title("IsolationForest")
plt.contourf(xx, yy, z, cmap=plt.cm.Blues_r)

b1 = plt.scatter(pcares[:, 0], pcares[:, 1], c=cs,
                 s=20, edgecolor='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))

plt.show()

The most outer points are assigned as outlier(red point) however the results is not so clear.

You can see more clear data in Scikit-learn document.

Apply isolation forest against PCA result is not suitable for AD detection I think.

But the method is useful for outlier detection.

By the way, medicinal chemist can predict molecular property with high accuracy when their project is late stage. So QSAR model doesn’t work well. But early stage of drug discovery is limited data. So it is difficult to build suitable QSAR model for the project.

So I think design experiment space is very difficult. How to use QSAR model and to build experiment space are key point of current medicinal chemistry.

How do you design SAR space and take data?

Today’s code URL is below.

https://github.com/iwatobipen/chemobinder