ML visualization tool of python #machine_learning #visualization

Machine learning is used for QSAR in drug discovery. After building the model, analyze its performance is important. One of the ML package for python named ‘scikit-learn’ has many function for model performance metrics. But it does not provide visualization tools. So I make some plot with matplotlib, seaborn aned etc,

But few days ago, I knew useful package for visulization for ML The name is yellowbrick.
https://www.scikit-yb.org/en/latest/index.html
It can install via conda or pip.

By using the package, user can make many plots very efficiently. Following code is my brief examples.

1st step) Import packages.



%matplotlib inline
import numpy as np
import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import RDConfig
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import DrawMorganBit
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from yellowbrick.features import FeatureImportances
from yellowbrick.classifier import PrecisionRecallCurve
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from IPython import display
le = LabelEncoder()

2nd step) Make fingerprint matrix and label index for ML. I used solubility data from rdkit package.

def fp2arr(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
train_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf'))]
test_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf'))]
cls = set([m.GetProp('SOL_classification') for m in train_mols])
n_cls = len(cls)
train_fp_bitifo = [{} for _ in range(len(train_mols))]
train_fp = np.asarray([fp2arr(AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=512, bitInfo=train_fp_bitifo[idx])) for idx, m in enumerate(train_mols)])
train_y = [m.GetProp('SOL_classification') for m in train_mols]
train_y_le = le.fit_transform(train_y)

test_fp_bitifo = [{} for _ in range(len(train_mols))]
test_fp = np.asanyarray([fp2arr(AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=512, bitInfo=test_fp_bitifo[idx])) for idx, m in enumerate(test_mols)])
test_y = [m.GetProp('SOL_classification') for m in test_mols]
test_y_le = le.fit_transform(test_y)

3rd step) OK, let’s see feature importance. Yellowbrick general flow is..

viz(model) => viz.fit(x, y) => viz.score(x,y) => viz.poof()

In the following example, bit49, 356, 295 are top 3 important features.

rf = RandomForestClassifier(n_estimators=200)
viz = FeatureImportances(rf)
viz.size = np.array([600, 6000])
viz.fit(train_fp, train_y_le)
viz.poof()

Let’s see the structure.

Draw.MolsToGridImage(test_mols[50:80], legends=test_y[50:80], molsPerRow=5)

print(test_fp_bitifo[57])
test_mols[57]
> {33: ((3, 0), (4, 0), (5, 0)), 54: ((2, 2),), 80: ((1, 0),), 114: ((2, 0),), 115: ((1, 1),), 222: ((0, 1),), 295: ((0, 0),), 392: ((3, 1), (4, 1), (5, 1)), 485: ((2, 1),)}
#see bit295 of test mol 57.
IPythonConsole.DrawMorganBit(test_mols[57], 295, bitInfo=test_fp_bitifo[57])

Bit295 means hydroxyl group. Hmm it is reasonable.

It is easy to make other metrics visualization. Coding style is almost same.

rf = RandomForestClassifier(n_estimators=200)
viz_cls = PrecisionRecallCurve(rf)
viz_cls.size = np.array([600, 400])
viz_cls.fit(train_fp, train_y_le)
viz_cls.score(test_fp, test_y_le)
viz_cls.poof()
rf = RandomForestClassifier(n_estimators=200)
viz_rocauc = ROCAUC(rf)
viz_rocauc.size = np.array([600, 400])
viz_rocauc.fit(train_fp, train_y_le)
viz_rocauc.score(test_fp, test_y_le)
viz_rocauc.poof()
rf = RandomForestClassifier(n_estimators=200)
viz_cls = ClassificationReport(rf, )
viz_cls.size = np.array([600, 400])
viz_cls.fit(train_fp, train_y_le)
viz_cls.score(test_fp, test_y_le)
viz_cls.poof()

<br>rf = RandomForestClassifier(n_estimators=200)<br>viz_ce = ClassPredictionError(rf, )<br>viz_ce.size = np.array([600, 400])<br>viz_ce.fit(train_fp, train_y_le)<br>viz_ce.score(test_fp, test_y_le)<br>viz_ce.poof()<br>

It is nice isn’t it?

Yellowbrick is very useful package for ML.

All code is uploaded to my gist.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.