Build stacking Classification QSAR model with mlxtend #chemoinformatics #mlxtend #RDKit

I posed about the ML method named ‘blending’ somedays ago. And reader recommended me that how about try to use “mlxtend”.
When I learned ensemble learning package in python I had found it but never used.
So try to use the library to build model.
Mlxtend is easy to install and good document is provided from following URL.

Following code is example for stacking.
In ipython notebook…
Use base.csv for test and load some functions.

%matplotlib inline

import warnings
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
import pandas as pd

df = pd.read_csv("bace.csv")

The dataset has pIC50 for objective value.

mols = [Chem.MolFromSmiles(smi) for smi in df.mol]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol,2, nBits=1024) for mol in mols]
pIC50 = [i for i in df.pIC50]
Draw.MolsToGridImage(mols[:10], legends=["pIC50 "+str(i) for i in pIC50[:10]], molsPerRow=5)

Images of compounds is below.

Then calculate molecular fingerprint. And make binary activity array as y_bin.

X = []
for fp in fps:
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
X = np.array(X)
y = np.array(pIC50)
y_bin = np.asarray(y>7,

Then load some classifier model from sklearn and split data for training and testing.

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from mlxtend.classifier import StackingClassifier
from mlxtend.plotting import plot_decision_regions
from mlxtend.plotting import plot_confusion_matrix
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(X,y_bin, test_size=0.2)

To make stacking classifier, it is very simple just call StackingClassifier and set classifier and meta_classifier as arguments.
I use SVC as meta_classifier.

clf1 = RandomForestClassifier(random_state=794)
clf2 = GaussianNB()
clf3 = XGBClassifier(random_state=0)
clf4 = SVC(random_state=0)
clflist = ["RF", "GNB", "XGB", "SVC", "SCLF"]

sclf = StackingClassifier(classifiers=[clf1,clf2,clf3], meta_classifier=clf4)

Then let’s learn the data!

skf = StratifiedKFold(n_splits=5)
for j, (train_idx,test_idx) in enumerate(skf.split(x_train, y_train)):
    for i, clf in enumerate([clf1, clf2, clf3, clf4, sclf]):[train_idx],y_train[train_idx])
        ypred = clf.predict(x_train[test_idx])
        acc = accuracy_score(y_train[test_idx], ypred)
        b_acc = balanced_accuracy_score(y_train[test_idx], ypred)
        print("round {}".format(j))
        print("accuracy {}".format(acc))
        print("balanced accuracy {}".format(b_acc))

> output
round 0
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
round 0
accuracy 0.6625514403292181
balanced accuracy 0.680450351191296
round 0
accuracy 0.8271604938271605
balanced accuracy 0.8136275995042005
round 0
accuracy 0.7325102880658436
balanced accuracy 0.7072717256576229
round 0
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
round 1
accuracy 0.7603305785123967
balanced accuracy 0.7534683684794672
round 1
accuracy 0.640495867768595
balanced accuracy 0.6634988901220866
round 1
accuracy 0.8140495867768595
balanced accuracy 0.8127081021087681
round 1
accuracy 0.756198347107438
balanced accuracy 0.7414678135405106

Reader who is interested in stacking, you can find nice document here

And my all code is uploaded to myrepo on github.

Mlxtend has many functions for building, analyzing and visualizing machine learning model and data. I will use the package more and more.