Vote Vote Vote #chemoinformatics

Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Following explanation from sklearn document.

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

I used the classifier for QSAR modeling.

Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]),
                                        ("GBC", clf_dict["GBC"]),
                                        ("XGB", clf_dict["XGB"]),
                                        ("SVC", clf_dict["SVC"])        
                                    ], voting='hard')

dataset = np.load("train.npz")['arr_0']
X = dataset[:,:-1]
y = dataset[:,-1]
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)
nfolds = 10
skf = StratifiedKFold(nfolds)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(test_X)
print("Voting!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

rf_clf = RandomForestClassifier(n_estimators=100)
 
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(test_X)
print("Random Forest!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

svc_clf = SVC(probability=True, gamma='auto')
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    svc_clf.fit(X_train, y_train)
y_pred = svc_clf.predict(test_X)
print("SV!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred)) 

Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)

iwatobipen$ python voting.py 
Voting!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Random Forest!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

SV!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30
Advertisements

Visualize important features of machine leaning #RDKit

As you know, rdkit2018 09 01 has very exiting method named ‘DrawMorganBit’ and ‘DrawMorganBits’. It can render the bit information of fingerprint.
It is described the following blog post.
http://rdkit.blogspot.com/2018/10/using-new-fingerprint-bit-rendering-code.html
And if you can read Japanese, Excellent posts are provided.
View story at Medium.com
https://future-chem.com/rdkit-fp-visualize/

What I want to do in the blog post is that visualize important features of random forest classifier.
Fingerprint based predictor is sometime difficult to understand feature importance of each bit. I think that using the new method, I can visualize important features of active molecules.

Let’s try it.
At frist import several packages.

import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import DataStructs
from rdkit.Chem import RDConfig
from rdkit.Chem import rdBase
print(rdBase.rdkitVersion)

Then load dataset and define mol2fp function.

trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
solclass = {'(A) low':0, '(B) medium': 1, '(C) high': 2}
n_splits = 10
def mol2fp(mol,nBits=1024):
    bitInfo={}
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, bitInfo=bitInfo)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr, bitInfo
trainmols = [m for m in Chem.SDMolSupplier(trainpath)]
testmols = [m for m in Chem.SDMolSupplier(testpath)]

Then get X and Y.
In the blogpost I used solubility classification data.

trainX = np.array([mol2fp(m)[0] for m in trainmols])
trainY = np.array([solclass[m.GetProp('SOL_classification')] for m in trainmols], dtype=np.int)

testX = np.array([mol2fp(m)[0] for m in testmols])
testY = np.array([solclass[m.GetProp('SOL_classification')] for m in testmols], dtype=np.int)

Then train RandomForestClassifier.
I used StratifiedKFold instead of KFold.
The difference of two methods is well described following URL.
https://www.kaggle.com/ogrellier/kfold-or-stratifiedkfold

rfclf = RandomForestClassifier(n_estimators=50)
skf = StratifiedKFold(n_splits=n_splits)
for train, test in skf.split(trainX, trainY):
    trainx = trainX[train]
    trainy = trainY[train]
    testx = trainX[test]
    testy = trainY[test]
    rfclf.fit(trainx, trainy)
    predy = rfclf.predict(testx)
    print(accuracy_score(testy, predy))

Then get important features. It seems a little bit tricky. And picked index of high solubility molecule.

fimportance = rfclf.feature_importances_
fimportance_dict = dict(zip(range(1024), fimportance))
sorteddata = sorted(fimportance_dict.items(), key=lambda x:-x[1])
top50feat = [x[0] for x in sorteddata][:50]
solblemol_idx = []
for i, v in enumerate(testY):
    if v == 2:
        solblemol_idx.append(i)

Now ready.
Test!!!!

testidx = solblemol_idx[-1] 
testmols[testidx]

rfclf.predict([testX[testidx]])
array([2])

Get fingerprint and bit information and extract important features from onBits.

fp, bitinfo = mol2fp(testmols[testidx])
onbit = [bit for bit in bitinfo.keys()]
importantonbits = list(set(onbit) & set(top50feat))

Finally visualize important bit information with structures.

tpls = [(testmols[testidx], x, bitinfo) for x in importantonbits]
Draw.DrawMorganBits(tpls, legends = [str(x) for x in importantonbits])

This classifier can detect substructures with oxygen atom and sp3 carbon.
The example molecule has amino functional groups but the predictor does not detect nitrogen as an important group.
I think this is because the model is not well optimized.
But it is useful that visualize features of molecule.
All code is uploaded to google colab.
Readers who are interested the code. Can run the code on the colab!
https://colab.research.google.com/drive/16NmSqHMD3Tw562PeCh0u22OB9oPyqCsw

about scaling

Some days ago, I wanted data that was scaled.
I found scikit-learn is best practice.
Like this….

import numpy as np
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
arr = np.array([[1, 10, 200],
                          [4, 30, 100],
                          [3, 20 ,300]], dtype="float")

arr_minmax = min_max_scaler.fit_transform( arr )
print arr

Then, I got….

rray([[ 0.        ,  0.        ,  0.5       ],
       [ 1.        ,  1.        ,  0.        ],
       [ 0.66666667,  0.5       ,  1.        ]])

It was easy !