Python package of machine learning for imbalanced data #machine_learning #chemoinformatics

Recently I’m struggling with imbalanced data. I didn’t have any idea to handle it. So my predictive model showed poor performance.

Some days ago, I found useful package for imbalanced data learning which name is ‘imbalanced learn‘.

It can be installed from conda. The package provides methods for over sampling and under sampling.

I had interested it and try to use it for drug discovery dataset.

Following example used two methods, SMOTE(Synthetic Minority Over-sampling Technique) and ADASYN(Adaptive Synthetic sampling approach).

Both methods are over sampling approach so they generate artificial data.

At first import packages and load dataset.


%matplotlib inline
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython import display
from sklearn.decomposition import PCA
df = pd.read_csv('chembl_5HT.csv')
df = df.dropna()

Made imbalanced label.

# define class pIC50 >9 is active and other is inactive.
df['CLS'] = np.array(df.pchembl_value > 9, dtype=np.int)
pd.plotting.hist_series(df.CLS)

Then calculate finger print with RDKit.

mols = [Chem.MolFromSmiles(smi) for smi in df.canonical_smiles]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]

def fp2np(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

X = np.array([fp2np(fp) for fp in fps])
Y = df.CLS.to_numpy()
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, random_state=123, test_size=0.2)

Finally, apply the data to 3 approaches, default, SMOTE and ADASYN.

print(train_X.shape)
print(train_Y.shape)
print(sum(train_Y)/len(train_Y))
>(11340, 2048)
>(11340,)
>0.08686067019400352
rf = RandomForestClassifier(n_estimators=10)
rf.fit(train_X, train_Y)
pred_Y = rf.predict(test_X)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))
----out

              precision    recall  f1-score   support

           0       0.95      0.97      0.96      2586
           1       0.57      0.42      0.48       250

    accuracy                           0.92      2836
   macro avg       0.76      0.69      0.72      2836
weighted avg       0.91      0.92      0.92      2836

[[2506   80]
 [ 145  105]]

Then try to re sampling. After resampling, ratio of negative and positive is fifty fifty.

X_resampled, Y_resampled = SMOTE().fit_resample(train_X, train_Y)
print(X_resampled.shape)
print(Y_resampled.shape)
print(sum(Y_resampled)/len(Y_resampled))
>(20710, 2048)
>(20710,)
>0.5

rf = RandomForestClassifier(n_estimators=10)
rf.fit(X_resampled, Y_resampled)
pred_Y = rf.predict(test_X)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))
---out

              precision    recall  f1-score   support

           0       0.95      0.95      0.95      2586
           1       0.47      0.48      0.48       250

    accuracy                           0.91      2836
   macro avg       0.71      0.72      0.71      2836
weighted avg       0.91      0.91      0.91      2836

[[2451  135]
 [ 129  121]]
X_resampled, Y_resampled = ADASYN().fit_resample(train_X, train_Y)
print(X_resampled.shape)
print(Y_resampled.shape)
print(sum(Y_resampled)/len(Y_resampled))
>(20884, 2048)
>(20884,)
>0.5041658686075464
rf = RandomForestClassifier(n_estimators=10)
rf.fit(X_resampled, Y_resampled)
pred_Y = rf.predict(test_X)

---out
clsreport = classification_report(test_Y, pred_Y)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      2586
           1       0.44      0.47      0.46       250

    accuracy                           0.90      2836
   macro avg       0.70      0.71      0.70      2836
weighted avg       0.90      0.90      0.90      2836

[[2437  149]
 [ 132  118]]

Hmm, re sampling results are not improved..

Let’s check chemical space with PCA.

pca = PCA(n_components=3)
res = pca.fit_transform(X)

col = {0:'blue', 1:'yellow'}
color = [col[np.int(i)] for i in Y]
plt.figure(figsize=(10,7))
plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)

Negative and positive(yellow) is located very close.

In my example, generated fingerprint is not true molecule. Finally did PCA with re sampled data.

pca = PCA(n_components=3)
res = pca.fit_transform(X_resampled)
col = {0:'blue', 1:'yellow'}
color = [col[np.int(i)] for i in Y_resampled]
plt.figure(figsize=(10,7))
plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)

Wow both class located same chemical space… It seems difficult to classify for me. But randomforest showed better performance.

I need to learn more and more how to handle imbalanced data in drug discovery….

Imbalanced-learn has also under sampling method. I would like to try under sampling method if I have time to do it.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/imbalanced_learn.ipynb

Vote Vote Vote #chemoinformatics

Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Following explanation from sklearn document.

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

I used the classifier for QSAR modeling.

Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]),
                                        ("GBC", clf_dict["GBC"]),
                                        ("XGB", clf_dict["XGB"]),
                                        ("SVC", clf_dict["SVC"])        
                                    ], voting='hard')

dataset = np.load("train.npz")['arr_0']
X = dataset[:,:-1]
y = dataset[:,-1]
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)
nfolds = 10
skf = StratifiedKFold(nfolds)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(test_X)
print("Voting!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

rf_clf = RandomForestClassifier(n_estimators=100)
 
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(test_X)
print("Random Forest!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

svc_clf = SVC(probability=True, gamma='auto')
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    svc_clf.fit(X_train, y_train)
y_pred = svc_clf.predict(test_X)
print("SV!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred)) 

Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)

iwatobipen$ python voting.py 
Voting!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Random Forest!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

SV!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Visualize important features of machine leaning #RDKit

As you know, rdkit2018 09 01 has very exiting method named ‘DrawMorganBit’ and ‘DrawMorganBits’. It can render the bit information of fingerprint.
It is described the following blog post.
http://rdkit.blogspot.com/2018/10/using-new-fingerprint-bit-rendering-code.html
And if you can read Japanese, Excellent posts are provided.
View at Medium.com
https://future-chem.com/rdkit-fp-visualize/

What I want to do in the blog post is that visualize important features of random forest classifier.
Fingerprint based predictor is sometime difficult to understand feature importance of each bit. I think that using the new method, I can visualize important features of active molecules.

Let’s try it.
At frist import several packages.

import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import DataStructs
from rdkit.Chem import RDConfig
from rdkit.Chem import rdBase
print(rdBase.rdkitVersion)

Then load dataset and define mol2fp function.

trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
solclass = {'(A) low':0, '(B) medium': 1, '(C) high': 2}
n_splits = 10
def mol2fp(mol,nBits=1024):
    bitInfo={}
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, bitInfo=bitInfo)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr, bitInfo
trainmols = [m for m in Chem.SDMolSupplier(trainpath)]
testmols = [m for m in Chem.SDMolSupplier(testpath)]

Then get X and Y.
In the blogpost I used solubility classification data.

trainX = np.array([mol2fp(m)[0] for m in trainmols])
trainY = np.array([solclass[m.GetProp('SOL_classification')] for m in trainmols], dtype=np.int)

testX = np.array([mol2fp(m)[0] for m in testmols])
testY = np.array([solclass[m.GetProp('SOL_classification')] for m in testmols], dtype=np.int)

Then train RandomForestClassifier.
I used StratifiedKFold instead of KFold.
The difference of two methods is well described following URL.
https://www.kaggle.com/ogrellier/kfold-or-stratifiedkfold

rfclf = RandomForestClassifier(n_estimators=50)
skf = StratifiedKFold(n_splits=n_splits)
for train, test in skf.split(trainX, trainY):
    trainx = trainX[train]
    trainy = trainY[train]
    testx = trainX[test]
    testy = trainY[test]
    rfclf.fit(trainx, trainy)
    predy = rfclf.predict(testx)
    print(accuracy_score(testy, predy))

Then get important features. It seems a little bit tricky. And picked index of high solubility molecule.

fimportance = rfclf.feature_importances_
fimportance_dict = dict(zip(range(1024), fimportance))
sorteddata = sorted(fimportance_dict.items(), key=lambda x:-x[1])
top50feat = [x[0] for x in sorteddata][:50]
solblemol_idx = []
for i, v in enumerate(testY):
    if v == 2:
        solblemol_idx.append(i)

Now ready.
Test!!!!

testidx = solblemol_idx[-1] 
testmols[testidx]

rfclf.predict([testX[testidx]])
array([2])

Get fingerprint and bit information and extract important features from onBits.

fp, bitinfo = mol2fp(testmols[testidx])
onbit = [bit for bit in bitinfo.keys()]
importantonbits = list(set(onbit) & set(top50feat))

Finally visualize important bit information with structures.

tpls = [(testmols[testidx], x, bitinfo) for x in importantonbits]
Draw.DrawMorganBits(tpls, legends = [str(x) for x in importantonbits])

This classifier can detect substructures with oxygen atom and sp3 carbon.
The example molecule has amino functional groups but the predictor does not detect nitrogen as an important group.
I think this is because the model is not well optimized.
But it is useful that visualize features of molecule.
All code is uploaded to google colab.
Readers who are interested the code. Can run the code on the colab!
https://colab.research.google.com/drive/16NmSqHMD3Tw562PeCh0u22OB9oPyqCsw

about scaling

Some days ago, I wanted data that was scaled.
I found scikit-learn is best practice.
Like this….

import numpy as np
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
arr = np.array([[1, 10, 200],
                          [4, 30, 100],
                          [3, 20 ,300]], dtype="float")

arr_minmax = min_max_scaler.fit_transform( arr )
print arr

Then, I got….

rray([[ 0.        ,  0.        ,  0.5       ],
       [ 1.        ,  1.        ,  0.        ],
       [ 0.66666667,  0.5       ,  1.        ]])

It was easy !