Python package of machine learning for imbalanced data #machine_learning #chemoinformatics

Recently I’m struggling with imbalanced data. I didn’t have any idea to handle it. So my predictive model showed poor performance.

Some days ago, I found useful package for imbalanced data learning which name is ‘imbalanced learn‘.

It can be installed from conda. The package provides methods for over sampling and under sampling.

I had interested it and try to use it for drug discovery dataset.

Following example used two methods, SMOTE(Synthetic Minority Over-sampling Technique) and ADASYN(Adaptive Synthetic sampling approach).

Both methods are over sampling approach so they generate artificial data.

At first import packages and load dataset.


%matplotlib inline
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython import display
from sklearn.decomposition import PCA
df = pd.read_csv('chembl_5HT.csv')
df = df.dropna()

Made imbalanced label.

# define class pIC50 >9 is active and other is inactive.
df['CLS'] = np.array(df.pchembl_value > 9, dtype=np.int)
pd.plotting.hist_series(df.CLS)

Then calculate finger print with RDKit.

mols = [Chem.MolFromSmiles(smi) for smi in df.canonical_smiles]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]

def fp2np(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

X = np.array([fp2np(fp) for fp in fps])
Y = df.CLS.to_numpy()
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, random_state=123, test_size=0.2)

Finally, apply the data to 3 approaches, default, SMOTE and ADASYN.

print(train_X.shape)
print(train_Y.shape)
print(sum(train_Y)/len(train_Y))
>(11340, 2048)
>(11340,)
>0.08686067019400352
rf = RandomForestClassifier(n_estimators=10)
rf.fit(train_X, train_Y)
pred_Y = rf.predict(test_X)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))
----out

              precision    recall  f1-score   support

           0       0.95      0.97      0.96      2586
           1       0.57      0.42      0.48       250

    accuracy                           0.92      2836
   macro avg       0.76      0.69      0.72      2836
weighted avg       0.91      0.92      0.92      2836

[[2506   80]
 [ 145  105]]

Then try to re sampling. After resampling, ratio of negative and positive is fifty fifty.

X_resampled, Y_resampled = SMOTE().fit_resample(train_X, train_Y)
print(X_resampled.shape)
print(Y_resampled.shape)
print(sum(Y_resampled)/len(Y_resampled))
>(20710, 2048)
>(20710,)
>0.5

rf = RandomForestClassifier(n_estimators=10)
rf.fit(X_resampled, Y_resampled)
pred_Y = rf.predict(test_X)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))
---out

              precision    recall  f1-score   support

           0       0.95      0.95      0.95      2586
           1       0.47      0.48      0.48       250

    accuracy                           0.91      2836
   macro avg       0.71      0.72      0.71      2836
weighted avg       0.91      0.91      0.91      2836

[[2451  135]
 [ 129  121]]
X_resampled, Y_resampled = ADASYN().fit_resample(train_X, train_Y)
print(X_resampled.shape)
print(Y_resampled.shape)
print(sum(Y_resampled)/len(Y_resampled))
>(20884, 2048)
>(20884,)
>0.5041658686075464
rf = RandomForestClassifier(n_estimators=10)
rf.fit(X_resampled, Y_resampled)
pred_Y = rf.predict(test_X)

---out
clsreport = classification_report(test_Y, pred_Y)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      2586
           1       0.44      0.47      0.46       250

    accuracy                           0.90      2836
   macro avg       0.70      0.71      0.70      2836
weighted avg       0.90      0.90      0.90      2836

[[2437  149]
 [ 132  118]]

Hmm, re sampling results are not improved..

Let’s check chemical space with PCA.

pca = PCA(n_components=3)
res = pca.fit_transform(X)

col = {0:'blue', 1:'yellow'}
color = [col[np.int(i)] for i in Y]
plt.figure(figsize=(10,7))
plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)

Negative and positive(yellow) is located very close.

In my example, generated fingerprint is not true molecule. Finally did PCA with re sampled data.

pca = PCA(n_components=3)
res = pca.fit_transform(X_resampled)
col = {0:'blue', 1:'yellow'}
color = [col[np.int(i)] for i in Y_resampled]
plt.figure(figsize=(10,7))
plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)

Wow both class located same chemical space… It seems difficult to classify for me. But randomforest showed better performance.

I need to learn more and more how to handle imbalanced data in drug discovery….

Imbalanced-learn has also under sampling method. I would like to try under sampling method if I have time to do it.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/imbalanced_learn.ipynb

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.