Vote Vote Vote #chemoinformatics

Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Following explanation from sklearn document.

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

I used the classifier for QSAR modeling.

Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]),
                                        ("GBC", clf_dict["GBC"]),
                                        ("XGB", clf_dict["XGB"]),
                                        ("SVC", clf_dict["SVC"])        
                                    ], voting='hard')

dataset = np.load("train.npz")['arr_0']
X = dataset[:,:-1]
y = dataset[:,-1]
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)
nfolds = 10
skf = StratifiedKFold(nfolds)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(test_X)
print("Voting!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

rf_clf = RandomForestClassifier(n_estimators=100)
 
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(test_X)
print("Random Forest!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

svc_clf = SVC(probability=True, gamma='auto')
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    svc_clf.fit(X_train, y_train)
y_pred = svc_clf.predict(test_X)
print("SV!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)

iwatobipen$ python voting.py 
Voting!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Random Forest!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

SV!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

2 thoughts on “Vote Vote Vote #chemoinformatics”

Jérémy says:

29/11/2018 at 23:07

Did you check mlxtend library for combining models? But may be it’s not doing what you are looking for – I don’t know the library in details but saw it from Sebastian Raschka twitter

1. iwatobipen says:
  
  29/11/2018 at 23:17
  
  Yah! I have checked mlxtend. It seems useful but I have not try to use it yet. I will use it ASAP. Thanks.

Vote Vote Vote #chemoinformatics

Published by iwatobipen

2 thoughts on “Vote Vote Vote #chemoinformatics”

Leave a comment Cancel reply

Share this:

Related

Published by iwatobipen

2 thoughts on “Vote Vote Vote #chemoinformatics”

Leave a comment Cancel reply