Make original sklearn classifier #sklearn #chemoinfo

I posted and wrote code about ‘blending’ which is one of the strategy for ensemble learning. But the code had many hard coded part so it was difficult to use in my job. In this post, I tried to make new classification class of sklearn for ensemble learning and test the code.

At first, most of part came from my old post.
To make blending classifier, the class succeed to BaseEstimator, ClassifierMixin and TransformerMixin. And defined fit, predict and predict_proba method.

Main code is below.

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import TransformerMixin
from sklearn.base import clone
import numpy as np
import six
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

class BlendingClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, l1_clfs, l2_clf, 
                 n_hold=5, test_size=0.2, verbose=0,
                 use_clones=True, random_state=794):
        self.l1_clfs = l1_clfs
        self.l2_clf = l2_clf

        self.n_hold = n_hold
        self.test_size = test_size
        self.verbose = verbose
        self.use_clones = use_clones
        self.random_state = random_state
        self.num_cls = None

    def fit(self, X, y):
        skf = StratifiedKFold(n_splits=self.n_hold, random_state=self.random_state)
        if self.use_clones:
            self.l1_clfs_ = [clone(clf) for clf in self.l1_clfs]
            self.l2_clf_ = clone(self.l2_clf)
            self.l1_clfs_ = self.l1_clfs
            self.l2_clf_ = self.l2_clf

        self.num_cls = len(set(y))
        if self.verbose > 0:
            print("Fitting {} l1_classifiers...".format(len(self.l1_clfs)))
            print("{} classes classification".format(self.num_cls))
        dataset_blend_train = np.zeros((X.shape[0],len(self.l1_clfs_), self.num_cls))

        for j, clf in enumerate(self.l1_clfs_):
            for i, (train_idx, test_idx) in enumerate(skf.split(X, y)):
                if self.verbose > 0:
                    print('{}-{}th hold, {} classifier'.format(j+1, i+1, type(clf)))
                train_i_x, train_i_y  = X[train_idx], y[train_idx]
                test_i_x, test_i_y = X[test_idx], y[test_idx]
      , train_i_y)
                dataset_blend_train[test_idx, j, :] = clf.predict_proba(test_i_x)
        if self.verbose > 0:
            print('--- Blending ---')
        dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1)), y)
        return self

    def predict(self, X):
        l1_output = np.zeros((X.shape[0], len(self.l1_clfs_), self.num_cls))
        for i, clf in enumerate(self.l1_clfs_):
            pred_y = clf.predict_proba(X)
            l1_output[:, i, :] = pred_y
        l1_output = l1_output.reshape((X.shape[0], -1))
        return self.l2_clf_.predict(l1_output)

    def predict_proba(self, X):
        l1_output = np.zeros((X.shape[0], len(self.l1_clfs_), self.num_cls))
        for i, clf in enumerate(self.l1_clfs_):
            pred_y = clf.predict_proba(X)
            l1_output[:, i, :] = pred_y
        l1_output = l1_output.reshape((X.shape[0], -1))
        return self.l2_clf_.predict_proba(l1_output)

Now ready, let’s test the code! The code works like this …

from blending_classification import BlendingClassifier

rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
et = ExtraTreesClassifier(n_estimators=100, n_jobs=-1)
gbc = GradientBoostingClassifier(learning_rate=0.01)
xgbc = XGBClassifier(n_estimators=100, n_jobs=-1)
# To use SVC, probability option must be True
svc = SVC(probability=True, gamma='auto')
# The class is two layers blending classifier.
# layer one is set of classifier.
# layer two is final classifier which uses output of layer one.

l1_clfs = [rf, et, gbc, xgbc]
l2_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
blendclf = BlendingClassifier(l1_clfs, l2_clf, verbose=1), train_y)
pred_y = blendclf.predict(test_X)
print(classification_report(test_y, pred_y))
              precision    recall  f1-score   support

           0       0.76      0.79      0.78       102
           1       0.69      0.61      0.65       115
           2       0.56      0.70      0.62        40

   micro avg       0.70      0.70      0.70       257
   macro avg       0.67      0.70      0.68       257
weighted avg       0.70      0.70      0.70       257
cm = confusion_matrix(test_y, pred_y)

Next, run same task with RandomForest.

mono_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1), train_y) pred_y2 = mono_rf.predict(test_X)  print(classification_report(test_y, pred_y2))
              precision    recall  f1-score   support

           0       0.81      0.81      0.81       102
           1       0.77      0.75      0.76       115
           2       0.72      0.78      0.75        40

   micro avg       0.78      0.78      0.78       257
   macro avg       0.77      0.78      0.77       257
weighted avg       0.78      0.78      0.78       257
cm2 = confusion_matrix(test_y, pred_y2)

Hmm…… RandomForest shows better performance than blending classifier. Ensemble method seems return robust model I think but it is needed to parameter tuning. (I know it same for every machine learning method ;-) )

Any way, I wrote blending classifier class today and up load the code to my repo.
Any comments and/or suggestions are appreciated.

*All code can found following repo.

*And note book can see following URL.

Two days to 2019!


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: