I posted and wrote code about ‘blending’ which is one of the strategy for ensemble learning. But the code had many hard coded part so it was difficult to use in my job. In this post, I tried to make new classification class of sklearn for ensemble learning and test the code.
At first, most of part came from my old post.
To make blending classifier, the class succeed to BaseEstimator, ClassifierMixin and TransformerMixin. And defined fit, predict and predict_proba method.
Main code is below.
from sklearn.base import BaseEstimator from sklearn.base import ClassifierMixin from sklearn.base import TransformerMixin from sklearn.base import clone import numpy as np import six from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import train_test_split class BlendingClassifier(BaseEstimator, ClassifierMixin): def __init__(self, l1_clfs, l2_clf, n_hold=5, test_size=0.2, verbose=0, use_clones=True, random_state=794): self.l1_clfs = l1_clfs self.l2_clf = l2_clf self.n_hold = n_hold self.test_size = test_size self.verbose = verbose self.use_clones = use_clones self.random_state = random_state self.num_cls = None def fit(self, X, y): skf = StratifiedKFold(n_splits=self.n_hold, random_state=self.random_state) if self.use_clones: self.l1_clfs_ = [clone(clf) for clf in self.l1_clfs] self.l2_clf_ = clone(self.l2_clf) else: self.l1_clfs_ = self.l1_clfs self.l2_clf_ = self.l2_clf self.num_cls = len(set(y)) if self.verbose > 0: print("Fitting {} l1_classifiers...".format(len(self.l1_clfs))) print("{} classes classification".format(self.num_cls)) dataset_blend_train = np.zeros((X.shape[0],len(self.l1_clfs_), self.num_cls)) for j, clf in enumerate(self.l1_clfs_): for i, (train_idx, test_idx) in enumerate(skf.split(X, y)): if self.verbose > 0: print('{}-{}th hold, {} classifier'.format(j+1, i+1, type(clf))) train_i_x, train_i_y = X[train_idx], y[train_idx] test_i_x, test_i_y = X[test_idx], y[test_idx] clf.fit(train_i_x, train_i_y) dataset_blend_train[test_idx, j, :] = clf.predict_proba(test_i_x) if self.verbose > 0: print('--- Blending ---') print(dataset_blend_train.shape) dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1)) self.l2_clf_.fit(dataset_blend_train, y) return self def predict(self, X): l1_output = np.zeros((X.shape[0], len(self.l1_clfs_), self.num_cls)) for i, clf in enumerate(self.l1_clfs_): pred_y = clf.predict_proba(X) l1_output[:, i, :] = pred_y l1_output = l1_output.reshape((X.shape[0], -1)) return self.l2_clf_.predict(l1_output) def predict_proba(self, X): l1_output = np.zeros((X.shape[0], len(self.l1_clfs_), self.num_cls)) for i, clf in enumerate(self.l1_clfs_): pred_y = clf.predict_proba(X) l1_output[:, i, :] = pred_y l1_output = l1_output.reshape((X.shape[0], -1)) return self.l2_clf_.predict_proba(l1_output)
Now ready, let’s test the code! The code works like this …
from blending_classification import BlendingClassifier rf = RandomForestClassifier(n_estimators=100, n_jobs=-1) et = ExtraTreesClassifier(n_estimators=100, n_jobs=-1) gbc = GradientBoostingClassifier(learning_rate=0.01) xgbc = XGBClassifier(n_estimators=100, n_jobs=-1) # To use SVC, probability option must be True svc = SVC(probability=True, gamma='auto') # The class is two layers blending classifier. # layer one is set of classifier. # layer two is final classifier which uses output of layer one. l1_clfs = [rf, et, gbc, xgbc] l2_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1) blendclf = BlendingClassifier(l1_clfs, l2_clf, verbose=1) blendclf.fit(train_X, train_y) pred_y = blendclf.predict(test_X) print(classification_report(test_y, pred_y)) precision recall f1-score support 0 0.76 0.79 0.78 102 1 0.69 0.61 0.65 115 2 0.56 0.70 0.62 40 micro avg 0.70 0.70 0.70 257 macro avg 0.67 0.70 0.68 257 weighted avg 0.70 0.70 0.70 257 cm = confusion_matrix(test_y, pred_y) plot_confusion_matrix(cm)
Next, run same task with RandomForest.
mono_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1) mono_rf.fit(train_X, train_y) pred_y2 = mono_rf.predict(test_X) print(classification_report(test_y, pred_y2)) precision recall f1-score support 0 0.81 0.81 0.81 102 1 0.77 0.75 0.76 115 2 0.72 0.78 0.75 40 micro avg 0.78 0.78 0.78 257 macro avg 0.77 0.78 0.77 257 weighted avg 0.78 0.78 0.78 257 cm2 = confusion_matrix(test_y, pred_y2) plot_confusion_matrix(cm2)
Hmm…… RandomForest shows better performance than blending classifier. Ensemble method seems return robust model I think but it is needed to parameter tuning. (I know it same for every machine learning method ;-) )
Any way, I wrote blending classifier class today and up load the code to my repo.
Any comments and/or suggestions are appreciated.
*All code can found following repo.
https://github.com/iwatobipen/skensemble
*And note book can see following URL.
https://nbviewer.jupyter.org/github/iwatobipen/skensemble/blob/master/solubility.ipynb
Two days to 2019!