Ensemble learning with scikit-learn and XGBoost

I often post about the topics of deep learning. But today I would like to post about ensemble learning.
There are lots of documents describes Ensemble learning. And I think following document is very informative for me.

Kaggle Ensembling Guide

I interested one of the method, named ‘blending’.
Regarding above URL, the merit of ‘blending’ are …

Blending has a few benefits:

It is simpler than stacking.
It wards against an information leak: The generalizers and stackers use different data.
You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.

There are two layers in blending. First layer is a set of multiple classifiers that is trained with training data. And second layer is a classifier that is trained with output of the test set of the first layer.
I tried to write a code for blending. Following code I used scikit-learn and XGBoost.

First, import libraries and define the dictionary for my conveniense.

import click
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
l1_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

l2_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

Then defined model build function. Following code can be applied multiple classification problem.
The code seems a little bit complicated and it can only return set of trained classifiers. I would like to improve the code in the near the future.

@click.option('--l1', default='all', type=str, help='models of first layers input format is csv. RF/ETC/GBC/XGB/SVC')
@click.option('--l2', default='XGB', type=str, help='model of second layer default is XGB')
@click.option('--nfolds', default=10, type=int, help='number of KFolds default is 10')
@click.option('--traindata', default='train.npz', type=str, help='data for training')
def buildmodel(l1, l2, nfolds, traindata):
    skf = StratifiedKFold(nfolds)
    dataset = np.load(traindata)['arr_0']
    X = dataset[:,:-1]
    y = dataset[:,-1]
    idx = np.random.permutation(y.size)
    X = X[idx]
    y = y[idx]
    num_cls = len(set(y))
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)

    if l1 == 'all':
        l1 = list(l1_clf_dict.keys())
        clfs = list(l1_clf_dict.values())
        clfs = [l1_clf_dict[clf] for clf in l1.split(',')]

    dataset_blend_train = np.zeros((train_X.shape[0], len(clfs), num_cls ))
    dataset_blend_test = np.zeros((test_X.shape[0], len(clfs), num_cls ))

    for j, clf in enumerate(clfs):
        dataset_blend_test_j = np.zeros((test_X.shape[0], nfolds, num_cls))
        for i, (train, val) in enumerate(skf.split(train_X, train_y)):
            print('fold {}'.format(i))
            X_train = train_X[train]
            y_train = train_y[train]
            X_val = train_X[val]
            y_val = train_y[val]
            clf.fit(X_train, y_train)
            # use clfs predicted value for next layer's training
            y_pred = clf.predict_proba(X_val)
            dataset_blend_train[val, j, :] = y_pred
            dataset_blend_test_j[:, i, :] = clf.predict_proba(test_X)
        dataset_blend_test[:, j, :] = dataset_blend_test_j.mean(1)
    l2_clf = l2_clf_dict[l2]
    dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1))
    l2_clf.fit(dataset_blend_train, train_y)

    dataset_blend_test = dataset_blend_test.reshape((dataset_blend_test.shape[0], -1))
    y_pred = l2_clf.predict_proba(dataset_blend_test)
    print(classification_report(test_y, np.argmax(y_pred, 1)))
    print(confusion_matrix(test_y, np.argmax(y_pred, 1)))

    for i, key in enumerate(l1):
        print('layer 1 {}'.format(l1[i]))
        l1_pred = clfs[i].predict_proba(test_X)
        print(classification_report(test_y, np.argmax(l1_pred, 1)))
        print(confusion_matrix(test_y, np.argmax(l1_pred, 1)))  
    return (clfs, l2_clf)

if __name__=='__main__':

Now I just finished making base code.
Let’s make sample code and run it.

# make data
import numpy as np
from sklearn.datasets import load_iris 
x = load_iris().data
y = load_iris().target
data = np.hstack((x, y.reshape(y.size, 1)))
np.savez('train.npz', data)

Run the code :)

iwatobipen$ python blending.py
fold 0
fold 1
fold 2
fold 7
fold 8
fold 9
(120, 5, 3)
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      0.83      0.91         6
         2.0       0.92      1.00      0.96        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.94      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  5  1]
 [ 0  0 11]]
layer 1 RF
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.86      1.00      0.92         6
         2.0       1.00      0.91      0.95        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  6  0]
 [ 0  1 10]]
layer 1 ETC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.71      0.83      0.77         6
         2.0       0.90      0.82      0.86        11

   micro avg       0.90      0.90      0.90        30
   macro avg       0.87      0.88      0.88        30
weighted avg       0.91      0.90      0.90        30

[[13  0  0]
 [ 0  5  1]
 [ 0  2  9]]
layer 1 GBC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
layer 1 XGB
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
layer 1 SVC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      1.00      1.00         6
         2.0       1.00      1.00      1.00        11

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

[[13  0  0]
 [ 0  6  0]
 [ 0  0 11]]

All classifiers in the first layer showed good performance. This is very simple case and very small data to estimate of merit of the blending. I will check another data from now.


4 thoughts on "Ensemble learning with scikit-learn and XGBoost"

  1. Great example of blending, although I believe blending is stacking – one and the same.

    Thankfully, modern sklearn provides a stacking classifier we can use to directly blend and stack models. Using xgboost as the meta model works very well.

