Ensemble learning with scikit-learn and XGBoost #machine learning

I often post about the topics of deep learning. But today I would like to post about ensemble learning.
There are lots of documents describes Ensemble learning. And I think following document is very informative for me.
https://mlwave.com/kaggle-ensembling-guide/
I interested one of the method, named ‘blending’.
Regarding above URL, the merit of ‘blending’ are …

Blending has a few benefits:

It is simpler than stacking.
It wards against an information leak: The generalizers and stackers use different data.
You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the β€˜blender’ and the blender decides if it wants to keep that model or not.

There are two layers in blending. First layer is a set of multiple classifiers that is trained with training data. And second layer is a classifier that is trained with output of the test set of the first layer.
I tried to write a code for blending. Following code I used scikit-learn and XGBoost.

First, import libraries and define the dictionary for my conveniense.

import click
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
l1_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

l2_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

Then defined model build function. Following code can be applied multiple classification problem.
The code seems a little bit complicated and it can only return set of trained classifiers. I would like to improve the code in the near the future.

@click.command()
@click.option('--l1', default='all', type=str, help='models of first layers input format is csv. RF/ETC/GBC/XGB/SVC')
@click.option('--l2', default='XGB', type=str, help='model of second layer default is XGB')
@click.option('--nfolds', default=10, type=int, help='number of KFolds default is 10')
@click.option('--traindata', default='train.npz', type=str, help='data for training')
def buildmodel(l1, l2, nfolds, traindata):
    skf = StratifiedKFold(nfolds)
    dataset = np.load(traindata)['arr_0']
    X = dataset[:,:-1]
    y = dataset[:,-1]
    idx = np.random.permutation(y.size)
    X = X[idx]
    y = y[idx]
    num_cls = len(set(y))
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)

    if l1 == 'all':
        l1 = list(l1_clf_dict.keys())
        clfs = list(l1_clf_dict.values())
    else:
        clfs = [l1_clf_dict[clf] for clf in l1.split(',')]

    dataset_blend_train = np.zeros((train_X.shape[0], len(clfs), num_cls ))
    dataset_blend_test = np.zeros((test_X.shape[0], len(clfs), num_cls ))

    for j, clf in enumerate(clfs):
        dataset_blend_test_j = np.zeros((test_X.shape[0], nfolds, num_cls))
        for i, (train, val) in enumerate(skf.split(train_X, train_y)):
            print('fold {}'.format(i))
            X_train = train_X[train]
            y_train = train_y[train]
            X_val = train_X[val]
            y_val = train_y[val]
            clf.fit(X_train, y_train)
            # use clfs predicted value for next layer's training
            y_pred = clf.predict_proba(X_val)
            dataset_blend_train[val, j, :] = y_pred
            dataset_blend_test_j[:, i, :] = clf.predict_proba(test_X)
        dataset_blend_test[:, j, :] = dataset_blend_test_j.mean(1)
    l2_clf = l2_clf_dict[l2]
    print('Blending')
    print(dataset_blend_train.shape)
    dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1))
    l2_clf.fit(dataset_blend_train, train_y)

    dataset_blend_test = dataset_blend_test.reshape((dataset_blend_test.shape[0], -1))
    y_pred = l2_clf.predict_proba(dataset_blend_test)
    y_pred
    print(classification_report(test_y, np.argmax(y_pred, 1)))
    print(confusion_matrix(test_y, np.argmax(y_pred, 1)))

    print("*"*50)
    for i, key in enumerate(l1):
        print('layer 1 {}'.format(l1[i]))
        l1_pred = clfs[i].predict_proba(test_X)
        print(classification_report(test_y, np.argmax(l1_pred, 1)))
        print(confusion_matrix(test_y, np.argmax(l1_pred, 1)))  
        print("*"*50) 
    return (clfs, l2_clf)

if __name__=='__main__':
    buildmodel()

Now I just finished making base code.
Let’s make sample code and run it.

# make data
import numpy as np
from sklearn.datasets import load_iris 
x = load_iris().data
y = load_iris().target
data = np.hstack((x, y.reshape(y.size, 1)))
np.savez('train.npz', data)

Run the code :)

iwatobipen$ python blending.py
train.npz
fold 0
fold 1
fold 2
--snip--
fold 7
fold 8
fold 9
Blending
(120, 5, 3)
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      0.83      0.91         6
         2.0       0.92      1.00      0.96        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.94      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  5  1]
 [ 0  0 11]]
**************************************************
layer 1 RF
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.86      1.00      0.92         6
         2.0       1.00      0.91      0.95        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  6  0]
 [ 0  1 10]]
**************************************************
layer 1 ETC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.71      0.83      0.77         6
         2.0       0.90      0.82      0.86        11

   micro avg       0.90      0.90      0.90        30
   macro avg       0.87      0.88      0.88        30
weighted avg       0.91      0.90      0.90        30

[[13  0  0]
 [ 0  5  1]
 [ 0  2  9]]
**************************************************
layer 1 GBC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
**************************************************
layer 1 XGB
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
**************************************************
layer 1 SVC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      1.00      1.00         6
         2.0       1.00      1.00      1.00        11

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

[[13  0  0]
 [ 0  6  0]
 [ 0  0 11]]
**************************************************

All classifiers in the first layer showed good performance. This is very simple case and very small data to estimate of merit of the blending. I will check another data from now.

Ref.
http://www.chioka.in/stacking-blending-and-stacked-generalization/

Advertisements

One thought on “Ensemble learning with scikit-learn and XGBoost #machine learning

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.