Make original sklearn classifier-2 #sklearn #chemoinfo

After posted ‘Make original sklearn classifier’, I could get comment from my follower @yamasaKit_-san and @kzfm-san. (Thanks!)

So I checked diversity of models with principal component analysis(PCA).
The example is almost same as yesterday but little bit different at last part.
Last part of my code is below. Extract feature importances from L1 layer classifiers and mono-random forest classifier. And then conduct PCA.

labels = ["rf", "et", "gbc", "xgbc", "mono_rf"]
feature_importances_list = [clf.feature_importances_ for clf in blendclf.l1_clfs_]
feature_importances_list.append(mono_rf.feature_importances_)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
res = pca.fit_transform(feature_importances_list)

Then plot results with matplotlib. And I used adjustText library for plot labeling. adjustText is powerful package for making nice view of labeled plot.

from adjustText import adjust_text
x, y = res[:,0], res[:,1]
plt.plot(x, y, 'bo')
plt.xlim(-0.1, 0.3)
plt.ylim(-0.05, 0.1)

texts = [plt.text(x[i], y[i], '{}'.format(labels[i])) for i in range(len(labels))]
adjust_text(texts)

The plot indicates that two models ET, RF learned similar feature importances to mono Randomforest but XGBC and GBC learned different feature importance. So, combination of ET and RF is redundant for Ensemble learning I think.

To build good ensemble model, I need to think about combination of models and how to tune many hyper parameters.

You can check whole code from the URL below.
http://nbviewer.jupyter.org/github/iwatobipen/skensemble/blob/master/solubility2.ipynb

Advertisements

Ensemble learning with scikit-learn and XGBoost #machine learning

I often post about the topics of deep learning. But today I would like to post about ensemble learning.
There are lots of documents describes Ensemble learning. And I think following document is very informative for me.
https://mlwave.com/kaggle-ensembling-guide/
I interested one of the method, named ‘blending’.
Regarding above URL, the merit of ‘blending’ are …

Blending has a few benefits:

It is simpler than stacking.
It wards against an information leak: The generalizers and stackers use different data.
You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.

There are two layers in blending. First layer is a set of multiple classifiers that is trained with training data. And second layer is a classifier that is trained with output of the test set of the first layer.
I tried to write a code for blending. Following code I used scikit-learn and XGBoost.

First, import libraries and define the dictionary for my conveniense.

import click
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
l1_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

l2_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

Then defined model build function. Following code can be applied multiple classification problem.
The code seems a little bit complicated and it can only return set of trained classifiers. I would like to improve the code in the near the future.

@click.command()
@click.option('--l1', default='all', type=str, help='models of first layers input format is csv. RF/ETC/GBC/XGB/SVC')
@click.option('--l2', default='XGB', type=str, help='model of second layer default is XGB')
@click.option('--nfolds', default=10, type=int, help='number of KFolds default is 10')
@click.option('--traindata', default='train.npz', type=str, help='data for training')
def buildmodel(l1, l2, nfolds, traindata):
    skf = StratifiedKFold(nfolds)
    dataset = np.load(traindata)['arr_0']
    X = dataset[:,:-1]
    y = dataset[:,-1]
    idx = np.random.permutation(y.size)
    X = X[idx]
    y = y[idx]
    num_cls = len(set(y))
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)

    if l1 == 'all':
        l1 = list(l1_clf_dict.keys())
        clfs = list(l1_clf_dict.values())
    else:
        clfs = [l1_clf_dict[clf] for clf in l1.split(',')]

    dataset_blend_train = np.zeros((train_X.shape[0], len(clfs), num_cls ))
    dataset_blend_test = np.zeros((test_X.shape[0], len(clfs), num_cls ))

    for j, clf in enumerate(clfs):
        dataset_blend_test_j = np.zeros((test_X.shape[0], nfolds, num_cls))
        for i, (train, val) in enumerate(skf.split(train_X, train_y)):
            print('fold {}'.format(i))
            X_train = train_X[train]
            y_train = train_y[train]
            X_val = train_X[val]
            y_val = train_y[val]
            clf.fit(X_train, y_train)
            # use clfs predicted value for next layer's training
            y_pred = clf.predict_proba(X_val)
            dataset_blend_train[val, j, :] = y_pred
            dataset_blend_test_j[:, i, :] = clf.predict_proba(test_X)
        dataset_blend_test[:, j, :] = dataset_blend_test_j.mean(1)
    l2_clf = l2_clf_dict[l2]
    print('Blending')
    print(dataset_blend_train.shape)
    dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1))
    l2_clf.fit(dataset_blend_train, train_y)

    dataset_blend_test = dataset_blend_test.reshape((dataset_blend_test.shape[0], -1))
    y_pred = l2_clf.predict_proba(dataset_blend_test)
    y_pred
    print(classification_report(test_y, np.argmax(y_pred, 1)))
    print(confusion_matrix(test_y, np.argmax(y_pred, 1)))

    print("*"*50)
    for i, key in enumerate(l1):
        print('layer 1 {}'.format(l1[i]))
        l1_pred = clfs[i].predict_proba(test_X)
        print(classification_report(test_y, np.argmax(l1_pred, 1)))
        print(confusion_matrix(test_y, np.argmax(l1_pred, 1)))  
        print("*"*50) 
    return (clfs, l2_clf)

if __name__=='__main__':
    buildmodel()

Now I just finished making base code.
Let’s make sample code and run it.

# make data
import numpy as np
from sklearn.datasets import load_iris 
x = load_iris().data
y = load_iris().target
data = np.hstack((x, y.reshape(y.size, 1)))
np.savez('train.npz', data)

Run the code :)

iwatobipen$ python blending.py
train.npz
fold 0
fold 1
fold 2
--snip--
fold 7
fold 8
fold 9
Blending
(120, 5, 3)
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      0.83      0.91         6
         2.0       0.92      1.00      0.96        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.94      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  5  1]
 [ 0  0 11]]
**************************************************
layer 1 RF
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.86      1.00      0.92         6
         2.0       1.00      0.91      0.95        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  6  0]
 [ 0  1 10]]
**************************************************
layer 1 ETC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.71      0.83      0.77         6
         2.0       0.90      0.82      0.86        11

   micro avg       0.90      0.90      0.90        30
   macro avg       0.87      0.88      0.88        30
weighted avg       0.91      0.90      0.90        30

[[13  0  0]
 [ 0  5  1]
 [ 0  2  9]]
**************************************************
layer 1 GBC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
**************************************************
layer 1 XGB
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
**************************************************
layer 1 SVC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      1.00      1.00         6
         2.0       1.00      1.00      1.00        11

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

[[13  0  0]
 [ 0  6  0]
 [ 0  0 11]]
**************************************************

All classifiers in the first layer showed good performance. This is very simple case and very small data to estimate of merit of the blending. I will check another data from now.

Ref.
http://www.chioka.in/stacking-blending-and-stacked-generalization/

Make predictive models with small data and visualize it #Chemoinformatics

I enjoyed chemoinformatics conference held in Kumamoto in this week.
The first day of the conference, I could hear about very interesting lecture. That was very basic data handling and visualization tutorial but useful for newbie of chemoinformatics.
I would like to reproduce the code example, so I tried it.

First, visualize training data. It important to know the properties of training data.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
points = [(1.5, 2), (2, 1), (2, 3), (3, 5), (4, 3), (7, 6), (9,  10)]
label = [1, -1, 1, 1, -1, -1, 1]
def color_map(label):
    if label > 0:
        return 'blue'
    return 'red'
train_color = list(map(color_map, label))
# check data
train_x = [ i[0] for i in points]
train_y = [ i[1] for i in points]
plt.scatter(x=train_x, y=train_y, c=train_color)
plt.xlim(0,15)
plt.ylim(0,15)

Hmm, it seems linear relationship between x and y.

Next, made test data and helper function for visualize data.

test_x = np.linspace(0, 10, 20)
test_y = np.linspace(0, 10, 20)
xx, yy = np.meshgrid(test_x, test_y)
test_x = xx.ravel()
test_y = yy.ravel()
n_data = len(test_x)
test_data = [(test_x[i], test_y[i]) for i in range(n_data)]

def makeplot(test_x, test_y, predict_data):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(train_x, train_y, c=train_color)
    color = list(map(color_map, predict_data))
    ax.scatter(test_x, test_y, c = color, alpha=0.3)
    fig.show()

OK let’s build model!

# Linear Regression
model = LinearRegression()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)


Oh, Simple linear regressor works very well. ;-)
OK, next how about random forest ?

#Random forest
model = RandomForestClassifier(random_state=np.random.seed(794))
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)


The result is quite different from linear regressor.

Next I checked non linear data.

points = [(1, 5), (2, 10), (2, 3), (3, 5), (5, 4), (7, 6), (9,  10), (11,2), (7, 3)]
label = [1, 1, 1, 1, -1, -1, 1, 1, -1]
def color_map(label):
    if label > 0:
        return 'blue'
    return 'red'

train_color = list(map(color_map, label))
train_x = [ i[0] for i in points]
train_y = [ i[1] for i in points]
plt.scatter(x=train_x, y=train_y, c=train_color)
plt.xlim(0,15)
plt.ylim(0,15)

In this case, linear model does not work well.

#Ridge
model = Ridge()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

How about RF and SVR?

#RandomForest
model = RandomForestRegressor(random_state=np.random.seed(794))
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

#SVR
model = SVR()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

RF

SVR

Non linear regressor can fit non linear data but shows different output.
Model selection is important and to select model, it is needed check training data carefully.
All my experiments can check from google colab.

https://colab.research.google.com/drive/1ywqRlcjEPm7pLP-IeawPTsclb9siuFI4

Any comments and suggestions are appreciated.