Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Following explanation from sklearn document.
The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.
The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
I used the classifier for QSAR modeling.
Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.
import numpy as np from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import VotingClassifier from sklearn.svm import SVC from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from xgboost import XGBClassifier clf_dict = {'RF': RandomForestClassifier(n_estimators=100), 'ETC': ExtraTreesClassifier(n_estimators=100), 'GBC': GradientBoostingClassifier(learning_rate=0.05), 'XGB': XGBClassifier(n_estimators=100), 'SVC': SVC(probability=True, gamma='auto')} voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]), ("GBC", clf_dict["GBC"]), ("XGB", clf_dict["XGB"]), ("SVC", clf_dict["SVC"]) ], voting='hard') dataset = np.load("train.npz")['arr_0'] X = dataset[:,:-1] y = dataset[:,-1] idx = np.random.permutation(y.size) X = X[idx] y = y[idx] train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794) nfolds = 10 skf = StratifiedKFold(nfolds) for i, (train, val) in enumerate(skf.split(train_X, train_y)): #print('fold {}'.format(i)) X_train = train_X[train] y_train = train_y[train] X_val = train_X[val] y_val = train_y[val] voting_clf.fit(X_train, y_train) y_pred = voting_clf.predict(test_X) print("Voting!") print(confusion_matrix(test_y, y_pred)) print(classification_report(test_y, y_pred)) rf_clf = RandomForestClassifier(n_estimators=100) for i, (train, val) in enumerate(skf.split(train_X, train_y)): #print('fold {}'.format(i)) X_train = train_X[train] y_train = train_y[train] X_val = train_X[val] y_val = train_y[val] rf_clf.fit(X_train, y_train) y_pred = rf_clf.predict(test_X) print("Random Forest!") print(confusion_matrix(test_y, y_pred)) print(classification_report(test_y, y_pred)) svc_clf = SVC(probability=True, gamma='auto') for i, (train, val) in enumerate(skf.split(train_X, train_y)): #print('fold {}'.format(i)) X_train = train_X[train] y_train = train_y[train] X_val = train_X[val] y_val = train_y[val] svc_clf.fit(X_train, y_train) y_pred = svc_clf.predict(test_X) print("SV!") print(confusion_matrix(test_y, y_pred)) print(classification_report(test_y, y_pred))
Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)
iwatobipen$ python voting.py Voting! [[10 0 0] [ 0 11 0] [ 0 1 8]] precision recall f1-score support 0.0 1.00 1.00 1.00 10 1.0 0.92 1.00 0.96 11 2.0 1.00 0.89 0.94 9 micro avg 0.97 0.97 0.97 30 macro avg 0.97 0.96 0.97 30 weighted avg 0.97 0.97 0.97 30 Random Forest! [[10 0 0] [ 0 11 0] [ 0 1 8]] precision recall f1-score support 0.0 1.00 1.00 1.00 10 1.0 0.92 1.00 0.96 11 2.0 1.00 0.89 0.94 9 micro avg 0.97 0.97 0.97 30 macro avg 0.97 0.96 0.97 30 weighted avg 0.97 0.97 0.97 30 SV! [[10 0 0] [ 0 11 0] [ 0 1 8]] precision recall f1-score support 0.0 1.00 1.00 1.00 10 1.0 0.92 1.00 0.96 11 2.0 1.00 0.89 0.94 9 micro avg 0.97 0.97 0.97 30 macro avg 0.97 0.96 0.97 30 weighted avg 0.97 0.97 0.97 30
Did you check mlxtend library for combining models? But may be it’s not doing what you are looking for – I don’t know the library in details but saw it from Sebastian Raschka twitter
Yah! I have checked mlxtend. It seems useful but I have not try to use it yet. I will use it ASAP. Thanks.