To conduct machine learning it is needed to optimize hyper parameters.
For example scikit-learn provides grid search method. And you know there are several packages to do that such as hyperopt or gyopt etc. How do you mange builded models? It is difficult for me.
Recently I am interested in mlflow . MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions.
MLflow can track each model of hyper parameters and serve the models and also it can provide good web UI.
I used it in very simple example.
Code is below.
At first I got sample data and visualize the data set with PCA.
# on the ipython notebook %matplotlib inline !wget https://raw.githubusercontent.com/mlflow/mlflow/master/examples/sklearn_elasticnet_wine/wine-quality.csv -P ./data/ import matplotlib.pyplot as plt import matplotlib.colors as colors import matplotlib.cm as cm import pandas as pd import numpy as np from sklearn.decomposition import PCA data = pd.read_csv('./data/wine-quality.csv') cmap = plt.get_cmap("Blues", len(data.quality.unique())) pca = PCA() wine_pca = pca.fit_transform(data.iloc[:,:-1]) plt.scatter(wine_pca[:,0], wine_pca[:,1], c=data.quality, cmap=cmap) plt.xlim(np.min(wine_pca[:,0]), np.max(wine_pca[:,0])) plt.ylim(np.min(wine_pca[:,1]), np.max(wine_pca[:,1])) plt.colorbar()
Next train function is defined.
mlflow.log_param function can track scoring parameters and mlflow.sklearn.log_model can store the model.
After running the code, mlruns folder is generated in current directory and stored data.
def train(): import os import warnings import sys import pandas as pd import numpy as np from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score from sklearn.model_selection import train_test_split from sklearn.svm import SVR from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import cross_val_score import mlflow import mlflow.sklearn def eval_metrics(actual, pred): rmse = np.sqrt(mean_squared_error(actual, pred)) mae = mean_absolute_error(actual, pred) r2 = r2_score(actual, pred) return rmse, mae, r2 warnings.filterwarnings("ignore") np.random.seed(40) data = pd.read_csv('./data/wine-quality.csv') train, test = train_test_split(data) train_x = train.drop(["quality"], axis=1) test_x = test.drop(["quality"], axis=1) train_y = train[["quality"]] test_y = test[["quality"]] param = {'C':[0.01, 0.1, 1, 10, 100, 1000, 10000 ], 'gamma':[1.0, 1e-1, 1e-2, 1e-3, 1e-4]} for c in param['C']: for g in param['gamma']: with mlflow.start_run(): print(c,g) skf = StratifiedKFold(n_splits=5) svr = SVR(C=c, gamma=g)score = cross_val_score(svr, train_x, train_y, cv=skf, n_jobs=-1) svr.fit(train_x, train_y) predicted_qualities = svr.predict(test_x) (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities) print(" RMSE: %s" % rmse) print(" MAE: %s" % mae) print(" R2: %s" % r2) mlflow.log_param("C", c) mlflow.log_param("gamma", g) mlflow.log_metric("r2", r2) mlflow.log_metric("rmse", rmse) mlflow.log_metric("mae", mae) mlflow.sklearn.log_model(svr, "model")
Run the function.
train() 0.01 1.0 RMSE: 0.876717955304063 MAE: 0.6586558965180616 R2: 0.007250505904323745 0.01 0.1 RMSE: 0.872902609847314 MAE: 0.6523680676966712 R2: 0.015872299345786156 --snip-- 10000 0.0001 RMSE: 0.7902872331540974 MAE: 0.570097398346025 R2: 0.19334133272639453 '''
After running the code, MLflow can provide very useful webUI. To access the UI, just type following command from terminal ;-).
And then access http://127.0.0.1:5000/#/.
iwatobipen$ mlflow server
I can check the summary of the training with metrics like below.
And each model is stored. I can see details and access each model like below.
It is useful to manage many experiments and many models I think.
One thought on “Tracking progress of machine learning #MachineLearning”