Tracking progress of machine learning #MachineLearning

To conduct machine learning it is needed to optimize hyper parameters.
For example scikit-learn provides grid search method. And you know there are several packages to do that such as hyperopt or gyopt etc. How do you mange builded models? It is difficult for me.
Recently I am interested in mlflow . MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions.
MLflow can track each model of hyper parameters and serve the models and also it can provide good web UI.
I used it in very simple example.
Code is below.
At first I got sample data and visualize the data set with PCA.

# on the ipython notebook
%matplotlib inline
!wget https://raw.githubusercontent.com/mlflow/mlflow/master/examples/sklearn_elasticnet_wine/wine-quality.csv -P ./data/
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cm
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
data = pd.read_csv('./data/wine-quality.csv')

cmap = plt.get_cmap("Blues", len(data.quality.unique()))
pca = PCA()
wine_pca = pca.fit_transform(data.iloc[:,:-1])
plt.scatter(wine_pca[:,0], wine_pca[:,1], c=data.quality, cmap=cmap)
plt.xlim(np.min(wine_pca[:,0]), np.max(wine_pca[:,0]))
plt.ylim(np.min(wine_pca[:,1]), np.max(wine_pca[:,1]))
plt.colorbar()

Next train function is defined.
mlflow.log_param function can track scoring parameters and mlflow.sklearn.log_model can store the model.
After running the code, mlruns folder is generated in current directory and stored data.


def train():

    import os
    import warnings
    import sys
    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVR
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import cross_val_score
    import mlflow
    import mlflow.sklearn

    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mae, r2

    warnings.filterwarnings("ignore")
    np.random.seed(40)
    data = pd.read_csv('./data/wine-quality.csv')
    train, test = train_test_split(data)

    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]
    param = {'C':[0.01, 0.1, 1, 10, 100, 1000, 10000 ],
             'gamma':[1.0, 1e-1, 1e-2, 1e-3, 1e-4]}
    for c in param['C']:
        for g in param['gamma']:
            with mlflow.start_run():
                print(c,g)
                skf = StratifiedKFold(n_splits=5)
                svr = SVR(C=c, gamma=g)score = cross_val_score(svr, train_x, train_y, cv=skf, n_jobs=-1)
                svr.fit(train_x, train_y)
                predicted_qualities = svr.predict(test_x)
                (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
                print("  RMSE: %s" % rmse)
                print("  MAE: %s" % mae)
                print("  R2: %s" % r2)
                mlflow.log_param("C", c)
                mlflow.log_param("gamma", g)
                mlflow.log_metric("r2", r2)
                mlflow.log_metric("rmse", rmse)
                mlflow.log_metric("mae", mae)
                mlflow.sklearn.log_model(svr, "model")

Run the function.

train()
0.01 1.0
  RMSE: 0.876717955304063
  MAE: 0.6586558965180616
  R2: 0.007250505904323745
0.01 0.1
  RMSE: 0.872902609847314
  MAE: 0.6523680676966712
  R2: 0.015872299345786156
--snip--
10000 0.0001
  RMSE: 0.7902872331540974
  MAE: 0.570097398346025
  R2: 0.19334133272639453
'''

After running the code, MLflow can provide very useful webUI. To access the UI, just type following command from terminal ;-).
And then access http://127.0.0.1:5000/#/.

iwatobipen$ mlflow server

I can check the summary of the training with metrics like below.

And each model is stored. I can see details and access each model like below.

It is useful to manage many experiments and many models I think.

Advertisements