Package for ML task management #chemoinformatics #memo #machine_learning #RDKit

Now we can build lots of predictive models rapidly with useful ML tools such as keras, pytorch, scikit-learn, lightGBM etc… The problem for me is that how to manage these experimental results. I posted about the topics previously and I used MLflow, optuna as examples. These tools are has different features but both are very useful. Today I would like to introduce another useful package ‘modeldb‘ which is ‘An open-source system for Machine Learning model versioning, metadata, and experiment management.’.

The package is OSS developed under ‘Apache 2.0’. And it can get from github repository. The installation is very easy.

# modeldb required docker. To run the following code, user should install docker.
$ git clone https://github.com/VertaAI/modeldb.git
$ cd modeldb
$ docker-compose -f docker-compose-all.yaml up

Now, I could access http://localhost:3000 from web browser.

Then install modeldb with pip.

$ conda install -c conda-forge verta

Now ready to test. Following code is an example for solubility predictive model building. From jupyter notebook, I made client object and make my project and experiment.

import os
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

traindir = os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf')
testdir = os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(traindir)]
test_mols = [m for m in Chem.SDMolSupplier(testdir)]
target = 'SOL_classification'

prop_dict = {
            '(A) low':0,
            '(B) medium':1,
            '(C) high':2
            }

def mol2fp(mol, radi=2, nBits=1024):
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radi, nBits=nBits)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
x_train = np.array([mol2fp(m) for m in train_mols])
x_test = np.array([mol2fp(m) for m in test_mols])
y_train = np.array([prop_dict[mol.GetProp(target)] for mol in train_mols])
y_test = np.array([prop_dict[mol.GetProp(target)] for mol in test_mols])

grid = {
    'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000],
    'kernel' : ['rbf', 'sigmoid', 'linear']
}

from verta import Client
from verta.utils import ModelAPI
client = Client('localhost:3000')
proj = client.set_project('chemoinfo_pj')
expt = client.set_experiment('SVC test')

As you know scikit-learn has useful class GridsearchCV for hyper parameter searching. however I couldn’t find solution to get each model of the experiment. So I didn’t use the class and build each model manually. As you can see, client.set_experiment_run() object has log_xxxxx method. The method can communicate server and store experiment log not only score but also model object. And code is very simple ;)

for c in grid['C']:
    for kernel in grid['kernel']:
        svc = SVC(C=c, kernel=kernel)
        svc.fit(x_train, y_train)
        train_pred = svc.predict(x_train)
        train_acc = accuracy_score(y_train, train_pred)
        pred = svc.predict(x_test)
        test_acc = accuracy_score(y_test, pred)
        run = client.set_experiment_run()
        run.log_hyperparameter('C',c)
        run.log_hyperparameter('kernel', kernel)
        run.log_metric('test_acc', test_acc)
        run.log_metric('train_acc', train_acc)
        run.log_model(svc)

After that, I could see all results from web browser. I uploaded some screen shots.

After that if I would like to use best scored model, I can get model from management server and use it with only few lines.

# best scored model's run id
run_id = 'b6322728-f056-467d-a6ac-7cfd8818b7f6'
experiment1 = client.get_experiment_run(id=run_id)
mdl = experiment1.get_artifact('model.pkl')
type(mdl)
> sklearn.svm._classes.SVC
pred = mdl.predict(x_test)
accuracy_score(y_test, pred)
> 0.7354085603112841

Modeldb server has also repository so user can commit environment information, data set, code etc. to the repository. It is useful for version management.

Also modeldb has deployment method. So I think it’s interesting and useful package for chemoinformatics tasks.

Thanks for reading ;)

Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: