Tool for machine learning model logging #chemoinformatics #machine_learning

It is difficult to manage machine learning models because to obtain good model, many trials which called parameter optimization are required and then lots of models are generated.

Optuna is one of the useful package for model management and parameter optimization. I like it and posted some code about optuna.

So today, I would like to post another model management tool named modellogger. Modelloger can not only store model information but also visualization. It can be installed with pip.

Let’s go to code. Following example I used data set from rdkit.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.metrics import accuracy_score

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs

from modellogger.modellogger import ModelLogger
mlog = ModelLogger('mllog.db')

At first import packages and model logger object. it makes sqlite3 database.

Next, made data set and scoring function. I used accuracy score because it is the classification task.

def mol2arr(mol, radi=2, nBits=1024):
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radi, nBits=nBits)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

class RDKit_sol_data():
    def __init__(self, trainsdf, testsdf):
        self.trainsdf = trainsdf
        self.testsdf = testsdf
        self.cls_dict = {
            '(A) low': 0,
            '(B) medium': 1,
            '(C) high' :2
        } = 'SOL_classification'
        self.train_mols = [m for m in Chem.SDMolSupplier(self.trainsdf) if m is not None]
        self.test_mols =  [m for m in Chem.SDMolSupplier(self.testsdf) if m is not None]
    def get_dataset(self):
        train_y = [self.cls_dict[m.GetProp(] for m in self.train_mols]
        test_y =  [self.cls_dict[m.GetProp(] for m in self.test_mols]
        return {
            'train_mols': self.train_mols,
            'train_y': np.array(train_y),
            'test_y': np.array(test_y)

def scoringfunc(test_Y, pred_Y):
    return accuracy_score(test_Y, pred_Y)

Now almost there, I load data and make models. I made RandomForest, SVC and GaussianClassifier.

## Data preparation
radi = 2
nBits = 1024

train_path = './data/solubility.train.sdf'
test_path = './data/solubility.test.sdf'

loader = RDKit_sol_data(train_path, test_path)
dataset = loader.get_dataset()

featlist = [f'fp_{i}' for i in range(nBits)]

trainFP = [mol2arr(m, radi=radi, nBits=nBits) for m in dataset['train_mols']]
train_X = np.array(trainFP)
train_X = pd.DataFrame(train_X, columns=featlist)
train_Y = dataset['train_y']

testFP = [mol2arr(m, radi=radi, nBits=nBits) for m in dataset['test_mols']]
test_X = np.array(testFP)
test_X = pd.DataFrame(test_X, columns=featlist)
test_Y = dataset['test_y']

rf_cls = RandomForestClassifier()
sv_cls = SVC(C=100, gamma='auto')
gp_cls = GaussianProcessClassifier()
modelnames = ['rf_cls', 'sv_c', 'gp_c']

for idx, mdl in enumerate([rf_cls, sv_cls, gp_cls]):, train_Y)
    pred_Y = mdl.predict(test_X)
    mlog.store_model(modelnames[idx], mdl, train_X, scoringfunc(test_Y, pred_Y))


The final part of the code, I called model_profies(). After calling the method, local server will launch and the performance of models will be visible.

Modellogger is very interesting and easy to use. I uploaded today’s code on my github repo.


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: