Open Source Lilly’s Chemoinformatics Package

In 2012, lilly’s researchers published Lilly-MedChem Rules in J. Med. Chem. and disclosed their code on github. After the publication, the rules are used in many applications, papers and chemoinformatics applications. Open source tool made a big impact on chemoinformatics. Several hours ago I found an interesting tweet from @jcheminf.

They reported an algorithm of retro-synthesis. Data driven retrosynthetic analysis is hot topics in chemoinformatics area I think.

The article is published from Lilly and the author uploaded source code on github. URL is below.

Their implementation is different from Segler’s approach ‘Learning plan to chemical synthesis‘. They do not use machine learning approach but use Reverse Reaction Template (RRT) based approach. RRT defines reaction rules and it extracted from mapped reaction data such as Lowe’s US patent dataset.

At first they made RRT repository and used it for analysis. After making the repository. Researcher inputs query structure to the system, the system will search RRT which is applicable for the query and recode it when matched.
Key point is how to make RRT I think. More details are described in the article.
They benchmarked their system with 919 known drug structures from drug bank. The performance of results seems depends on settings of RRT, radius and support. radi-0 RRT seems more general than radi-2 RRT, it likes ECFP.(Table 2)

After reading the article, I would like to use the code. OK let’s try it!

I have checked the repo last year but the code supports linux only at that time. However now, the code supports not only linux but also OSx. ;-)
It is easy to install the tool-kit. OK let’s install the TK and use it.
For installation, gcc >= 6.2.0 and zlib>=1.2.11 are required, so I installed them with home brew.

iwatobipen$ brew install zlib
iwatobipen$ brew install gcc

Then clone the repository and change ZLIB part in makefile.public.OSX-gcc-8. I installed zlib via Homebrew, so I changed ZLIB to ‘/usr/loca/Cellar/zlib….’.

All code are implemented in C++ and the code does not use any chemoinformatics packages such as RDKit, openbabel and CDK!! @_@

iwatobipen$ git clone
iwatobipen$ cd LillyMol
iwatobipen$ vim makefile.public.OSX-gcc-8
-- ZLIB =  /usr/local/opt/zlib/lib
++ ZLIB =  ZLIB = /usr/local/Cellar/zlib/1.2.11/lib

Now ready! After makefile change, run the After wait several minutes, installation will finish. All commands are generated in ./bin/OSX-gcc-8/. There are many commands are provided.

iwatobipen$ cd bin/OSX-gcc-8/
iwatobipen$ ls
activity_consistency	iwcut			msort			ring_extraction		rxn_substructure_search	trxn
common_names		iwdemerit		preferred_smiles	ring_trimming		smiles_mutation		tsubstructure
concat_files		mol2qry			random_smiles		rotatable_bonds		sp3_filter		unique_molecules
fetch_smiles_quick	molecular_scaffold	retrosynthesis		rxn_signature		tautomer_generation	unique_rows
fileconv		molecule_subset		rgroup			rxn_standardize		tp_first_pass

Details of the commands are described in the wiki page.

I checked retrosynthesis code with example data. It is a little difficult to set options for me.

iwatobipen$ cd ./example/retrosynthesis
iwatobipen$ cat 1Cmpds.smi 
> C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O
iwatobipen$  ../../bin/OSX-gcc-8/retrosynthesis -Y all -X kg -X kekule -X ersfrm -a 2 -q f -v -R 1 -I CentroidRxnSmi_1 -P UST:AZUCORS 1Cmpds.smi >log.txt 2>err.txt

Check log.txt and err.txt.

iwatobipen$ cat log.txt
O.Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1 ALL
Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1 SPFRM.1
Oc1ccc(NC(=O)C)cc1  via US03992389_NA CentroidRxnSmi_1 R 1
O=C  via US03992389_NA CentroidRxnSmi_1 R 1
O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1

iwatobipen$ cat err.txt
Will not write product fragments with fewer than 2 atoms
Will keep going after an individual test failure
Will preserve Kekule forms
Will use the reaction file name as the reaction name
Reading reactions took 0 seconds
read mol smi eof
Read 1 molecules, 1 deconstructed
1 molecules deconstructed at radius 1
0 deconstructions done at radius 0
1 deconstructions done at radius 1
Set_of_Reactions: CentroidRxnSmi_1 with 164 reactions
2 molecules deconstructed at radius 1
2 molecules deconstructed
Set_of_Reactions: CentroidRxnSmi_1 with 164 reactions
 1 US03947458_NA 1 searches, 0 matches found
 1 US03947473_NA 1 searches, 0 matches found
 1 US03989717_NA 1 searches, 0 matches found
 ----- snip ;-) ------
 1 US20160002218A1_0322 1 searches, 0 matches found
 1 US20160200725A1_0864 1 searches, 0 matches found
2 molecules deconstructed at radius 1
2 molecules deconstructed
163 reactions had 0 hits
1 reactions had 1 hits

It is difficult to understand smiles strings directly for me, OK let’s visualize with RDKit!

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
parent = Chem.MolFromSmiles('C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O')
mol1 = Chem.MolFromSmiles('O.Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1')
mol2 = Chem.MolFromSmiles('Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1')
mol3 = Chem.MolFromSmiles('Oc1ccc(NC(=O)C)cc1')
mol4 = Chem.MolFromSmiles('O=C')
mol5 = Chem.MolFromSmiles('O=C(NC1CCNCC1)c1ccccc1')
Draw.MolsToGridImage([parent, mol1, mol2, mol3, mol4, mol5])
Single step retro synthesis with Liily’s TK

Example code worked without any problems. But it failed when I used original molecule as a query. One of the reason is that I used very limited training data. I used default data for my test. It has only 164 rxns.

I would like to try to make RRT with large data set.


Make interactive chemical space plot in jupyter notebook #cheminformatics #Altair

I often use seaborn for data visualization. With the library, user can make beautiful visualization.
BTW, today I tried to use another library that can make interactive plot in jupyter notebook.
Name of the library is ‘altair’.
The library can be installed from pip or conda and this package based vega and vega-lite. Vega is a python package for data visualization.

Regarding the tutorial, Altair can make beautiful plow with very simple code. I wrote an example that plot chemical space of cdk2.sdf which is provided by RDKit.

Following code is conducted in Google colab.
At first install rdkit in my environment. Fortunately Altair can call directly without installation to colab.

!chmod +x
!time bash ./ -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

After installation of RDKit, append rdkit path to sys path,

import sys
import os

Now ready. Let’s import some libraries.

import pandas as pd
import numpy as np
import altair as alt

from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import PandasTools
from rdkit.Chem import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA

Then load dataset

moldf = PandasTools.LoadSDF(os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf'))
moldf['SMILES'] = moldf.ROMol.apply(Chem.MolToSmiles)
def mol2fparr(mol):
    arr = np.zeros((0,))
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,2)
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

The conduct PCA with molecular fingerprint for plot chemical space.
And make new dataset. cdk2.sdf has Cluster number, so I used the annotation for coloring.

pca = PCA(n_components=2)
X = np.asarray([mol2fparr(mol) for mol in moldf.ROMol])
res = pca.fit_transform(X)
moldf['PCA1'] = res[:,0]
moldf['PCA2'] = res[:,1]
moldf2 = moldf[['ID', 'PCA1', 'PCA2', 'SMILES' ]]
moldf2['Cluster'] = ["{:0=2}".format(int(cls)) for cls in moldf.loc[:,'Cluster']]

To make scatter plot in Altair, it is easy just call ‘alt.Cahrt.mark_point(pandas data frame)’
mark_* is the method which can access many kids of plots.
From the document, following plots are provided.

Mark Name Method Description Example
area mark_area() A filled area plot. Simple Stacked Area Chart
bar mark_bar() A bar plot. Simple Bar Chart
circle mark_circle() A scatter plot with filled circles. One Dot Per Zipcode
geoshape mark_geoshape() A geographic shape Choropleth Map
line mark_line() A line plot. Simple Line Chart
point mark_point() A scatter plot with configurable point shapes. Multi-panel Scatter Plot with Linked Brushing
rect mark_rect() A filled rectangle, used for heatmaps Simple Heatmap
rule mark_rule() A vertical or horizontal line spanning the axis. Candlestick Chart
square mark_square() A scatter plot with filled squares. N/A
text mark_text() A scatter plot with points represented by text. Bar Chart with Labels
tick mark_tick() A vertical or horizontal tick mark. Simple Strip Plot

Now I would like to make scatter plot, so I call mark_point.
“interactive()” method returns interactive plot in jupyter.
So After run the code, I can see interactive plot in notebook the plot returns tooltip when mouse over the point.

           x = 'PCA1',
           y = 'PCA2',
           color = 'Cluster',
           tooltip = ['ID', 'SMILES']).interactive()

This library is interesting for me because it is easy to implement tooltip. I tried to embed SVG image to tooltip but it did not work. I would like to know how to embed image to the tooltip if it possible.

How to visualize your data is important because it receives different impression from different visualization even if data is same.

Reader who is interested in the post can found whole code from google colab or github. ;-)

Build stacking Classification QSAR model with mlxtend #chemoinformatics #mlxtend #RDKit

I posed about the ML method named ‘blending’ somedays ago. And reader recommended me that how about try to use “mlxtend”.
When I learned ensemble learning package in python I had found it but never used.
So try to use the library to build model.
Mlxtend is easy to install and good document is provided from following URL.

Following code is example for stacking.
In ipython notebook…
Use base.csv for test and load some functions.

%matplotlib inline

import warnings
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
import pandas as pd

df = pd.read_csv("bace.csv")

The dataset has pIC50 for objective value.

mols = [Chem.MolFromSmiles(smi) for smi in df.mol]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol,2, nBits=1024) for mol in mols]
pIC50 = [i for i in df.pIC50]
Draw.MolsToGridImage(mols[:10], legends=["pIC50 "+str(i) for i in pIC50[:10]], molsPerRow=5)

Images of compounds is below.

Then calculate molecular fingerprint. And make binary activity array as y_bin.

X = []
for fp in fps:
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
X = np.array(X)
y = np.array(pIC50)
y_bin = np.asarray(y>7,

Then load some classifier model from sklearn and split data for training and testing.

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from mlxtend.classifier import StackingClassifier
from mlxtend.plotting import plot_decision_regions
from mlxtend.plotting import plot_confusion_matrix
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(X,y_bin, test_size=0.2)

To make stacking classifier, it is very simple just call StackingClassifier and set classifier and meta_classifier as arguments.
I use SVC as meta_classifier.

clf1 = RandomForestClassifier(random_state=794)
clf2 = GaussianNB()
clf3 = XGBClassifier(random_state=0)
clf4 = SVC(random_state=0)
clflist = ["RF", "GNB", "XGB", "SVC", "SCLF"]

sclf = StackingClassifier(classifiers=[clf1,clf2,clf3], meta_classifier=clf4)

Then let’s learn the data!

skf = StratifiedKFold(n_splits=5)
for j, (train_idx,test_idx) in enumerate(skf.split(x_train, y_train)):
    for i, clf in enumerate([clf1, clf2, clf3, clf4, sclf]):[train_idx],y_train[train_idx])
        ypred = clf.predict(x_train[test_idx])
        acc = accuracy_score(y_train[test_idx], ypred)
        b_acc = balanced_accuracy_score(y_train[test_idx], ypred)
        print("round {}".format(j))
        print("accuracy {}".format(acc))
        print("balanced accuracy {}".format(b_acc))

> output
round 0
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
round 0
accuracy 0.6625514403292181
balanced accuracy 0.680450351191296
round 0
accuracy 0.8271604938271605
balanced accuracy 0.8136275995042005
round 0
accuracy 0.7325102880658436
balanced accuracy 0.7072717256576229
round 0
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
round 1
accuracy 0.7603305785123967
balanced accuracy 0.7534683684794672
round 1
accuracy 0.640495867768595
balanced accuracy 0.6634988901220866
round 1
accuracy 0.8140495867768595
balanced accuracy 0.8127081021087681
round 1
accuracy 0.756198347107438
balanced accuracy 0.7414678135405106

Reader who is interested in stacking, you can find nice document here

And my all code is uploaded to myrepo on github.

Mlxtend has many functions for building, analyzing and visualizing machine learning model and data. I will use the package more and more.

Vote Vote Vote #chemoinformatics

Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.
Following explanation from sklearn document.

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

I used the classifier for QSAR modeling.

Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]),
                                        ("GBC", clf_dict["GBC"]),
                                        ("XGB", clf_dict["XGB"]),
                                        ("SVC", clf_dict["SVC"])        
                                    ], voting='hard')

dataset = np.load("train.npz")['arr_0']
X = dataset[:,:-1]
y = dataset[:,-1]
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)
nfolds = 10
skf = StratifiedKFold(nfolds)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val], y_train)
y_pred = voting_clf.predict(test_X)
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

rf_clf = RandomForestClassifier(n_estimators=100)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val], y_train)
y_pred = rf_clf.predict(test_X)
print("Random Forest!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

svc_clf = SVC(probability=True, gamma='auto')
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val], y_train)
y_pred = svc_clf.predict(test_X)
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred)) 

Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)

iwatobipen$ python 
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Random Forest!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Visualize pharmacophore in RDKit #RDKit

RDKit has pharmacophore feature assignment function. The function can retrieve molecular features based on pre-defined ph4core.
And RDKit IPythonconsole can draw molecules on ipython notebook.
Today I tried to visualize ph4core on notebook.
Code is very simple.

from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import rdBase
from rdkit.RDPaths import RDDocsDir
from rdkit.RDPaths import RDDataDir
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import os
> 2018.09.1

First, load feature definition.

fdefFile = os.path.join(RDDataDir,'BaseFeatures.fdef')
featFact = ChemicalFeatures.BuildFeatureFactory(fdefFile)

Then calculate pharmacophore. And compute 2D cordes.

mols = [m for m in Chem.SDMolSupplier(os.path.join(RDDocsDir,"Book/data/cdk2.sdf"))]
featslists = [featFact.GetFeaturesForMol(mol) for mol in mols]
for mol in mols:

Next I defined drawing function. To highlight ph4core, highlightAtomLists and legends are used as optional arguments of MolsToGridImage.

def drawp4core(mol, feats):
    atoms_list = {}
    for feat in feats:
        atom_ids = feat.GetAtomIds()
        feat_type = feat.GetType()
        atoms_list[feat_type] = atom_ids
    return Draw.MolsToGridImage([mol]*len(atoms_list), legends=list(atoms_list.keys()), highlightAtomLists=list(atoms_list.values()))

Test the function.

im = drawp4core(mols[1], featslists[1])

I could get following image.

The function worked and it could pick up pharmacophore of given molecule. ;-)

Generate possible list of SMLIES with RDKit #RDKit

In the computer vision, it is often used data augmentation technique for getting large data set. On the other hand, Canonical SMILES representations are used in chemoinformatics area.
RDKit UGM in last year, Dr. Esben proposed new approach for RNN with SMILES. He expanded 602 training molecules to almost 8000 molecules with different smiles representation technique.
This approach seems works well.
In the UGM hackathon at this year, this random smiles generate function is implemented and it can call from new version of RDKit!
I appreciate rdkit developers!

It is very easy to use, pls see the code below.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase

I used kinase inhibitor as an example.

testsmi = 'CC(C1=C(C=CC(=C1Cl)F)Cl)OC2=C(N=CC(=C2)C3=CN(N=C3)C4CCNCC4)N'
mol = Chem.MolFromSmiles(testsmi)

Default output of MolToSmiles is canonical manner.


But if you set MolToSmiles with doRandom=True option, the function return random but valid SMILES.

mols = []
for _ in range(50):
  smi = Chem.MolToSmiles(mol, doRandom=True)
  m = Chem.MolFromSmiles(smi)

#check molecules
Draw.MolsToGridImage(mols, molsPerRow = 10)

Different SMILES but same molecule!

There are many deep learning approaches which use SMIELS as input. It is useful for these models to augment input data I think.

I uploaded my example code on google colab and github my repo.




Tracking progress of machine learning #MachineLearning

To conduct machine learning it is needed to optimize hyper parameters.
For example scikit-learn provides grid search method. And you know there are several packages to do that such as hyperopt or gyopt etc. How do you mange builded models? It is difficult for me.
Recently I am interested in mlflow . MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions.
MLflow can track each model of hyper parameters and serve the models and also it can provide good web UI.
I used it in very simple example.
Code is below.
At first I got sample data and visualize the data set with PCA.

# on the ipython notebook
%matplotlib inline
!wget -P ./data/
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import as cm
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
data = pd.read_csv('./data/wine-quality.csv')

cmap = plt.get_cmap("Blues", len(data.quality.unique()))
pca = PCA()
wine_pca = pca.fit_transform(data.iloc[:,:-1])
plt.scatter(wine_pca[:,0], wine_pca[:,1], c=data.quality, cmap=cmap)
plt.xlim(np.min(wine_pca[:,0]), np.max(wine_pca[:,0]))
plt.ylim(np.min(wine_pca[:,1]), np.max(wine_pca[:,1]))

Next train function is defined.
mlflow.log_param function can track scoring parameters and mlflow.sklearn.log_model can store the model.
After running the code, mlruns folder is generated in current directory and stored data.

def train():

    import os
    import warnings
    import sys
    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVR
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import cross_val_score
    import mlflow
    import mlflow.sklearn

    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mae, r2

    data = pd.read_csv('./data/wine-quality.csv')
    train, test = train_test_split(data)

    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]
    param = {'C':[0.01, 0.1, 1, 10, 100, 1000, 10000 ],
             'gamma':[1.0, 1e-1, 1e-2, 1e-3, 1e-4]}
    for c in param['C']:
        for g in param['gamma']:
            with mlflow.start_run():
                skf = StratifiedKFold(n_splits=5)
                svr = SVR(C=c, gamma=g)score = cross_val_score(svr, train_x, train_y, cv=skf, n_jobs=-1)
      , train_y)
                predicted_qualities = svr.predict(test_x)
                (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
                print("  RMSE: %s" % rmse)
                print("  MAE: %s" % mae)
                print("  R2: %s" % r2)
                mlflow.log_param("C", c)
                mlflow.log_param("gamma", g)
                mlflow.log_metric("r2", r2)
                mlflow.log_metric("rmse", rmse)
                mlflow.log_metric("mae", mae)
                mlflow.sklearn.log_model(svr, "model")

Run the function.

0.01 1.0
  RMSE: 0.876717955304063
  MAE: 0.6586558965180616
  R2: 0.007250505904323745
0.01 0.1
  RMSE: 0.872902609847314
  MAE: 0.6523680676966712
  R2: 0.015872299345786156
10000 0.0001
  RMSE: 0.7902872331540974
  MAE: 0.570097398346025
  R2: 0.19334133272639453

After running the code, MLflow can provide very useful webUI. To access the UI, just type following command from terminal ;-).
And then access

iwatobipen$ mlflow server

I can check the summary of the training with metrics like below.

And each model is stored. I can see details and access each model like below.

It is useful to manage many experiments and many models I think.