Useful ML tool for chemoinformatics #chemoinformatics #RDKit #Machine learning

Yesterday, I moved my main PC from Ubuntu18.04 to 20.04LTS. Now it works well. And I’m building new(clean) env for my coding.

Today I would like to share useful package for machine learning named pycaret. Brief introduction of PyCaret is below.

—from original site—
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.
In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only.

It means that user don’t need writing long lines for model building and user can focus to how to apply and utilize their models.

I read tutorial session of PyCaret documentation and tried to use it for chemoinformatics problem.

Following code is an example of simple classification task. Data is borrowed from RDKit solubility data. Let’s start coding.

I have train and test data separately so I didn’t use train and test split function of pycaret. And I make dataset as pandas DataFrame as show below.

import os
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import RDConfig
## following part is not so interesting ...... Routine task ;)
traindir = os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf')
testdir = os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(traindir)]
test_mols = [m for m in Chem.SDMolSupplier(testdir)]
def mol2fp(mol, radi=2, nBits=1024):
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radi, nBits=nBits)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
prop_dict = {
            '(A) low':0,
            '(B) medium':1,
            '(C) high':2
## Make data as DataFrame

columns = [f'fp_{idx}' for idx in range(nBits)] + ['target']
train_x = np.array([mol2fp(m, nBits=nBits) for m in train_mols])
test_x = np.array([mol2fp(m, nBits=nBits) for m in test_mols])
train_target = np.array([prop_dict[m.GetProp('SOL_classification')] for m in train_mols]).reshape(-1,1)
test_target = np.array([prop_dict[m.GetProp('SOL_classification')] for m in test_mols]).reshape(-1,1)
train_data = np.concatenate([train_x, train_target], axis=1)
test_data = np.concatenate([test_x, test_target], axis=1)
train_df = pd.DataFrame(train_data, columns=columns)
test_df = pd.DataFrame(test_data, columns=columns)

Now I finished data preparation, I called setup to start classification task. The function initialize the training environments and I can see summary of dataset.

Then to get best_model with default parameters, it is really simple…. Just call compare_models function. That’s all! After waiting some minutes, PyCaret return summary data with useful metrics as dataframe shown below.


The experiment found that cat

boost is the best model. It is also easy to make individual models. I build random forest model and optimize the model.

rfmodel = create_model('rf')
tuned_rf = tune_model(rfmodel)
# That's all!!

After building the models, we need to check the performance by making plot with matplotlib, seaborn and etc. But pycaret isn’t required codes with these packages…. I show examples to make some useful visualizations to do it.

Or call evaluate_model function I can get interactive plot generator.

OK, I could build RF model so next step, I would like to finalize the model and use the model for unseen data(test data).

## finalize model and check the performance
final_rf = finalize_model(tuned_rf)
Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Random Forest Classifier	0.9188	0	0.9003	0.9219	0.9186	0.8723	0.874

## predict test data with the model
test_pred = predict_model(final_rf, data=test_df)
from pycaret.utils import check_metric
check_metric(test_pred['target'], test_pred['Label'], metric='Accuracy')
> 0.708

In summary, I think PyCaret can improve productivity of chemoinformatics/machine learning task not only model building part but also making visualization part.

And PyCaret supports major ML packages scikit-learn, LightGBM, XGBoost and CatBoost. This is the first time for me to use pycaret. But it seems really useful so I’ll play the package more and more.

I uploaded today’s code to my gist. It is almost as same as original example expect to dataset.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

4 thoughts on “Useful ML tool for chemoinformatics #chemoinformatics #RDKit #Machine learning

  1. wow! This is the first time I hear about PyCaret and it looks so useful and convenienet. Thank you very much for sharing this information.

  2. hi, iwatobipen,

    appreciate your sharing that. I’m a beginner for cheminfo.
    for one quetions, where can I find the file, “Book/data/solubility.train.sdf”, as you mentioned in blog?
    many thanks,



  3. hi, iwatobipen

    I come up with a problem when I run your code on Colab.
    the error happens at the line: exp_sol = setup(data=train_df, target=’target’, session_id=123)

    –> 153 type(estimator)))
    155 # Estimator interface

    TypeError: Last step of Pipeline should implement fit. ‘passthrough’ (type ) doesn’t

    could you please provide suggetions to fix that? many thanks,


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: