Yesterday, I moved my main PC from Ubuntu18.04 to 20.04LTS. Now it works well. And I’m building new(clean) env for my coding.

Today I would like to share useful package for machine learning named pycaret. Brief introduction of PyCaret is below.
—from original site—
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.
In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only.
——————————
It means that user don’t need writing long lines for model building and user can focus to how to apply and utilize their models.
I read tutorial session of PyCaret documentation and tried to use it for chemoinformatics problem.
Following code is an example of simple classification task. Data is borrowed from RDKit solubility data. Let’s start coding.
I have train and test data separately so I didn’t use train and test split function of pycaret. And I make dataset as pandas DataFrame as show below.
import os
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import RDConfig
## following part is not so interesting ...... Routine task ;)
traindir = os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf')
testdir = os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(traindir)]
test_mols = [m for m in Chem.SDMolSupplier(testdir)]
def mol2fp(mol, radi=2, nBits=1024):
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radi, nBits=nBits)
arr = np.zeros((0,))
DataStructs.ConvertToNumpyArray(fp, arr)
return arr
prop_dict = {
'(A) low':0,
'(B) medium':1,
'(C) high':2
}
## Make data as DataFrame
nBits=1024
columns = [f'fp_{idx}' for idx in range(nBits)] + ['target']
train_x = np.array([mol2fp(m, nBits=nBits) for m in train_mols])
test_x = np.array([mol2fp(m, nBits=nBits) for m in test_mols])
train_target = np.array([prop_dict[m.GetProp('SOL_classification')] for m in train_mols]).reshape(-1,1)
test_target = np.array([prop_dict[m.GetProp('SOL_classification')] for m in test_mols]).reshape(-1,1)
train_data = np.concatenate([train_x, train_target], axis=1)
test_data = np.concatenate([test_x, test_target], axis=1)
train_df = pd.DataFrame(train_data, columns=columns)
test_df = pd.DataFrame(test_data, columns=columns)
Now I finished data preparation, I called setup to start classification task. The function initialize the training environments and I can see summary of dataset.
Then to get best_model with default parameters, it is really simple…. Just call compare_models function. That’s all! After waiting some minutes, PyCaret return summary data with useful metrics as dataframe shown below.

The experiment found that cat
boost is the best model. It is also easy to make individual models. I build random forest model and optimize the model.
rfmodel = create_model('rf')
tuned_rf = tune_model(rfmodel)
# That's all!!


After building the models, we need to check the performance by making plot with matplotlib, seaborn and etc. But pycaret isn’t required codes with these packages…. I show examples to make some useful visualizations to do it.




Or call evaluate_model function I can get interactive plot generator.


OK, I could build RF model so next step, I would like to finalize the model and use the model for unseen data(test data).
## finalize model and check the performance
final_rf = finalize_model(tuned_rf)
predict_model(final_rf)
>
Model Accuracy AUC Recall Prec. F1 Kappa MCC
0 Random Forest Classifier 0.9188 0 0.9003 0.9219 0.9186 0.8723 0.874
## predict test data with the model
test_pred = predict_model(final_rf, data=test_df)
from pycaret.utils import check_metric
check_metric(test_pred['target'], test_pred['Label'], metric='Accuracy')
> 0.708
In summary, I think PyCaret can improve productivity of chemoinformatics/machine learning task not only model building part but also making visualization part.
And PyCaret supports major ML packages scikit-learn, LightGBM, XGBoost and CatBoost. This is the first time for me to use pycaret. But it seems really useful so I’ll play the package more and more.
I uploaded today’s code to my gist. It is almost as same as original example expect to dataset.
wow! This is the first time I hear about PyCaret and it looks so useful and convenienet. Thank you very much for sharing this information.
hi, iwatobipen,
appreciate your sharing that. I’m a beginner for cheminfo.
for one quetions, where can I find the file, “Book/data/solubility.train.sdf”, as you mentioned in blog?
many thanks,
best,
Sh-Y
Hi, if you have already installed rdkit in your env, you can get sdf path with following dode.
os.path.join(RDConfig.RDDocsDir,’Book/data/solubility.train.sdf’)
RDConfig.RDDocsDir depends place where you installed rdkit.
And if you would like to get raw data from web, please access github rdkit repo.
https://github.com/rdkit/rdkit/tree/master/Docs/Book/data
Thanks
hi, iwatobipen
I come up with a problem when I run your code on Colab.
the error happens at the line: exp_sol = setup(data=train_df, target=’target’, session_id=123)
–> 153 type(estimator)))
154
155 # Estimator interface
TypeError: Last step of Pipeline should implement fit. ‘passthrough’ (type ) doesn’t
could you please provide suggetions to fix that? many thanks,
Sh-Y