Useful python package for QSAR related tasks of chemoinformatician #chemoinformatics #oloren-ai #RDKit

When I posted my memo about open science, @OlorenAI introduced python package named Oloren ChemEngine (OCE). I often use chemprop or interanly build system for QSAR tasks. ChemProp is the one of favorite package because it is easy to use and it includes web application flamework for users. I’ve never used OEC so I tried to use the package. Thaks @Oloren for developing and sharing the code.

At first, I installed OCE. My cuda version is little bit old, so I need to install Pytorch geometric after running the original install script .

(base) $ conda create -c conda-forge -n oce python=3.8
(base) $ conda activate oce
(oce) $ bash <(curl -s https://raw.githubusercontent.com/Oloren-AI/olorenchemengine/master/install.sh)
(oce) $ conda install -c conda-forge jupyter mamba
# following command depends on your env. I use pytorch1.12 and cuda 10.2
(oce) $ pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.12.0+cu102.html

After running the code above, I could use OCE.

OK, let’s run the code. Following code is almost same as original documentation. The main different point compared to other QSAR library, OCE don’t need vectorize(featurize) molecules just define which method user would like to use in the model.

This example make Boosing emsemble model with two RandomfolrestModel with two different features. After making model object, user need to pass the list of SMILES and target value for training !!

OCE wraps complicated molecular featureize process. So user don’t need do that. I think it is cool isn’t it? After making the model object, model training and prediction step is as same as scikit-learn.

User can build complex model with few lines of code ;)

import olorenchemengine as oce
df = oce.ExampleDataFrame()

model = oce.BaseBoosting([
            oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"), n_estimators=1000),
            oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)])

model.fit(df["Smiles"], df["pChEMBL Value"])
oce.save(model, "model.oce")
model2 = oce.load("model.oce")
y_pred = model2.predict(["CC(=O)OC1=CC=CC=C1C(=O)O"])
y_pred
>>array([0.15748118])

OCE provides not only model building method but also provides visualization method. Let’s see it.

from olorenchemengine.visualizations import *
from olorenchemengine import *
import olorenchemengine as oce
df = oce.ExampleDataset().data
test_dataset = oce.BaseDataset(name='purple', data=df.to_csv(), structure_col='Smiles', property_col='pChEMBL Value')

model = oce.BaseBoosting([
            oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"),n_estimators=1000),
            oce.BaseTorchGeometricModel(oce.TLFromCheckpoint("default"), preinitialized=True),
            oce.RandomForestModel(oce.OlorenCheckpoint("default"),n_estimators=1000)])


## Training the model
model.fit(df["Smiles"], df["pChEMBL Value"])
vis =  oce.VisualizeModelSim(model=loaded_model,dataset=test_dataset)
vis.render_ipynb()

After making vis object and call render_ipynb(), I could get interactive scatter plot shown below. When I mouse over on the plot, compound structure will be hovered. This example is 2 class classification model but you can make same plot with regression task.

OCE also have not only traditional predictive model but also chemprop model (MPDNN!). It sounds nice, let’s use it. It’s really user friendly API ;)

import io
import sys
import zipfile

import pandas as pd
import requests
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

import olorenchemengine as oce
data_dir = "./data"
data_url = "http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip"
r = requests.get(data_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(data_dir)
df = pd.read_csv(f"{data_dir}/hiv/mapping/mol.csv.gz")

X_train, X_test, y_train, y_test = train_test_split(df["smiles"], df["HIV_active"], test_size=0.2, random_state=42)

## building and training chemprop model is finished only two lines!!!  
model = oce.ChemPropModel()
model.fit(X_train, y_train)

'''
print(X_train)
29361                     O=C(O)CC(NC(=O)OCc1ccccc1)C(=O)O
10448                 O=[N+]([O-])c1ccc(Nc2ccccc2)c2nonc12
31039      CCOC(=O)C(=NNc1ccc(C)cc1)N1C(=S)N(C)N=C(C)C=C1S
1311                       N#CSC1=C(SC#N)C(=O)c2ccccc2C1=O
27834    COc1cc(C2C3=C(COC3=O)OC(C)(C)Oc3cc4c(cc32)OCO4...
                               ...                        
6265                            Cc1ccc2nsnc2c1[N+](=O)[O-]
11284    CC1=CC(=C(c2cc(C)c(O)c(C(=O)O)c2)c2c(Cl)ccc(S(...
38158                            Oc1ncnc2c1sc1nc3ccccc3n12
860                 CCN(CCO)CCNc1ccc(C)c2sc3ccccc3c(=O)c12
15795    COc1cccc(NC(=O)CC(=O)N2N=C(N(CCC#N)c3ccc(Cl)cc...
Name: smiles, Length: 32901, dtype: object

model.model
Sequential(
  (0): DMPNNEncoder(
    (act_func): ReLU()
    (W1): Linear(in_features=165, out_features=300, bias=False)
    (W2): Linear(in_features=300, out_features=300, bias=False)
    (W3): Linear(in_features=451, out_features=300, bias=True)
  )
  (1): Sequential(
    (0): Dropout(p=0.0, inplace=False)
    (1): Linear(in_features=300, out_features=300, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.0, inplace=False)
    (4): Linear(in_features=300, out_features=1, bias=True)
  )
)
'''

Additionaly, OCE can visualize chemicalspace. Let’s plot chemicals space with TSNE.

import olorenchemengine as oce
dataset = oce.BACEDataset() + oce.ScaffoldSplit()
vis = oce.ChemicalSpacePlot(dataset, oce.DescriptastorusDescriptor('morgan3counts'), opacity = 0.4, dim_reduction = "tsne")
vis.render_ipynb()

Next, make plot with trained model data. The size of marker means  the magnitude of the residuals. It’s useful to evaluate model performance.

model = oce.BaseBoosting([oce.RandomForestModel(oce.DescriptastorusDescriptor("morgan3counts")),
                         oce.RandomForestModel(oce.OlorenCheckpoint("default"))])
model.fit(*dataset.train_dataset)
vis = oce.VisualizeDatasetSplit(dataset, oce.DescriptastorusDescriptor("morgan3counts"), 
                                model = model, opacity = 0.4)
vis.render_ipynb()

These codes are few examples from original document. If readers have interest the package please install and use it in your Real tasks!

There is a nice documents are provided here. https://docs.oloren.ai/api/modules.html

Thanks!!!

Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: