When I posted my memo about open science, @OlorenAI introduced python package named Oloren ChemEngine (OCE). I often use chemprop or interanly build system for QSAR tasks. ChemProp is the one of favorite package because it is easy to use and it includes web application flamework for users. I’ve never used OEC so I tried to use the package. Thaks @Oloren for developing and sharing the code.
At first, I installed OCE. My cuda version is little bit old, so I need to install Pytorch geometric after running the original install script .
(base) $ conda create -c conda-forge -n oce python=3.8
(base) $ conda activate oce
(oce) $ bash <(curl -s https://raw.githubusercontent.com/Oloren-AI/olorenchemengine/master/install.sh)
(oce) $ conda install -c conda-forge jupyter mamba
# following command depends on your env. I use pytorch1.12 and cuda 10.2
(oce) $ pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.12.0+cu102.html
After running the code above, I could use OCE.
OK, let’s run the code. Following code is almost same as original documentation. The main different point compared to other QSAR library, OCE don’t need vectorize(featurize) molecules just define which method user would like to use in the model.
This example make Boosing emsemble model with two RandomfolrestModel with two different features. After making model object, user need to pass the list of SMILES and target value for training !!
OCE wraps complicated molecular featureize process. So user don’t need do that. I think it is cool isn’t it? After making the model object, model training and prediction step is as same as scikit-learn.
User can build complex model with few lines of code ;)
import olorenchemengine as oce
df = oce.ExampleDataFrame()
model = oce.BaseBoosting([
oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"), n_estimators=1000),
oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)])
model.fit(df["Smiles"], df["pChEMBL Value"])
oce.save(model, "model.oce")
model2 = oce.load("model.oce")
y_pred = model2.predict(["CC(=O)OC1=CC=CC=C1C(=O)O"])
y_pred
>>array([0.15748118])
OCE provides not only model building method but also provides visualization method. Let’s see it.
from olorenchemengine.visualizations import *
from olorenchemengine import *
import olorenchemengine as oce
df = oce.ExampleDataset().data
test_dataset = oce.BaseDataset(name='purple', data=df.to_csv(), structure_col='Smiles', property_col='pChEMBL Value')
model = oce.BaseBoosting([
oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"),n_estimators=1000),
oce.BaseTorchGeometricModel(oce.TLFromCheckpoint("default"), preinitialized=True),
oce.RandomForestModel(oce.OlorenCheckpoint("default"),n_estimators=1000)])
## Training the model
model.fit(df["Smiles"], df["pChEMBL Value"])
vis = oce.VisualizeModelSim(model=loaded_model,dataset=test_dataset)
vis.render_ipynb()
After making vis object and call render_ipynb(), I could get interactive scatter plot shown below. When I mouse over on the plot, compound structure will be hovered. This example is 2 class classification model but you can make same plot with regression task.

OCE also have not only traditional predictive model but also chemprop model (MPDNN!). It sounds nice, let’s use it. It’s really user friendly API ;)
import io
import sys
import zipfile
import pandas as pd
import requests
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
import olorenchemengine as oce
data_dir = "./data"
data_url = "http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip"
r = requests.get(data_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(data_dir)
df = pd.read_csv(f"{data_dir}/hiv/mapping/mol.csv.gz")
X_train, X_test, y_train, y_test = train_test_split(df["smiles"], df["HIV_active"], test_size=0.2, random_state=42)
## building and training chemprop model is finished only two lines!!!
model = oce.ChemPropModel()
model.fit(X_train, y_train)
'''
print(X_train)
29361 O=C(O)CC(NC(=O)OCc1ccccc1)C(=O)O
10448 O=[N+]([O-])c1ccc(Nc2ccccc2)c2nonc12
31039 CCOC(=O)C(=NNc1ccc(C)cc1)N1C(=S)N(C)N=C(C)C=C1S
1311 N#CSC1=C(SC#N)C(=O)c2ccccc2C1=O
27834 COc1cc(C2C3=C(COC3=O)OC(C)(C)Oc3cc4c(cc32)OCO4...
...
6265 Cc1ccc2nsnc2c1[N+](=O)[O-]
11284 CC1=CC(=C(c2cc(C)c(O)c(C(=O)O)c2)c2c(Cl)ccc(S(...
38158 Oc1ncnc2c1sc1nc3ccccc3n12
860 CCN(CCO)CCNc1ccc(C)c2sc3ccccc3c(=O)c12
15795 COc1cccc(NC(=O)CC(=O)N2N=C(N(CCC#N)c3ccc(Cl)cc...
Name: smiles, Length: 32901, dtype: object
model.model
Sequential(
(0): DMPNNEncoder(
(act_func): ReLU()
(W1): Linear(in_features=165, out_features=300, bias=False)
(W2): Linear(in_features=300, out_features=300, bias=False)
(W3): Linear(in_features=451, out_features=300, bias=True)
)
(1): Sequential(
(0): Dropout(p=0.0, inplace=False)
(1): Linear(in_features=300, out_features=300, bias=True)
(2): ReLU()
(3): Dropout(p=0.0, inplace=False)
(4): Linear(in_features=300, out_features=1, bias=True)
)
)
'''
Additionaly, OCE can visualize chemicalspace. Let’s plot chemicals space with TSNE.
import olorenchemengine as oce
dataset = oce.BACEDataset() + oce.ScaffoldSplit()
vis = oce.ChemicalSpacePlot(dataset, oce.DescriptastorusDescriptor('morgan3counts'), opacity = 0.4, dim_reduction = "tsne")
vis.render_ipynb()

Next, make plot with trained model data. The size of marker means the magnitude of the residuals. It’s useful to evaluate model performance.
model = oce.BaseBoosting([oce.RandomForestModel(oce.DescriptastorusDescriptor("morgan3counts")),
oce.RandomForestModel(oce.OlorenCheckpoint("default"))])
model.fit(*dataset.train_dataset)
vis = oce.VisualizeDatasetSplit(dataset, oce.DescriptastorusDescriptor("morgan3counts"),
model = model, opacity = 0.4)
vis.render_ipynb()

These codes are few examples from original document. If readers have interest the package please install and use it in your Real tasks!
There is a nice documents are provided here. https://docs.oloren.ai/api/modules.html
Thanks!!!