It will take 3 years after changing job. Fortunately, I’m still working at early drug discovery field as chemoinformatician ;)
As many drug hunter know that bioisosteric replacement is one of the major way for designing novel molecules from known biologically active molecules.
Transformations from carboxylic acid to tetorazole, from amide to oxazole are major examples of bioisosteric replatecement.
Are these pair, “carboxylic & tetorazole” and “amide & oxazole” simlar ? And are there any metrics to compre these similarity instead of biological activities?
Espsim is one of the interesting OSS package for compare molecules with electro static potentials.
I tried to compare following tetrazole and carboxylic acid pair and amide and oxazole pair.
To clarify the similarity, I used not only neutoral state of carboxylic acid but also ionization state.
Let’s compare carboxylic acid and tetrazole with Morgan FP and ESPsim. As you can see that tetrazole and carboxylic acid shows low 2d finger print similarity but high similarity with ESPsim and Shapesim.
Next, let’s compare carboxylate and carboxylic acid metyl ester.
Interestingly ESPSIM between calboxylate and methyl ester is very low compared to their shape similarity.
How about similarity between calboxylic acid and methyl ester? Ah! there molecule shows high similarity not only shape but also electrostatic potential.
Next, let’s see amide and oxazole.
The result is as same as calboxylate case, the pair shows low 2D similarty but high shape and esp similarity.
In this experiments, ESPsim is interesting tool for predicting bioisosteric fragments. I think it worth to use espsim as a tool for drug design.
This is the last chemoinformatics blogpost in this year. Thanks for visiting and reading my blog. I hope you and your family have a great rest of the year and have a happy new year ;)
I hope all reader enjoying chemoinformatics ;) Unfortunately I’m struggling to make PPTX file in this year. So I can’t have enought time to coding. But It’s same for medicinal chemists. Because they often make presentation slide for project team, theier boss, senior etc….
They transfer SAR data from spread sheet to pptx. Sometime it seems time consuming task for me.
So I would like to write code for making automatic pptx generator with python ;)
As you know there are useful prior art in the topic. The URL is listed below.
The first one uses RDKit and the second one uses OETK.
In this post, I used RDKit and most of code is bollowed from the OE’s example code.
Here is a my example code. At first, import libraries which are required to make pptx and handling chemical structure.
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from io import BytesIO, StringIO
from pptx import Presentation
from pptx.util import Inches
# read example molecules
mols = [m for m in Chem.SDMolSupplier('./cdk2.sdf')]
for m in mols:
AllChem.Compute2DCoords(m)
Then define function to draw molecule and make pptx. I used BytesIO because by using the BytesIO I don’t need to make tempolary image file.
As many RDKitter know that rdSubstructLibrary is one of the cool tool for conductiong substructure search. Greg Landrum introduced how to use it in his great blog post.
I love the method because it works very fast for substructure searching. So I would like to make CLI tool for making substructure library database.
To do it, I used click which is useful package for making CLI tool.
After installing the package, three commands will be available. 1. make_rdssslib command makes sslib from sdf.gz 2. update_rdssslib which updates sslib with new sdf.gz 3. run_rdsss which run SSS with given smarts query.
The example is shown below.
# make ssslib from sdf.gz
$ make_rdssslib cdk2.sdf.gz cdk2.sslib.pkl
# search with ssslib from CLI
$ run_rdsss 'c1ccccc1' cdk2.sslib.pkl
After running the run_rdsss, hits.csv file will be generated.
All process can do from CLI with the code. But to handle learge sslib. I think user should run sss on interprinter. Because IO of SSLIB will take bottle neck of the code.
This code is stil underl development. Any advice or suggestion will be greatly appreciated.
I introduced MMPDB v3 few days ago. The package is CLI tool. So user can do MMPA by typing command line. However the tool doesn’t provide API interface.
I think, if we can use MMPDB on jupyter-notebook it’ll be more useful because output data can be used directly in coding process.
MMPDB Cli interface is implemented with Click which is python command line option parser. And click has CliRunner for testing. I found that CliRunner is nice solution to do that.
Following example is my first trial to call mmpdb on jupyter notebook. And most of SQL part is bollowed from RDKit-Blog post.
At first, connect postgresql chembl30 and retrive data.
import psycopg2
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem import PandasTools
import requests
import pandas as pd
from mmpdblib import cli
from click.testing import CliRunner
from IPython.display import HTML
%load_ext sql
print(rdBase.rdkitVersion)
data = %sql select canonical_smiles,molregno,activity_id,standard_value,standard_units from activities \
join assays using (assay_id) \
join compound_structures using (molregno) \
where tid=165 and standard_type='Ki' and standard_value is not null and standard_relation='=' \
and canonical_smiles not like '%.%';
df = data.DataFrame()
Next, calculate pKi and rename column for next data processing. And save compound data and property data separated as CSV format. It’s worth to know that properties data should be tab separated format.
# I used droperidol as an example molecule
smi = 'C1CN(CC=C1N2C3=CC=CC=C3NC2=O)CCCC(=O)C4=CC=C(C=C4)F'
droperidol = Chem.MolFromSmiles(smi)
smi = Chem.MolToSmiles(droperidol)
droperidol
Finally, convert smiles to rdkit molobjects with Chem.PandasTools. I used changeMoleculeRendering method because before using the method molecules in dataframe are not rendered.
By using the approach, we can call mmpdb function from python code without subprocess module. I think it is usful for calling function from the code with is click based CLI tool.
After the command avobe, I could call mmpdb command ;)
Make cdk2.smi from cdk2.sdf with rdkit.
from rdkit import Chem
mols = Chem.SDMolSupplier('cdk2.sdf')
with open('cdk2.smi') as of:
for m in mols:
molid = m.GetProp('_Name')
smi = Chem.MolToSmiles(m)
of.write(f'{smi} {molid}\n')
After making smi file, let’s make mmpdb.
$ mmpdb fragment cdk2.smi -o cdk2.fragment
$ mmpdb index cdk2.fragment -o cdk2.mmpd
60 rules are generated from cdk2.smi which contains 47 molecules.
One of the interesting feature of mmpdb v3 is generate command imprementation. The command allows to user for generating molecule with user defined substructure as query.
cdk2.mmpdb has metyl to thiazole rules ([:1]C >>> [:1]c1nccs1). So I tried to generate command with simple molecule.
I often use flask for my web app development because the package is light weight web framework and easy to write.
BTW rendering molecules as grid image is easy in jupyter notebook by using RDKit’s Draw.MolsToGridImage function. And also recently developed package named mols2grid is really cool for rendering molecules on jupyter notebook. mols2grid generaetes image data as HTML. So it means that we can embed these grid image to webpage.
So I tried to it ;)
Following code uses Flask, Flask-bootsrap, mols2grid. All packages are installed from conda-forge.
Directory structure is below.
root/app.py root/cdk2.sdf root/templates/top.html
And app.py
from flask import Flask
from flask import render_template
from flask_bootstrap import Bootstrap
from rdkit import Chem
import mols2grid
app = Flask(__name__)
Bootstrap(app)
@app.route('/')
def top():
im = mols2grid.display('./cdk2.sdf',
#tooltip=["id", "r_mmffld_Potential_Energy-OPLS_2005"]
)
return render_template('/top.html', data=im.data)
if __name__=="__main__":
app.run(debug=True)
Then access localhost:5000. I could get molecules as grid image.
The data has pagenation and check box. It can dowonload selected molecules. I think it is one of the convenient way to implement rendering molecules function in my web applications.
Last week I enjoyed RDKit UGM 2022. It was really great and exciting evenif I participated there from online. I hope I could participate RDKIT UGM 2023 locally ;)
As you know RDKit is one of the useful OSS package for chemoinformatician. It has nice community and be developed actively. I respect the community and developer’s effort!
And fortunately, we can also use FW from rdkit contrib package ;)
Now new version of rdkit(2022.09.1) isn’t available from conda-forge but I could get new version from original repo and FW package could install with pip command.
By using the package, user can conduct FW analysis really easily. Original repo shows another approach which uses FMCS to get scaffold for the dataset.
I love RDKit because it supports lots of chemoinformatics features and it is supported top level scientific community.
But for eary chemoinformacian, it’s required to learn python or C++ to write chemoinformatics code. As you know, Knime is good option for no-code chemoinformtatics. And another option for beginner is good wrapper of rdkit
Recently the package named datamol is intorduced in my TL.
And as you know, REINVENT is the one of famous compound generation tool for drug discovery. Its community provides lots of example codes as jupyternotebook. After building some model, I run sampling steps with configration file for sampling which is generated by the notebook. Today, I tried to generate molecules with python and process them with Knime. And I surprized that the process is really easy!
At first, I made simple workflow shown below.
To run the workflow, I used reinvent.v3.2 env. And next python script node defines compound generation process. The code is below.
from pandas import DataFrame
# Create empty table
output_table = DataFrame()
from rdkit import Chem
import sys
print(sys.path)
import reinvent_models.reinvent_core.models.model as reinvent
batchsize = 124
#model path where user would like to use
modelpath = '/hogehoge/ReinventCommunity/notebooks/models/random.prior.new'
model = reinvent.Model.load_from_file(modelpath, sampling_mode=True)
sampled_smi, likelyhood = model.sample_smiles(batchsize)
output_table['sampled_smi'] = sampled_smi
output_table['likelyhood'] = likelyhood
output_table['ROMol'] = output_table['sampled_smi'] .apply(Chem.MolFromSmiles)
To generate molecule from model, reinvent_models package is used. After loading the model, call sample_smiles is used for compound generation.
Post process is very simple. Calculate molecular fingerprint, run PCA and calculate molecular descirptors and convet molecule to PNG.
Finally, make javascript component with scatterplot and tile view.
I added interactive view for Tile view. So I could get tile view when I select plots ;)
This is a simple example to integrate KNIME and REINVENT. But I think it’s interesting that to use the approach, we can add lots of postprocess after generating molecules not only LBDD approach but also SBDD approach.
If readers who have interest knime and REINVENT, let’s make your ouwn work flow ;)
When I posted my memo about open science, @OlorenAI introduced python package named Oloren ChemEngine (OCE). I often use chemprop or interanly build system for QSAR tasks. ChemProp is the one of favorite package because it is easy to use and it includes web application flamework for users. I’ve never used OEC so I tried to use the package. Thaks @Oloren for developing and sharing the code.
Agreed! Please check out Oloren ChemEngine, a library for molecular property prediction! https://t.co/ZZ8zKGQzR8
At first, I installed OCE. My cuda version is little bit old, so I need to install Pytorch geometric after running the original install script .
(base) $ conda create -c conda-forge -n oce python=3.8
(base) $ conda activate oce
(oce) $ bash <(curl -s https://raw.githubusercontent.com/Oloren-AI/olorenchemengine/master/install.sh)
(oce) $ conda install -c conda-forge jupyter mamba
# following command depends on your env. I use pytorch1.12 and cuda 10.2
(oce) $ pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.12.0+cu102.html
After running the code above, I could use OCE.
OK, let’s run the code. Following code is almost same as original documentation. The main different point compared to other QSAR library, OCE don’t need vectorize(featurize) molecules just define which method user would like to use in the model.
This example make Boosing emsemble model with two RandomfolrestModel with two different features. After making model object, user need to pass the list of SMILES and target value for training !!
OCE wraps complicated molecular featureize process. So user don’t need do that. I think it is cool isn’t it? After making the model object, model training and prediction step is as same as scikit-learn.
User can build complex model with few lines of code ;)
OCE provides not only model building method but also provides visualization method. Let’s see it.
from olorenchemengine.visualizations import *
from olorenchemengine import *
import olorenchemengine as oce
df = oce.ExampleDataset().data
test_dataset = oce.BaseDataset(name='purple', data=df.to_csv(), structure_col='Smiles', property_col='pChEMBL Value')
model = oce.BaseBoosting([
oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"),n_estimators=1000),
oce.BaseTorchGeometricModel(oce.TLFromCheckpoint("default"), preinitialized=True),
oce.RandomForestModel(oce.OlorenCheckpoint("default"),n_estimators=1000)])
## Training the model
model.fit(df["Smiles"], df["pChEMBL Value"])
vis = oce.VisualizeModelSim(model=loaded_model,dataset=test_dataset)
vis.render_ipynb()
After making vis object and call render_ipynb(), I could get interactive scatter plot shown below. When I mouse over on the plot, compound structure will be hovered. This example is 2 class classification model but you can make same plot with regression task.
OCE also have not only traditional predictive model but also chemprop model (MPDNN!). It sounds nice, let’s use it. It’s really user friendly API ;)
import io
import sys
import zipfile
import pandas as pd
import requests
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
import olorenchemengine as oce
data_dir = "./data"
data_url = "http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip"
r = requests.get(data_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(data_dir)
df = pd.read_csv(f"{data_dir}/hiv/mapping/mol.csv.gz")
X_train, X_test, y_train, y_test = train_test_split(df["smiles"], df["HIV_active"], test_size=0.2, random_state=42)
## building and training chemprop model is finished only two lines!!!
model = oce.ChemPropModel()
model.fit(X_train, y_train)
'''
print(X_train)
29361 O=C(O)CC(NC(=O)OCc1ccccc1)C(=O)O
10448 O=[N+]([O-])c1ccc(Nc2ccccc2)c2nonc12
31039 CCOC(=O)C(=NNc1ccc(C)cc1)N1C(=S)N(C)N=C(C)C=C1S
1311 N#CSC1=C(SC#N)C(=O)c2ccccc2C1=O
27834 COc1cc(C2C3=C(COC3=O)OC(C)(C)Oc3cc4c(cc32)OCO4...
...
6265 Cc1ccc2nsnc2c1[N+](=O)[O-]
11284 CC1=CC(=C(c2cc(C)c(O)c(C(=O)O)c2)c2c(Cl)ccc(S(...
38158 Oc1ncnc2c1sc1nc3ccccc3n12
860 CCN(CCO)CCNc1ccc(C)c2sc3ccccc3c(=O)c12
15795 COc1cccc(NC(=O)CC(=O)N2N=C(N(CCC#N)c3ccc(Cl)cc...
Name: smiles, Length: 32901, dtype: object
model.model
Sequential(
(0): DMPNNEncoder(
(act_func): ReLU()
(W1): Linear(in_features=165, out_features=300, bias=False)
(W2): Linear(in_features=300, out_features=300, bias=False)
(W3): Linear(in_features=451, out_features=300, bias=True)
)
(1): Sequential(
(0): Dropout(p=0.0, inplace=False)
(1): Linear(in_features=300, out_features=300, bias=True)
(2): ReLU()
(3): Dropout(p=0.0, inplace=False)
(4): Linear(in_features=300, out_features=1, bias=True)
)
)
'''
Additionaly, OCE can visualize chemicalspace. Let’s plot chemicals space with TSNE.
import olorenchemengine as oce
dataset = oce.BACEDataset() + oce.ScaffoldSplit()
vis = oce.ChemicalSpacePlot(dataset, oce.DescriptastorusDescriptor('morgan3counts'), opacity = 0.4, dim_reduction = "tsne")
vis.render_ipynb()
Next, make plot with trained model data. The size of marker means the magnitude of the residuals. It’s useful to evaluate model performance.
model = oce.BaseBoosting([oce.RandomForestModel(oce.DescriptastorusDescriptor("morgan3counts")),
oce.RandomForestModel(oce.OlorenCheckpoint("default"))])
model.fit(*dataset.train_dataset)
vis = oce.VisualizeDatasetSplit(dataset, oce.DescriptastorusDescriptor("morgan3counts"),
model = model, opacity = 0.4)
vis.render_ipynb()
These codes are few examples from original document. If readers have interest the package please install and use it in your Real tasks!
My blogsite was started 2013, so I took almost 10 years ;) It’s amazing. I’m enjoying writing the blog and sharing my code snippet.
BTW, I’m working in pharmaceutical company as a chemoinformatician. So, is it risk to write chemoinformatics related post in my blog site….? I don’t think so. As many people know that AstraZeneca shares theire useful code on theire github repository<https://github.com/MolecularAI>. It means that sharing code is not differenciation point and it is good eco system to improve data science. It is worth to open science of non competitive area. Because to do it, we can get many feedback from the many communities. It is difficult to get these thins from interal community only. Now it is easy to communicate not only domestic area but also wold wide with IT tools.
I would like to keep writing the blog post enven if the frequency of the update is slow ;) And hope many reader will enjoy open science and get new idea from it.
Today is the last day of my summer vacation…. Due to COVID-19 pandemic, I and my family didn’t go travel during the vacation however, we’ll go to national championship of dodgeball game in next sturday. It will be exciting day!
BTW as you know, rdkit has lots of contrib packages. And I would like to introduce a package named CalcLigRMSD which is calculator of two molecules RMSD.
It is easy to two ligands RMSD with the package. In the original repositry, more useful examples are introduced. If reader has interest the code, pls check original repo ;)
Visualize chemical space is important task for chemoinformatitian. And there are lots of way to represent chemical space. One of the common approach is PCA. And recently tSNE and UMAP are used.
I wrote template code for plotting these data in my task but didn’t write code as a package.
Today I found useful package for making chemical space plot with python named ChemPlot. You can get and read the article from following URL.
After the installation, I tried to use chemplot and uploaded the code. It’s really easy to make chemical space plot with chemplot. I could make not only static plot but also interactive plot. Following code doesn’t render interactive plot but it worked on my PC. Interactive plot provides structure image when mouse over on the plot.
It’s worth to use for making your chemical space plot. In summary there are lots of useful packages are available in python. Thanks for sharing the code!
I think workflow tool is useful for many chemoinformatician such as Knime, Pipeline Pilot(PP) and Orange. In my knowledge, it’s difficult to use user defined conda env in PP. Of course these workflow tools provides enough components to do many tasks. But I would like to have more flexibility to build chemoinformatics pipelines.
Recently I updated Knime in my PC and found nice node for using user defined conda env in Knime.
Conda environment probacation node enable to define conda env in Knime pipeline.
I tried to use it in very simple workflow. Following image is overall view of workflow.
Conda environment probagation can select local conda env for workflow. User can select env very easily. I selected my chemoinfomatics env with python 3.9. The setting is passed as flow valiable.
Then, I defined flow valiable in python script node like below.
And wrote some code in the node.
# Copy input to output
output_table_1 = input_table_1.copy()
from rdkit import Chem
from rdkit.Chem import Descriptors
output_table_1['ROMol'] = output_table_1['Molecule'].apply(Chem.MolFromMolBlock)
output_table_1['SMILES'] = output_table_1['ROMol'].apply(Chem.MolToSmiles)
output_table_1['MolWt'] = output_table_1['ROMol'].apply(Descriptors.MolWt)
output_table_1['LogP'] = output_table_1['ROMol'].apply(Descriptors.MolLogP)
Data from SDF reader is processed in python script node under my defiened conda env. It worked fine.
In summary, Knime is really flexible and useful tool IMHO. And now there are lots of useful resource for learning Knime.
Reader who interested in Knime I recommend to check following blog post ;)
Virtual Screening is important task of drug discovery projects. There are lots of approach for example Finger print based, substructure based and shape based screening. All approaches listed above is not only used in SBDD but also LBDD.
And there are lots of apprications to do these tasks. I wrote scripts for these task and use then. But recently I found nice package for VS named VSflow which is developed by Paul Czodrowski’s group.
It seems interesting, so I tried to use it. At first, I prepared conda env and install it.
After running the code, I could use vsflow command.
Next, I prepared dabase. Database can be made from any kinds of dataset but I used default set. ‘-d pdb’ option means prepare database with smiles which cames from ligandexpo.
$ time vsflow preparedb -d pdb -o pdb_ligs -np 6
**************************
VV VV SSSSSSS VSFlow
VV VV SSS SS Virtual Screening
VV VV SSSS Workflow
VV VV SSSS
VVVV SS SSS
VV SSSSSSS
**************************
Start: 06/21/2022, 21:22:37
Running in parallel mode on 6 threads
Downloading database pdb ...
Finished downloading database
Generating database file ...
Finished in 168 seconds
real 2m48.611s
user 0m6.480s
sys 0m1.619s
Now I could get ‘pdb_ligs.vsdb’ which is pickled data for VSFlow. Next, I tried to substructure and fp sim search. I used SMILES as a query. The task done in a second.
$ vsflow substructure -smi 'c1ccnnc1' -d pdb_ligs.vsdb -o smi_sub_pdb.sdf
**************************
VV VV SSSSSSS VSFlow
VV VV SSS SS Virtual Screening
VV VV SSSS Workflow
VV VV SSSS
VVVV SS SSS
VV SSSSSSS
**************************
Start: 06/21/2022, 21:28:50
Running in single core mode
Loading database pdb_ligs.vsdb ...
Reading query ...
Finished substructure search in 0.83285 seconds
Generating output file(s) ...
313 matches found
Finished: 06/21/2022, 21:28:51
Finished in 0.91936 seconds
SSS hit compounds are below.
Following example is similarity sarch and I made similarity map as PDF.
$ vsflow fpsim -d pdb_ligs.vsdb -smi "CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12" -o sim.sdf --pdf --simmap
**************************
VV VV SSSSSSS VSFlow
VV VV SSS SS Virtual Screening
VV VV SSSS Workflow
VV VV SSSS
VVVV SS SSS
VV SSSSSSS
**************************
Start: 06/21/2022, 22:06:41
Running in single core mode
Loading database pdb_ligs.vsdb ...
Reading query input ...
Calculating fingerprints ...
Finished fingerprint generation in 6.04996 seconds
Calculating similarities ...
Finished calculating similarities in 0.08398 seconds
Writing 10 molecules to output file(s)
Generating output file(s) ...
Generating PDF file(s) ...
Calculating similarity maps for 10 matches ...
Finished: 06/21/2022, 22:06:56
Finished in 14.63942 seconds
Similarity map is nice approach to visualize similarity between query(tofacitinib) and hit compounds. This example used fcfp4 as FP however user can use other rdkit supported FP such as ECFP, RDKit, Atom etc.
Final example is shape similarity. To do it vsdb should have 3D structure information. So I got 3D data from ligand expo and made vsdb.
To use rdkit from Rust, I introduced rdkit-sys before. And fortunately recent version of rdkit-sys cleat supports rdkit-env. It’s worth to use conda-env to build rdkit-sys because user don’t need to build rdkit from source code.
Following code is almost same as my previous post but I would like to share it.
At first, I cloned rdkit from rdkit-rs.
$ gh repo clone rdkit-rs/rdkit
$ cd rdkit
Then edit Cargo.toml. I modified dependencies part as below. Added features=[“dynamic-linking-from-conda”] option. And then added LD_LIBRARY_PATH to use it.
[package]
name = "rdkit"
version = "0.2.11"
edition = "2021"
authors = ["Xavier Lange <xrlange@gmail.com>"]
license = "MIT"
description = "High level RDKit functionality for rust"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
bitvec = "1"
cxx = "1"
log = "0.4"
#iwatobipen modified
rdkit-sys = {version="0.2.14", features=["dynamic-linking-from-conda"]}
flate2 = "1"
[dev-dependencies]
env_logger = "0.9.0"
## from bash
(base) $ conda activate chemo_py310
(chemo_py310)$ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib;$LD_LIBRARY_PATH
(chemo_py310)$ cargo test # all tests will pass
Then I wrote rust_rdkit_v3
$ cargo new rust_rdkit_v3
$ cd rdkit_rust
$ vim src/main.rs
use std::env;
use std::path::PathBuf;
use rdkit;
fn main() {
let args: Vec<String> = env::args().collect();
let sdgz_filename = &args[1];
println!("filename is {}", sdgz_filename);
let mol_block_iter =
rdkit::MolBlockIter::from_gz_file(sdgz_filename, false, false, false).unwrap();
let mol_list = mol_block_iter.collect::<Vec<_>>();
for m in mol_list {
let smi = m.unwrap().as_smile();
println!("{}", smi)
}
println!("Done");
}
$ vim Cargo.toml
[package]
name = "rust_rdkit_v3"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
#rdkit = "0.2.11"
rdkit={path="../rdkit"}
~
General molecular generator with deep learning approach is difficult to fix substructure. But common SAR expansion by medchem is focus on specific part such as terminal part, core and linker.
Linker is defined as a part which connect parts of molecules. Linker is important of drug design and is called scaffold when the linker connets two or three parts. And also important in PROTAC design. So it’s useful if deep learning based generator can design linker ;). As you know delinker(https://github.com/fimrie/DeLinker) is one of the useful and cool code for linker design.
Recently, AstraZeneca team disclosed cool linker generator code for REINVET named Link-INVENT which is implemented in REINVENT v3.2. Link-INVENT is new option of REINVENT for liker design with many useful scoring options.
Link-INVENT can use following linker design specific scoring functions. As you can see, the package can design specific linker.
I added ‘linker_graph_length’ scoring function for controlling length of molecules.
And after running the learnig process, I checked the learning process with tensorboad. It’s really useful to review the learning process becase I can see not only learning score but also structures of each learning process.
Here is the first output of the learning process.
And here is the molecules from last learning step .
The molecules which generated at last learning step has rigid and short linker which is defined by scoring function.
In summary, I think REINVENT is cool and flexible molecular generator code. I respect authors work and thank for sharing the nice code ;)
I introduced rdkit-sys which is wrapper of rdkit for rust. The package development is ongoing so code is chaged frequentry. To use the package, user need to build rdkit with static link ON option. Build rdkit-sys with static link (.a) is efficient way for portability of the code however it means that it’s difficult to use conda env rdkit because, conda env provides only dynamic link (.so).
So I would like to build rdkit-sys and call rdkit from rust with dylib.
After several try and error, I found solution to do it. I would like to share my trial.
At first, I cloned rdkit-sys in my PC.
$ gh repo clone rdkit-rs/rdkit-sys
$ cd rdkit-sys
Then edit build.rs script show below.
fn main() {
if std::env::var("DOCS_RS").is_ok() {
return;
}
env_logger::init();
let library_root = match (std::env::consts::OS, std::env::consts::ARCH) {
("macos", "x86_64") => "/usr/local",
("macos", "aarch64") => "/opt/homebrew",
//("linux", _) => "/usr",
//following path is my conda env's path
("linux", _) => "/home/iwatobipen/miniconda3/envs/chemo_py310",
(unsupported_os, unsupported_arch) => panic!(
"sorry, rdkit-sys doesn't support {} on {} at this time",
unsupported_os, unsupported_arch
),
};
let brew_lib_path = format!("{}/lib", library_root);
let include = format!("{}/include", library_root);
let rdkit_include = format!("{}/include/rdkit", library_root);
let dir = std::fs::read_dir("src/bridge").unwrap();
let rust_files = dir
.into_iter()
.filter_map(|p| match p {
Ok(p) => {
if p.metadata().unwrap().is_file() {
Some(p.path())
} else {
None
}
}
Err(_) => None,
})
.filter(|p| !p.ends_with("mod.rs"))
.collect::<Vec<_>>();
let mut cc_paths = vec![];
let wrapper_root = std::path::PathBuf::from("wrapper");
for file in &rust_files {
let file_name = file.file_name().unwrap();
let file_name = file_name.to_str().unwrap();
let base_name = &file_name[0..file_name.len() - 3];
let cc_path = wrapper_root.join("src").join(format!("{}.cc", base_name));
let meta = std::fs::metadata(&cc_path).unwrap();
if !meta.is_file() {
panic!("{} must exist", cc_path.display())
}
cc_paths.push(cc_path);
let h_path = wrapper_root
.join("include")
.join(format!("{}.h", base_name));
let meta = std::fs::metadata(&h_path).unwrap();
if !meta.is_file() {
panic!("{} must exist", h_path.display())
}
}
cxx_build::bridges(rust_files)
.files(cc_paths)
.include(include)
.include(rdkit_include)
.include(std::env::var("CARGO_MANIFEST_DIR").unwrap())
.flag("-std=c++14")
.warnings(false)
// rdkit has warnings that blow up our build. we could enumerate all those warnings and tell
// the compiler to allow them... .warnings_into_errors(true)
.compile("rdkit");
println!("cargo:rustc-link-search=native={}", brew_lib_path);
// println!("cargo:rustc-link-lib=static=c++");
for lib in &[
"Catalogs",
"ChemReactions",
"ChemTransforms",
"DataStructs",
"Descriptors",
"FileParsers",
"Fingerprints",
"GenericGroups",
"GraphMol",
"MolStandardize",
"RDGeneral",
"RDGeometryLib",
"RingDecomposerLib",
"SmilesParse",
"Subgraphs",
"SubstructMatch",
] {
//swich static link to dynamic link!!!
//println!("cargo:rustc-link-lib=static=RDKit{}_static", lib);
println!("cargo:rustc-link-lib=dylib=RDKit{}", lib);
}
println!("cargo:rustc-link-lib=static=boost_serialization");
}
After that, I could cargo build command in rdkit-sys folder ;). Then cloned rdkit-rs/rdkit with the crate. To do that I edited Cargo.toml like below.
$ gh repo clone rdkit-rs/rdkit
$ cd rdkit
$ vim Cargo.toml
[package]
name = "rdkit"
version = "0.2.6"
edition = "2021"
authors = ["Xavier Lange <xrlange@gmail.com>"]
license = "MIT"
description = "High level RDKit functionality for rust"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
cxx = "1"
log = "0.4"
#rdkit-sys = "0.2.7"
rdkit-sys = { path = "../rdkit-sys"} # I used modified version of rdkit-sys
flate2 = "1"
After that, I wrote code with these packages.
my rdkitest code is below.
#Cargo.toml
[package]
name = "rdkrust"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
rdkit = { path = "/home/iwatobipen/dev/sandbox/rusttest/rdkit" }
#src/main.rs
use std::env;
use rdkit;
fn main() {
let args: Vec<String> = env::args().collect();
let sdgz_filename = &args[1];
println!("filename is {}", sdgz_filename);
let mol_block_iter =
rdkit::MolBlockIter::from_gz_file(sdgz_filename, false, false, false).unwrap();
let mol_list = mol_block_iter.collect::<Vec<_>>();
for m in mol_list {
let smi = m.unwrap().as_smile();
println!("{}", smi)
}
println!("Done");
}
The code will read sdf.gz and retrieve molecules and then convert molecules to SMILES.
It seems work fine. In summary, rdkit-rs activity is really cool project for chemoinformatics because rust is really efficient language. I would like to learn rust more and more and develop useful apps for chemoinformatics.
I wrote lots of chemoinformatics apps as web app. To do that I often use Flask / Django. It’s python packge so most of part can be written in python however it’s difficult to embed python directry in html document.
Recently I found cool package for embedding python code in html, it’s name is pyscript. It’s really cool!
And code is below. To do he test, I used python http package.
#webserver.py
from http.server import HTTPServer, SimpleHTTPRequestHandler, os
os.chdir(os.path.join(os.path.dirname(__file__), 'contents'))
server = HTTPServer(('', 8000), SimpleHTTPRequestHandler)
server.serve_forever()
As described in the article, surge has a limitation. Current version doesn’t perform a Huckel aromaticity test. It means surge will generate dupilicates structure for kekule versions of aromatic rings.
However it works fast and interesting tool for molecular generation. BTW it’s difficult to filter from generated molecules with desired compound properties in the drug discovery field.