Generate molecules with REINVENT and make interactive plot #RDKit #Knime #REINVENT

I introduced to use user defined condaenv in Knime. It’s really useful I think.

Make chemoinformatics workflow by Kinme with user defined conda envrionment #Knime #Chemoinforamtics #memo

And as you know, REINVENT is the one of famous compound generation tool for drug discovery. Its community provides lots of example codes as jupyternotebook. After building some model, I run sampling steps with configration file for sampling which is generated by the notebook. Today, I tried to generate molecules with python and process them with Knime. And I surprized that the process is really easy!

At first, I made simple workflow shown below.

To run the workflow, I used reinvent.v3.2 env. And next python script node defines compound generation process. The code is below.

from pandas import DataFrame
# Create empty table
output_table = DataFrame()
from rdkit import Chem
import sys
print(sys.path)
import reinvent_models.reinvent_core.models.model as reinvent
batchsize = 124
#model path where user would like to use
modelpath = '/hogehoge/ReinventCommunity/notebooks/models/random.prior.new'

model = reinvent.Model.load_from_file(modelpath, sampling_mode=True)
sampled_smi, likelyhood = model.sample_smiles(batchsize)
output_table['sampled_smi'] = sampled_smi
output_table['likelyhood'] = likelyhood
output_table['ROMol'] = output_table['sampled_smi'] .apply(Chem.MolFromSmiles)

To generate molecule from model, reinvent_models package is used. After loading the model, call sample_smiles is used for compound generation.

Post process is very simple. Calculate molecular fingerprint, run PCA and calculate molecular descirptors and convet molecule to PNG.

Finally, make javascript component with scatterplot and tile view.

I added interactive view for Tile view. So I could get tile view when I select plots ;)

This is a simple example to integrate KNIME and REINVENT. But I think it’s interesting that to use the approach, we can add lots of postprocess after generating molecules not only LBDD approach but also SBDD approach.

If readers who have interest knime and REINVENT, let’s make your ouwn work flow ;)

Useful python package for QSAR related tasks of chemoinformatician #chemoinformatics #oloren-ai #RDKit

When I posted my memo about open science, @OlorenAI introduced python package named Oloren ChemEngine (OCE). I often use chemprop or interanly build system for QSAR tasks. ChemProp is the one of favorite package because it is easy to use and it includes web application flamework for users. I’ve never used OEC so I tried to use the package. Thaks @Oloren for developing and sharing the code.

At first, I installed OCE. My cuda version is little bit old, so I need to install Pytorch geometric after running the original install script .

(base) $ conda create -c conda-forge -n oce python=3.8
(base) $ conda activate oce
(oce) $ bash <(curl -s https://raw.githubusercontent.com/Oloren-AI/olorenchemengine/master/install.sh)
(oce) $ conda install -c conda-forge jupyter mamba
# following command depends on your env. I use pytorch1.12 and cuda 10.2
(oce) $ pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.12.0+cu102.html

After running the code above, I could use OCE.

OK, let’s run the code. Following code is almost same as original documentation. The main different point compared to other QSAR library, OCE don’t need vectorize(featurize) molecules just define which method user would like to use in the model.

This example make Boosing emsemble model with two RandomfolrestModel with two different features. After making model object, user need to pass the list of SMILES and target value for training !!

OCE wraps complicated molecular featureize process. So user don’t need do that. I think it is cool isn’t it? After making the model object, model training and prediction step is as same as scikit-learn.

User can build complex model with few lines of code ;)

import olorenchemengine as oce
df = oce.ExampleDataFrame()

model = oce.BaseBoosting([
            oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"), n_estimators=1000),
            oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)])

model.fit(df["Smiles"], df["pChEMBL Value"])
oce.save(model, "model.oce")
model2 = oce.load("model.oce")
y_pred = model2.predict(["CC(=O)OC1=CC=CC=C1C(=O)O"])
y_pred
>>array([0.15748118])

OCE provides not only model building method but also provides visualization method. Let’s see it.

from olorenchemengine.visualizations import *
from olorenchemengine import *
import olorenchemengine as oce
df = oce.ExampleDataset().data
test_dataset = oce.BaseDataset(name='purple', data=df.to_csv(), structure_col='Smiles', property_col='pChEMBL Value')

model = oce.BaseBoosting([
            oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"),n_estimators=1000),
            oce.BaseTorchGeometricModel(oce.TLFromCheckpoint("default"), preinitialized=True),
            oce.RandomForestModel(oce.OlorenCheckpoint("default"),n_estimators=1000)])


## Training the model
model.fit(df["Smiles"], df["pChEMBL Value"])
vis =  oce.VisualizeModelSim(model=loaded_model,dataset=test_dataset)
vis.render_ipynb()

After making vis object and call render_ipynb(), I could get interactive scatter plot shown below. When I mouse over on the plot, compound structure will be hovered. This example is 2 class classification model but you can make same plot with regression task.

OCE also have not only traditional predictive model but also chemprop model (MPDNN!). It sounds nice, let’s use it. It’s really user friendly API ;)

import io
import sys
import zipfile

import pandas as pd
import requests
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

import olorenchemengine as oce
data_dir = "./data"
data_url = "http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip"
r = requests.get(data_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(data_dir)
df = pd.read_csv(f"{data_dir}/hiv/mapping/mol.csv.gz")

X_train, X_test, y_train, y_test = train_test_split(df["smiles"], df["HIV_active"], test_size=0.2, random_state=42)

## building and training chemprop model is finished only two lines!!!  
model = oce.ChemPropModel()
model.fit(X_train, y_train)

'''
print(X_train)
29361                     O=C(O)CC(NC(=O)OCc1ccccc1)C(=O)O
10448                 O=[N+]([O-])c1ccc(Nc2ccccc2)c2nonc12
31039      CCOC(=O)C(=NNc1ccc(C)cc1)N1C(=S)N(C)N=C(C)C=C1S
1311                       N#CSC1=C(SC#N)C(=O)c2ccccc2C1=O
27834    COc1cc(C2C3=C(COC3=O)OC(C)(C)Oc3cc4c(cc32)OCO4...
                               ...                        
6265                            Cc1ccc2nsnc2c1[N+](=O)[O-]
11284    CC1=CC(=C(c2cc(C)c(O)c(C(=O)O)c2)c2c(Cl)ccc(S(...
38158                            Oc1ncnc2c1sc1nc3ccccc3n12
860                 CCN(CCO)CCNc1ccc(C)c2sc3ccccc3c(=O)c12
15795    COc1cccc(NC(=O)CC(=O)N2N=C(N(CCC#N)c3ccc(Cl)cc...
Name: smiles, Length: 32901, dtype: object

model.model
Sequential(
  (0): DMPNNEncoder(
    (act_func): ReLU()
    (W1): Linear(in_features=165, out_features=300, bias=False)
    (W2): Linear(in_features=300, out_features=300, bias=False)
    (W3): Linear(in_features=451, out_features=300, bias=True)
  )
  (1): Sequential(
    (0): Dropout(p=0.0, inplace=False)
    (1): Linear(in_features=300, out_features=300, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.0, inplace=False)
    (4): Linear(in_features=300, out_features=1, bias=True)
  )
)
'''

Additionaly, OCE can visualize chemicalspace. Let’s plot chemicals space with TSNE.

import olorenchemengine as oce
dataset = oce.BACEDataset() + oce.ScaffoldSplit()
vis = oce.ChemicalSpacePlot(dataset, oce.DescriptastorusDescriptor('morgan3counts'), opacity = 0.4, dim_reduction = "tsne")
vis.render_ipynb()

Next, make plot with trained model data. The size of marker means  the magnitude of the residuals. It’s useful to evaluate model performance.

model = oce.BaseBoosting([oce.RandomForestModel(oce.DescriptastorusDescriptor("morgan3counts")),
                         oce.RandomForestModel(oce.OlorenCheckpoint("default"))])
model.fit(*dataset.train_dataset)
vis = oce.VisualizeDatasetSplit(dataset, oce.DescriptastorusDescriptor("morgan3counts"), 
                                model = model, opacity = 0.4)
vis.render_ipynb()

These codes are few examples from original document. If readers have interest the package please install and use it in your Real tasks!

There is a nice documents are provided here. https://docs.oloren.ai/api/modules.html

Thanks!!!

Let’s enjoy open science! #memo #diary

My blogsite was started 2013, so I took almost 10 years ;) It’s amazing. I’m enjoying writing the blog and sharing my code snippet.

BTW, I’m working in pharmaceutical company as a chemoinformatician. So, is it risk to write chemoinformatics related post in my blog site….? I don’t think so. As many people know that AstraZeneca shares theire useful code on theire github repository<https://github.com/MolecularAI&gt;. It means that sharing code is not differenciation point and it is good eco system to improve data science. It is worth to open science of non competitive area. Because to do it, we can get many feedback from the many communities. It is difficult to get these thins from interal community only. Now it is easy to communicate not only domestic area but also wold wide with IT tools.

I would like to keep writing the blog post enven if the frequency of the update is slow ;) And hope many reader will enjoy open science and get new idea from it.

Thanks for reading.

Calculate ligand RMSD with rdkit contrib package #RDKit #memo

Today is the last day of my summer vacation…. Due to COVID-19 pandemic, I and my family didn’t go travel during the vacation however, we’ll go to national championship of dodgeball game in next sturday. It will be exciting day!

BTW as you know, rdkit has lots of contrib packages. And I would like to introduce a package named CalcLigRMSD which is calculator of two molecules RMSD.

It is easy to use it. And good example code is disclosed in original repo.
https://github.com/rdkit/rdkit/tree/master/Contrib/CalcLigRMSD

I would like to use the package so I tried to run the code on my PC.

Here is an my code. (Most of code is borrowed from original repository)

https://nbviewer.org/gist/iwatobipen/7456f75e7d5f23607cd9c9885c7f5ff0

It is easy to two ligands RMSD with the package. In the original repositry, more useful examples are introduced. If reader has interest the code, pls check original repo ;)

Thanks for reading.

Useful package for ploting chemical space rapidly #chemoinformatics #memo

Visualize chemical space is important task for chemoinformatitian. And there are lots of way to represent chemical space. One of the common approach is PCA. And recently tSNE and UMAP are used.

I wrote template code for plotting these data in my task but didn’t write code as a package.

Today I found useful package for making chemical space plot with python named ChemPlot. You can get and read the article from following URL.

The article is open access! https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cmtd.202200005

And fortunately the author shared the code on github. Pepople who have interest the pacakge, can install chemplot via conda or pip commmand.

I installed the package with conda.

$ mamba install -c conda-forge -c chemplot chemplot

After the installation, I tried to use chemplot and uploaded the code. It’s really easy to make chemical space plot with chemplot. I could make not only static plot but also interactive plot. Following code doesn’t render interactive plot but it worked on my PC. Interactive plot provides structure image when mouse over on the plot.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

It’s worth to use for making your chemical space plot. In summary there are lots of useful packages are available in python. Thanks for sharing the code!

Make chemoinformatics workflow by Kinme with user defined conda envrionment #Knime #Chemoinforamtics #memo

I think workflow tool is useful for many chemoinformatician such as Knime, Pipeline Pilot(PP) and Orange. In my knowledge, it’s difficult to use user defined conda env in PP. Of course these workflow tools provides enough components to do many tasks. But I would like to have more flexibility to build chemoinformatics pipelines.

Recently I updated Knime in my PC and found nice node for using user defined conda env in Knime.

Conda environment probacation node enable to define conda env in Knime pipeline.

I tried to use it in very simple workflow. Following image is overall view of workflow.

Conda environment probagation can select local conda env for workflow. User can select env very easily. I selected my chemoinfomatics env with python 3.9. The setting is passed as flow valiable.

Then, I defined flow valiable in python script node like below.

And wrote some code in the node.

# Copy input to output
output_table_1 = input_table_1.copy()
from rdkit import Chem
from rdkit.Chem import Descriptors
output_table_1['ROMol'] = output_table_1['Molecule'].apply(Chem.MolFromMolBlock)
output_table_1['SMILES'] = output_table_1['ROMol'].apply(Chem.MolToSmiles)
output_table_1['MolWt'] = output_table_1['ROMol'].apply(Descriptors.MolWt)
output_table_1['LogP'] = output_table_1['ROMol'].apply(Descriptors.MolLogP)

Data from SDF reader is processed in python script node under my defiened conda env. It worked fine.

In summary, Knime is really flexible and useful tool IMHO. And now there are lots of useful resource for learning Knime.

Reader who interested in Knime I recommend to check following blog post ;)

Written in Japanese!
https://note.com/knimesupportteam/m/m041672f62384
https://sumtat.hatenablog.com/

Kime blog
https://www.knime.com/blog

Thanks for reading.

Useful package for virtual screening #chemoinformatics #RDKit

Virtual Screening is important task of drug discovery projects. There are lots of approach for example Finger print based, substructure based and shape based screening. All approaches listed above is not only used in SBDD but also LBDD.

And there are lots of apprications to do these tasks. I wrote scripts for these task and use then. But recently I found nice package for VS named VSflow which is developed by Paul Czodrowski’s group.

It seems interesting, so I tried to use it. At first, I prepared conda env and install it.

$ gh repo clone czodrowskilab/VSFlow
$ cd VSFlow
$ conda env create --quiet --force --file environment.yml
$ conda activate vsflow
$ pip install .

After running the code, I could use vsflow command.

Next, I prepared dabase. Database can be made from any kinds of dataset but I used default set. ‘-d pdb’ option means prepare database with smiles which cames from ligandexpo.

$ time vsflow preparedb -d pdb -o pdb_ligs -np 6
**************************

 VV        VV  SSSSSSS             VSFlow
  VV      VV  SSS    SS       Virtual Screening
   VV    VV    SSSS               Workflow
    VV  VV       SSSS
     VVVV     SS    SSS
      VV       SSSSSSS

**************************

Start: 06/21/2022, 21:22:37
Running in parallel mode on 6 threads
Downloading database pdb ...
Finished downloading database
Generating database file ...
Finished in 168 seconds

real	2m48.611s
user	0m6.480s
sys	0m1.619s

Now I could get ‘pdb_ligs.vsdb’ which is pickled data for VSFlow. Next, I tried to substructure and fp sim search. I used SMILES as a query. The task done in a second.

$ vsflow substructure -smi 'c1ccnnc1' -d pdb_ligs.vsdb -o smi_sub_pdb.sdf
**************************

 VV        VV  SSSSSSS             VSFlow
  VV      VV  SSS    SS       Virtual Screening
   VV    VV    SSSS               Workflow
    VV  VV       SSSS
     VVVV     SS    SSS
      VV       SSSSSSS

**************************

Start: 06/21/2022, 21:28:50
Running in single core mode
Loading database pdb_ligs.vsdb ...
Reading query ...
Finished substructure search in 0.83285 seconds
Generating output file(s) ...
313 matches found
Finished: 06/21/2022, 21:28:51
Finished in 0.91936 seconds

SSS hit compounds are below.

Following example is similarity sarch and I made similarity map as PDF.

$ vsflow fpsim -d pdb_ligs.vsdb -smi "CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12" -o sim.sdf --pdf --simmap
**************************

 VV        VV  SSSSSSS             VSFlow
  VV      VV  SSS    SS       Virtual Screening
   VV    VV    SSSS               Workflow
    VV  VV       SSSS
     VVVV     SS    SSS
      VV       SSSSSSS

**************************

Start: 06/21/2022, 22:06:41
Running in single core mode
Loading database pdb_ligs.vsdb ...
Reading query input ...
Calculating fingerprints ...
Finished fingerprint generation in 6.04996 seconds
Calculating similarities ...
Finished calculating similarities in 0.08398 seconds
Writing 10 molecules to output file(s)
Generating output file(s) ...
Generating PDF file(s) ...
Calculating similarity maps for 10 matches ...
Finished: 06/21/2022, 22:06:56
Finished in 14.63942 seconds

Similarity map is nice approach to visualize similarity between query(tofacitinib) and hit compounds. This example used fcfp4 as FP however user can use other rdkit supported FP such as ECFP, RDKit, Atom etc.

Final example is shape similarity. To do it vsdb should have 3D structure information. So I got 3D data from ligand expo and made vsdb.

Data link is below.
http://ligand-expo.rcsb.org/dictionaries/Components-pub.sdf.gz

Then run shape sim search. I took long time compared to commercial package such as ROCS but could generate nice output.

$ vsflow shape -smi "CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12" -d pdb_ligs3d.vsdb -o shapesmi -np 6 --pymol
**************************

 VV        VV  SSSSSSS             VSFlow
  VV      VV  SSS    SS       Virtual Screening
   VV    VV    SSSS               Workflow
    VV  VV       SSSS
     VVVV     SS    SSS
      VV       SSSSSSS

**************************

Start: 06/21/2022, 22:43:11
Running in parallel mode on 6 threads
Reading database ...
Reading query ...
Performing shape screening ...
Generating 3D conformer(s) for 1 query molecule(s)
Generating PyMOl file ...
Finished: 06/22/2022, 02:16:54
Finished in 12822.98383 seconds

Here is an example output of shape similarity screening. Green is query molecule. As you can see, vsflow got molecules which has similar 3D shape.

In summary vsflow is useful package for chemoinformatics.

More detials are described the arxiv and repository’s wiki.

https://chemrxiv.org/engage/chemrxiv/article-details/628c60215d9485a206cc8ecc

Call RDKit from Rust with conda env ver2 #RDKit #RDKit-sys #Rust #Chemoinformatics

To use rdkit from Rust, I introduced rdkit-sys before. And fortunately recent version of rdkit-sys cleat supports rdkit-env. It’s worth to use conda-env to build rdkit-sys because user don’t need to build rdkit from source code.

Following code is almost same as my previous post but I would like to share it.

At first, I cloned rdkit from rdkit-rs.

$ gh repo clone rdkit-rs/rdkit
$ cd rdkit

Then edit Cargo.toml. I modified dependencies part as below. Added features=[“dynamic-linking-from-conda”] option. And then added LD_LIBRARY_PATH to use it.

[package]
name = "rdkit"
version = "0.2.11"
edition = "2021"
authors = ["Xavier Lange <xrlange@gmail.com>"]
license = "MIT"
description = "High level RDKit functionality for rust"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
bitvec = "1"
cxx = "1"
log = "0.4"
#iwatobipen modified
rdkit-sys = {version="0.2.14", features=["dynamic-linking-from-conda"]}
flate2 = "1"

[dev-dependencies]
env_logger = "0.9.0"

## from bash
(base) $ conda activate chemo_py310
(chemo_py310)$ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib;$LD_LIBRARY_PATH
(chemo_py310)$ cargo test # all tests will pass

Then I wrote rust_rdkit_v3

$ cargo new rust_rdkit_v3
$ cd rdkit_rust
$ vim src/main.rs
use std::env;
use std::path::PathBuf;
use rdkit;

fn main() {
    let args: Vec<String> = env::args().collect();
    let sdgz_filename = &args[1];

    println!("filename is {}", sdgz_filename);
    let mol_block_iter =
        rdkit::MolBlockIter::from_gz_file(sdgz_filename, false, false, false).unwrap();
    let mol_list = mol_block_iter.collect::<Vec<_>>();
    for m in mol_list {
        let smi = m.unwrap().as_smile();
        println!("{}", smi)
    }
    println!("Done");
}

$ vim Cargo.toml
[package]
name = "rust_rdkit_v3"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
#rdkit = "0.2.11"
rdkit={path="../rdkit"}
~

After writing the code, build it.

$ cargo build --release

Now I could run code.

./target/release/rust_rdkit_v3 cdk2.sdf.gz
filename is cdk2.sdf.gz
[H]C1=NC2=C(N=C(N([H])[H])N=C2OC([H])([H])C(=O)C([H])(C([H])([H])[H])C([H])([H])[H])N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])OC([H])([H])C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])N([H])C(=O)C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])C([H])([H])C([H])=C([H])C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
[H]OC([H])([H])C([H])([H])N([H])C1=NC(N([H])C([H])([H])C2=C([H])C([H])=C([H])C([H])=C2[H])=C2N=C([H])N(C([H])([H])[H])C2=N1
....

In summary current rdkit-sys supports conda env and it makes easy to call rdkit from rust.

Thanks!

Generate molecules with Link-invent #RDKit #RINVENT #chemoinformatics

General molecular generator with deep learning approach is difficult to fix substructure. But common SAR expansion by medchem is focus on specific part such as terminal part, core and linker.

Linker is defined as a part which connect parts of molecules. Linker is important of drug design and is called scaffold when the linker connets two or three parts. And also important in PROTAC design. So it’s useful if deep learning based generator can design linker ;). As you know delinker(https://github.com/fimrie/DeLinker) is one of the useful and cool code for linker design.

Recently, AstraZeneca team disclosed cool linker generator code for REINVET named Link-INVENT which is implemented in REINVENT v3.2. Link-INVENT is new option of REINVENT for liker design with many useful scoring options.

Link-INVENT can use following linker design specific scoring functions. As you can see, the package can design specific linker.

LinkerEffectiveLength
LinkerGraphLength
LinkerLengthRatio
LinkerNumRings
LinkerNumAliphaticRings
LinkerNumAromaticRings
LinkerNumSPAtoms
LinkerNumSP2Atoms
LinkerNumSP3Atoms
LinkerNumHBA
LinkerNumHBD
LinkerMolWeight
LinkerRatioRotatableBonds

I had interest the code and tried to use it. Fortunately Reinvent community provides useful example code in ReinventCommunity. Ok, let’s test it!

At first I modified code from ReinventCommunity, added scoring function.

I added ‘linker_graph_length’ scoring function for controlling length of molecules.

And after running the learnig process, I checked the learning process with tensorboad. It’s really useful to review the learning process becase I can see not only learning score but also structures of each learning process.

Here is the first output of the learning process.

And here is the molecules from last learning step .

The molecules which generated at last learning step has rigid and short linker which is defined by scoring function.

In summary, I think REINVENT is cool and flexible molecular generator code. I respect authors work and thank for sharing the nice code ;)

Call RDKit from Rust with conda env #RDKit #Rust #Chemoinformatics

I introduced rdkit-sys which is wrapper of rdkit for rust. The package development is ongoing so code is chaged frequentry. To use the package, user need to build rdkit with static link ON option. Build rdkit-sys with static link (.a) is efficient way for portability of the code however it means that it’s difficult to use conda env rdkit because, conda env provides only dynamic link (.so).

So I would like to build rdkit-sys and call rdkit from rust with dylib.

After several try and error, I found solution to do it. I would like to share my trial.

At first, I cloned rdkit-sys in my PC.

$ gh repo clone rdkit-rs/rdkit-sys
$ cd rdkit-sys

Then edit build.rs script show below.

fn main() {
    if std::env::var("DOCS_RS").is_ok() {
        return;
    }

    env_logger::init();

    let library_root = match (std::env::consts::OS, std::env::consts::ARCH) {
        ("macos", "x86_64") => "/usr/local",
        ("macos", "aarch64") => "/opt/homebrew",
        //("linux", _) => "/usr",
        //following path is my conda env's path
        ("linux", _) => "/home/iwatobipen/miniconda3/envs/chemo_py310",
        (unsupported_os, unsupported_arch) => panic!(
            "sorry, rdkit-sys doesn't support {} on {} at this time",
            unsupported_os, unsupported_arch
        ),
    };

    let brew_lib_path = format!("{}/lib", library_root);
    let include = format!("{}/include", library_root);
    let rdkit_include = format!("{}/include/rdkit", library_root);

    let dir = std::fs::read_dir("src/bridge").unwrap();
    let rust_files = dir
        .into_iter()
        .filter_map(|p| match p {
            Ok(p) => {
                if p.metadata().unwrap().is_file() {
                    Some(p.path())
                } else {
                    None
                }
            }
            Err(_) => None,
        })
        .filter(|p| !p.ends_with("mod.rs"))
        .collect::<Vec<_>>();

    let mut cc_paths = vec![];

    let wrapper_root = std::path::PathBuf::from("wrapper");
    for file in &rust_files {
        let file_name = file.file_name().unwrap();
        let file_name = file_name.to_str().unwrap();
        let base_name = &file_name[0..file_name.len() - 3];

        let cc_path = wrapper_root.join("src").join(format!("{}.cc", base_name));
        let meta = std::fs::metadata(&cc_path).unwrap();
        if !meta.is_file() {
            panic!("{} must exist", cc_path.display())
        }
        cc_paths.push(cc_path);

        let h_path = wrapper_root
            .join("include")
            .join(format!("{}.h", base_name));
        let meta = std::fs::metadata(&h_path).unwrap();
        if !meta.is_file() {
            panic!("{} must exist", h_path.display())
        }
    }

    cxx_build::bridges(rust_files)
        .files(cc_paths)
        .include(include)
        .include(rdkit_include)
        .include(std::env::var("CARGO_MANIFEST_DIR").unwrap())
        .flag("-std=c++14")
        .warnings(false)
        // rdkit has warnings that blow up our build. we could enumerate all those warnings and tell
        // the compiler to allow them... .warnings_into_errors(true)
        .compile("rdkit");

    println!("cargo:rustc-link-search=native={}", brew_lib_path);
    // println!("cargo:rustc-link-lib=static=c++");

    for lib in &[
        "Catalogs",
        "ChemReactions",
        "ChemTransforms",
        "DataStructs",
        "Descriptors",
        "FileParsers",
        "Fingerprints",
        "GenericGroups",
        "GraphMol",
        "MolStandardize",
        "RDGeneral",
        "RDGeometryLib",
        "RingDecomposerLib",
        "SmilesParse",
        "Subgraphs",
        "SubstructMatch",
    ] {
        //swich static link to dynamic link!!!
        //println!("cargo:rustc-link-lib=static=RDKit{}_static", lib);
        println!("cargo:rustc-link-lib=dylib=RDKit{}", lib);
    }
    println!("cargo:rustc-link-lib=static=boost_serialization");
}

After the modification, I set up LD_LIBRARY_PATH.

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib;$LD_LIBRARY_PATH

After that, I could cargo build command in rdkit-sys folder ;). Then cloned rdkit-rs/rdkit with the crate. To do that I edited Cargo.toml like below.

$ gh repo clone rdkit-rs/rdkit
$ cd rdkit
$ vim Cargo.toml
[package]
name = "rdkit"
version = "0.2.6"
edition = "2021"
authors = ["Xavier Lange <xrlange@gmail.com>"]
license = "MIT"
description = "High level RDKit functionality for rust"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
cxx = "1"
log = "0.4"
#rdkit-sys = "0.2.7"
rdkit-sys = { path = "../rdkit-sys"} # I used modified version of rdkit-sys
flate2 = "1"

After that, I wrote code with these packages.

my rdkitest code is below.

#Cargo.toml
[package]
name = "rdkrust"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
rdkit = { path = "/home/iwatobipen/dev/sandbox/rusttest/rdkit" }
#src/main.rs
use std::env;
use rdkit;

fn main() {
    let args: Vec<String> = env::args().collect();
    let sdgz_filename = &args[1];

    println!("filename is {}", sdgz_filename);
    let mol_block_iter =
        rdkit::MolBlockIter::from_gz_file(sdgz_filename, false, false, false).unwrap();
    let mol_list = mol_block_iter.collect::<Vec<_>>();
    for m in mol_list {
        let smi = m.unwrap().as_smile();
        println!("{}", smi)
    }
    println!("Done");
}
    

The code will read sdf.gz and retrieve molecules and then convert molecules to SMILES.

OK, let’s check the code.

$ cargo build
#Then
$ target/debug/rdkrust cdk2.sdf.gz
filename is cdk2.sdf.gz
[H]C1=NC2=C(N=C(N([H])[H])N=C2OC([H])([H])C(=O)C([H])(C([H])([H])[H])C([H])([H])[H])N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])OC([H])([H])C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
'''snip
'''
Done

It seems work fine. In summary, rdkit-rs activity is really cool project for chemoinformatics because rust is really efficient language. I would like to learn rust more and more and develop useful apps for chemoinformatics.

Thanks for reading.

Run python script in html_file #memo #cording

I wrote lots of chemoinformatics apps as web app. To do that I often use Flask / Django. It’s python packge so most of part can be written in python however it’s difficult to embed python directry in html document.

Recently I found cool package for embedding python code in html, it’s name is pyscript. It’s really cool!

I tested simple code as shown below.

The directory system is …

root/
webserve.py
conponents/test1.html
test2.html
test3.html

And code is below. To do he test, I used python http package.

#webserver.py
from http.server import HTTPServer, SimpleHTTPRequestHandler,  os
os.chdir(os.path.join(os.path.dirname(__file__), 'contents'))
server = HTTPServer(('', 8000), SimpleHTTPRequestHandler)
server.serve_forever()
#contents/test1.html
<html>
  <head>
    <link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
    <script defer src="https://pyscript.net/alpha/pyscript.js"></script>
  </head>
  <body> <py-script> print('Hello, World!') </py-script> </body>
</html>
#contents/test2.html
<html>
    <head>
      <link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
      <script defer src="https://pyscript.net/alpha/pyscript.js"></script>
      <py-env>
        - matplotlib
      </py-env>
    </head>

  <body>
    <h1>Let's plot random numbers</h1>
    <div id="plot"></div>
    <py-script output="plot">
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(1000)
y = np.random.randn(1000)

fig, ax = plt.subplots()
ax.scatter(x, y)
fig
    </py-script>
  </body>
</html>
#contents/test3.html
<html>
    <head>
      <link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
      <script defer src="https://pyscript.net/unstable/pyscript.js"></script>
      <py-env>
        - pandas
        - scikit-learn
      </py-env>
    </head>

  <body>
<div>
    <py-script>
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data)
df['target'] = data.target
df.head(5)
    </py-script>
</div>
  </body>
</html>

After writing code above, run the server.

$ python webserver.py

Then simple webserver will run. And I could get following veiw from each page.

As you can see, main part of these pages are built with pure python code without javascript. It’s really interesting.

But there are limitations. PyScript doesn’t support all packages which are provided from pypi evenif major packages are suppored.

Unfortunately RDkit is not suppored too.

BTW lots of packages are abailable. If reader has interest PyScript, let’s check the original document and examples!

https://github.com/pyscript/pyscript/tree/main/pyscriptjs/examples

Generate molecules from molecular formula #Chemoinformatics #memo #jcheminf

Most of chemoinformatitian will think that C6H6 means benzene and its SMILES strings will be ‘c1ccccc1’.

However how do you think that how many possible combinations will be generated from molecular formula C6H6?

…..

Yah, it’s interesting but difficult question.

Recently I read interesting article published from Jounral of chemoinformaitcs. The title is ‘Surge: a fast open-source chemical graph generator’.

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00604-9

The authors developed a fast chemical graph generator which generates molecules from formula.

To generate chemical graph from formula, several steps are required 1. generate graph generation and check automorphism, bond multipicity.

In the case of C6H6, over 200 molecules are generated with surge!!!

Fortunately, binay version of surge is provided from following URL.
https://structuregenerator.github.io/

So I used to it. At first, I got program from the URL above and generate molecules from formula C6H6.

$ $ ./surge-linux-v1.0 -o hoge.sdf C6H6

Then hoge.sdf was generated. And I checked generated structure.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw C6H6.ipynb hosted with ❤ by GitHub

As described in the article, surge has a limitation. Current version doesn’t perform a Huckel aromaticity test. It means surge will generate dupilicates structure for kekule versions of aromatic rings.

However it works fast and interesting tool for molecular generation. BTW it’s difficult to filter from generated molecules with desired compound properties in the drug discovery field.

Use RDKit from Rust v2 #RDKit #Rust

I enjoyed 18th mishima.syk meeting at last weekend. I think the community is really cool and worth to join for caching the cutting edge of chemo/bio informatics ;) Feel free to participate and present here if you have interest the meeting.

In the meeting, yamasakit_ introduced “Rust basics” with live coding! Fortunately his presantation material is available from mishima.syk 18 repo (written in Japanese)!

I’m interested in Rust and posted how to integrate rdkit and rust before. In the previous post, rdkitciff is requried to use rdkit functionality from Rust. I felt that the process is a little annoying.

Recently, ‘Xavier Lange‘ developed rdkit-sys and it resigsterd crates.io. By using the package, user don’t need to use rdkitcffi for your rust code development.

OK let’s write code!

At first I made example project.

$ cargo new rdkrust
$ cd rdkrust

Then, add “rdkit-sys = “0.1.10” to Cargo.toml

[package]
name = "rdkrust"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
rdkit-sys = "0.1.10"

Next, wrote main.rs. Following code will convert molecules from sdf to SMILES.
mol_opt_list.into_iter().filter_map(|m| m).collect(); as same as mols = [ mol for mol in mols if mol != None] in python.

use std::env;
use rdkit_sys::molecule::Molecule;
use rdkit_sys::molecule::read_sdfile;
fn main() {
    let args: Vec<String> = env::args().collect();
    let sd_filename = &args[1];
    println!("filename is {}", sd_filename);
    let mol_opt_list: Vec<Option<Molecule>> = read_sdfile(sd_filename);
    let mut mol_list: Vec<Molecule> = mol_opt_list.into_iter().filter_map(|m| m).collect();
    mol_list.iter_mut().for_each(|m| m.remove_all_hs());
    for m in mol_list {
        let smi = m.get_smiles("");
        println!("{}", smi)
    }
    println!("Done");
}

Then compile the code! The process will take few minutes.

$ cargo build

After the process, rdkrust command is built in target/debug folder. Check it.

$ target/debug/rdkrust cdk2.sdf
filename is cdk2.sdf
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1
Nc1nc(OCC2CCCCC2)c2nc[nH]c2n1
Nc1nc(OCC2CC=CCC2)c2nc[nH]c2n1
Cn1cnc2c(NCc3ccccc3)nc(NCCO)nc21
CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1
COc1ccc(CNc2nc(N(CCO)CCO)nc3c2ncn3C(C)C)cc1
Nc1nc(N)c(N=O)c(OCC2CCCCC2)n1
COc1ccc2c(c1)/C(=C/c1cnc[nH]1)C(=O)N2
COc1cc[nH]c1/C=C1\C(=O)Nc2ccc([N+](=O)[O-])cc21
CCc1cnc(CSc2cnc(NC(=O)C(C)C)s2)o1
CCCCOc1c(C(=O)c2nccs2)cnc2[nH]ncc12
COc1cc(-c2ccc[nH]2)c2c3c(ccc(F)c13)NC2=O
[NH3+]CCSc1cc(-c2ccc[nH]2)c2c3c(ccc(F)c13)NC2=O
NC(=O)Nc1cccc2c1C(=O)c1c-2n[nH]c1-c1cccs1
COc1ccc2c(c1)/C(=C/c1[nH]cc3c1CCOC3=O)C(=O)N2
CNS(=O)(=O)c1ccc2c(c1)/C(=C/c1[nH]cc3c1CCNC3=O)C(=O)N2
CC(=O)Nc1cccc2c1C(=O)c1c-2n[nH]c1-c1ccncc1
COc1ccc(-c2[nH]nc3c2C(=O)c2c(NC(N)=O)cccc2-3)cc1
COc1ccc(-c2[nH]nc3c2C(=O)c2c(NC(=O)NN(C)C)cccc2-3)cc1
Cc1nc2ccccn2c1-c1ccnc(Nc2ccccc2)n1
Cc1nc2ccccn2c1-c1ccnc(Nc2ccc(OCC(O)C[NH+](C)C)cc2)n1
O=C(Nc1ccccn1)Nc1cccc2c1C1CCCN1C2=O
NS(=O)(=O)c1ccc(N/N=C2\C(=O)Nc3ccc(Br)cc32)cc1
CNS(=O)(=O)c1ccc(N/C=C2\C(=O)Nc3ccccc32)cc1
N=C(N)NS(=O)(=O)c1ccc(N/C=C2\C(=O)Nc3ccccc32)cc1
O=C1Nc2ccc(S(=O)(=O)O)cc2/C1=C1/Nc2ccccc2C1=O
c1ccc(Nc2nc(OCC3CCCCC3)c3nc[nH]c3n2)cc1
NS(=O)(=O)c1ccc(Nc2nc(OCC3CCCCC3)c3nc[nH]c3n2)cc1
O=C(c1ccccc1)c1cnc2n[nH]cc2c1OCc1ccccc1
NS(=O)(=O)c1ccc(Nc2cc(-c3ccc([N+](=O)[O-])cc3)[nH]n2)cc1
CCCCOc1c(C(=O)c2c(F)cc(Br)cc2F)cnc2[nH]ncc12
Cc1ccc(F)c(Nc2ccnc(Nc3ccc(S(N)(=O)=O)cc3)n2)c1
CC(C)C(CO)Nc1nc(Nc2cccc(Cl)c2)c2ncn(C(C)C)c2n1
CC(C)C(CO)Nc1nc(Nc2ccc(C(=O)[O-])c(Cl)c2)c2ncn(C(C)C)c2n1
[NH3+]C1CCC(Nc2nc(NCC3CC3)c3ncn(C4CCCC4)c3n2)CC1

It seems work well. Due to lack of my cpp knowledge, I can’t imprement MCS in the rdkitcffi. I think it’s great to call findMCS funtion from rust. Because finding MCS process required many computational cost. So I would like to search MCS more speedly. Please let me know if reader can imprement it to rdkitcffi ;)

Thhank Xavier Lange to developping cool package for rust!

Make curated Kinase inhibitor dataset from ChEMBL30 #memo #chemoinformatcs

Kinase is one of the attractive target for drug discovery. So there are lots of data not only protein but also inhibitor available.

ChEMBL is useful public data source for Kinase inhibitor data however to use the data, we need to retrieve data from the DB and curate it. Of course there are commercial database focused on Kinase, but not freely available. I would like to use data conveniently in my hobby ;).

If you think so too, I would like to check following repository openkinome/kinodata. URL is below

https://github.com/openkinome/kinodata

The repository provides useful notebook for getting kinase related dataset from ChEMBL29. Fortunately we can use ChEMBL30! So I modified the notebook for ChEMBL30 and make kinase-inhibitor and activities dataset.

At first, I cloned the repository and chembl_30_sqlite.tar.gz from chemblsite.(sqlite) After that I ran ‘kinases_in_chembl.ipynb’ with ChEMBL30.

human_kinases.aggregated.csv which is provided from original repo was used in following code. And most of the code in my blogpost is came from kinodata repo. I appreciate great work of the authors.
Compared to ChEMBL29 and ChEMBL30, there are more data in ver 30.

After runngin the code, I could get 'human_kinases_and_chembl_targets.chembl_30.csv'
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Next step, I ran kinase-bioactivities-in-chembl.ipynb for making csv which has structure and biological activity information.

Here is a code.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

The code avobe worth to read because there is good SQL example to data extraction from ChEMBLDB. And useful data curation process is provided.

Finaly, I added PCA analysis with potent compouns fingerprint. In the dataset. I used useful_rdkit_utils to convert molecule to fingerprint array. By using the package I could get molecular fp in few code.

In summary, using these notebook I could make kinase inhibitor dataset conveniently. I’ll use it for many chemoinformatics tasks ;)

Build peptide from monomer library from ChEMBL #RDKit #ChEMBL #Chemoinformatics

Recently, ChEMBL ver. 30 is released. I’ve installed it in my PC and added rdkit schema ;) And current chembl ftp site provides monomer library of HELM.

@magattaca posted really useful blog post in Japanese https://magattaca.hatenablog.com/entry/2020/11/25/004829. The post describes about HELM, its monomer and render these monomers.

I’m interested in how to build peptide from these monomers. And recently, @dr_greg_landrum introduced how to build molecule from parts of fragments with molzip function.

So I thought that by using molzip function and monomer library which is provided from ChEMBL, it will be easy to build peptide from monomers.

To do that, I defined 3 functions,
1) combine_fragments which combines two monomers with N-terminal and C-terminal as an amide.
2) make peptide which build peptide from list of monomers.
3) cap_terminal which caps terminal of peptide.

The main functions are shown below.

def combine_fragments(m1, m2):
    m1 = Chem.Mol(m1)
    m2 = Chem.Mol(m2)
    for atm in m1.GetAtoms():
        if atm.HasProp('atomLabel') and atm.GetProp('atomLabel') == '_R2':
            atm.SetAtomMapNum(1)
    for atm in m2.GetAtoms():
        if atm.HasProp('atomLabel') and atm.GetProp('atomLabel') == '_R1':
            atm.SetAtomMapNum(1)
    return molzip(m1, m2)

def make_peptide(monomerlist):
    monomerlist = copy.deepcopy(monomerlist)
    for idx, monomer in enumerate(monomerlist):
        if Chem.MolToSmiles(monomer).count("*") == 1:
            continue
        if idx==0:
            res = monomer
        else:
            res = combine_fragments(res, monomer)
    return res

def cap_terminal(m):
    m = Chem.Mol(m)
    n_term = Chem.MolFromSmiles('CC(=O)[*:1]')
    c_term = Chem.MolFromSmiles('CO[*:2]')
    for atm in m.GetAtoms():
        if atm.HasProp('atomLabel') and atm.GetProp('atomLabel') == '_R1':
            atm.SetAtomMapNum(1)
        if atm.HasProp('atomLabel') and atm.GetProp('atomLabel') == '_R2':
            atm.SetAtomMapNum(2)
    res = molzip(m, n_term)
    res = molzip(res, c_term)
    return res

Here is an example of molecule from two monomers.

image

And here is an example of peptide from more than two monomers. Right bottom molecule is a final products. My function omits monomers which has only one “*”(attachment points).

It works fine. I’ll add ring closure function for making macro cyclic peptides.

The monomer library has lots of amino acids so it’s interesting data set for chemoinformatics.

Whole code is uploaded my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Make interactive chemical space plot with Dash #RDKit #useful_rdkit_utils #Dash #Chemoinfomratics

Recently I posted really useful package named molplotly. The package can make interactive plot with compound structure as hover.

Inspired the activity, I tried to wrote code for making chemical space mapping with interactive rendering of molecules.

Base code came from my previous post. There small changes in dash part because of I used new version of dash in this code.

Here is a code. The points of the code are below.

  1. I used BytesIO for reading uploaded file without save it as temporaly file. And loaded external css file to make horizontal layout of compound structure and scatterplot.
import base64
import os
import io
import numpy as np
import pandas as pd
import flask
import dash
import dash_dangerously_set_inner_html as dhtml
from dash import dcc
from dash import html
from dash.dependencies import Input, Output, State
import plotly.express as px
import plotly.graph_objects as go
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdDepictor
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import molplotly4flask
import useful_rdkit_utils
import dash_bootstrap_components as dbc


def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    rdDepictor.Compute2DCoords(mol)
    mc = Chem.Mol(mol.ToBinary())
    Chem.Kekulize(mc)
    drawer = Draw.MolDraw2DSVG(200,200)
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText().replace('svg:','')
    return svg

upload_style = {
    "width": "50%",
    "height": "120px",
    "lineHeight": "60px",
    "borderWidth": "1px",
    "borderStyle": "dashed",
    "borderRadius": "5px",
    "textAlign": "center",
    "margin": "10px",
    "margin": "3% auto",
}

#https://stackoverflow.com/questions/63459424/how-to-add-multiple-graphs-to-dash-app-on-a-single-browser-page
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

app.config.suppress_callback_exceptions=True

vals = {'PC-1':'PC-1',
        'PC-2':'PC-2',
        'TSNE-1':'TSNE-1',
        'TSNE-2': 'TSNE-2'}


df_bar = pd.DataFrame({
    "Fruit": ["Apples", "Oranges", "Bananas", "Apples", "Oranges", "Bananas"],
    "Amount": [4, 1, 2, 2, 4, 5],
    "City": ["SF", "SF", "SF", "Montreal", "Montreal", "Montreal"]
})

fig2 = px.bar(df_bar, x="Fruit", y="Amount", color="City", barmode="group")


app.layout = html.Div(children=[
    html.H1(children='Hello Chemoinfo'),
        dcc.Upload(
            id='sdf',
            children=html.Div(['upload sdf']),
            style=upload_style,
        ),
     
    html.Div(children='''
    Dash : sample plot
    '''),
 
    html.Div([dcc.Dropdown(id='x-column',
                           value='PC-1',
                           options=[{'label': key, 'value': key} for key in vals.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
              dcc.Dropdown(id='y-column',
                           value='PC-2',
                           options=[{'label': key, 'value': key} for key in vals.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
                           ]),
    html.Div([
        html.Div([html.Div(id="molimg")], className="two columns"),
        html.Div([dcc.Graph(id='mol_graph')], className="eight columns")
        ], 
        className="row"
        ),
    
    #html.Div([dcc.Graph(id='chemical-space')])
    ])

def parse_content(contents, filename):

    content_type, content_string = contents.split(",")
    decoded = base64.b64decode(content_string)
    bio = io.BytesIO(decoded)
    bio.seek(0)
    try:
        df = PandasTools.LoadSDF(bio)
    except Exception as e:
        print(e)
        return html.Div([f"{filename} error occured during file reading"])
    X = [useful_rdkit_utils.mol2numpy_fp(m, 2, 1024) for m in df.ROMol]
    pca = PCA(n_components=2)
    tsne = TSNE()
    pca_res = pca.fit_transform(X)
    tsne_res = tsne.fit_transform(X)
    df['PC-1'] = pca_res[:,0]
    df['PC-2'] = pca_res[:,1]
    df['TSNE-1'] = tsne_res[:,0]
    df['TSNE-2'] = tsne_res[:,1]
    return df

@app.callback(
    Output('mol_graph', 'figure'),
    [Input('sdf', 'contents'),
    Input('x-column', 'value'),
    Input('y-column', 'value')],
    [State('sdf', 'filename')],
    prevent_initial_call=True
)
def drawgraph(contents,x_column_name, y_column_name ,filename):
    df = parse_content(contents, filename)
    return {'data':[go.Scatter(
        x=df[x_column_name],
        y=df[y_column_name],
        #text=['mol_{}'.format(i) for i in range(len(mols))],
        text=[Chem.MolToSmiles(mol) for mol in df.ROMol],
        mode='markers',
        marker={
            'size':15,
            'opacity':0.5
        }
    )],
    'layout':go.Layout(
        xaxis={'title':x_column_name},
        yaxis={'title':y_column_name}
    )}

@app.callback(
    Output('molimg', 'children'),
    [Input('mol_graph', 'hoverData'),
    ]
)
def update_img(hoverData):
    try:
        svg = smi2svg(hoverData['points'][0]['text'])
    except:
        svg = 'Select molecule'
    return dhtml.DangerouslySetInnerHTML(svg)

if __name__=="__main__":

    app.run_server(debug=True)

After writing the code, run it with following command.

$ python app.py

Then I could see interactive chemical space plot of uploaded file. Like below ;)

Chemical space is represented with PCA/TSNE. And I used useful_rdkit_utils for calculation of molecular fingerprint. The package is worth to use for chemoinformatics task.

To use dash, pythonista can make really cool view in short time I think. It’s really useful for chemoiformaticians. Enjoy OSS and coding!

Integration of molplotly and Flask for developing chemoinformatics web app #RDKit #molplotly #Flask

Some days ago, @WillMcCorki1 introduced really cool package named molplotly. Thanks for sharing the information. Molplotly is add-on to ploty for rendering molecular image on mouseover like TIBCO Spotfire.

The package(https://github.com/wjm41/molplotly) uses JupyterDash for integrating JupyterNotebook so it’s easy to embed interactive chemoinformatics chart in JupyterNotebook. However, it’s difficult to use the package from outside of jupyternotebook like Flask, Django etc.

I’ve never tried to integrate Dash app to Flask system. So I tried to integrate molplotly and simple dash app to Flask.

Dash use Flask internally, so it’s required little bit tricky way to integrate Dash to Flask.
Here is an example to serve multiple dash app from Flask.

DispatcherMiddleware organizes multiple dash apps. *run_simple is not suitable for production usage only test usage. (I use gunicorn for production ;))

#app.py
from dash import Dash
from werkzeug.middleware.dispatcher import DispatcherMiddleware
from werkzeug.serving import run_simple
import flask

from flask import Flask
from dash import html
server = Flask(__name__)


dash_app1 = Dash(__name__, server = server, url_base_pathname='/dashboard/' )
dash_app2 = Dash(__name__, server = server, url_base_pathname='/reports/')
dash_app1.layout = html.Div([html.H1('Hi there, I am app1 for dashboards')])
dash_app2.layout = html.Div([html.H1('Hi there, I am app2 for reports')])

@server.route('/dashboard/')
def render_dashboard():
    return flask.redirect('/dash1')


@server.route('/reports/')
def render_reports():
    return flask.redirect('/dash2')

app = DispatcherMiddleware(server, {
    '/dash1': dash_app1.server,
    '/dash2': dash_app2.server,
})

run_simple('localhost', 8080, app, use_reloader=True, use_debugger=True)

The app shown above will work ‘python app.py’

OK, lets integrate molploty to Flask.

I changed following line from ‘JupyterDash’ to ‘Dash’ and move outside of code. https://github.com/wjm41/molplotly/blob/bde11c4a4861590e3151a7e5dd481e3c8d84b741/molplotly/main.py#L90

Most of code is borrowed from original molplotly repo. And my changed code is uploaded my repository.

Here is a molploty4flask.py

https://github.com/iwatobipen/molplotly4flask/blob/main/molplotly4flask.py

Next, write chemoinformatics web app with the code.

from dash import Dash
from werkzeug.middleware.dispatcher import DispatcherMiddleware
from werkzeug.serving import run_simple
import flask
from flask import Flask
from flask import render_template
from dash import html
import pandas as pd
import plotly
import plotly.express as px
import json
import molplotly4flask

server = Flask(__name__)

# Following example code is borrowed from molploty repo. Thanks!
# https://github.com/wjm41/molplotly
df_esol = pd.read_csv(
    'https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/delaney-processed.csv')
df_esol['y_pred'] = df_esol['ESOL predicted log solubility in mols per litre']
df_esol['y_true'] = df_esol['measured log solubility in mols per litre']
df_esol['delY'] = df_esol["y_pred"] - df_esol["y_true"]

# This is a simple dash app
dash_app1 = Dash(__name__, server = server, url_base_pathname='/dashboard/' )
dash_app2 = Dash(__name__, server = server, url_base_pathname='/reports/')
dash_app1.layout = html.Div([html.H1('Hi there, I am app1 for dashboards')])
dash_app2.layout = html.Div([html.H1('Hi there, I am app2 for reports')])

# make scatter plot with plotly
fig_scatter = px.scatter(df_esol,
                         x="y_true",
                         y="y_pred",
                         color='delY',
                         title='ESOL Regression (default plotly)',
                         labels={'y_pred': 'Predicted Solubility',
                                 'y_true': 'Measured Solubility',
                                 'delY': 'dY'},
                         width=1200,
                         height=800)

# make chemoinformatics app with molploty ;)
molapp = Dash(__name__, server=server, url_base_pathname='/molplotly_test/')
molapp = app_scatter_with_captions = molplotly4flask.add_molecules(fig=fig_scatter,
                                                    df=df_esol,
                                                    app=molapp,
                                                    smiles_col='smiles',
                                                    title_col='Compound ID',
                                                    caption_cols=['Molecular Weight', 'Number of Rings'],
                                                    caption_transform={'Predicted Solubility': lambda x: f"{x:.2f}",
                                                                       'Measured Solubility': lambda x: f"{x:.2f}",
                                                                       'Molecular Weight': lambda x: f"{x:.2f}"
                                                                       },
                                                    show_coords=True)


@server.route('/plot1/')
def plot1():
    graphJSON = json.dumps(fig_scatter, cls=plotly.utils.PlotlyJSONEncoder)
    return render_template('plot1.html', graphJSON=graphJSON)

@server.route('/molplotly_test/')
def render_molplotly_test():
    print(type(molapp))
    return flask.redirect('/testmol')

@server.route('/dashboard/')
def render_dashboard():
    return flask.redirect('/dash1')


@server.route('/reports/')
def render_reports():
    return flask.redirect('/dash2')

app = DispatcherMiddleware(server, {
    '/dash1': dash_app1.server,
    '/dash2': dash_app2.server,
    '/testmol': molapp.server
})

run_simple('localhost', 8080, app, use_reloader=True, use_debugger=True)

After writing the code, run it.

python app.py

Access http://localhost:8080/plot1/, default plotly chart will be rendered, then access http://localhost:8080/testmol/ will be rendered same chart with tooltip!

Here are gif animation of the web app.

default web app
web app with molplotly

molploty4flask seems work well ;)

By using the approach, it’s easy to integrate rich chemoinformatics graphs to Flask web app I think.

Whole code includes template can be found my repo.

https://github.com/iwatobipen/molplotly4flask

Easy way to visualize SMARTS #chemoinformatics #memo

SMARTS which is a language for describing molecular patterns like regular expression for NLP is really useful for chemoinformatician. However it’s difficult to understand due to difficulty of visualization SMARTS query.

As far as I know, there are few software which can visualize beautiful SMARTS pattern. BioSolveIT provides unique SMARTS editor but it’s required commercial license so it’s difficult to use personally.

So I used the other free tool which named ‘SMARTS plus‘. The tool is managed by team of university of hamburg.

The tool is easy to use from web site, but also the site provides REST service. So user can use SMARTS plus from python. I tried it. Here is an example, as you can see SMARTS plus provides rich information of SMARTS.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

TIPs for using SMARTS plus from REST, to pass the query with get method, some character should be encoded. The details are described in following url.
https://smarts.plus/rest

I think visualize chemical structure and query is really interesting and useful for learning chemoinforamtics.

Let’s update MMPDB #RDKit #mmpdb #Chemoinformatcs

Matched molecular pairs are popular approach to transform molecules with prior knowledge of medicinal chemistry. MMPDB is useful open source package for managing MMP dataset which is derived from GSK  under the 3-clause BSD license.  It is easy to use and work really fast. I love it the package. However current version of MMPDB supports only sqlite3. SQLite3 is easy to use and it works without server.

I would like to build MMPDB in postgresql because RDKit community provides for chemical cartridge postgresql. So it’ll be useful if it’s possible to make postgresql MMPDB.

Fortunately recently, Andrew Dalke, shared new version of MMPDB wich supports postgresql!

The version isn’t main beach yet but you can get it from v3-dev branch. https://github.com/adalke/mmpdb/tree/v3-dev/mmpdblib/cli

OK let’s use it.

At first, clone the mmpdb and install it.

$ gh repo clone adalke/mmpdb -- -b v3-dev
$ cd mmpdb
$ pip install -e .

Then modify mmpdb/index_writers.py.

    # From L436-
    '''
    def add_rule_environment(self, rule_env_idx, rule_idx, env_fp_idx, radius):
        self._rule_environment_values.append(
            (rule_env_idx, rule_idx, env_fp_idx, radius))
        if next(self._check_flush):
            self.flush()
    '''
    # To (added num_pairs)
    def add_rule_environment(self, rule_env_idx, rule_idx, env_fp_idx, radius, num_pairs=0):
        self._rule_environment_values.append(
            (rule_env_idx, rule_idx, env_fp_idx, radius, num_pairs))
        if next(self._check_flush):
            self.flush()

Then create test mmpdb.

$ createdb mmpdbtest

Next, make fragmentdb with mmpdb fragment command.

$ head -n 10 herg_data.txt
$ head -n 10 herg_data.txt
canonical_smiles chembl_id molregno activity_id standard_value standard_units
O=S(=O)(c1ccccc1)C1(F)CCN(CCc2ccc(F)cc2F)CC1 CHEMBL175586 296708 1403965 2446.0 nM
N[C@H](C(=O)N1CC[C@H](F)C1)[C@H]1CC[C@H](NS(=O)(=O)c2ccc(F)cc2F)CC1 CHEMBL22310 29272 671631 49000.0 nM
N[C@H](C(=O)N1CCSC1)C1CCCCC1 CHEMBL23223 29758 674222 28000.0 nM
N[C@H](C(=O)N1CCSC1)[C@H]1CC[C@H](NC(=O)c2ccc(F)c(F)c2)CC1 CHEMBL22359 29449 675583 5900.0 nM
N[C@H](C(=O)N1CCCC1)[C@H]1CC[C@H](NS(=O)(=O)c2ccc(F)cc2F)CC1 CHEMBL25437 29244 675588 35000.0 nM
N[C@H](C(=O)N1CC[C@@H](F)C1)[C@H]1CC[C@H](NS(=O)(=O)c2ccc(OC(F)(F)F)cc2)CC1 CHEMBL281561 29265 679299 6000.0 nM
N[C@H](C(=O)N1CC[C@@H](F)C1)[C@H]1CC[C@H](NS(=O)(=O)c2ccc(F)cc2F)CC1 CHEMBL283309 29253 679302 52000.0 nM
N[C@H](C(=O)N1CCCC1)[C@H]1CC[C@H](NC(=O)c2ccc(F)c(F)c2)CC1 CHEMBL278558 29482 683566 29000.0 nM
N[C@H](C(=O)N1CCSC1)[C@H]1CC[C@H](NC(=O)c2ccccc2C(F)(F)F)CC1 CHEMBL283368 29340 685042 39000.0 nM

$ mmpdb fragment herg_data.txt -o herg_dataset.fragdb

Now I could get herg_dataset.fragdb. To make mmpdb in postgresql. Type following command. I run postgresql in localhost and insert table to mmpdbtest.

$ mmpdb index herg_dataset.fragdb -o postgres://localhost/mmpdbtest

The command above will make mmpdb in postgresql mmpdbtest database. Check postgesql database.

$ psql mmpdbtest
psql (12.9, server 12.2)
Type "help" for help.

mmpdbtest=# select * from rule_environment where num_pairs >1;

  id   | rule_id | environment_fingerprint_id | radius | num_pairs
-------+---------+----------------------------+--------+-----------
    43 |       8 |                          7 |      0 |         2
    49 |       9 |                          1 |      0 |         2
    55 |      10 |                          7 |      0 |         5
    61 |      11 |                          7 |      0 |         2
    67 |      12 |                          1 |      0 |         2
   127 |      22 |                          1 |      0 |         8
   133 |      23 |                          1 |      0 |         3
   139 |      24 |                          1 |      0 |         3
   187 |      32 |                          1 |      0 |         4
   193 |      33 |                          1 |      0 |         6
   211 |      36 |                          1 |      0 |         6
   217 |      37 |                          1 |      0 |         3
   229 |      39 |                          1 |      0 |         3
   253 |      43 |                          7 |      0 |         3
   259 |      44 |                          7 |      0 |         2
   265 |      45 |                          7 |      0 |         8
   343 |      58 |                         57 |      0 |         2
   361 |      61 |                         57 |      0 |         2
   439 |      74 |                          7 |      0 |         3
--More--

Next do transformation with mmpdb.

$ mmpdb transform --smiles 'c1cccnc1O' postgres://localhost/mmpdbtest --max-variable-size 5
ID	SMILES
1	CN(C)c1ccccn1
2	CNc1ccccn1
3	COC(=O)c1ccccn1
4	COCCNc1ccccn1
5	COc1ccccn1
6	CS(=O)(=O)Nc1ccccn1
7	Cc1ccccn1
8	Clc1ccccn1
9	Fc1ccccn1
...

It works fine.

In summary, new version of mmpdb works with postgresql. It provides opportunity to integrate rdkit chemical cartridge and move to sqlite3 to postgresql.

Thanks for continuous development of mmpdb!

If readers who have interest the package let’s use it ;)

Define a function after the request #Flask #memo #python

I love flask and django for making web app and often use Flask for web app development.

Sometime the app will serve files after getting user request. In this case, static files which are generated by the app will be stored in static folder. And the folder will store lots of files. So I would like to delete these files after serving the file to use. How to do it?

I searched google and found good solution ;) Flask has request callback function to do the task. ‘after_this_request‘ method is suitable for my situation.

An example of using the method is below. after_this_request is used as decorator.

from flask import Flask
from flask import render_template, after_this_request, request, send_file

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def hello():
    if request.method == 'POST':
        # do something
        @after_this_request
        def rm_file(response):
            # remove generated file
            return response
         return send_file('generatedfile')
     return render_template('hoge.html')

Flask has many useful callback functions and I’ve never used them. So I would like to read the document and use them in my products.