Calculate similarity of popular bioisosters #RDKit #espsim #memo

It will take 3 years after changing job. Fortunately, I’m still working at early drug discovery field as chemoinformatician ;)

As many drug hunter know that bioisosteric replacement is one of the major way for designing novel molecules from known biologically active molecules.

Transformations from carboxylic acid to tetorazole, from amide to oxazole are major examples of bioisosteric replatecement.

Are these pair, “carboxylic & tetorazole” and “amide & oxazole” simlar ? And are there any metrics to compre these similarity instead of biological activities?

Espsim is one of the interesting OSS package for compare molecules with electro static potentials.

I tried to compare following tetrazole and carboxylic acid pair and amide and oxazole pair.

To clarify the similarity, I used not only neutoral state of carboxylic acid but also ionization state.

Let’s compare carboxylic acid and tetrazole with Morgan FP and ESPsim. As you can see that tetrazole and carboxylic acid shows low 2d finger print similarity but high similarity with ESPsim and Shapesim.

Next, let’s compare carboxylate and carboxylic acid metyl ester.

Interestingly ESPSIM between calboxylate and methyl ester is very low compared to their shape similarity.

How about similarity between calboxylic acid and methyl ester? Ah! there molecule shows high similarity not only shape but also electrostatic potential.

Next, let’s see amide and oxazole.

The result is as same as calboxylate case, the pair shows low 2D similarty but high shape and esp similarity.

In this experiments, ESPsim is interesting tool for predicting bioisosteric fragments. I think it worth to use espsim as a tool for drug design.

Today’s code is uploaded my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This is the last chemoinformatics blogpost in this year. Thanks for visiting and reading my blog. I hope you and your family have a great rest of the year and have a happy new year ;)

Advertisement

Make pptx file from Python #chemoinformatics #memo #python-pptx

I hope all reader enjoying chemoinformatics ;)
Unfortunately I’m struggling to make PPTX file in this year. So I can’t have enought time to coding. But It’s same for medicinal chemists. Because they often make presentation slide for project team, theier boss, senior etc….

They transfer SAR data from spread sheet to pptx. Sometime it seems time consuming task for me.

So I would like to write code for making automatic pptx generator with python ;)

As you know there are useful prior art in the topic. The URL is listed below.

https://github.com/dkuhn/sdf2ppt
https://docs.eyesopen.com/toolkits/cookbook/python/depiction/csv2pptx.html?highlight=pptx

The first one uses RDKit and the second one uses OETK.

In this post, I used RDKit and most of code is bollowed from the OE’s example code.

Here is a my example code.
At first, import libraries which are required to make pptx and handling chemical structure.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from io import BytesIO, StringIO
from pptx import Presentation
from pptx.util import Inches
# read example molecules
mols = [m for m in Chem.SDMolSupplier('./cdk2.sdf')]
for m in mols:
    AllChem.Compute2DCoords(m)

Then define function to draw molecule and make pptx. I used BytesIO because by using the BytesIO I don’t need to make tempolary image file.

def getmol_bio(mol, size=(200, 200), fontsize=-1):
    bio = BytesIO()
    drawer = rdMolDraw2D.MolDraw2DCairo(size[0], size[1])
    op = drawer.drawOptions()
    op.fixedFontSize = fontsize
    drawer.DrawMolecule(mol)
    drawer.FinishDrawing()
    bio.write(drawer.GetDrawingText())
    return bio

def make_pptx(mol):
    pres = Presentation()
    title_slide_layout = pres.slide_layouts[0]
    title_slide = pres.slides.add_slide(title_slide_layout)
    title = title_slide.shapes.title
    title.text = 'mol props'
    slide_layout = pres.slide_layouts[5]
    slide = pres.slides.add_slide(slide_layout)
    bio = getmol_bio(mol, size=(300,200), fontsize=18)
    slide.shapes.add_picture(bio, left=Inches(1.0), top=Inches(2.0), width=Inches(2.5))
    props = mol.GetPropsAsDict()
    rows = len(props)
    cols = 2
    table = slide.shapes.add_table(rows, cols, left=Inches(4.0), top=Inches(2.0),
                                   width=Inches(5.5), height=Inches(0.8)).table
    table.columns[0].width = Inches(2.0)
    table.columns[1].width = Inches(3.5)
    table.first_row = False
    for row, (k,v) in enumerate(props.items()):
        table.cell(row, 0).text = k[:20]+':'
        table.cell(row, 1).text = str(v)
    pres.save('mol.pptx')

The main part of code is done. Let’s make PPTX file.

make_pptx(mols[0])

After calling make_pptx, I could get pptx file shown below.

It worked fine. This is a smile example but by using the approach it will be easy to make many compound related pptx files (IMHO, IMHO).

Python-pptx package is really cool and useful tool Ithink. It’s interesting for me! Love coding!

CLI tool for making ssslib #chemoinformatics #rdkit

As many RDKitter know that rdSubstructLibrary is one of the cool tool for conductiong substructure search. Greg Landrum introduced how to use it in his great blog post.

I love the method because it works very fast for substructure searching. So I would like to make CLI tool for making substructure library database.

To do it, I used click which is useful package for making CLI tool.

This is an example to use the code.

$ gh repo clone iwatobipen/rdsss
$ cd rdsss
$ pip install -e .

After installing the package, three commands will be available.
1. make_rdssslib command makes sslib from sdf.gz
2. update_rdssslib which updates sslib with new sdf.gz
3. run_rdsss which run SSS with given smarts query.

The example is shown below.

# make ssslib from sdf.gz
$ make_rdssslib cdk2.sdf.gz cdk2.sslib.pkl

# search with ssslib from CLI
$ run_rdsss 'c1ccccc1' cdk2.sslib.pkl

After running the run_rdsss, hits.csv file will be generated.

$ cat hits.csv
Cn1cnc2c(NCc3ccccc3)nc(NCCO)nc21,ZINC01641925
CC[C@H](CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,ZINC01649340
COc1ccc(CNc2nc(N(CCO)CCO)nc3c2ncn3C(C)C)cc1,ZINC01487345
COc1ccc2c(c1)/C(=C/c1cnc[nH]1)C(=O)N2,ZINC03814467
COc1cc[nH]c1/C=C1\C(=O)Nc2ccc([N+](=O)[O-])cc21,ZINC03814470
COc1cc(-c2ccc[nH]2)c2c3c(ccc(F)c13)NC2=O,ZINC00003491
[NH3+]CCSc1cc(-c2ccc[nH]2)c2c3c(ccc(F)c13)NC2=O,ZINC03814473
NC(=O)Nc1cccc2c1C(=O)c1c-2n[nH]c1-c1cccs1,ZINC03814477

The csv file contains hit smiles and _Name props.

All process can do from CLI with the code. But to handle learge sslib. I think user should run sss on interprinter. Because IO of SSLIB will take bottle neck of the code.

This code is stil underl development. Any advice or suggestion will be greatly appreciated.

https://github.com/iwatobipen/rdsss

Use mmpdblib on jupyter notebook #rdkit #mmpdb #chemoinformatics

I introduced MMPDB v3 few days ago. The package is CLI tool. So user can do MMPA by typing command line. However the tool doesn’t provide API interface.

I think, if we can use MMPDB on jupyter-notebook it’ll be more useful because output data can be used directly in coding process.

MMPDB Cli interface is implemented with Click which is python command line option parser. And click has CliRunner for testing. I found that CliRunner is nice solution to do that.

Following example is my first trial to call mmpdb on jupyter notebook. And most of SQL part is bollowed from RDKit-Blog post.

At first, connect postgresql chembl30 and retrive data.

import psycopg2
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem import PandasTools
import requests
import pandas as pd
from mmpdblib import cli
from click.testing import CliRunner
from IPython.display import HTML
%load_ext sql
print(rdBase.rdkitVersion)

data = %sql select canonical_smiles,molregno,activity_id,standard_value,standard_units from activities \
  join assays using (assay_id) \
  join compound_structures using (molregno) \
  where tid=165 and standard_type='Ki' and standard_value is not null and standard_relation='=' \
    and canonical_smiles not like '%.%';

df = data.DataFrame()

Next, calculate pKi and rename column for next data processing. And save compound data and property data separated as CSV format. It’s worth to know that properties data should be tab separated format.

import numpy as np
def get_pki(val):
    val = float(val)+0.001
    return 9.-np.log10(val)

df['pKi_hERG'] = df.standard_value.apply(get_pki)
df.rename(columns={'molregno':'ID'}, inplace=True)

cpddf = df[['canonical_smiles','ID']].copy()
cpddf.drop_duplicates().to_csv('chembl_herg.smi', sep=' ', index=False, header=False)

propdf = df[['ID', 'pKi_hERG']].copy()
propdf.drop_duplicates().to_csv('chembl_herg_prop.csv', sep='\t', index=False, columns=['ID', 'pKi_hERG'])

OK, ready to run MMPDBlib. It is easy to use CliRunner, I show examples.

# make fragmentdb
runner = CliRunner(mix_stderr=False)
args = ("fragment", "chembl_herg.smi", "-o", "chembl_herg.fragmentdb")
runner.invoke(cli.main, args=args)

# do indexing
args = ("index", "chembl_herg.fragmentdb", "-o", "chembl_herg.mmpdb")
runner.invoke(cli.main, args=args)

# check data
!mmpdb rulecat chembl_herg.mmpdb
id	from_smiles	to_smiles
4	[*:1]c1ccc(F)cc1Cl	[*:1]c1ccc(F)cc1O
1	[*:1]c1ccc(F)cc1O	[*:1]c1ccccc1O
2	[*:1]c1ccc(F)cc1Cl	[*:1]c1ccccc1O
4183	[*:1]c1ccc(F)cc1Br	[*:1]c1ccc(F)cc1Cl
4184	[*:1]c1c(Cl)cccc1Cl	[*:1]c1ccc(F)cc1Cl
.....

# load properties
args = ("loadprops", "-p", "chembl_herg_prop.csv", "chembl_herg.mmpdb")
runner.invoke(cli.main, args=args)

It works fine. OK let’s use generate command.

# I used droperidol as an example molecule
smi = 'C1CN(CC=C1N2C3=CC=CC=C3NC2=O)CCCC(=O)C4=CC=C(C=C4)F'
droperidol = Chem.MolFromSmiles(smi)
smi = Chem.MolToSmiles(droperidol)
droperidol

args = ('generate', "--smiles", smi, "--query", "c1c(F)ccc(*)c1", "--radius", "1", "chembl_herg.mmpdb")
res = runner.invoke(cli.main, args)

Result object named ‘res’ has stdout as text format. I used StringIO and convert the data as pandas dataframe ;)

from io import StringIO
sio = StringIO()
sio.write(res.output)
sio.seek(0)
mmpdf = pd.read_csv(sio, sep='\t', skiprows=[1,2])
mmpdf.head(3)

#
start	constant	from_smiles	to_smiles	r	pseudosmiles	final	heavies_diff	#pairs	pair_from_id	pair_from_smiles	pair_to_id	pair_to_smiles
0	O=C(CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1)c1ccc(F...	*C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	[*:1]c1ccc(F)cc1	[*:1]C	1	[*:1]-[C](~*)(~*)	CC(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	-6	2	423277	O=C(NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21)...	423269	CC(=O)NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21
1	O=C(CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1)c1ccc(F...	*C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	[*:1]c1ccc(F)cc1	[*:1]C(C)C	1	[*:1]-[C](~*)(~*)	CC(C)C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	-4	2	423277	O=C(NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21)...	423275	CC(C)C(=O)NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)...

Finally, convert smiles to rdkit molobjects with Chem.PandasTools. I used changeMoleculeRendering method because before using the method molecules in dataframe are not rendered.

PandasTools.AddMoleculeColumnToFrame(mmpdf, smilesCol="start", molCol="start")
PandasTools.AddMoleculeColumnToFrame(mmpdf, smilesCol="final", molCol="final")

from rdkit.Chem import PandasPatcher
PandasPatcher.changeMoleculeRendering(mmpdf)

Worked fine ;)

By using the approach, we can call mmpdb function from python code without subprocess module. I think it is usful for calling function from the code with is click based CLI tool.

I uploaded whole code of the post on my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Thanks for reading.

Let’s use mmpdb v3 #memo #rdkit #chemoinformatics

Matched molecular pair (MMP) is not AI based compound design method however it’s still useful and powerful approach to do compound design.

MMPDB is one of the cool package for MMP analysis. And Andrew who is developer of MMPDB preseted new version of MMPDB at RDKit UGM 2022.

link

Version 3 is still developing state but you can install it from andrew’s repo. I installed it and use mmpdb v3.

$ gh repo clone adalke/mmpdb -- -b v3-dev
$ cd mmpdb
$ pip install -e .

After the command avobe, I could call mmpdb command ;)

Make cdk2.smi from cdk2.sdf with rdkit.

from rdkit import Chem
mols = Chem.SDMolSupplier('cdk2.sdf')
with open('cdk2.smi') as of:
    for m in mols:
        molid = m.GetProp('_Name')
        smi = Chem.MolToSmiles(m)
        of.write(f'{smi} {molid}\n')

After making smi file, let’s make mmpdb.

$ mmpdb fragment cdk2.smi -o cdk2.fragment
$ mmpdb index cdk2.fragment -o cdk2.mmpd

Check rule list with rulecat command.

$ mmpdb rulecat cdk2.mmpdb
id	from_smiles	to_smiles
1	[*:1]c1[nH]cc2c1CCOC2=O	[*:1]c1cnc[nH]1
2	[*:1]c1c(F)cc(Br)cc1F	[*:1]c1nccs1
3	[*:1]Oc1nc([*:2])nc(N)c1N=O	[*:1]Oc1nc([*:2])nc2[nH]cnc12
4	[*:1]Oc1nc([*:2])nc2[nH]cnc12	[*:1]Oc1nc(N)nc([*:2])c1N=O
5	[*:1]C1CC1	[*:1]c1ccccc1
6	[*:1]C(=O)C(C)C	[*:1][C@@H]1CCCO1
7	[*:1]C(=O)C(C)C	[*:1][C@H]1CCC(=O)N1
8	[*:1]C(=O)C(C)C	[*:1]C1CCCCC1
9	[*:1]C(=O)C(C)C	[*:1][C@@H]1CC=CCC1
10	[*:1][C@@H]1CCCO1	[*:1][C@H]1CCC(=O)N1
11	[*:1]C1CCCCC1	[*:1][C@@H]1CCCO1
12	[*:1][C@@H]1CC=CCC1	[*:1][C@@H]1CCCO1
13	[*:1]C1CCCCC1	[*:1][C@H]1CCC(=O)N1
14	[*:1][C@@H]1CC=CCC1	[*:1][C@H]1CCC(=O)N1
15	[*:1]C1CCCCC1	[*:1][C@@H]1CC=CCC1
16	[*:1]c1ccc(C(=O)[O-])c([*:2])c1	[*:1]c1cccc([*:2])c1
17	[*:1]Nc1ccc(C(=O)[O-])c([*:2])c1	[*:1]Nc1cccc([*:2])c1
18	[*:1]c1nc([*:2])c(N=O)c(N)n1	[*:1]c1nc([*:2])c2nc[nH]c2n1
19	[*:1]c1nc([*:2])c2nc[nH]c2n1	[*:1]c1nc(N)nc([*:2])c1N=O
20	[*:1]C	[*:1]c1nccs1
21	[*:1]c1ccc(C(N)=O)cc1	[*:1][H]
22	[*:1]c1ccc(C(=O)[O-])c(Cl)c1	[*:1]c1cccc(Cl)c1
23	[*:1]CC1CC1	[*:1]Cc1ccccc1
24	[*:1]c1ccc(S(N)(=O)=O)cc1	[*:1]c1ccccc1
25	[*:1]c1ccccc1	[*:1][H]
26	[*:1]c1ccc(S(N)(=O)=O)cc1	[*:1][H]
27	[*:1]c1nc(N)nc(N)c1N=O	[*:1]c1nc(N)nc2[nH]cnc12
28	[*:1]CC(=O)C(C)C	[*:1]C[C@@H]1CCCO1
29	[*:1]CC(=O)C(C)C	[*:1]C[C@H]1CCC(=O)N1
30	[*:1]CC(=O)C(C)C	[*:1]CC1CCCCC1
31	[*:1]CC(=O)C(C)C	[*:1]C[C@@H]1CC=CCC1
32	[*:1]C[C@@H]1CCCO1	[*:1]C[C@H]1CCC(=O)N1
33	[*:1]CC1CCCCC1	[*:1]C[C@@H]1CCCO1
34	[*:1]C[C@@H]1CC=CCC1	[*:1]C[C@@H]1CCCO1
35	[*:1]CC1CCCCC1	[*:1]C[C@H]1CCC(=O)N1
36	[*:1]C[C@@H]1CC=CCC1	[*:1]C[C@H]1CCC(=O)N1
37	[*:1]CC1CCCCC1	[*:1]C[C@@H]1CC=CCC1
38	[*:1]c1ccc(OC)cc1	[*:1]c1cccs1
39	[*:1]N1CCC[C@H](C(N)=O)C1	[*:1][H]
40	[*:1]OC	[*:1]SCC[NH3+]
41	[*:1]S(=O)(=O)NC	[*:1]S(=O)(=O)NC(=N)N
42	[*:1]S(=O)(=O)NC	[*:1]S(=O)(=O)Nc1nccs1
43	[*:1]S(=O)(=O)NC(=N)N	[*:1]S(=O)(=O)Nc1nccs1
44	[*:1]C(=O)[O-]	[*:1][H]
45	[*:1]S(N)(=O)=O	[*:1][H]
46	[*:1]OC[C@@H](O)C[NH+](C)C	[*:1][H]
47	[*:1]NC(=O)NN(C)C	[*:1]NC(N)=O
48	[*:1]N	[*:1]Nc1ccc(C(N)=O)cc1
49	[*:1]OCC(=O)C(C)C	[*:1]OC[C@@H]1CCCO1
50	[*:1]OCC(=O)C(C)C	[*:1]OC[C@H]1CCC(=O)N1
51	[*:1]OCC(=O)C(C)C	[*:1]OCC1CCCCC1
52	[*:1]OCC(=O)C(C)C	[*:1]OC[C@@H]1CC=CCC1
53	[*:1]OC[C@@H]1CCCO1	[*:1]OC[C@H]1CCC(=O)N1
54	[*:1]OCC1CCCCC1	[*:1]OC[C@@H]1CCCO1
55	[*:1]OC[C@@H]1CC=CCC1	[*:1]OC[C@@H]1CCCO1
56	[*:1]OCC1CCCCC1	[*:1]OC[C@H]1CCC(=O)N1
57	[*:1]OC[C@@H]1CC=CCC1	[*:1]OC[C@H]1CCC(=O)N1
58	[*:1]OCC1CCCCC1	[*:1]OC[C@@H]1CC=CCC1
59	[*:1]NCC1CC1	[*:1]NCc1ccccc1
60	[*:1]N	[*:1]Nc1ccccc1

60 rules are generated from cdk2.smi which contains 47 molecules.

One of the interesting feature of mmpdb v3 is generate command imprementation. The command allows to user for generating molecule with user defined substructure as query.

cdk2.mmpdb has metyl to thiazole rules ([:1]C >>> [:1]c1nccs1). So I tried to generate command with simple molecule.

$ mmpdb generate --smiles 'Cc1ccccc1' --query '*C' cdk2.mmpdb
start	constant	from_smiles	to_smiles	r	pseudosmiles	final	heavies_diff	#pairs	pair_from_id	pair_from_smiles	pair_to_id	pair_to_smiles
GENERATE:
  EXEC: *C (22, 22, 1, 1)
Cc1ccccc1	*c1ccccc1	[*:1]C	[*:1]c1nccs1	0	[*:1](~*)	c1ccc(-c2nccs2)cc1	4	1	ZINC03814443	CNS(=O)(=O)c1ccc(N/C=C2\C(=O)Nc3ccccc32)cc1	ZINC03814447	O=C1Nc2ccccc2/C1=C/Nc1ccc(S(=O)(=O)Nc2nccs2)cc1

Generated molecule from “Cc1ccccc1” is ‘c1ccc(-c2nccs2)cc1’ and the image is shown below.

generate command can use radius option. It provides flexibility of user definition.

The new feature of mmpdb v3 is not limited generate command which is shown here. You should read Andrew’s presentation and his repo!

Thanks for reading ;)

Embed mols2grid to web page #mols2grid #RDKit #webapp

I often use flask for my web app development because the package is light weight web framework and easy to write.

BTW rendering molecules as grid image is easy in jupyter notebook by using RDKit’s Draw.MolsToGridImage function. And also recently developed package named mols2grid is really cool for rendering molecules on jupyter notebook. mols2grid generaetes image data as HTML. So it means that we can embed these grid image to webpage.

So I tried to it ;)

Following code uses Flask, Flask-bootsrap, mols2grid. All packages are installed from conda-forge.

Directory structure is below.

root/app.py
root/cdk2.sdf
root/templates/top.html

And app.py

from flask import Flask
from flask import render_template
from flask_bootstrap import Bootstrap
from rdkit import Chem
import mols2grid


app = Flask(__name__)
Bootstrap(app)

@app.route('/')
def top():
    im = mols2grid.display('./cdk2.sdf',
    #tooltip=["id", "r_mmffld_Potential_Energy-OPLS_2005"]
    )
    return render_template('/top.html', data=im.data)


if __name__=="__main__":
    app.run(debug=True)

And top.html is below.

{% extends "bootstrap/base.html" %}
{% block title %}This is an example page{% endblock %}


{% block navbar %}
  <div class="navbar navbar-inverse" role="navigation">
    <div class="container">
      <a class="navbar-brand" href="#">Navbar</a>
    </div>
</div>
{% endblock %}

{% block content %}
<div class="container">
  <div>{{data|safe}}</div>
</div>
{% endblock %}

That’s all. It is really simple isn’t it ;)

Let’s run the web app.

$ python app.py

Then access localhost:5000. I could get molecules as grid image.

The data has pagenation and check box. It can dowonload selected molecules. I think it is one of the convenient way to implement rendering molecules function in my web applications.

FW analysis make easily with rdkit contrib FW package #rdkit #chemoinformatics #memo

Last week I enjoyed RDKit UGM 2022. It was really great and exciting evenif I participated there from online. I hope I could participate RDKIT UGM 2023 locally ;)

As you know RDKit is one of the useful OSS package for chemoinformatician. It has nice community and be developed actively. I respect the community and developer’s effort!

I found new contrib package in rdkit github repo it was “FreeWilson“(FW). The logic of FW analyisis is simple but powerful approach to find the better conbination of R groups.
Original article is shown here. And PatWalter disclosed and described the code and usecase in his repo and blog post.
https://github.com/PatWalters/Free-Wilson
http://practicalcheminformatics.blogspot.com/2018/05/free-wilson-analysis.html

And fortunately, we can also use FW from rdkit contrib package ;)

Now new version of rdkit(2022.09.1) isn’t available from conda-forge but I could get new version from original repo and FW package could install with pip command.

(base)$ conda activate chemo_py309
(chemo_py309)$ gh repo clone rdkit/rdkit
(chemo_py309)$ cd rdkit/Contrib/FreeWilson
(chemo_py309)$ pip install .

After installation of FW, I run code with sample data. My code (almost same as readme) is uploaded to my gist which is show below.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

By using the package, user can conduct FW analysis really easily. Original repo shows another approach which uses FMCS to get scaffold for the dataset.

Ofcourse FW is not always propose correct combination however it is chance for thinking non additive SAR (e.g https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00525-z)

I felt the package is useful and easy to use. So I would like to apply it in my internal project ;)

Make chemoinformatics coding more easily ! #RDKit #datamol

I love RDKit because it supports lots of chemoinformatics features and it is supported top level scientific community.

But for eary chemoinformacian, it’s required to learn python or C++ to write chemoinformatics code. As you know, Knime is good option for no-code chemoinformtatics. And another option for beginner is good wrapper of rdkit

Recently the package named datamol is intorduced in my TL.

You can install datamol via conda command. And datamol is inroduced at RDKitUGM 2021!
https://www.youtube.com/watch?v=ibE10Xy_5VY

I used datamol today. And I felt that datamol is useful tool for chemoinformatics because it wrapps lots of rdkit functions and it makes easy to use.

My example was uploaded on my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

I think it is important for these kinds of package is Sustainability. So I hope datamol will be maintained long term.

If reader has interesd datamol, please install your environment and try it ;)

Generate molecules with REINVENT and make interactive plot #RDKit #Knime #REINVENT

I introduced to use user defined condaenv in Knime. It’s really useful I think.

Make chemoinformatics workflow by Kinme with user defined conda envrionment #Knime #Chemoinforamtics #memo

And as you know, REINVENT is the one of famous compound generation tool for drug discovery. Its community provides lots of example codes as jupyternotebook. After building some model, I run sampling steps with configration file for sampling which is generated by the notebook. Today, I tried to generate molecules with python and process them with Knime. And I surprized that the process is really easy!

At first, I made simple workflow shown below.

To run the workflow, I used reinvent.v3.2 env. And next python script node defines compound generation process. The code is below.

from pandas import DataFrame
# Create empty table
output_table = DataFrame()
from rdkit import Chem
import sys
print(sys.path)
import reinvent_models.reinvent_core.models.model as reinvent
batchsize = 124
#model path where user would like to use
modelpath = '/hogehoge/ReinventCommunity/notebooks/models/random.prior.new'

model = reinvent.Model.load_from_file(modelpath, sampling_mode=True)
sampled_smi, likelyhood = model.sample_smiles(batchsize)
output_table['sampled_smi'] = sampled_smi
output_table['likelyhood'] = likelyhood
output_table['ROMol'] = output_table['sampled_smi'] .apply(Chem.MolFromSmiles)

To generate molecule from model, reinvent_models package is used. After loading the model, call sample_smiles is used for compound generation.

Post process is very simple. Calculate molecular fingerprint, run PCA and calculate molecular descirptors and convet molecule to PNG.

Finally, make javascript component with scatterplot and tile view.

I added interactive view for Tile view. So I could get tile view when I select plots ;)

This is a simple example to integrate KNIME and REINVENT. But I think it’s interesting that to use the approach, we can add lots of postprocess after generating molecules not only LBDD approach but also SBDD approach.

If readers who have interest knime and REINVENT, let’s make your ouwn work flow ;)

Useful python package for QSAR related tasks of chemoinformatician #chemoinformatics #oloren-ai #RDKit

When I posted my memo about open science, @OlorenAI introduced python package named Oloren ChemEngine (OCE). I often use chemprop or interanly build system for QSAR tasks. ChemProp is the one of favorite package because it is easy to use and it includes web application flamework for users. I’ve never used OEC so I tried to use the package. Thaks @Oloren for developing and sharing the code.

At first, I installed OCE. My cuda version is little bit old, so I need to install Pytorch geometric after running the original install script .

(base) $ conda create -c conda-forge -n oce python=3.8
(base) $ conda activate oce
(oce) $ bash <(curl -s https://raw.githubusercontent.com/Oloren-AI/olorenchemengine/master/install.sh)
(oce) $ conda install -c conda-forge jupyter mamba
# following command depends on your env. I use pytorch1.12 and cuda 10.2
(oce) $ pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.12.0+cu102.html

After running the code above, I could use OCE.

OK, let’s run the code. Following code is almost same as original documentation. The main different point compared to other QSAR library, OCE don’t need vectorize(featurize) molecules just define which method user would like to use in the model.

This example make Boosing emsemble model with two RandomfolrestModel with two different features. After making model object, user need to pass the list of SMILES and target value for training !!

OCE wraps complicated molecular featureize process. So user don’t need do that. I think it is cool isn’t it? After making the model object, model training and prediction step is as same as scikit-learn.

User can build complex model with few lines of code ;)

import olorenchemengine as oce
df = oce.ExampleDataFrame()

model = oce.BaseBoosting([
            oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"), n_estimators=1000),
            oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)])

model.fit(df["Smiles"], df["pChEMBL Value"])
oce.save(model, "model.oce")
model2 = oce.load("model.oce")
y_pred = model2.predict(["CC(=O)OC1=CC=CC=C1C(=O)O"])
y_pred
>>array([0.15748118])

OCE provides not only model building method but also provides visualization method. Let’s see it.

from olorenchemengine.visualizations import *
from olorenchemengine import *
import olorenchemengine as oce
df = oce.ExampleDataset().data
test_dataset = oce.BaseDataset(name='purple', data=df.to_csv(), structure_col='Smiles', property_col='pChEMBL Value')

model = oce.BaseBoosting([
            oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"),n_estimators=1000),
            oce.BaseTorchGeometricModel(oce.TLFromCheckpoint("default"), preinitialized=True),
            oce.RandomForestModel(oce.OlorenCheckpoint("default"),n_estimators=1000)])


## Training the model
model.fit(df["Smiles"], df["pChEMBL Value"])
vis =  oce.VisualizeModelSim(model=loaded_model,dataset=test_dataset)
vis.render_ipynb()

After making vis object and call render_ipynb(), I could get interactive scatter plot shown below. When I mouse over on the plot, compound structure will be hovered. This example is 2 class classification model but you can make same plot with regression task.

OCE also have not only traditional predictive model but also chemprop model (MPDNN!). It sounds nice, let’s use it. It’s really user friendly API ;)

import io
import sys
import zipfile

import pandas as pd
import requests
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

import olorenchemengine as oce
data_dir = "./data"
data_url = "http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip"
r = requests.get(data_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(data_dir)
df = pd.read_csv(f"{data_dir}/hiv/mapping/mol.csv.gz")

X_train, X_test, y_train, y_test = train_test_split(df["smiles"], df["HIV_active"], test_size=0.2, random_state=42)

## building and training chemprop model is finished only two lines!!!  
model = oce.ChemPropModel()
model.fit(X_train, y_train)

'''
print(X_train)
29361                     O=C(O)CC(NC(=O)OCc1ccccc1)C(=O)O
10448                 O=[N+]([O-])c1ccc(Nc2ccccc2)c2nonc12
31039      CCOC(=O)C(=NNc1ccc(C)cc1)N1C(=S)N(C)N=C(C)C=C1S
1311                       N#CSC1=C(SC#N)C(=O)c2ccccc2C1=O
27834    COc1cc(C2C3=C(COC3=O)OC(C)(C)Oc3cc4c(cc32)OCO4...
                               ...                        
6265                            Cc1ccc2nsnc2c1[N+](=O)[O-]
11284    CC1=CC(=C(c2cc(C)c(O)c(C(=O)O)c2)c2c(Cl)ccc(S(...
38158                            Oc1ncnc2c1sc1nc3ccccc3n12
860                 CCN(CCO)CCNc1ccc(C)c2sc3ccccc3c(=O)c12
15795    COc1cccc(NC(=O)CC(=O)N2N=C(N(CCC#N)c3ccc(Cl)cc...
Name: smiles, Length: 32901, dtype: object

model.model
Sequential(
  (0): DMPNNEncoder(
    (act_func): ReLU()
    (W1): Linear(in_features=165, out_features=300, bias=False)
    (W2): Linear(in_features=300, out_features=300, bias=False)
    (W3): Linear(in_features=451, out_features=300, bias=True)
  )
  (1): Sequential(
    (0): Dropout(p=0.0, inplace=False)
    (1): Linear(in_features=300, out_features=300, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.0, inplace=False)
    (4): Linear(in_features=300, out_features=1, bias=True)
  )
)
'''

Additionaly, OCE can visualize chemicalspace. Let’s plot chemicals space with TSNE.

import olorenchemengine as oce
dataset = oce.BACEDataset() + oce.ScaffoldSplit()
vis = oce.ChemicalSpacePlot(dataset, oce.DescriptastorusDescriptor('morgan3counts'), opacity = 0.4, dim_reduction = "tsne")
vis.render_ipynb()

Next, make plot with trained model data. The size of marker means  the magnitude of the residuals. It’s useful to evaluate model performance.

model = oce.BaseBoosting([oce.RandomForestModel(oce.DescriptastorusDescriptor("morgan3counts")),
                         oce.RandomForestModel(oce.OlorenCheckpoint("default"))])
model.fit(*dataset.train_dataset)
vis = oce.VisualizeDatasetSplit(dataset, oce.DescriptastorusDescriptor("morgan3counts"), 
                                model = model, opacity = 0.4)
vis.render_ipynb()

These codes are few examples from original document. If readers have interest the package please install and use it in your Real tasks!

There is a nice documents are provided here. https://docs.oloren.ai/api/modules.html

Thanks!!!

Let’s enjoy open science! #memo #diary

My blogsite was started 2013, so I took almost 10 years ;) It’s amazing. I’m enjoying writing the blog and sharing my code snippet.

BTW, I’m working in pharmaceutical company as a chemoinformatician. So, is it risk to write chemoinformatics related post in my blog site….? I don’t think so. As many people know that AstraZeneca shares theire useful code on theire github repository<https://github.com/MolecularAI&gt;. It means that sharing code is not differenciation point and it is good eco system to improve data science. It is worth to open science of non competitive area. Because to do it, we can get many feedback from the many communities. It is difficult to get these thins from interal community only. Now it is easy to communicate not only domestic area but also wold wide with IT tools.

I would like to keep writing the blog post enven if the frequency of the update is slow ;) And hope many reader will enjoy open science and get new idea from it.

Thanks for reading.

Calculate ligand RMSD with rdkit contrib package #RDKit #memo

Today is the last day of my summer vacation…. Due to COVID-19 pandemic, I and my family didn’t go travel during the vacation however, we’ll go to national championship of dodgeball game in next sturday. It will be exciting day!

BTW as you know, rdkit has lots of contrib packages. And I would like to introduce a package named CalcLigRMSD which is calculator of two molecules RMSD.

It is easy to use it. And good example code is disclosed in original repo.
https://github.com/rdkit/rdkit/tree/master/Contrib/CalcLigRMSD

I would like to use the package so I tried to run the code on my PC.

Here is an my code. (Most of code is borrowed from original repository)

https://nbviewer.org/gist/iwatobipen/7456f75e7d5f23607cd9c9885c7f5ff0

It is easy to two ligands RMSD with the package. In the original repositry, more useful examples are introduced. If reader has interest the code, pls check original repo ;)

Thanks for reading.

Useful package for ploting chemical space rapidly #chemoinformatics #memo

Visualize chemical space is important task for chemoinformatitian. And there are lots of way to represent chemical space. One of the common approach is PCA. And recently tSNE and UMAP are used.

I wrote template code for plotting these data in my task but didn’t write code as a package.

Today I found useful package for making chemical space plot with python named ChemPlot. You can get and read the article from following URL.

The article is open access! https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cmtd.202200005

And fortunately the author shared the code on github. Pepople who have interest the pacakge, can install chemplot via conda or pip commmand.

I installed the package with conda.

$ mamba install -c conda-forge -c chemplot chemplot

After the installation, I tried to use chemplot and uploaded the code. It’s really easy to make chemical space plot with chemplot. I could make not only static plot but also interactive plot. Following code doesn’t render interactive plot but it worked on my PC. Interactive plot provides structure image when mouse over on the plot.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

It’s worth to use for making your chemical space plot. In summary there are lots of useful packages are available in python. Thanks for sharing the code!

Make chemoinformatics workflow by Kinme with user defined conda envrionment #Knime #Chemoinforamtics #memo

I think workflow tool is useful for many chemoinformatician such as Knime, Pipeline Pilot(PP) and Orange. In my knowledge, it’s difficult to use user defined conda env in PP. Of course these workflow tools provides enough components to do many tasks. But I would like to have more flexibility to build chemoinformatics pipelines.

Recently I updated Knime in my PC and found nice node for using user defined conda env in Knime.

Conda environment probacation node enable to define conda env in Knime pipeline.

I tried to use it in very simple workflow. Following image is overall view of workflow.

Conda environment probagation can select local conda env for workflow. User can select env very easily. I selected my chemoinfomatics env with python 3.9. The setting is passed as flow valiable.

Then, I defined flow valiable in python script node like below.

And wrote some code in the node.

# Copy input to output
output_table_1 = input_table_1.copy()
from rdkit import Chem
from rdkit.Chem import Descriptors
output_table_1['ROMol'] = output_table_1['Molecule'].apply(Chem.MolFromMolBlock)
output_table_1['SMILES'] = output_table_1['ROMol'].apply(Chem.MolToSmiles)
output_table_1['MolWt'] = output_table_1['ROMol'].apply(Descriptors.MolWt)
output_table_1['LogP'] = output_table_1['ROMol'].apply(Descriptors.MolLogP)

Data from SDF reader is processed in python script node under my defiened conda env. It worked fine.

In summary, Knime is really flexible and useful tool IMHO. And now there are lots of useful resource for learning Knime.

Reader who interested in Knime I recommend to check following blog post ;)

Written in Japanese!
https://note.com/knimesupportteam/m/m041672f62384
https://sumtat.hatenablog.com/

Kime blog
https://www.knime.com/blog

Thanks for reading.

Useful package for virtual screening #chemoinformatics #RDKit

Virtual Screening is important task of drug discovery projects. There are lots of approach for example Finger print based, substructure based and shape based screening. All approaches listed above is not only used in SBDD but also LBDD.

And there are lots of apprications to do these tasks. I wrote scripts for these task and use then. But recently I found nice package for VS named VSflow which is developed by Paul Czodrowski’s group.

It seems interesting, so I tried to use it. At first, I prepared conda env and install it.

$ gh repo clone czodrowskilab/VSFlow
$ cd VSFlow
$ conda env create --quiet --force --file environment.yml
$ conda activate vsflow
$ pip install .

After running the code, I could use vsflow command.

Next, I prepared dabase. Database can be made from any kinds of dataset but I used default set. ‘-d pdb’ option means prepare database with smiles which cames from ligandexpo.

$ time vsflow preparedb -d pdb -o pdb_ligs -np 6
**************************

 VV        VV  SSSSSSS             VSFlow
  VV      VV  SSS    SS       Virtual Screening
   VV    VV    SSSS               Workflow
    VV  VV       SSSS
     VVVV     SS    SSS
      VV       SSSSSSS

**************************

Start: 06/21/2022, 21:22:37
Running in parallel mode on 6 threads
Downloading database pdb ...
Finished downloading database
Generating database file ...
Finished in 168 seconds

real	2m48.611s
user	0m6.480s
sys	0m1.619s

Now I could get ‘pdb_ligs.vsdb’ which is pickled data for VSFlow. Next, I tried to substructure and fp sim search. I used SMILES as a query. The task done in a second.

$ vsflow substructure -smi 'c1ccnnc1' -d pdb_ligs.vsdb -o smi_sub_pdb.sdf
**************************

 VV        VV  SSSSSSS             VSFlow
  VV      VV  SSS    SS       Virtual Screening
   VV    VV    SSSS               Workflow
    VV  VV       SSSS
     VVVV     SS    SSS
      VV       SSSSSSS

**************************

Start: 06/21/2022, 21:28:50
Running in single core mode
Loading database pdb_ligs.vsdb ...
Reading query ...
Finished substructure search in 0.83285 seconds
Generating output file(s) ...
313 matches found
Finished: 06/21/2022, 21:28:51
Finished in 0.91936 seconds

SSS hit compounds are below.

Following example is similarity sarch and I made similarity map as PDF.

$ vsflow fpsim -d pdb_ligs.vsdb -smi "CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12" -o sim.sdf --pdf --simmap
**************************

 VV        VV  SSSSSSS             VSFlow
  VV      VV  SSS    SS       Virtual Screening
   VV    VV    SSSS               Workflow
    VV  VV       SSSS
     VVVV     SS    SSS
      VV       SSSSSSS

**************************

Start: 06/21/2022, 22:06:41
Running in single core mode
Loading database pdb_ligs.vsdb ...
Reading query input ...
Calculating fingerprints ...
Finished fingerprint generation in 6.04996 seconds
Calculating similarities ...
Finished calculating similarities in 0.08398 seconds
Writing 10 molecules to output file(s)
Generating output file(s) ...
Generating PDF file(s) ...
Calculating similarity maps for 10 matches ...
Finished: 06/21/2022, 22:06:56
Finished in 14.63942 seconds

Similarity map is nice approach to visualize similarity between query(tofacitinib) and hit compounds. This example used fcfp4 as FP however user can use other rdkit supported FP such as ECFP, RDKit, Atom etc.

Final example is shape similarity. To do it vsdb should have 3D structure information. So I got 3D data from ligand expo and made vsdb.

Data link is below.
http://ligand-expo.rcsb.org/dictionaries/Components-pub.sdf.gz

Then run shape sim search. I took long time compared to commercial package such as ROCS but could generate nice output.

$ vsflow shape -smi "CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12" -d pdb_ligs3d.vsdb -o shapesmi -np 6 --pymol
**************************

 VV        VV  SSSSSSS             VSFlow
  VV      VV  SSS    SS       Virtual Screening
   VV    VV    SSSS               Workflow
    VV  VV       SSSS
     VVVV     SS    SSS
      VV       SSSSSSS

**************************

Start: 06/21/2022, 22:43:11
Running in parallel mode on 6 threads
Reading database ...
Reading query ...
Performing shape screening ...
Generating 3D conformer(s) for 1 query molecule(s)
Generating PyMOl file ...
Finished: 06/22/2022, 02:16:54
Finished in 12822.98383 seconds

Here is an example output of shape similarity screening. Green is query molecule. As you can see, vsflow got molecules which has similar 3D shape.

In summary vsflow is useful package for chemoinformatics.

More detials are described the arxiv and repository’s wiki.

https://chemrxiv.org/engage/chemrxiv/article-details/628c60215d9485a206cc8ecc

Call RDKit from Rust with conda env ver2 #RDKit #RDKit-sys #Rust #Chemoinformatics

To use rdkit from Rust, I introduced rdkit-sys before. And fortunately recent version of rdkit-sys cleat supports rdkit-env. It’s worth to use conda-env to build rdkit-sys because user don’t need to build rdkit from source code.

Following code is almost same as my previous post but I would like to share it.

At first, I cloned rdkit from rdkit-rs.

$ gh repo clone rdkit-rs/rdkit
$ cd rdkit

Then edit Cargo.toml. I modified dependencies part as below. Added features=[“dynamic-linking-from-conda”] option. And then added LD_LIBRARY_PATH to use it.

[package]
name = "rdkit"
version = "0.2.11"
edition = "2021"
authors = ["Xavier Lange <xrlange@gmail.com>"]
license = "MIT"
description = "High level RDKit functionality for rust"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
bitvec = "1"
cxx = "1"
log = "0.4"
#iwatobipen modified
rdkit-sys = {version="0.2.14", features=["dynamic-linking-from-conda"]}
flate2 = "1"

[dev-dependencies]
env_logger = "0.9.0"

## from bash
(base) $ conda activate chemo_py310
(chemo_py310)$ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib;$LD_LIBRARY_PATH
(chemo_py310)$ cargo test # all tests will pass

Then I wrote rust_rdkit_v3

$ cargo new rust_rdkit_v3
$ cd rdkit_rust
$ vim src/main.rs
use std::env;
use std::path::PathBuf;
use rdkit;

fn main() {
    let args: Vec<String> = env::args().collect();
    let sdgz_filename = &args[1];

    println!("filename is {}", sdgz_filename);
    let mol_block_iter =
        rdkit::MolBlockIter::from_gz_file(sdgz_filename, false, false, false).unwrap();
    let mol_list = mol_block_iter.collect::<Vec<_>>();
    for m in mol_list {
        let smi = m.unwrap().as_smile();
        println!("{}", smi)
    }
    println!("Done");
}

$ vim Cargo.toml
[package]
name = "rust_rdkit_v3"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
#rdkit = "0.2.11"
rdkit={path="../rdkit"}
~

After writing the code, build it.

$ cargo build --release

Now I could run code.

./target/release/rust_rdkit_v3 cdk2.sdf.gz
filename is cdk2.sdf.gz
[H]C1=NC2=C(N=C(N([H])[H])N=C2OC([H])([H])C(=O)C([H])(C([H])([H])[H])C([H])([H])[H])N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])OC([H])([H])C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])N([H])C(=O)C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])C([H])([H])C([H])=C([H])C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
[H]OC([H])([H])C([H])([H])N([H])C1=NC(N([H])C([H])([H])C2=C([H])C([H])=C([H])C([H])=C2[H])=C2N=C([H])N(C([H])([H])[H])C2=N1
....

In summary current rdkit-sys supports conda env and it makes easy to call rdkit from rust.

Thanks!

Generate molecules with Link-invent #RDKit #RINVENT #chemoinformatics

General molecular generator with deep learning approach is difficult to fix substructure. But common SAR expansion by medchem is focus on specific part such as terminal part, core and linker.

Linker is defined as a part which connect parts of molecules. Linker is important of drug design and is called scaffold when the linker connets two or three parts. And also important in PROTAC design. So it’s useful if deep learning based generator can design linker ;). As you know delinker(https://github.com/fimrie/DeLinker) is one of the useful and cool code for linker design.

Recently, AstraZeneca team disclosed cool linker generator code for REINVET named Link-INVENT which is implemented in REINVENT v3.2. Link-INVENT is new option of REINVENT for liker design with many useful scoring options.

Link-INVENT can use following linker design specific scoring functions. As you can see, the package can design specific linker.

LinkerEffectiveLength
LinkerGraphLength
LinkerLengthRatio
LinkerNumRings
LinkerNumAliphaticRings
LinkerNumAromaticRings
LinkerNumSPAtoms
LinkerNumSP2Atoms
LinkerNumSP3Atoms
LinkerNumHBA
LinkerNumHBD
LinkerMolWeight
LinkerRatioRotatableBonds

I had interest the code and tried to use it. Fortunately Reinvent community provides useful example code in ReinventCommunity. Ok, let’s test it!

At first I modified code from ReinventCommunity, added scoring function.

I added ‘linker_graph_length’ scoring function for controlling length of molecules.

And after running the learnig process, I checked the learning process with tensorboad. It’s really useful to review the learning process becase I can see not only learning score but also structures of each learning process.

Here is the first output of the learning process.

And here is the molecules from last learning step .

The molecules which generated at last learning step has rigid and short linker which is defined by scoring function.

In summary, I think REINVENT is cool and flexible molecular generator code. I respect authors work and thank for sharing the nice code ;)

Call RDKit from Rust with conda env #RDKit #Rust #Chemoinformatics

I introduced rdkit-sys which is wrapper of rdkit for rust. The package development is ongoing so code is chaged frequentry. To use the package, user need to build rdkit with static link ON option. Build rdkit-sys with static link (.a) is efficient way for portability of the code however it means that it’s difficult to use conda env rdkit because, conda env provides only dynamic link (.so).

So I would like to build rdkit-sys and call rdkit from rust with dylib.

After several try and error, I found solution to do it. I would like to share my trial.

At first, I cloned rdkit-sys in my PC.

$ gh repo clone rdkit-rs/rdkit-sys
$ cd rdkit-sys

Then edit build.rs script show below.

fn main() {
    if std::env::var("DOCS_RS").is_ok() {
        return;
    }

    env_logger::init();

    let library_root = match (std::env::consts::OS, std::env::consts::ARCH) {
        ("macos", "x86_64") => "/usr/local",
        ("macos", "aarch64") => "/opt/homebrew",
        //("linux", _) => "/usr",
        //following path is my conda env's path
        ("linux", _) => "/home/iwatobipen/miniconda3/envs/chemo_py310",
        (unsupported_os, unsupported_arch) => panic!(
            "sorry, rdkit-sys doesn't support {} on {} at this time",
            unsupported_os, unsupported_arch
        ),
    };

    let brew_lib_path = format!("{}/lib", library_root);
    let include = format!("{}/include", library_root);
    let rdkit_include = format!("{}/include/rdkit", library_root);

    let dir = std::fs::read_dir("src/bridge").unwrap();
    let rust_files = dir
        .into_iter()
        .filter_map(|p| match p {
            Ok(p) => {
                if p.metadata().unwrap().is_file() {
                    Some(p.path())
                } else {
                    None
                }
            }
            Err(_) => None,
        })
        .filter(|p| !p.ends_with("mod.rs"))
        .collect::<Vec<_>>();

    let mut cc_paths = vec![];

    let wrapper_root = std::path::PathBuf::from("wrapper");
    for file in &rust_files {
        let file_name = file.file_name().unwrap();
        let file_name = file_name.to_str().unwrap();
        let base_name = &file_name[0..file_name.len() - 3];

        let cc_path = wrapper_root.join("src").join(format!("{}.cc", base_name));
        let meta = std::fs::metadata(&cc_path).unwrap();
        if !meta.is_file() {
            panic!("{} must exist", cc_path.display())
        }
        cc_paths.push(cc_path);

        let h_path = wrapper_root
            .join("include")
            .join(format!("{}.h", base_name));
        let meta = std::fs::metadata(&h_path).unwrap();
        if !meta.is_file() {
            panic!("{} must exist", h_path.display())
        }
    }

    cxx_build::bridges(rust_files)
        .files(cc_paths)
        .include(include)
        .include(rdkit_include)
        .include(std::env::var("CARGO_MANIFEST_DIR").unwrap())
        .flag("-std=c++14")
        .warnings(false)
        // rdkit has warnings that blow up our build. we could enumerate all those warnings and tell
        // the compiler to allow them... .warnings_into_errors(true)
        .compile("rdkit");

    println!("cargo:rustc-link-search=native={}", brew_lib_path);
    // println!("cargo:rustc-link-lib=static=c++");

    for lib in &[
        "Catalogs",
        "ChemReactions",
        "ChemTransforms",
        "DataStructs",
        "Descriptors",
        "FileParsers",
        "Fingerprints",
        "GenericGroups",
        "GraphMol",
        "MolStandardize",
        "RDGeneral",
        "RDGeometryLib",
        "RingDecomposerLib",
        "SmilesParse",
        "Subgraphs",
        "SubstructMatch",
    ] {
        //swich static link to dynamic link!!!
        //println!("cargo:rustc-link-lib=static=RDKit{}_static", lib);
        println!("cargo:rustc-link-lib=dylib=RDKit{}", lib);
    }
    println!("cargo:rustc-link-lib=static=boost_serialization");
}

After the modification, I set up LD_LIBRARY_PATH.

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib;$LD_LIBRARY_PATH

After that, I could cargo build command in rdkit-sys folder ;). Then cloned rdkit-rs/rdkit with the crate. To do that I edited Cargo.toml like below.

$ gh repo clone rdkit-rs/rdkit
$ cd rdkit
$ vim Cargo.toml
[package]
name = "rdkit"
version = "0.2.6"
edition = "2021"
authors = ["Xavier Lange <xrlange@gmail.com>"]
license = "MIT"
description = "High level RDKit functionality for rust"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
cxx = "1"
log = "0.4"
#rdkit-sys = "0.2.7"
rdkit-sys = { path = "../rdkit-sys"} # I used modified version of rdkit-sys
flate2 = "1"

After that, I wrote code with these packages.

my rdkitest code is below.

#Cargo.toml
[package]
name = "rdkrust"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
rdkit = { path = "/home/iwatobipen/dev/sandbox/rusttest/rdkit" }
#src/main.rs
use std::env;
use rdkit;

fn main() {
    let args: Vec<String> = env::args().collect();
    let sdgz_filename = &args[1];

    println!("filename is {}", sdgz_filename);
    let mol_block_iter =
        rdkit::MolBlockIter::from_gz_file(sdgz_filename, false, false, false).unwrap();
    let mol_list = mol_block_iter.collect::<Vec<_>>();
    for m in mol_list {
        let smi = m.unwrap().as_smile();
        println!("{}", smi)
    }
    println!("Done");
}
    

The code will read sdf.gz and retrieve molecules and then convert molecules to SMILES.

OK, let’s check the code.

$ cargo build
#Then
$ target/debug/rdkrust cdk2.sdf.gz
filename is cdk2.sdf.gz
[H]C1=NC2=C(N=C(N([H])[H])N=C2OC([H])([H])C(=O)C([H])(C([H])([H])[H])C([H])([H])[H])N1[H]
[H]C1=NC2=C(OC([H])([H])C3([H])OC([H])([H])C([H])([H])C3([H])[H])N=C(N([H])[H])N=C2N1[H]
'''snip
'''
Done

It seems work fine. In summary, rdkit-rs activity is really cool project for chemoinformatics because rust is really efficient language. I would like to learn rust more and more and develop useful apps for chemoinformatics.

Thanks for reading.

Run python script in html_file #memo #cording

I wrote lots of chemoinformatics apps as web app. To do that I often use Flask / Django. It’s python packge so most of part can be written in python however it’s difficult to embed python directry in html document.

Recently I found cool package for embedding python code in html, it’s name is pyscript. It’s really cool!

I tested simple code as shown below.

The directory system is …

root/
webserve.py
conponents/test1.html
test2.html
test3.html

And code is below. To do he test, I used python http package.

#webserver.py
from http.server import HTTPServer, SimpleHTTPRequestHandler,  os
os.chdir(os.path.join(os.path.dirname(__file__), 'contents'))
server = HTTPServer(('', 8000), SimpleHTTPRequestHandler)
server.serve_forever()
#contents/test1.html
<html>
  <head>
    <link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
    <script defer src="https://pyscript.net/alpha/pyscript.js"></script>
  </head>
  <body> <py-script> print('Hello, World!') </py-script> </body>
</html>
#contents/test2.html
<html>
    <head>
      <link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
      <script defer src="https://pyscript.net/alpha/pyscript.js"></script>
      <py-env>
        - matplotlib
      </py-env>
    </head>

  <body>
    <h1>Let's plot random numbers</h1>
    <div id="plot"></div>
    <py-script output="plot">
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(1000)
y = np.random.randn(1000)

fig, ax = plt.subplots()
ax.scatter(x, y)
fig
    </py-script>
  </body>
</html>
#contents/test3.html
<html>
    <head>
      <link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
      <script defer src="https://pyscript.net/unstable/pyscript.js"></script>
      <py-env>
        - pandas
        - scikit-learn
      </py-env>
    </head>

  <body>
<div>
    <py-script>
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data)
df['target'] = data.target
df.head(5)
    </py-script>
</div>
  </body>
</html>

After writing code above, run the server.

$ python webserver.py

Then simple webserver will run. And I could get following veiw from each page.

As you can see, main part of these pages are built with pure python code without javascript. It’s really interesting.

But there are limitations. PyScript doesn’t support all packages which are provided from pypi evenif major packages are suppored.

Unfortunately RDkit is not suppored too.

BTW lots of packages are abailable. If reader has interest PyScript, let’s check the original document and examples!

https://github.com/pyscript/pyscript/tree/main/pyscriptjs/examples

Generate molecules from molecular formula #Chemoinformatics #memo #jcheminf

Most of chemoinformatitian will think that C6H6 means benzene and its SMILES strings will be ‘c1ccccc1’.

However how do you think that how many possible combinations will be generated from molecular formula C6H6?

…..

Yah, it’s interesting but difficult question.

Recently I read interesting article published from Jounral of chemoinformaitcs. The title is ‘Surge: a fast open-source chemical graph generator’.

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00604-9

The authors developed a fast chemical graph generator which generates molecules from formula.

To generate chemical graph from formula, several steps are required 1. generate graph generation and check automorphism, bond multipicity.

In the case of C6H6, over 200 molecules are generated with surge!!!

Fortunately, binay version of surge is provided from following URL.
https://structuregenerator.github.io/

So I used to it. At first, I got program from the URL above and generate molecules from formula C6H6.

$ $ ./surge-linux-v1.0 -o hoge.sdf C6H6

Then hoge.sdf was generated. And I checked generated structure.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw C6H6.ipynb hosted with ❤ by GitHub

As described in the article, surge has a limitation. Current version doesn’t perform a Huckel aromaticity test. It means surge will generate dupilicates structure for kekule versions of aromatic rings.

However it works fast and interesting tool for molecular generation. BTW it’s difficult to filter from generated molecules with desired compound properties in the drug discovery field.