Visualize the torsion drive with different approach #openff #torchani #chemoinformatics #quantum_chemistry

Yesterday, I posted about quantum chemistry based predictive model named ‘torch-ani’. It’s really interesting module which build from lots of QC data. To use torchani API, we can visualize torsion drive with the trained model. It sounds interesting however, I would like to compare the torsion drive results from different approaches.

Fortunately, QCArchive provides very useful data and code. To use qcportal package, I could access some data source which is ready to visualize.

So I tried to get data and visualize them. Most of the following code is same as qcarchive example code. But different to data source. OK let’s do it. I used rdkit for compound visualization. And qcportal is used retrieving data and visualize the results. At first import packages and connect portal site.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
import py3Dmol
import copy
from IPython.display import display

# ref
import qcportal as ptl
client = ptl.FractalClient()

> FractalClient
> Server:   The MolSSI QCArchive Server
> Address:
> Username:   None

Then I used “TorsionDriveDataset”, “OpenFF-benchmark-ligand-fragments-v1.0” dataset. This data set has torsion drive data which is calculated with different parameters such as openff, gaff, and ani2x. I got data which has ani2x results only. And added rdkit molobject to the dataframe.

ds = client.get_collection("TorsionDriveDataset", "OpenFF-benchmark-ligand-fragments-v1.0")

ani2xcomdf = ds.status('ani2x',collapse=False, status="COMPLETE")
ani2xcomdf['Mol'] = ani2xcomdf.index.to_list()
PandasTools.AddMoleculeColumnToFrame(ani2xcomdf, smilesCol='Mol', molCol='Mol')
ani2xcomdf['RowID'] = [i for i in range(ani2xcomdf.shape[0])]

pandas data frame

Now ready, when I call ds.visualize method with some arguments, I could visualize torsion drive data. To visualize some dataset, I used for loop. It’s really simple code.

import time
for idx in range(ani2xcomdf.shape[0]):
    m = copy.deepcopy(ani2xcomdf.Mol[idx])
    mh = Chem.AddHs(m)
    m = Chem.RemoveHs(mh)
           ['openff-1.0.0', 'ani2x'],
           units="kcal / mol")

The code iterate ani2xcomdf, and visualize the torsion drive of each row of ‘openff’ and ‘ani2x’.

It’s interesting that some compounds data showed very similar trend but some compounds showed quit different result.

Here are some examples….

The minimum point is same but trend seems little bit different
Ani2x suggested 0deg is the minimum but I don’t think so…
This example shows quite similar trend between openff and ani2x
This example also showed similar trend
Wow, the example quite different …..

I think that the simple example shows machine learning based approach is not perfect but works well if target molecules are in applicability domain(AD). It depends on training data.

I would like to check the training data of ani2x and compare molecules which are used in above code.

Today’s code was uploaded to my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Sample test for quantum ML #pytorch #psikit #RDKit

Recently I have many opportunities to read exciting articles about quantum-machine learning which means some models are trained with quantum chemistry based data such as ANI.

I’m interested in the area and fortunately we can use really cool code named torchani. It provides model zoo includes ani-2x ;) Regarding the original article about ANI2x, ANI model can predict conformer energy with low MAE value compared to MP2 calculation.

ANI1 supports molecules which are constructed from C, N, H and O, but ANI-2x supports molecules constructed from H, C, N, O, F, Cl, S. It means new model cover wide range of drug like molecules.

So I would like to test torchani today. To install torchani is very easy, it can install with pip ;)

$ pip install torchani

After installing the package, I compared the model and psi4 results with simple molecule.

Here is a today’s simple code…

At first load molecule from smiles and generate conformer, then calculate energy with psi4 basis set is scf/6-31g**

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
import py3Dmol
import torch
import torchani
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
from psikit import Psikit
pk = Psikit()
smiles = 'c1ccccc1-c2c(C)cncc2'

> CPU times: user 1min 10s, sys: 1.79 s, total: 1min 12s
> Wall time: 18.5 s
> -518.5098949584047

Next, I tried to calculate(predict) energy with ANI-2x.

# at first time trained model will be downloaded it will take few minutes.
model = torchani.models.ANI2x(periodic_table_index=True).to(device)
def mol2arr(mols, device=device):
    coordinates = []
    spices = []
    for mol in mols:
        pos = mol.GetConformer().GetPositions().tolist()
        atomnums = [a.GetAtomicNum() for a in mol.GetAtoms()]
    coordinates = torch.tensor(coordinates,
    species = torch.tensor(spices, device=device)
    return coordinates, species
coordinates, species = mol2arr([pk.mol], device)

%time energy = model((species, coordinates)).energies
derivative = torch.autograd.grad(energy.sum(), coordinates)[0]
force = -derivative
>CPU times: user 19.5 ms, sys: 4.24 ms, total: 23.7 ms
>Wall time: 22.6 ms

print('Energy:', energy.item())
print('Force:', force.squeeze())

>Energy: -518.5092850483595
>Force: tensor([[ 1.3592e-02,  3.5025e-03,  2.0906e-02],
        [-1.4324e-02, -1.6894e-04,  1.4033e-02],
        [-1.3329e-02, -3.7380e-03,  2.3461e-04],
        [-9.2623e-03, -5.9881e-03, -1.5998e-02],
        [ 2.1978e-02,  4.3879e-03, -1.6717e-02],
        [ 1.4468e-02,  7.4550e-03,  3.5641e-03],
        [-2.8785e-02, -9.5195e-03, -5.3672e-03],
        [-2.6473e-03, -3.8258e-03,  8.0282e-04],
        [-9.3186e-04,  1.7754e-02, -1.3840e-03],
        [ 6.6303e-02, -7.4541e-02, -4.3813e-02],
        [-1.2795e-02, -1.2576e-03,  1.3008e-03],
        [-2.4707e-03,  7.7883e-02,  3.8845e-02],
        [-1.5167e-02,  6.1120e-03,  1.5764e-03],
        [-5.9885e-03, -2.7664e-03, -4.3181e-03],
        [ 2.6414e-03, -5.7584e-06, -5.2600e-03],
        [ 5.9540e-03,  1.8318e-03,  4.1510e-04],
        [ 1.0910e-03,  1.0532e-03,  4.8398e-03],
        [-7.2534e-03, -1.1427e-03,  3.0659e-03],
        [-1.5687e-03,  1.0146e-02, -1.6901e-02],
        [-1.1689e-02, -1.0652e-02,  1.5877e-02],
        [ 1.5274e-02, -1.3123e-02,  6.3923e-03],
        [-9.5655e-03, -1.6000e-03, -5.9008e-04],
        [-1.3352e-02, -2.3642e-03, -2.1185e-04],
        [ 7.8283e-03,  5.6739e-04, -1.2904e-03]], device='cuda:0')

Result, Energy from psi4 was -518.5098949584047,
and Energy from torch ani was -518.5092850483595.
(Epsi4 – Etorch_ani) * 627.5096080305927 was -0.38 kcal/mol.

1 Hartree = 627.5096080305927kcal/mol

It means that the model could predict the energy of molecule with high accuracy. And as you know, speed of the calculation is faster than QM method. It’s an exciting tool for chemoinformatics.

And torhani has many useful functions, I’m now reading original articles and code documents.

Today’s code was uploaded following link.

Benzenoid substitution pattern of drugs #memo #journal

Phenyl ring is very important parts in medicinal chemistry. So there are many drugs which have benzoid in the substructure. And do you know which is most popular substitution pattern in drugs?

Here is an interesting article to answer the question.

Here is a link to ACS publication

And following link is arxiv, it’s freely accessible ;)

The author retrieved data from drug bank and analyzed substitution pattern of approved drugs. The analysis is not only substitution pattern but also any other properties such as shape, drug likeness.

It is surprising for me that major trend of the substitution pattern is same compared to 1958-1988 and 1998-2018. It means that new synthetic reaction such as Pd mediated cross coupling doesn’t change the pattern. 1,4-substitution (-para) is the most major.

from Fig2 of arxiv article

The article focused on benzenoid, does not include hetero aromatic cycles.

After reading the article I have a question. Is the trend based on reactivity of benzenoid or protein-ligand recognition process?

Anyway, retrospective analysis from multiple point of view provides new insight of research I think.

Compound Generator with Graph Networks, GraphINVENT take2 #chemoinformatics #RDKit #PyTorch

I posted about graph based compound generator named ‘GraphINVENT’ some days ago.

Fortunately I could get response from author and could get useful information about their model.

The hyperparameter of the model is very important and difficult to optimize but I could get suitable learning rate to train the GDB-13 small set. I changed lr from 1e-3 to 5e-2 (currently original repository changed to same value). And tried model building again. Train model with same training molecules but different learning rate and more epochs (from 100 to 400), I could generate following molecules.

molecules from 10 epochs.

Model trained only 10 epochs generates many invalid molecules (Xe) and ….. not so interesting molecules.

molecules from 100 epochs.
molecules from 400 epochs

Hmm… After 400 epochs training molecules seem not so but for me. Most of molecules has suitable ring size.

In summary, it is difficult to optimize hyperparameter and it’s really improve compound quality. Does it mean loss function of model optimization still has space for improvement? I think so, because the model uses KL divergence to compare distribution of molecules. But generator performance should evaluated not only compound distribution but also any other druglikeness metrics (for drug like molecule generation). It’s worth to know for me that hyperparameters are very important for the generative model. ;)

Compound Generator with Graph Networks, GraphINVENT #chemoinformatics #RDKit #PyTorch

Here is a new article from Esben et. al. about de novo compound generator with graph network which is named GraphINVENT.


Graph based approach has advantage for compound generation compared to SMILES based approach. It doesn’t need to learn grammar of SMILES. Graph approach represents molecule as graph, atom is node and bond is edge. However currently SMILES based approach works very well even if the approach is required to additional task for learning compound structure. So I have an interest this work to check current status of Graph based compound generation.

Their proposed platform named ‘GraphINVENT’ used six different GNNs! Wow it seems very complex for me. And the implementation is used rdkit and pytorch without using any graph NN frame work such as PyG or DGL.

They compared several compound generation algorithm with MOSES and discussed advantage and disadvantage of GGNN based compound generator. GraphINVENT can generate valid molecules but difficult to tuning hyper parameters.

Fortunately the authors disclosed source code so I tried to use GraphINVENT.

At first I made new conda env for the test. The details are described in original repository.

$ git clone
$ cd GraphINVENT
$ conda env create -f environments/GraphINVENT-env.yml
$ conda activate GraphINVENT-env

Now I created test env for GraphINVENT and then I modified default setting because my PC has GPU (GeForce GTX 1650) but it has enough size for gpu based learning with default settings. And edited python path of

GraphINVENT$ vim graphinvent/parameters/
# edit batch_size and block_size to smaller size
# line 96
model_common_hp_dict = {
    "batch_size": 100, # 1000>>100
    "block_size": 1000, # 100000>>1000

GraphINVENT$ vim 
# set paths here
python_path = f"/home/iwatobipen/miniconda3/envs/GraphINVENT-env/bin/python"
graphinvent_path = f"./graphinvent/"
data_path = f"./data/"

Now ready to run. Move to graphinvent folder and run

GraphINVENT$ cd graphinvent
GraphINVENT/graphinvent$ python
* Running job using HDF datasets located at /home/iwatobipen/dev/GraphINVENT/data/gdb13_1K/
* Checking that the relevant parameters match those used in preprocessing the dataset.
-- Job parameters match preprocessing parameters.
* Run mode: 'train'
* Loading preprocessed training set.
-- time elapsed: 0.00075 s
* Loading preprocessed validation set.
-- time elapsed: 0.00123 s
* Loading training set properties.
* Touching output files.
* Defining model and optimizer.
-- Initializing model from scratch.
-- Defining optimizer.
* Beginning training.
* Training epoch 1.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:05<00:00,  2.05it/s]
* Training epoch 2.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:05<00:00,  2.10it/s]
* Training epoch 3.
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 115/116 [00:06<00:00, 18.78it/s]Learning rate has reached minimum lr.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:06<00:00, 17.93it/s]
* Generating 100 molecules.
Batch 0 of 0
102it [00:00, 325.18it/s]                                                                                                                                   
Generated 102 molecules in 0.3228 s
--316.01 molecules/s
* Evaluating model.
-- Calculating NLL statistics for validation set.
-- Calculating NLL statistics for training set.
* Saving model state at Epoch 100.
-- time elapsed: 653.27597 s

After training, generated molecules are stored in GraphINVENT/output/generation folder. OK let’s check the structure.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdDepictor
## At first, I loaded training compound molecules.
trainsmi = []
with open('data/gdb13_1K/test.smi', 'r') as train:
    for i, l in enumerate(train):
        if i == 0:
        trainsmi.append(l.split(' ')[0])

## These compounds are generated from GGNN
gensmi = []
with open('output/generation/epoch100_0.smi', 'r') as gen:
    for i, l in enumerate(gen):
        if i == 0:
        gensmi.append(l.split(' ')[0])

trainmol = [Chem.MolFromSmiles(smi) for smi in trainsmi]
genmol =  [Chem.MolFromSmiles(smi) for smi in gensmi]

DrdImage(trainmol[:20], molsPerRow=5)
Draw.MolsToGridImage(genmol[50:], molsPerRow=5)

Here is a test molecules. GDB13 molecules seems undruggable structure.

Next compounds are generated molecules.

Hmm…. Some compounds have large size of rings (7~8 membered ring) and unstable substructures. Example molecules in the article shows more druggable structures so it depends on training dataset and which algorithm is used in the GraphINVENT. I could check that the code work well in my PC but I should train with more suitable dataset and GPUs for the code evaluation.

Which do you like RNN based compounds generation or GGNN based compounds generation? ;)

Difference between santize mol and not sanitize mol #memo #rdkit

I posted about fast compound search with rdkit. And in the post, I used patternfinger print in the post.

Today I checked behavior of the fingerprint. Patternfingerprint can calculate molecules which is not sanitized. However the fingerprint is different to the fingerprint which is calculated from sanitized mol.

Here is a simple example.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
m1 = Chem.MolFromSmiles('c1ccccc1')
m2 = Chem.MolFromSmiles('c1ccccc1', sanitize=False)
m3 = Chem.MolFromSmiles('C1=CC=CC=C1')
m4 = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False)
mols = [m1, m2, m3, m4]
for m in mols:
fps = [Chem.PatternFingerprint(m) for m in mols]
for fp in fps:
    print(DataStructs.TanimotoSimilarity(fp, fps[0]))

The output is below.


I calculated fingerprint from same structure but different settings. And last one, kekulized and non sanitized molecule gave different fingerprint.

So I think that it is important to make fingerprint db with suitable format of SMILES.

Does Bigdata and chemists knowledge make good molecular representation? #memo #machine_learning #chemoinformatics

Here is an exciting article published by Alpha A. Lee et al in ACS journal. And the article is freely accessible. URL is below.

They compared the performance of several molecular fingerprints for QSAR task.

It is interesting for me that they used not only traditional ECFP like fingerprint but also ‘CAS fingerprint’. Regarding the article, cas finger print is an expert system produced by CAS. As you know CAS has huge amount of chemical data and scientific experts. Thus cas fingerprint developed from big data and experts knowledge.

They checked the performance against 88 different target prediction task and found that cas fingerprint outperforms most of the commonly used fingerprints with bit. (see Table 1, 2) And they discussed the reason. Their opinion is that cas fingerprint developed by experts and they can identify rare chemical features. I agree that point and feel interesting.

Because in recent image classification area, Deep Learning based approach outperforms traditional featurization such as SIFT. It means no human expert based feature works well in the task. But in the chemoinformatics area, expert based FP works well compared to systematic FP.

Unfortunately, CAS FP can’t use freely, so I can’t take bench mark on my private PC. But it worth to know that there is still room of improvement to improvement of compound representation.

It’s very hot topics in chemoinformatics!!!!

Relation ship between dihedral deg and atomic charge #psi4 #RDKit #psikit

Recently psikit repository got PR about RESP charge calculation. Thanks for PR. And I have question about the relation ship between compound conformation and partial charge.

Fortunately, psikit already has an example for torsion scan thank @fmkz___ for sharing useful code. The example code is here.

Following code is same as example code linked above. But I calculated not only energy but also RESP charge at each conformer.

Here is an code.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem import rdMolTransforms
from rdkit.Chem.Draw import IPythonConsole

def mol_with_atom_index(mol):
    for atom in mol.GetAtoms():
    return mol

import py3Dmol

I used 2-Propenoic acid methyl ester as an example.

mol = Chem.MolFromSmiles('C=CC(=O)OC')
hmol = Chem.AddHs(mol)

I generated conformers which have different dihedral angle of bond between carbonyl carbon and carbon of double bond. To generate conformer which has different dihedral angle can generate with SetDihedralDeg method of rdkit. B3LYP/6-31G* is used in following example.

from psikit import Psikit
pk = Psikit()

resp_charges = []
dihedral_energies = []
dihedral_degrees = [i for i in range(0, 360, 10)]

for deg in dihedral_degrees:
    conformer = pk.mol.GetConformer(0)
    rdMolTransforms.SetDihedralDeg(conformer, 0, 1, 2, 3, deg)
    degy ="b3lyp/6-31g*")
    print(deg, degy)
        resp_chage = pk.calc_resp_charges()
        resp_chage = []

OK let’s check results. At first check relative energy of the molecule with different angle.

import numpy as np
dihedral_energies_array = np.array(dihedral_energies)

rerative = (dihedral_energies_array - dihedral_energies_array.min()) * pk.psi4.constants.hartree2kcalmol
%matplotlib inline
import matplotlib.pyplot as plt'ggplot')
plt.plot(dihedral_degrees, rerative)
plt.xlabel('Dihedral Angle degree')
plt.ylabel('Relative energy kcal/mol')
relative energy

The compound is simple alpha-beta unsaturated ester so minimum energy is shown at angle at 0 and 180 degree seems reasonable.

How about resp charge of beta carbon?

chargeofO = []
chargeofC = []
for respc in resp_charges:
        oc = respc[3]
        cc = respc[0]
        oc = None
        cc = None
plt.plot(dihedral_degrees, chargeofC, c='b')
plt.xlabel('Dihedral Angle degree')
plt.ylabel('RESP Charge of Caron_0')

Then I got following figure about carbon which index is zero (beta carbon).

RESP charge of beta carbon at different dihedral angle.

The trend of resp charge of beta-carbon is totally opposite, 0 and 180 deg shows highest charge (90 deg shows most negative value). I think it indicates that electron of conformers which have dihedral angle 0, 180 deg are de-localized. And it is important that partial charge of atoms are strongly depends on molecular conformation. So conformation search and geometry optimization is important task for computational chemistry.

Today’s code can be found from following URL. With nbviewer, 3d compound is visible but gist can’t render 3d molecule.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw torsion_resp.ipynb hosted with ❤ by GitHub

Build accurate model with small training data and quantum chemistry #memo #from_ChemRxiv

Recently I read the nice article from ChemRxiv.
Here is the link ;)

The title is ‘Machine Learning Meets Mechanistic Modelling for Accurate Prediction of Experimental Activation Energies’.

I don’t have experience there area but I found and read publications which use Mechanistic DFT. The author mentioned that DFT based approach has difficulties to calculate reaction such as ionic reactions in solution. On the other side, machine learning approach can solve the issue (soluvation) however many training data is required to build accurate model.

So the article proposed hybrid models, it means mechanistic DFT and QSPR combination approach. The article focused on SNAr reaction which is one of the popular reaction in pharma.

To build the predictive model, they used reaction SMILES as input then optimize ground state and transition state and calculate reaction features. Details are shown in Fig. 5.

They compared performance of several models and found that Gaussian process regressor with Full descriptor works very well. Also the model which is trained with descriptors without transition state(TS) data works well. It is interesting for me because I thought TS is key of reaction prediction but difficult to guess the TS.

In the research they revealed the key feature of the SNAr reaction from noTS features.

I think Fig. 8(a) shows the power of the proposed approach. This figure shows learning curve giving the MAE as a function of number of reactions in the training set. MAE of hybrid models lower than chemical accuracy even if the training samples are small (-150). Deep learning based model(BERT) also worked well if more data set is available (350-).

The article not only describes the model performance but also describes AD(applicability domain) and showed that the model has good accuracy to not only interpolate but also extrapolate.

I could learn lots from the publication. It really helpful for me.

転職してから5ヶ月位たった #Japanse_entry #diary











Get and Draw molecular fragment with user defined path #RDKit #memo

Chemical structure can represent as graph atoms as nodes, bonds as edges. And some compound fingerprints based on the graph. These algorithm extract fragment from molecule given radius of center atom.

To get atom environment of RadiusN, FindAtomEnvironmentOfRadiusN method or rdkit is useful.

It can get fragment from molecule with given radius of specific atom.

So I would like to show how to do it. At first import libraries.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem

Then define sample molecule and getSubmolRadN.

def getSubmolRadN(mol, radius):
    for atom in atoms:
        env=Chem.FindAtomEnvironmentOfRadiusN(mol, radius, atom.GetIdx())
        submol=Chem.PathToSubmol(mol, env, atomMap=amap)
        subsmi=Chem.MolToSmiles(submol, rootedAtAtom=amap[atom.GetIdx()], canonical=False)
        submols.append(Chem.MolFromSmiles(subsmi, sanitize=False))
    return submols
mol = Chem.MolFromSmiles('n1c(N3CCN(C(=O)C)CC3)c(C#N)cc2cc(OC)ccc12')

FindAtomEnvironmentOfRadiusN can get env around atom of radius N and then, call to PathToSubmol with env to get submol of the env. After that to keep query atom as atom index 0, I call MolFromSmiles with rootedAtAtom option.

Now ready, let’s check the submols.

submols = getSubmolRadN(mol, 1)
Draw.MolsToGridImage(submols, highlightAtomLists=[[0] for _ in range(len(submols))], molsPerRow=5)
submols = getSubmolRadN(mol, 3)
Draw.MolsToGridImage(submols, highlightAtomLists=[[0] for _ in range(len(submols))], molsPerRow=5)

It seems work.

Here is a whole code of the post.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw mol2fragment.ipynb hosted with ❤ by GitHub
gist code

Submols are rdkit mol object so I can use the object any tasks.

Also as reader knows that current RDKit has cool method such as DrawMorganBits, DrawMorganEnv etc. It is useful to check bit information of molecule. Details are provided in original document.

The article about the guideline of RNN based molecular generation #memo #chemoinformatics

I’m in summer vacation from today. Due to pandemic, we don’t have plan to go travel in this summer vacation ;( Hope the situation will go soon….

As reader know recently SMILES based de novo design is used for not only material design but also drug discovery project. Some years ago, the approach generates many invalid molecules because it is difficult to learn grammar of SMILES. However recently RNN based approach works very well also other approaches works well too GAN, Graph Based and image based(???). And chemoinformatitian can generate focused compound set with RNN generator and transfer learning technique.

I would like to introduce a nice article about guide line of SMLIES based generator.

They investigated the effect of data set and number of epochs for transfer learning. They used REINVENT(RNN based generator) and made base model with ChEMBL data set. Then preformed transfer learning with some kinds of specific data set such as target focused data, patent data etc.

I don’t describe details about the article here if reader who has interest the article please check it ;)

Their results are interesting for me. It indicates that the model which is trained large and general compound data can generate diverse of valid molecule and also indicates that it can learn specific compound feature(distribution) with small amount of compound set. For example macro cyclic compound, spiro cyclic block containing compound.

It means that to build focused library generator, user don’t need to prepare large amount of focused training data set but need to prepare general data set for learning SMILES grammar and small data set for transfer learning.

Now we can use many open source based de novo compound generator algorithm and techniques. Is there best way to do de novo design? No, it depends on our situation and requirements ;)

…. There are many publication and codes are available in these days…. I need to keep studying and opening my eyes……

What is scaffold / Medicinal chemist feeling #memo

Recently I’m interested in the following article.

The author tried to detect chemical series (scaffold) like medicinal chemist. In the drug discovery project chemical series / scaffold is very important concept to analyze compounds SAR but it is fuzzy. As chemoinformatitian know Bemis-Murcko scaffold is one of the solution for systemic detection of chemical series but BM scaffold isn’t mach chemist feeling sometime.

So more chemist friendly definition of the chemical series flow seems useful.

The author used UPGMA (unweighted pair group method with arithmetic mean) to cluster the molecules structure and calculate MCS to detect scaffold and then analyze frequency of the scaffold. Final step is interesting because more frequent MCS such as benzene etc. is too common and not suitable for scaffold.

So their approach seems very reasonable for me.

BTW, this approach seems effective for retrospective analysis to detect chemical series of each project. But chemist defines chemical series at the beginning of their project. There are few compound data at the time. So chemist intuition/feeling/common sense are required to define the chemical serirs.

I feel it is very interesting point. Recently AI based drug discovery is raising in these are. But current AI(ML) can’t move 0 to 1 even if it can move 1 to 2 or 10 ;) (it is my personal opinion..)

So important point is The right person(tool) in the right place. It’s difficult things but I should think at that point.

Use Hash value in python code #memo

Recently when I reading a code and test it, the code didn’t work. ;( The code used python hash function for generating some values. This is the reason why the code didn’t work in python3. Python3 hash function returns same value in only same session.

So it should not use as permanent value. For example…

$ python -c "print(hash('a'));"

$ python -c "print(hash('a'));"

Now I god different value from same string object ‘a’. Next, I used hashlib.

$ python -c "import hashlib;m=hashlib.md5(b'a').digest();print(m)"

$ python -c "import hashlib;m=hashlib.md5(b'a').digest();print(m)"

Next I could got same hashed value from binary object.

To get same value with hash() function for testing, set PYTHONHASHSEED value in environment value is required.

$ python -c "print(hash('a'));"
$ python -c "print(hash('a'));"

It just memorandum for me….

FPSim2 for fast compound search #fpsim2 #rdkit #chemoinformatics

In the previous posts, I described various way to search compounds from data source such as ChEMBL. For example… using rdkit postgre cartridge, GPUsim which is developed by schrodinger and rdSubstructLibrary which is implemented in RDKit. All methods are very useful.

And today I tried to use FPSim2 which was described in ChEMBL-OG. The package can be installed with conda command.

$ conda install -c efelix fpsim2

If your PC has GPU and cupy, FPSim2 can search compounds with GPU power! And it is easy to use. OK let’s test it.

At the first step, I made two fingerprint databases, one is morgan_fp db which is used for similarity search and the other is rdk_pattern_fp db which is used for substructure search. To make DB, few line of coding is required.

# retrieve smiles from chembl27 and save them to text files.
from pychembldb import *
from rdkit import Chem
q = chembldb.query(CompoundStructure.canonical_smiles)
with open('chembl27.smi', 'w') as outf:
    for row in q:
            if Chem.MolFromSmiles(row[0]):
# make data bases.
from import create_db_file

create_db_file('chembl27.smi','chembl27_morgan2_2048.h5', 'Morgan', {'radius': 2, 'nBits': 2048})
create_db_file('chembl27.smi', 'chembl27_rdk_pat.h5', 'RDKPatternFingerprint'

It took about 30-40 min to make FP database on my PC. (Dell XPS 7500)

Now ready to search! I used aspirin as test molecule. It is same as original documentation code. ### FPSim2CudaEngine for GPU search, FPSim2Engine for CPU search.###

from FPSim2 import FPSim2Engine
from FPSim2 import FPSim2CudaEngine
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import linecache
chemblsmi = 'chembl27.smi'

fpe = FPSim2Engine('chembl27_morgan2_2048.h5')
fpe_cuda = FPSim2CudaEngine('chembl27_morgan2_2048.h5')

query = 'CC(=O)Oc1ccccc1C(=O)O'
qmol = Chem.MolFromSmiles(query)

FPSim2 search engine returns index of hit molecules. So I defined simple converter index to smiles. This example compound id is same to db index.

$ wc chembl27.smi
1941410 1941410 112956700 chembl27.smi ## I have 1,941,410 molecules.

def getmol(index):
    smi = linecache.getline(chemblsmi, index).rstrip()
    return smi

OK try to similarity search with tanimoto coefficient 0.7 as threshold.

%time results = fpe.similarity(query, 0.7) #CPU search
> Wall time: 8.58 ms
%time results = fpe_cuda.similarity(query, 0.7) #GPU search
> Wall time: 19.6 ms

In the above case CPU search is faster the GPU search…. I think it is due to CPU>GPU overhead takes few msec… If I used more lower threshold to search, GPU engine worked faster than CPU engine.

i.e. threshold = 0.5 with same query molecule,
CPU; Wall time: 34 ms
GPU; Wall time: 20.9 ms

Next try to SSS. FPSim2Engine provides the method but FPSim2CudaEngine doesn’t have the method, so I tried SSS with only CPU Engine.

pattern_fpe = FPSim2Engine('chembl27_rdk_pat.h5')
%time results_pat = pattern_fpe.substructure(query)
> Wall time: 83.8 ms # search with single thread
%time results_pat = pattern_fpe.substructure(query, n_workers=16)
> Wall time: 33.2 ms # search with multiple threads

Search performance is improved by using multiple cores. Finally I tried to use rdkit rdSubstructLibrary.

import pickle
library = pickle.load(open('chembl27_ssslib.pkl', 'rb'))
%time res = library.GetMatches(qmol)
> Wall time: 39.9 ms

rdSubstructureLibrary perform search with multi threads. So the performance seems almost same to FPSim2 SSS.

In brief summary, rdkit rdSubstructLibrary and FPSim2 are useful for compound search. FPSim2 can take moleclar_id in FP db it seems good point however current version can take only integer. I’m happy to use many tools for compound search with very few lines of code.

Today’s experiments are uploaded in my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Fast substructure search module of RDKit #rdkit #memo

Recently I posted an example of substructure search with razi, rdkit postgres cartridge. It works well but sometime I would like to conduct SSS faster. And I could get useful information from Greg, yamasakit.

rdSubstructLibrary module can perform fast substructure search! The details are described in rdkit official blog post. I never tried to use the module so I tested the module and compared to razi with simple structure query. Simple code is below..

In my PC sss with rdSubstructLibrary almost 10 times faster than postgresq-rdkit cartridge.

So I think it is very useful for many chemoinformatitian. However the module can’t store compound ID so if I use the approach to ChEMBL DB, I think I need to make index-ChEMBL ID table for getting compound relative assay data.

Substructure search with SMARTS query against ChEMBLDB #rdkit #razi #pychembldb

Recently I often use razi for making structure search because it is very easy to integrate many workflow written in python.

Today I would like to show how to perform substructure search with SMARTS query in ChEMBL. Because I’m modifying pychembldb to integrate razi for enabling structure search in pychembldb.

To perform substructure seach with SMARTS in razi, SMARTS string should be convert query object with mol_from_smarts method.

So I wrote sample code which search MCS from given compound list and perform search with the MCS. Code is shown below.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Modified version of pychembldb seems work fine. But sometime it takes long time for retrieving response, it depends on query smarts.

I would like to know way to improve the search speed if there is. Any suggestions comments are greatly appreciated ;)

Make pandas dataframe with r-group information #memo

I often forget many things …. So there are same topics will be posted in my blog. Sometime it’s updated due to change of package version or some reasons.

And I posted very similar code previously. But I posted again to remember the procedure for myself. It’s just memo…

PandasTools of RDKit makes easy to integrate rdkit and pandas. To use the module, user can render molecules on pandas dataframe, filter by substructure like postgres cartridge, calculate descriptors and do many things.

Also PandasTools can render rgroup decomposition data very easily. ‘PandasTools.RGroupDecompositionToFrame’ works well for the task. I tried to use the method in following test code. All code is uploaded my gist.

  • Load SDF to dataframe
  • Generate 2D code because original data has 3D code
  • get MCS from bi-cyclic compounds and pruned bonds which aren’t in ring. It’s defined as scaffold
  • Filter data which has scaffold as substructure
  • Then perform rgroup decomposition and load data to pandas dataframe
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

The function can automatically attach R-group number, it is useful for further analysis such as FW analysis etc.

I don’t have many examples which use rdRGroupDecomposition method in my code. I would like to think about new idea which uses the method.

Does the fastest computer find drug more efficiently?

Recently I read good news that the latest supercomputer developed by Japan named ‘Fugaku’ has the world’s fastest computing speed. Th e ULR of the japan times is below.

Recently computational chemists need to handle huge amount of virtual compounds and data (medicinal chemist too ;)). So many computer resources are required to drug discovery projects.

By using supercomputer, virtual screening with massive virtual compounds will performed in short time. So it provides many opportunity to pick up screening canditates.

Does it really mean supercomputer accelerates drug discovery process? Docking study, long time molecular dynamics and ab initio calculation in short time is useful for researchers, I agree that. However I think biological process is too complex to simulate current computing. And target protein and ligand interaction prediction is only one side of drug development process.

So some news papers describes ‘Fugaku for drug discovery’ looks overhype for me.

But I thought that it is difficult computer to win human in Japanese Chess or it will tooooo long time to develop computer which can win to human when I was high school student. But now, computer wins professional Japanese chess player.

So I looking forward to progress of the drug discovery project which uses Fugaku.

Optimize ML model with optuna and visualize the result with MLFlow #informatics #machine learning

As you know Optuna is very useful and powerful package for machine learning. I often use the package in my own task. And MLFLOW is also useful package. I posted about mlflow before. MLflow has many functions for visualize experiment results and manage models.

I think it will be useful if models can be optimized optuna and be managed with mlflow. Fortunately new version of optuna integrates mlflow ;)

It sounds nice doesn’t it. I used the function with simple Iris dataset.

At first, I optimized SVC model with optuna. The code is below. To integrate optuna and mlflow, MLflowCallback should be called from optuna.integration.mlflow. And pass the object to optimize method.

import os
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import optuna
from optuna.integration.mlflow import MLflowCallback

iris = load_iris()
trainx, testx, trainy, testy = train_test_split(,, test_size=0.2)

def objective(trial):
    gamma = trial.suggest_loguniform('gamma', 1e-3, 3.0)
    C = trial.suggest_loguniform('C', 1e+0, 1e+2/2)
    kernel = trial.suggest_categorical('kernel', ['linear','rbf','sigmoid'])
    svc = SVC(gamma=gamma, C=C, kernel=kernel), trainy)
    predy = svc.predict(testx)
    accuracy = accuracy_score(testy, predy)
    return accuracy

if __name__=='__main__':

    mlflc = MLflowCallback(tracking_uri='ml_exp',
    study = optuna.create_study(study_name='iris_test')
    study.optimize(objective, n_trials=50, callbacks=[mlflc])

The code above optimizes kernel, gamma and C of SVC. After optimization, all experiments data is stored in ‘ml_exp’ folder where is defined in MLflowCallback.

Then run mlflow ui command to check the results. And open localhost:5000 from web browser.

$  mlflow ui --backend-store-uri file:///home/user/dev/mlflowsample/test_optuna/ml_exp/

Flow webbrowser, user can check performance of each model and relation ship between model accuracy and hyper parameters.

MLflow not only visualize model performance but also serve model. Today I showed simple example of model performance visualization only but I’ll post more examples related chemoinformatics topics near the feature.