I posted new function about scaffold tree which is implemented in rdkit 2020 03 before.
In previous my post, I showed example to draw scaffold tree with networkx. It could draw the scaffold tree as a network but molecular structures are not shown on the node. For chemist, structure image is important so I tried to draw network with structure image on node.
Fortunately cyjupyter is very useful package to do it, it can call cytoscape.js internally so user can draw network with many options.
OK let’s try. At first import required packages. And set example strcutures.
from rdkit import Chem
from rdkit.Chem import rdBase
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw import IPythonConsole
import networkx as nx
from networkx.readwrite import cytoscape_data
from cyjupyter import Cytoscape
from rdkit.Chem import AllChem
from rdkit.Chem.Scaffolds import rdScaffoldNetwork
from urllib import parse
smiles_lsit = [
Then create scaffold tree following code is almost as same as previous post.
Now beta version of rdkit is available from anaconda! So I would like to try it. However I would like to test without contaminating current my environment. So I tried new version of rdkit with Docker.
Fortunately rdkit can be installed via conda, so I made Dockerfie based on miniconda3. Following dockerfile used continuumio/miniconda3. By using the image I could install required packages via conda command. ;)
RUN pip install --upgrade pip
RUN conda install -c rdkit/label/beta rdkit -y
RUN conda install -c conda-forge seaborn pandas scikit-learn -y
RUN conda install -c conda-forge networkx jupyter jupyterlab -y
RUN pip install py2cytoscape
ENTRYPOINT ["jupyter-lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''"]
Then jupyter-lab will launch and I can access notebook the following URL. ‘localhost:8888’
And I tried to use new function named rdScaffoldNetwork. It is the implementation of scaffold tree. If reader would like to know what is the scaffold tree, please check the original article.
It is very useful for analyze compounds scaffolds. Following code is only just used it with simple case. So it is not practical example for real drug discovery project. I would like to apply the approach against real project near the feature.
As you know, there many packages for quantum chemistry not only commercial software but also free tools. And each soft has own output format. So user need to understand how to use it. But it is very time consuming step I think.
ORBIKIT is a modular python toolbox for cross-platform post processing of quantum chemical wavefunction data.
Molecular fingerprint(FP) is a very important for chemoinformatics because it is used for building many predictive models not only ADMET but also biological activities.
As you know, ECFP (Morgan Fingerprint) is one of golden standard FP of chemoinformatics. Because it shows stable performance against any problems. After ECFP is reported, many new fingerprint algorithm is proposed but still ECFP is used in many case.
It means that more improvement of fingerprint is required but it’s too difficult task.
MAP of MAP4 means “MinHashed atom-pair fingerprint up to a diameter of four bonds“.
The calculation process is below.
First, the circular substructures surrounding each non-hydrogen atom in the molecule at radii 1 to r are written as canonical, non-isomeric, and rooted SMILES string .
Second, the minimum topological distance TPj,k separating each atom pair(j, k) in the input molecule is calculated.
Third,all atom-pair shingles are written for each atom pair (j, k) and each value of r, placing the two SMILES strings in lexicographical order.
Fourth, the resulting set of atom-pair shingles is hashed to a set of integers using the unique mapping SHA-1, and its corresponding transposed vector is finally MinHashed to form the MAP4 vector.
It seems similar approach to ECFP but this approach uses minhashing techniques. It works very fast and this fingerprint outperform compared to their developed MHFP, ECPF and other fingerprints which are impremented RDKit.
In their article Fig2 showed performance of virtual screening against various targets and MAP4FP outperformed in many targets.
In the Fig8 shows Jaccard distance between set of molecules. MAP4 shows better performance against not only small molecule but also large molecule such as peptide, compared to other finger print such as atom pair FP, ECFP etc.
So I would like to use the FP on my notebook, so I tried to use it.
The author disclosed source code so I could get code from github.
Yesterday I posted benchmarking platform named ‘moses’ and found it worked for test data. And then I could get comment from @Mufei Li, developer of DGL that how about to try guacamol. I checked guacamol before but didn’t try it. So I installed guacamol and used it.
From original repo, GuacaMol is an open source Python package for benchmarking of models for de novo molecular design.
This package is developed by BenevolentAI. This package can assess de novo molecular generator of following metrics.
Fr´echet ChemNet Distance
5 is different point to moses. KL divergence metrics is based on molecular descriptors. It is not fingerprint base. So different structure but similar molecular property sets will show high score.
For test the package I installed guacamol and used it.
Fortunately, guacamol can install via pip ;).
And I tested two metrics one was KLDivBenchmark and the other was UniquenessBenchmark. As you can see most of the following code is same as test code of original repository.
I uploaded my code to my gist.
Moses and Guacamol are both useful package for benchmarking.
It is important that we should get benchmark data compared to same baseline. However in the real drug discovery project, molecular properties which is required in the projects are depends on their situation. So it is not big problem of generator performance because current models can generate reasonable structures I think.
There are lots of publications about molecular generators. Each publication implements novel algorithms so we need tool for comparing these models that which is better for us.
I often use PCA, tSNE for chemical space visualization and calculate some scores such as QED, SA/SC Score and molecular properties. However I need the unified metrics. So I think Molecular Sets(MOSES) is nice tool to do it.
MOSES provides useful metrics shown below.
Fragment similarity (FRAG) which is defined as the cosine distance between vectors of fragment frequencies.
Scaffold similarity (SCAFF) which is defined as similar as FRAG but the metrics uses frequency of scaffolds instead of fragments.
Nearest neighbor similarity (SNN) which is the average Tanimoto similarity between test molecules and generated molecules.
Internal diversity (IntDiv_p) assesses the chemical diversity within the generated set of molecules. p=1 or 2
Freched ChemNet Distance (FCD) which uses the penultimate layer of ChemNet and measure distance of reference molecules and generated molecules.
Fortunately source code of MOSES are freely available. I installed moses in my PC and test it.
MOSES repository provides install script, so it is easy to install moses and required packages. I modified setup.py and install it because I use torch=1.2.0 but original setup.py requires ver 1.1.0. So I commented out the line 16 of setup.py.
After installed the package, I used moses from jupyter notebook. All molecules are borrowed from test script of the repository.
Following code is an example. It is easy to get metrics, just calling metrics.get_all_metrics(testmolecules, generatedmolecules). As shown below the method calculate all metrics which are implemented MOSES and show some additional properties such as ratio of valid molecule, qed, molwt etc.
import pandas as pd
from rdkit import Chem
from moses import metrics
test = ['Oc1ccccc1-c1cccc2cnccc12','COc1cccc(NC(=O)Cc2coc3ccc(OC)cc23)c1']
test_sf = ['COCc1nnc(NC(=O)COc2ccc(C(C)(C)C)cc2)s1',
gen = ['CNC', 'Oc1ccccc1-c1cccc2cnccc12',
metrics.get_all_metrics(test, gen, k=3)
I think moses is very useful tool for checking performance of molecular generator. Thanks the author for developing such as a nice tool for in silico drug discovery!
Install psi4, openmm and set PYTHONPASS. It was simple. And I need to modify some code to import the package correctly.
It uses PeriodicTable class of qcelemental, but the module can’t import _temp_symbol attribute from periodictable.py, so I added _temp_symbol line to qcelemental/periodic_table.py
"""Nuclear and mass data about chemical elements from NIST.
A : list of int
Mass number, number of protons and neutrons, starting with 0 for dummies.
Z : list of int
Atomic number, number of protons, starting with 0 for dummies.
E : list of str
Element symbol from periodic table, starting with "X" for dummies. "Fe" capitalization.
EA : list of str
Nuclide symbol in E + A form, e.g., "Li6".
List `EA` is a superset of `E`; that is, both "Li6" and "Li" present.
For hydrogen, "D" and "T" also included.
mass : list of :py:class:`decimal.Decimal`
Atomic mass [u].
For nuclides (e.g., "Li6"), the reported mass.
For stable elements (e.g., "Li"), the mass of the most abundant isotope ("Li7").
For unstable elements (e.g., "Pu"), the mass of the longest-lived isotope ("Pu244").
name : list of str
Element name from periodic table, starting with "Dummy". "Iron" capitalization.
from . import data
# Of length number of elements
self.Z = data.nist_2011_atomic_weights["Z"]
self.E = data.nist_2011_atomic_weights["E"]
self.name = data.nist_2011_atomic_weights["name"]
self._el2z = dict(zip(self.E, self.Z))
self._z2el = collections.OrderedDict(zip(self.Z, self.E))
self._element2el = dict(zip(self.name, self.E))
self._el2element = dict(zip(self.E, self.name))
# Of length number of isotopes
self._EE = data.nist_2011_atomic_weights["_EE"]
self.EA = data.nist_2011_atomic_weights["EA"]
self.A = data.nist_2011_atomic_weights["A"]
self.mass = data.nist_2011_atomic_weights["mass"]
self._eliso2mass = dict(zip(self.EA, self.mass))
self._eliso2el = dict(zip(self.EA, self._EE))
self._eliso2a = dict(zip(self.EA, self.A))
self._el2a2mass = collections.defaultdict(dict)
for EE, m, A in zip(self._EE, self.mass, self.A):
self._el2a2mass[EE][A] = float(m)
# following code was added
self._temp_symbol = [
"X", "H", "HE", "LI", "BE", "B", "C", "N", "O", "F", "NE", "NA", "MG", "AL", "SI", "P", "S", "CL", "AR", "K", "CA",
"SC", "TI", "V", "CR", "MN", "FE", "CO", "NI", "CU", "ZN", "GA", "GE", "AS", "SE", "BR", "KR", "RB", "SR", "Y",
"ZR", "NB", "MO", "TC", "RU", "RH", "PD", "AG", "CD", "IN", "SN", "SB", "TE", "I", "XE", "CS", "BA", "LA", "CE",
"PR", "ND", "PM", "SM", "EU", "GD", "TB", "DY", "HO", "ER", "TM", "YB", "LU", "HF", "TA", "W", "RE", "OS", "IR",
"PT", "AU", "HG", "TL", "PB", "BI", "PO", "AT", "RN", "FR", "RA", "AC", "TH", "PA", "U", "NP", "PU", "AM", "CM",
"BK", "CF", "ES", "FM", "MD", "NO", "LR", "RF", "DB", "SG", "BH", "HS", "MT", "DS", "RG", "UUB", "UUT", "UUQ",
"UUP", "UUH", "UUS", "UUO"
2. Line 81 of psiomm/molecule.py should be change below.
3. gaff2.xml which is forcefield parameters is placed in openmm installed path openmm/app/data/. Because for my env, gaff2.xml file couldn’t find in the folder.
4. Got cov_radii.py from qcdb/qcdb and placed in psi4/driver/qcdb folder.
Now ready. Let’s write code.
Import some packages at first I used psikit for molecular geometry generation.
import numpy as np
from psikit import Psikit
from psiomm import molecule
from psiomm import psi_omm as po
from simtk.openmm.app import Simulation
from simtk.openmm.app import ForceField
from simtk.openmm.app import Modeller
from simtk.openmm.app import PDBReporter
from simtk.openmm import Platform
from simtk.openmm import LangevinIntegrator
from simtk.openmm import Vec3
from simtk.unit import picosecond
from simtk.unit import kelvin
from simtk.unit import kilocalorie_per_mole
from simtk.unit import angstrom
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
It worked fine. following code didn’t use psi4 based calculation, to do it I need to modify input file because last data has not only target molecule but also 50 water molecules. I’ll check the package and psi4 more.