New fingerprint/MinHash FingerPrint #RDKit #Chemoinformatics

Recently I found an article that describe new method for fast fingerprint calculation.
You can read the article from chemrxiv, URL is below.
https://chemrxiv.org/articles/A_Probabilistic_Molecular_Fingerprint_for_Big_Data_Settings/7176350
They used MinHash method.
MinHash method is the way to estimate jaccard similarity very efficiently. The authors developed MHFP (MinHash Fingerprint) and compared the performance with ECFP4.
”’
? MinHash ?
for example..
http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/
”’
They discussed the performance of MFHP6 (6 means radius 3) and the FP generally outperforms MHFP4, ECFPxs.
In fig6. shows performance analysis of k-nearest neighbor search and MHFP6 works very nice and fast.

Fortunately, author disclosed source code on github. You can use it if you would like to use it.
https://github.com/reymond-group/mhfp

Now I tried to use it and compared similarity between ECFP and MHFP.
Code is below.

@jupyter notebook
Load packages.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from mhfp.encoder import MHFPEncoder
mhfp_encoder = MHFPEncoder()
/sourcecode]

Calculate fingerprints!

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
nmols = len(mols)
#Calc morgan fp
mg2fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3) for mol in mols]
#Calc min hash fp
mhfps = [mhfp_encoder.encode_mol(mol) for mol in mols]

Check them!

tanimoto_sim = []
for i in range(nmols):
    for j in range(i):
        tc = DataStructs.TanimotoSimilarity(mg2fps[i], mg2fps[j])
        tanimoto_sim.append(tc)
mhfps_sim = []
for i in range(nmols):
    for j in range(i):
        jaccard = 1. - MHFPEncoder.distance(mhfps[i], mhfps[j])
        mhfps_sim.append(jaccard)
a, b = np.polyfit(tanimoto_sim, mhfps_sim, 1)
y2 = np.int64(a) * tanimoto_sim + np.int64(b)
print(a, b)
> 1.033917242502858 -0.031604772419224866

This results shows ECFP6 and MHFP6 has good correlation I think.
Finally I made a simple scatter plot.

plt.scatter(tanimoto_sim, mhfps_sim)
plt.plot(tanimoto_sim, y2, color='black')
plt.xlabel('tanimoto')
plt.ylabel('mhfp sim')


All code is pushed to my repo.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/MHFP_example.ipynb

In summary, I tried to use MHFP and it shows good correlation with ECFP.
I used very small dataset(47 molecules), so it can not check speed for large dataset.
I would like to check it near the future.

Last week, I participated CBI and a software UGM.
I am happy that I could have fruitful discussions. I could get many ideas for next challenge!
;-)

Advertisements

3D Alignment function of RDKit #RDKit

During the UGM, I was interested in Ben Tehan & Rob Smith’s great work.
They showed me a nice example of molecular alignment with RDKit.
RDKit has several function to perform 3D alignment. In the Drug Discovery 3D alignment of ligands is important not only Comp Chem but also Med Chem. After their presentation, I talked them and they told me that GetCrippenO3A is useful for 3D alignment.
Hmm, that’s sounds interesting.
I tried to use the function.
My example code is below. Following code run on ipython notebook. To visualize 3D structure of molecules, I used py3Dmol. It can visualize multiple 3D molecules and easy to use!

import py3Dmol
import copy
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import rdBase
from rdkit.Chem import rdMolAlign
from rdkit.Chem import rdMolDescriptors
import numpy as np
p = AllChem.ETKDGv2()
p.verbose = True

If p.verbose set True, user can get SMARTS patterns which hit the definition of ETKDG like below.

[cH0:1]c:2!@;-[O:3][C:4]: 8 7 6 5, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 3 5 6 7, (0, 0, 4, 0, 0, 0)
[O:1][CX4:2]!@;-[CX3:3]=[O:4]: 6 5 3 4, (0, 3, 0, 0, 0, 0)
[C:1][CX4:2]!@;-[CX3:3]=[O:4]: 0 1 3 4, (0, 0, 1, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 12 11 10 5, (0, 0, 4, 0, 0, 0)
[OX2:1][CX4:2]!@;-[CX4:3][OX2:4]: 10 11 12 16, (0, 0, 3, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 12 11 10 5, (0, 0, 4, 0, 0, 0)
[OX2:1][CX4:2]!@;-[CX4:3][N:4]: 10 11 12 17, (0, 0, 4, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)

Next load molecules and generate conformers. I used cdk2.sdf which is provided in rdkit as sample.

mols = [m for m in Chem.SDMolSupplier('cdk2.sdf') if m != None][:6]
for mol in mols:
    mol.RemoveAllConformers()
hmols_1 = [Chem.AddHs(m) for m in mols]
hmols_2 = copy.deepcopy(hmols_1)
# Generate 100 conformers per each molecule
for mol in hmols_1:
    AllChem.EmbedMultipleConfs(mol, 100, p)

for mol in hmols_2:
    AllChem.EmbedMultipleConfs(mol, 100, p)
# for Ipython notebook
Draw.MolsToGridImage(mols)

To conduct GetCrippenO3A and GetO3A, I calculate crippen_contrib of each atom and MMFF params of molecules.

crippen_contribs = [rdMolDescriptors._CalcCrippenContribs(mol) for mol in hmols_1]
crippen_ref_contrib = crippen_contribs[0]
crippen_prob_contribs = crippen_contribs[1:]
ref_mol1 = hmols_1[0]
prob_mols_1 = hmols_1[1:]

mmff_params = [AllChem.MMFFGetMoleculeProperties(mol) for mol in hmols_2]
mmff_ref_param = mmff_params[0]
mmff_prob_params = mmff_params[1:]
ref_mol2 = hmols_2[0]
prob_mols_2 = hmols_2[1:]

OK Let’s align molecules and visualize them!
I retrieved the best score index from multi conformers of each molecule and added viewer.
For crippenO3A…

p_crippen = py3Dmol.view(width=600, height=400)
p_crippen.addModel(Chem.MolToMolBlock(ref_mol1), 'sdf')
crippen_score = []
for idx, mol in enumerate(prob_mols_1):
    tempscore = []
    for cid in range(100):
        crippenO3A = rdMolAlign.GetCrippenO3A(mol, ref_mol1, crippen_prob_contribs[idx], crippen_ref_contrib, cid, 0)
        crippenO3A.Align()
        tempscore.append(crippenO3A.Score())
    best = np.argmax(tempscore)
    p_crippen.addModel(Chem.MolToMolBlock(mol, confId=int(best)), 'sdf')
    crippen_score.append(tempscore[best])
p_crippen.setStyle({'stick':{}})
p_crippen.render()

For O3A…

p_O3A = py3Dmol.view(width=600, height=400)
p_O3A.addModel(Chem.MolToMolBlock(ref_mol2), 'sdf')
pyO3A_score = []
for idx, mol in enumerate(prob_mols_2):
    tempscore = []
    for cid in range(100):
        pyO3A = rdMolAlign.GetO3A(mol, ref_mol2, mmff_prob_params[idx], mmff_ref_param, cid, 0)
        pyO3A.Align()
        tempscore.append(pyO3A.Score())
    best = np.argmax(tempscore)
    p_O3A.addModel(Chem.MolToMolBlock(mol, confId=int(best)), 'sdf')
    pyO3A_score.append(tempscore[best])
p_O3A.setStyle({'stick':{'colorscheme':'cyanCarbon'}})
p_O3A.render()

In my example, both methods shows good results. To check the details, I will calculate ShapeTanimoto and/or RMSD etc.

In summary, rdkit has many useful functions not only 2D but also 3D. I would like to use this function in my project.

All code of the post, I uploaded my github repo.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/rdkit_3d.ipynb

Development of new force field

Molecular force fields development is required human time and expertise.  Comp Chemists often uses FF for their task.  So force field is key parameter to conduct calculations.  But it has still room for improvement.
I have not know that major FF, AMBER-family does not have access to bond order information.

Recently I read an article for new FF development. The title is ‘Open Force Field Consortium: Escaping atom types using direct chemical perception with SMIRNOFF v0.1’
https://www.biorxiv.org/content/early/2018/07/13/286542.full.pdf+html

Fig1 in this article shows representative geometries of 1,2,3,4-tetraphenylbenzene. The geometries are quite different in different Force Fields.

The author proposed new approach called ‘direct chemical perception’ instead of atom typing.
This approach is based on SMIRKS patterns and named SMIRKS Native Open Force
Field (SMIRNOFF) format.
This new FF is now implemented via OpenMM and the Open Eye tools. And good news! The author has plan to implement the FF in RDKit.

If reader who can use OpenEye tool kits you are lucky I think!
Several bench mark is provided in the article and SMIRNOFF shows good performance.

New force field SMIRNOFF is simpler than other FF but powerful for describe feature of molecules.
I am looking forward to progress of the project and would like to think about the way deals with open source and industry.

If reader has interest in this FF, you can get it from the URL below.
https://github.com/openforcefield

Get 3D distance matrix with rdkit #RDKit

I updated rdkit of my env from 20180301 to 20180303 with anaconda. ;-)
When I want to get 3D distance matrix of the molecule I use Get3DDistanceMatrix method.
But I found that rdDistGeom.GetMoleculeBoundsMatrix returns almost same results.
3DDistance matrix is useful for feature of 3D QSAR.
I would like to use these method.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdDistGeom as molDG

mol = Chem.MolFromSmiles('CCC')
bm = molDG.GetMoleculeBoundsMatrix(mol)
bm
Out[]: 
array([[0.        , 1.524     , 2.51279063],
       [1.504     , 0.        , 1.524     ],
       [2.43279063, 1.504     , 0.        ]])

AllChem.EmbedMolecule(mol)
Out[]: 0

dm=AllChem.Get3DDistanceMatrix(mol)
dm
Out[]: 
array([[0.        , 1.50401361, 2.47848679],
       [1.50401361, 0.        , 1.50913087],
       [2.47848679, 1.50913087, 0.        ]])

mol = Chem.MolFromSmiles('c1ccncc1')

bm = molDG.GetMoleculeBoundsMatrix(mol)
bm
Out[]: 
array([[0.        , 1.38925641, 2.42894217, 2.80158084, 2.42894217,
        1.38925641],
       [1.36925641, 0.        , 1.38925641, 2.39940019, 2.81851281,
        2.42894217],
       [2.34894217, 1.36925641, 0.        , 1.35507278, 2.3697344 ,
        2.81851281],
       [2.68158084, 2.31940019, 1.33507278, 0.        , 1.35507278,
        2.39940019],
       [2.34894217, 2.64739923, 2.2897344 , 1.33507278, 0.        ,
        1.38925641],
       [1.36925641, 2.34894217, 2.64739923, 2.31940019, 1.36925641,
        0.        ]])

IAllChem.EmbedMolecule(mol)
Out[31]: 0

dm=AllChem.Get3DDistanceMatrix(mol)
dm
Out[]: 
array([[0.        , 1.38579572, 2.3376015 , 2.70976216, 2.34726355,
        1.38818637],
       [1.38579572, 0.        , 1.39124965, 2.30040952, 2.68623726,
        2.37414223],
       [2.3376015 , 1.39124965, 0.        , 1.35847264, 2.28377475,
        2.61995998],
       [2.70976216, 2.30040952, 1.35847264, 0.        , 1.35652537,
        2.3746108 ],
       [2.34726355, 2.68623726, 2.28377475, 1.35652537, 0.        ,
        1.3905889 ],
       [1.38818637, 2.37414223, 2.61995998, 2.3746108 , 1.3905889 ,
        0.        ]])

And also I would like to extract 3D pharmacophore feature from molecule.
Example is below.

import os
from rdkit import Geometry
from rdkit import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import ChemicalFeatures
from rdkit.Chem.Pharm3D import Pharmacophore
FEAT = os.path.join(RDConfig.RDDataDir, "BaseFeatures.fdef")
featfact = ChemicalFeatures.BuildFeatureFactory(FEAT)
mol = Chem.MolFromSmiles('c1cccnc1')
AllChem.EmbedMolecule(mol)
feats = featfact.GetFeauresFromMol(mol)
for feat in feats:
    ...:     print(feat.GetFamily())
    ...:     pos = feat.GetPos()
    ...:     print(pos.x, pos.y, pos.z)
    ...:     
    ...:     
Acceptor
-1.042491325161116 -0.7212337219142626 -0.7060687598321927
Aromatic
2.7755575615628914e-16 -1.942890293094024e-16 -0.2252839133689341

RDKit has many useful features for chemoinformatics!

Molecular set profiling with pandas_profiling #RDKit

Molecular descriptors are good indicator for molecular profiling. Visualize and analyze these descriptors are important to have a bird’s-eye view of given molecules set.

I often use “pandas” and “seaborn” to do it. Seaborn is powerful tool to make cool visualization but difficult to obtain statistics data.

Yesterday, I found interesting tool to analyze pandas data frame named “pandas_profiling”.
It seems very easy to make analyze report. It can be installed with conda/pip.
https://github.com/pandas-profiling/pandas-profiling

Let’s install the package and use!
First, call library.

import os
import pandas
import pandas_profiling
import pandas as pd
from rdkit import Chem
from rdkit import RDConfig
from rdkit.Chem import rdBase 
from rdkit.Chem import Descriptors
from rdkit.Chem.Descriptors import _descList
from rdkit.ML.Descriptors import MoleculeDescriptors
# I used cdk2.sdf dataset as test.
datadir =  os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

Then calculate descriptors and make dataframe

data = {}
for name in desc_name:
    data[name] = []
for descs in descs_list:
    for i, desc in enumerate(descs):
        data[desc_name[i]].append(desc)
df = pd.DataFrame(data)

Let’s make report. It is very very easy!!!! ;-)

pandas_profiling.ProfileReport(df)

Then you can get analyze repot with bar chart.
Snap shots are below.

This package provides not only summary of the dataset but also details of the data. It seems very cool package isn’t it?
You can check whole code is following URL.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/pandas_profiling_test.ipynb

mol2graph and graph2mol #rdkit #igraph

I posted about mol to graph object before.
In the blog post, I wrote example that convert RDKit mol object to igraph object. It was one way. There was no method igraph to rdkit mol object.
So I wrote very simple converter from graph to molecule.

First, import modules.

import numpy as np
import pandas as pd
import igraph
from rdkit import Chem
from rdkit.Chem.rdchem import RWMol
from rdkit.Chem import Draw
from rdkit.Chem import rdmolops
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True

Then define two way function, mol2graph and graph2mol. It is very simple.I did not sanitize process because I could not handle some compounds. RWMol method is very useful to do this work.

def mol2graph(mol):
    atoms_info = [ (atom.GetIdx(), atom.GetAtomicNum(), atom.GetSymbol()) for atom in mol.GetAtoms()]
    bonds_info = [(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType(), bond.GetBondTypeAsDouble()) for bond in mol.GetBonds()]
    graph = igraph.Graph()
    for atom_info in atoms_info:
        graph.add_vertex(atom_info[0], AtomicNum=atom_info[1], AtomicSymbole=atom_info[2])
    for bond_info in bonds_info:
        graph.add_edge(bond_info[0], bond_info[1], BondType=bond_info[2], BondTypeAsDouble=bond_info[3])
    return graph

def graph2mol(graph): 
    emol = RWMol()
    for v in graph.vs():
        emol.AddAtom(Chem.Atom(v["AtomicNum"]))
    for e in graph.es():
        emol.AddBond(e.source, e.target, e['BondType'])
    mol = emol.GetMol()
    return mol

Finally, I checked my function on jupyter notebook. And It worked well.

All code is uploaded my repo and can check from following URL. https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/mol2g_g2mol.ipynb

mol2graph and graph2mol #rdkit #igraph

I posted about mol to graph object before.
In the blog post, I wrote example that convert RDKit mol object to igraph object. It was one way. There was no method igraph to rdkit mol object.
So I wrote very simple converter from graph to molecule.

First, import modules.

import numpy as np
import pandas as pd
import igraph
from rdkit import Chem
from rdkit.Chem.rdchem import RWMol
from rdkit.Chem import Draw
from rdkit.Chem import rdmolops
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True

Then define two way function, mol2graph and graph2mol. It is very simple.I did not sanitize process because I could not handle some compounds. RWMol method is very useful to do this work.

def mol2graph(mol):
    atoms_info = [ (atom.GetIdx(), atom.GetAtomicNum(), atom.GetSymbol()) for atom in mol.GetAtoms()]
    bonds_info = [(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType(), bond.GetBondTypeAsDouble()) for bond in mol.GetBonds()]
    graph = igraph.Graph()
    for atom_info in atoms_info:
        graph.add_vertex(atom_info[0], AtomicNum=atom_info[1], AtomicSymbole=atom_info[2])
    for bond_info in bonds_info:
        graph.add_edge(bond_info[0], bond_info[1], BondType=bond_info[2], BondTypeAsDouble=bond_info[3])
    return graph

def graph2mol(graph): 
    emol = RWMol()
    for v in graph.vs():
        emol.AddAtom(Chem.Atom(v["AtomicNum"]))
    for e in graph.es():
        emol.AddBond(e.source, e.target, e['BondType'])
    mol = emol.GetMol()
    return mol

Finally, I checked my function on jupyter notebook. And It worked well.

All code is uploaded my repo and can check from following URL. https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/mol2g_g2mol.ipynb