Molecular set profiling with pandas_profiling #RDKit

Molecular descriptors are good indicator for molecular profiling. Visualize and analyze these descriptors are important to have a bird’s-eye view of given molecules set.

I often use “pandas” and “seaborn” to do it. Seaborn is powerful tool to make cool visualization but difficult to obtain statistics data.

Yesterday, I found interesting tool to analyze pandas data frame named “pandas_profiling”.
It seems very easy to make analyze report. It can be installed with conda/pip.
https://github.com/pandas-profiling/pandas-profiling

Let’s install the package and use!
First, call library.

import os
import pandas
import pandas_profiling
import pandas as pd
from rdkit import Chem
from rdkit import RDConfig
from rdkit.Chem import rdBase 
from rdkit.Chem import Descriptors
from rdkit.Chem.Descriptors import _descList
from rdkit.ML.Descriptors import MoleculeDescriptors
# I used cdk2.sdf dataset as test.
datadir =  os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

Then calculate descriptors and make dataframe

data = {}
for name in desc_name:
    data[name] = []
for descs in descs_list:
    for i, desc in enumerate(descs):
        data[desc_name[i]].append(desc)
df = pd.DataFrame(data)

Let’s make report. It is very very easy!!!! ;-)

pandas_profiling.ProfileReport(df)

Then you can get analyze repot with bar chart.
Snap shots are below.

This package provides not only summary of the dataset but also details of the data. It seems very cool package isn’t it?
You can check whole code is following URL.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/pandas_profiling_test.ipynb

Advertisements

mol2graph and graph2mol #rdkit #igraph

I posted about mol to graph object before.
In the blog post, I wrote example that convert RDKit mol object to igraph object. It was one way. There was no method igraph to rdkit mol object.
So I wrote very simple converter from graph to molecule.

First, import modules.

import numpy as np
import pandas as pd
import igraph
from rdkit import Chem
from rdkit.Chem.rdchem import RWMol
from rdkit.Chem import Draw
from rdkit.Chem import rdmolops
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True

Then define two way function, mol2graph and graph2mol. It is very simple.I did not sanitize process because I could not handle some compounds. RWMol method is very useful to do this work.

def mol2graph(mol):
    atoms_info = [ (atom.GetIdx(), atom.GetAtomicNum(), atom.GetSymbol()) for atom in mol.GetAtoms()]
    bonds_info = [(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType(), bond.GetBondTypeAsDouble()) for bond in mol.GetBonds()]
    graph = igraph.Graph()
    for atom_info in atoms_info:
        graph.add_vertex(atom_info[0], AtomicNum=atom_info[1], AtomicSymbole=atom_info[2])
    for bond_info in bonds_info:
        graph.add_edge(bond_info[0], bond_info[1], BondType=bond_info[2], BondTypeAsDouble=bond_info[3])
    return graph

def graph2mol(graph): 
    emol = RWMol()
    for v in graph.vs():
        emol.AddAtom(Chem.Atom(v["AtomicNum"]))
    for e in graph.es():
        emol.AddBond(e.source, e.target, e['BondType'])
    mol = emol.GetMol()
    return mol

Finally, I checked my function on jupyter notebook. And It worked well.

All code is uploaded my repo and can check from following URL. https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/mol2g_g2mol.ipynb

mol2graph and graph2mol #rdkit #igraph

I posted about mol to graph object before.
In the blog post, I wrote example that convert RDKit mol object to igraph object. It was one way. There was no method igraph to rdkit mol object.
So I wrote very simple converter from graph to molecule.

First, import modules.

import numpy as np
import pandas as pd
import igraph
from rdkit import Chem
from rdkit.Chem.rdchem import RWMol
from rdkit.Chem import Draw
from rdkit.Chem import rdmolops
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True

Then define two way function, mol2graph and graph2mol. It is very simple.I did not sanitize process because I could not handle some compounds. RWMol method is very useful to do this work.

def mol2graph(mol):
    atoms_info = [ (atom.GetIdx(), atom.GetAtomicNum(), atom.GetSymbol()) for atom in mol.GetAtoms()]
    bonds_info = [(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType(), bond.GetBondTypeAsDouble()) for bond in mol.GetBonds()]
    graph = igraph.Graph()
    for atom_info in atoms_info:
        graph.add_vertex(atom_info[0], AtomicNum=atom_info[1], AtomicSymbole=atom_info[2])
    for bond_info in bonds_info:
        graph.add_edge(bond_info[0], bond_info[1], BondType=bond_info[2], BondTypeAsDouble=bond_info[3])
    return graph

def graph2mol(graph): 
    emol = RWMol()
    for v in graph.vs():
        emol.AddAtom(Chem.Atom(v["AtomicNum"]))
    for e in graph.es():
        emol.AddBond(e.source, e.target, e['BondType'])
    mol = emol.GetMol()
    return mol

Finally, I checked my function on jupyter notebook. And It worked well.

All code is uploaded my repo and can check from following URL. https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/mol2g_g2mol.ipynb

Make Drug central ER diagram with python #chemoinfo

Recently I knew useful database “DrugCentral“.
From About.
DrugCentral provides information on active ingredients chemical entities, pharmaceutical products, drug mode of action, indications, pharmacologic action. We monitor FDA, EMA, and PMDA for new drug approval on regular basis to ensure currency of the resource.

By using the site, user can search many information on web browser. And also the site provides posgresql dump file with all data of DrugCentral.
I had interest the data so I got dump file and use it.
Afte download the dump file, I made local db in my postgres env and install the db.

iwatobipen$ psql -U postgres
postgres=# create database drugcentral;
postgres=# \q
iwatobipen$ psql drugcental < drugcentral.dump.08292017.sql

OK now I made local drugcentral db.
Next, I would like to know the structure of the database, schema. I could not find the schema in the site but I found good library named "eralchemy".
ERAlchemy generates Entity Relation (ER) diagram (like the one below) from databases or from SQLAlchemy models. I installed the package via pip and made ER diagram. ;-)

iwatobipen$ pip install eralchemy
iwatobipen$ eralchemy -i 'postgresql+psycopg2://postgres@127.0.0.1:5432/drugcentral' -o er.pdf

Second code of above generates ER diagram as PDF format. Let's check it.
er
Good! ;-)

First extract smiles and DDI risk.

iwatobipen$ psql -U postgres -D drugcentral
drugcentral=# SELECT STRUCTURES.SMILES, DDI.DESCRIPTION, DDI.DDI_RISK FROM STRUCTURES, DDI WHERE STRUCTURES.ID = DDI.ID LIMIT 10;
                                                 smiles                                                 |                    description                     |  ddi_risk
--------------------------------------------------------------------------------------------------------+----------------------------------------------------+-------------
 OC[C@H]1N[C@H]([C@H](O)[C@@H]1O)C1=CNC2=C1NC=NC2=O                                                     | FLUCONAZOLE/OSPEMIFENE [VA Drug Interaction]       | Significant
 CO[C@@H]1[C@@H](C[C@H]2O[C@]1(C)N1C3=CC=CC=C3C3=C4CNC(=O)C4=C4C5=CC=CC=C5N2C4=C13)N(C)C(=O)C1=CC=CC=C1 | MERCAPTOPURINE/TOFACITINIB [VA Drug Interaction]   | Critical
 CCS(=O)(=O)N1CC(CC#N)(C1)N1C=C(C=N1)C1=NC=NC2=C1C=CN2                                                  | FLUDROCORTISONE/RISPERIDONE [VA Drug Interaction]  | Significant
 CNCC1=CC=C(C=C1)C1=C2CCNC(=O)C3=CC(F)=CC(N1)=C23                                                       | ARIPIPRAZOLE/HYDROCORTISONE [VA Drug Interaction]  | Significant
 OC(CNC1=CC=CC=N1)C1=CC=CC=C1                                                                           | CISAPRIDE/ZIPRASIDONE [VA Drug Interaction]        | Significant
 C#CC1=CC=CC(NC2=NC=NC3=C2C=C2OCCOCCOCCOC2=C3)=C1                                                       | ARIPIPRAZOLE/PHENYTOIN [VA Drug Interaction]       | Significant
 CN1CCN(CC1)C1=CC=C(NC2=NC3=C(SC=C3)C(OC3=CC=CC(NC(=O)C=C)=C3)=N2)C=C1                                  | ARIPIPRAZOLE/PREDNISOLONE [VA Drug Interaction]    | Significant
 OB1OCC2=CC(OC3=CC=C(C=C3)C#N)=CC=C12                                                                   | ARIPIPRAZOLE/FLUDROCORTISONE [VA Drug Interaction] | Significant
 CC1=CN(C=N1)C1=CC(NC(=O)C2=CC=C(C)C(NC3=NC=CC(=N3)C3=CN=CC=N3)=C2)=CC(=C1)C(F)(F)F                     | AMOBARBITAL/RISPERIDONE [VA Drug Interaction]      | Significant

Second extract smiles and mode of action.

drugcentral=# SELECT ST.SMILES, AC.ACTION_TYPE, AC.DESCRIPTION FROM STRUCTURES AS ST, ACTION_TYPE AS AC WHERE ST.ID = AC.ID LIMIT 10;
                                     smiles                                     |           action_type            |                                                                          description
--------------------------------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------
 COC1=C2OC=CC2=CC2=C1OC(=O)C=C2                                                 | PHARMACOLOGICAL CHAPERONE        | Pharmaceutical chaperones may help stabilize the protein structure thereby restoring folding and/or preventing misfolding of the protein
 FC1=CNC(=O)NC1=O                                                               | MINIMUM INHIBITORY CONCENTRATION | The lowest concentration of an antimicrobial that will inhibit the visible growth of a microorganism
 CC(=O)OC[C@H]1O[C@H]([C@H](OC(C)=O)[C@@H]1OC(C)=O)N1N=CC(=O)NC1=O              | ANTIBODY BINDING                 | Antibody binding activity
 CCCCN1CCCC[C@H]1C(=O)NC1=C(C)C=CC=C1C                                          | ANTAGONIST                       | Binds to a receptor and prevents activation by an agonist through competing for the binding site
 COC(=O)C1=C(C)NC(C)=C([C@H]1C1=CC(=CC=C1)[N+]([O-])=O)C(=O)OCCN(C)CC1=CC=CC=C1 | ANTISENSE INHIBITOR              | Prevents translation of a complementary mRNA sequence through binding to it
 CCOC(=O)C1=C(C)NC(C)=C([C@@H]1C1=CC(=CC=C1)[N+]([O-])=O)C(=O)OC                | BINDING AGENT                    | Binds to a substance such as a cell surface antigen, targetting a drug to that location, but not necessarily affecting the functioning of the substance itself
 C[C@@H](CCC1=CC=C(O)C=C1)NCCC1=CC=C(O)C(O)=C1                                  | MODULATOR                        | Effects the normal functioning of a protein in some way e.g., mixed agonist/antagonist or unclear whether action is positive or negative
 NC1=NC2=NC=C(CNC3=CC=C(C=C3)C(=O)N[C@@H](CCC(O)=O)C(O)=O)N=C2C(N)=N1           | POSITIVE MODULATOR               | Positively effects the normal functioning of a protein e.g., receptor agonist, positive allosteric modulator, ion channel activator
 NCC1=CC=C(C=C1)C(O)=O                                                          | PROTEOLYTIC ENZYME               | Hydrolyses a protein substrate through enzymatic reaction
 OC(=O)CCCC1=CC=CC=C1                                                           | SUBSTRATE                        | Carried by a transporter, possibly competing with the natural substrate for transport
(10 rows)

This is very limited example of the DB. If reader who interested in the DB how about play and analyze with the DB? And ERAlchemy is very useful!!!
Enjoy!
* To use ERAlchemy with postgresql, you need to install psycopg2 at first.

Make MMP network and send to cytoscape #chemoinfo

Recently I use cytoscape in my laboratory. You know Cytoscape is nice tool for network visualization.
I often make data with python and import data from cytoscape. The work flow is not so bad but I am thinking that it will be nice if python can communicate with cytoscape.
Fortunately cytocape has REST plugin called cyREST and also python has py2cytoscape to do it!
It sounds nice. I tried to use these libraries.
At first I installed cyREST to my cytoscape (v3.5.1). appmanager => cyREST >
And also I installed chemviz for drawing chemical structure in cytoscape.
Then install py2cytoscape via pip. ;-)
You can access localhost:1234/v1 from web blowser when cytoscape launched if cyREST is successfully installed.

Ready.
I drew simple MMP network. Code is below.
First, I made MMP from SMILES file by using RDKit MMP script.

$ cat testdata.smi
Oc1ccccc1 phenol
Oc1ccccc1O catechol
Oc1ccccc1N 2-aminophenol
Oc1ccccc1Cl 2-chlorophenol
Nc1ccccc1N o-phenylenediamine
Nc1cc(O)ccc1N amidol
Oc1cc(O)ccc1O hydroxyquinol
Nc1ccccc1 phenylamine
C1CCCC1N cyclopentanol
$ python rfrag.py < testdata.smi> testdata.frag
$ python indexing.py < testdata.frag > testmmp.txt -r 0.2
$ cat testmmp.txt
Oc1ccccc1,Nc1ccccc1,phenol,phenylamine,O[*:1]>>N[*:1],c1ccc(cc1)[*:1]
Nc1cc(O)ccc1N,Nc1ccccc1N,amidol,o-phenylenediamine,O[*:1]>>[*:1][H],Nc1ccc(cc1N)[*:1]
Oc1ccccc1N,Nc1ccccc1N,2-aminophenol,o-phenylenediamine,O[*:1]>>N[*:1],Nc1ccccc1[*:1]
Oc1ccccc1N,Nc1ccccc1,2-aminophenol,phenylamine,O[*:1]>>[*:1][H],Nc1ccccc1[*:1]
Nc1ccccc1N,Nc1ccccc1,o-phenylenediamine,phenylamine,N[*:1]>>[*:1][H],Nc1ccccc1[*:1]
Oc1ccccc1O,Oc1ccccc1N,catechol,2-aminophenol,O[*:1]>>N[*:1],Oc1ccccc1[*:1]
Oc1ccccc1O,Oc1ccccc1Cl,catechol,2-chlorophenol,O[*:1]>>Cl[*:1],Oc1ccccc1[*:1]
Oc1ccccc1O,Oc1ccccc1,catechol,phenol,O[*:1]>>[*:1][H],Oc1ccccc1[*:1]
Oc1ccccc1N,Oc1ccccc1Cl,2-aminophenol,2-chlorophenol,N[*:1]>>Cl[*:1],Oc1ccccc1[*:1]
Oc1ccccc1N,Oc1ccccc1,2-aminophenol,phenol,N[*:1]>>[*:1][H],Oc1ccccc1[*:1]
Oc1ccccc1Cl,Oc1ccccc1,2-chlorophenol,phenol,Cl[*:1]>>[*:1][H],Oc1ccccc1[*:1]
Oc1cc(O)ccc1O,Oc1ccccc1O,hydroxyquinol,catechol,O[*:1]>>[*:1][H],Oc1ccc(cc1O)[*:1]

Now I got MMP data I used the data to make edge of my network and testdata.smi is used to make node data.

Next code is example for communication between python and cytoscape.
At first, import CyRestClient and make connection. Default URL is localhost and port is 1234. But if user would like to use another IP and Port, user can modify from cytoscape.
Edit => Preferences => add => rest.url xxxxx, rest.port xxxx

I used python-igraph for making graph but py2cytoscape handle data generated by networkx, geohi and something.

import igraph
from py2cytoscape.data.cyrest_client import CyRestClient
import py2cytoscape

cy = CyRestClient()
cy.session.delete()
G = igraph.Graph()

with open('testdata.smi', 'r') as vertexis:
    for v in vertexis:
        G.add_vertex(v.split(' ')[0], molname=v.split(' ')[1])

with open("testmmp.txt", "r") as edges:
    for edge in edges:
        G.add_edge(edge.split(",")[0], edge.split(",")[1], transform=edge.split(",")[4])

After making network, go to next step.
network_create_from_igraph method receives data from igraph and send to cytoscape.
Then I set network layout ‘force-directed’.
Finally I set some view style and update the graph settings.

g_cy = cy.network.create_from_igraph(G)
cy.layout.apply(name='force-directed', network=g_cy)

mystyle = cy.style.create('mystyle')

defaults = {
    'NODE_HIGHT': 100,
    'NODE_WIDTH': 100,
    'NODE_FILL_COLOR': "#87CEFA",
    'NODE_BORDER_WIDTH': 5,
    'NODE_BORDER_PAINT': '#FFFFFF',
    'NODE_LABEL_FONT_SIZE': 14,
    'NODE_LABEL_COLOR': '#555555',
    'EDGE_TRANSPARENCY': 100,
    'EDGE_WIDTH': 20,
    'EDGE_STROKE_UNSELECTED_PAINT': '#FFFFFF',
    'NETWORK_BACKGROUND_PAINT': '#3B426F'
}

mystyle.update_defaults(defaults)
cy.style.apply(mystyle, network=g_cy)

View screenshots.
Before run the code, there is no network in cytoscape.

After run the code I could see MMP network without chemical structure.

Finally I set chemviz setting and run paint structure from menu, I could see structure on each node.

And also each node and edge has their own attribute that is set by igraph.
It interesting and useful because all work is done by using only python!

This example is one way python => cytoscape. But the library can send data in both directions.
There are nice documents written in Japanese such like a following URL.
https://qiita.com/keiono/items/ed796643107bd03aff64
Enjoy!

inter and intra reaction handling in RDKit #RDKit

RDKit can handle reaction. Enumeration of many molecules with template reaction and building blocks are useful for library generation.
Recently I have a question about how to handle intramolecular reactions with RDKit such as micro cyclization etc.
In the case of amidation reaction that is often used for drug synthesis SMARTS query is below.
‘[C:1][C:2](=[O:6])[O:3].[N:4][C:5]>>[C:1][C:2](=[O:6])[N:4][C:5]’
The query is inter molecular reaction in RDKit. A + B => C. So this query can not apply to intramolecular reaction such like a “OC(=O)CCCCN => C(=O)1CCCCN1”
In the RDKit, intramolecular reaction query is represented by including reactants in parentheses.
You can found in the document.
http://www.rdkit.org/docs/RDKit_Book.html

By the way how to distinguish between intra and inter reaction in my code?🤔

I propose simple solution, multiple SMILES is handled as one mol object and perform reaction and then separate it.
OK my english is difficult to understand, let’s go to code.

In the blog post, I wrote an example for amidation.

from rdkit import Chem
from rdkit.Chem import rdChemReactions
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.ipython_useSVG=True

# define intra and inter molecular reaction
intra_rxn = AllChem.ReactionFromSmarts('([C:1][C:2](=[O:6])[O:3].[N:4][C:5])>>[C:1][C:2](=[O:6])[N:4][C:5]')
inter_rxn = AllChem.ReactionFromSmarts('[C:1][C:2](=[O:6])[O:3].[N:4][C:5]>>[C:1][C:2](=[O:6])[N:4][C:5]')

# basic acid / amine
acid = Chem.MolFromSmiles('CC(=O)O')
amine = Chem.MolFromSmiles('NC')
# intramolecular
aminoacid =  Chem.MolFromSmiles('N(C)CCC(O)CC(=O)O')

# two molecules in one mol object!
combmol = Chem.MolFromSmiles("CC(=O)O.N1CCC(C)1")

In the case of A + B => C is below

inter_rxn.RunReactants([acid, amine])[0][0]


In this case reaction seems good. By the way, in case of intra reaction is below.

# Intra reaction can not represent with inter molecular reactoin query
inter_rxn.RunReactant(aminoacid,0)[0][0]

But intra SMIRKS query works fine.

#inra moleclar query works fine
intra_rxn.RunReactant(aminoacid,0)[0][0]

Next run the reaction with one mol object in two molecules and intra molecular reaction object.

# paired molecular object also works but reactant two molecules is handled as one object
intra_rxn.RunReactant(combmol,0)[0][0]

Finally combined molecules are separated with Sep_mol function. The function convert molecule to SMILES and split by ‘.’ then transforms SMILES to molecules.

def sepMol(mol):
    smi_list = Chem.MolToSmiles(mol).split('.')
    mols = [ Chem.MolFromSmiles(smi) for smi in smi_list]
    return mols
ms = sepMol(combmol)
print(len(ms))
ms.append(intra_rxn.RunReactant(combmol,0)[0][0])
# out
  2

I tried the function only amidation and do not know whether the method is efficient or not.
Any comment or advice is appreciated.

All my code can check from following URL.

https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/reaction_rdkit.ipynb
https://github.com/iwatobipen/chemo_info/tree/master/rdkit_notebook

Novel interaction FP for CADD #chemoinfo

I am not familiar with CADD such as Docking, MD and ab initio calculation but interested in.
Recently I read novel interaction FP between protein and ligand reported in Journal of chemoinformatics. It shows good performance.
URL is below.
https://jcheminf.springeropen.com/track/pdf/10.1186/s13321-018-0264-0

The author propose novel FP protein per atom score contributions derived interaction fingerprint (PADIF) based on GOLD docking score.
It is interesting for me because this score based on protein per atom score contributions. It is indicated that user can analyze the score by atom based. So by using the score Chemist can design molecules focused on specific atoms (positions). In the report there are no data about that.

BYW the algorithm shows good performance against nuclear hormone receptors such as ER, GR and PPAR. I do not think that good docking score correlates ligand affinity but it is useful for pose prediction. And PADIF shows good performance for pose prediction.

How about use the score with MD calculation ??

Pose prediction is very important task for computer aid drug design but very difficult to predict correct pose. Because protein is too flexible, PDB reflects only snap shot of molecular dynamics.
I am looking forward to improvement of these area.