Integration of RDKit and Neo4j #RDKit #Neo4j #GraphDB #Chemoinformatics

One of the most powerful storms of the year named ‘Hagibis’ is coming now. All weekend schedules are cancelled. I’m staying at my home and writing code…

BTW, In the RDKit UGM 2019, neo4j-rdkit integration project was introduced. The project is one of the topic in Google summer of code. You can find the project in following URL. https://github.com/rdkit/neo4j-rdkit

Neo4j is No-SQL graph based database. Some years ago I wrote post about the DB to store the MMP data but native neo4j function can’t conduct chemical specific query such as sub structure, exact structure search etc. But, if we use neo4j-rdkit plugin, we can do substructure search in neo4j. It seems nice. I would like to try it!

To do it, it is required to build plugin .jar file. The procedure is described in README.md.

At first, install neo4j and maeven. It is easy to do it. For debian, apt is used.
install neo4j
https://neo4j.com/docs/operations-manual/current/installation/
install maeven
https://linuxize.com/post/how-to-install-apache-maven-on-ubuntu-18-04/

$ wget -O - https://debian.neo4j.org/neotechnology.gpg.key | sudo apt-key add -
$ echo 'deb https://debian.neo4j.org/repo stable/' | sudo tee -a /etc/apt/sources.list.d/neo4j.list
$ sudo apt-get update
$ sudo apt-get install neo4j=1:3.5.11
$ sudo apt install maven

After installing neo4j and maeven, the build neo4j-rdkit! Clone repo and build jar files by typing following command.

$ git clone https://github.com/rdkit/neo4j-rdkit.git
$ neo4j-rdkit
$ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKit.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit -Dversion=1.0.0 \
                         -Dpackaging=jar
                         
$ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKitDoc.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit-doc -Dversion=1.0.0 \
                         -Dpackaging=jar

Then generate .jar file with all dependencies. It can do just type ‘mvn packgage’

After do it cp jar file to neo4j plugins folder. Following my environment’s example.
Then launch neo4j. After staring neo4j, localhost:7474 is accessible from web browser.

$ mvn package
$ sudo cp ./target/rdkit-index-0.0.7.jar /var/lib/neo4j/plugins
$ sudo neo4j start

I imported MMP data by using load csv command. The data already has matched pair information. So there is two nodes information per line.

$ head mmp_data_sets/ChEMBL17_IC50_RECAP_MMP_List.dat 
Target_ChEMBLID	Target_Name	Cpd1_ChEMBLID	Cpd2_ChEMBLID	Cpd1_pIC50	Cpd2_pIC50	Num_Cuts	KeyFragment	Transformation	Cpd1_SMILES	Cpd2_SMILES
CHEMBL1075097	Arginase-1	CHEMBL2348488	CHEMBL2326095	6.28399665637	6.79588001734	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCC[C@@]([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097	Arginase-1	CHEMBL2326085	CHEMBL2348486	6.56863623584	6.49485002168	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCC[C@@]([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097	Arginase-1	CHEMBL2348488	CHEMBL2348486	6.28399665637	6.49485002168	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]

From web UI, execute following command. TIP data should be stored in /var/lib/neo4j/import folder. And to use rdkit plugin, CALL the command at first.

#Cypher web browser
CALL org.rdkit.search.createIndex([‘Structure’, ‘Chemical’])

After installing the plugin, some chemoinformatics related property names are reserved listed below. And after importing the data several properties are automatically calculated.

  1. canonical_smiles
  2. inchi
  3. formula
  4. molecular_weight
  5. fp – bit-vector fingerprint in form of indexes of positive bits ("1 4 19 23")
  6. fp_ones – count of positive bits
  7. mdlmol

MDL format, fingerprint, inchi, etc is added !!!!

#Cypher web browser
LOAD CSV WITH HEADERS FROM "file:///ChEMBL17_IC50_RECAP_MMP_10k.csv" AS line
FIELDTERMINATOR '\t'
MERGE (m1:Chemical:Structure { mol_id: line.Cpd1_ChEMBLID, smiles: line.Cpd1_SMILES })
MERGE (m2:Chemical:Structure { mol_id: line.Cpd2_ChEMBLID, smiles: line.Cpd2_SMILES })
MERGE (m1)-[r:MMP { transform:line.Transformation, targetid:line.Target_ChEMBLID}]->(m2); # Maybe MERGE is better...

OK let’s do substructure search! It is really chemistry specific problem. It is useful for researcher for searching MMPs which has user defined substructure. rdkit.*.search function returns canonical smiles and score, so if you want to omit low score results, you should use where clauser I think.

CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'C1CCCCC1') YIELD canonical_smiles, score
MATCH (n)
#WHERE score>500 and n.smiles=canonical_smiles
RETURN n

Of course exact mach search is available.

Similarity search example is below.

Neo4j provides python driver https://neo4j.com/developer/python/.

This is very short example of neo4j-rdkit. LOAD CSV method of my code is not good for large data set. I would like to read neo4j document. And use it more deeply. This is very cool plugin thank you developer!

May the typhoon pass away early!

Retrieve MMP square from MMP database.

I wrote blog post about mmp and neo4j somedays ago.
I thought that I could retrieve mmp square from neo4j.
MMP square means 4 molecules pairs like a following relationship.
mol1 => mol2 (MMP), mol2 => mol3 (MMP), mol3 => mol4 (MMP), mol4 => mol1(MMP)
The relationship is important to think about additive or non additive SAR.
And cypher can search the square very simply.
Let’s try it.
At first, I prepare dataset from chemblDB CYP3A4 inhibition data.

import pandas as pd
import numpy as np
df = pd.read_table( "bioactivity-16_12-40-28.txt", header=0, low_memory=False )
df.shape
Out[6]:(17143, 55)

df2 = df[["CANONICAL_SMILES", "MOLREGNO" ]]
df2.to_csv('chembl_cyp3a4.csv', index=False)

from rdkit import Chem
from rdkit import rdBase

!python mmpa/rfrag.py < ./chembl_cyp3a4.csv > ./cyp3a4_frag.txt
!python mmpa/indexing.py -s -r 0.1 < ./cyp3a4_frag.txt > ./cyp3a4_mmp.txt
mmps= pd.read_csv(  'cyp3a4_mmp.txt' , header=None, names = ('smi1','smi2','id1','id2','tform','core'))
mmps.shape

Out[23]:(45096, 6)

Data preparation was finished.
Then read data from neo4j-shell
I used LOAD CSV WITH HEADERS function to do it.

neo4j-sh (?)$ LOAD CSV WITH HEADERS FROM 'file:///path/mmp_cyp3a4/chembl_cyp3a4mmp.csv' AS line
> MERGE (a:mol {smi:line.smi1, molregno: line.id1})
> MERGE (b:mol {smi:line.smi2, molregno: line.id2})
> MERGE (a)-[:MMP {tform:line.tform, core:line.core} ]->(b);
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 2463
Relationships created: 36388
Properties set: 77702
Labels added: 2463
191456 ms

OK, Finally Search mmp square using cypher.
Cypher does not allow query that has same node symbol in a path, so I wrote comma separated query.

neo4j-sh (?)$ MATCH (n)-[r1]->(a)-[r2]->(b)-[r3]->(c), (c)-[r4]->(n) RETURN n.smi,r1.tform,a.smi,r2.tform,b.smi,r3.tform,c.smi,r4.tform LIMIT 1;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| n.smi                               | r1.tform          | a.smi                              | r2.tform          | b.smi                               | r3.tform          | c.smi                              | r4.tform          |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "Cc1cccc(CNc2cc(ncn2)c3ccccc3Cl)c1" | "Cl[*:1]>>C[*:1]" | "Cc1cccc(CNc2cc(ncn2)c3ccccc3C)c1" | "C[*:1]>>CO[*:1]" | "COc1ccccc1c2cc(NCc3cccc(C)c3)ncn2" | "CO[*:1]>>C[*:1]" | "Cc1cccc(CNc2cc(ncn2)c3ccccc3C)c1" | "C[*:1]>>Cl[*:1]" |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
14266 ms

Wow! It works fine!
Next visualise result using rdkit!
New version of rdkit can draw molecule as SVG very easily.

from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.ipython_useSVG=True
tforms= "Cc1cccc(CNc2cc(ncn2)c3ccccc3Cl)c1", "Cl[*:1]>>C[*:1]" , "Cc1cccc(CNc2cc(ncn2)c3ccccc3C)c1" , "C[*:1]>>CO[*:1]" , "COc1ccccc1c2cc(NCc3cccc(C)c3)ncn2" , "CO[*:1]>>C[*:1]" , "Cc1cccc(CNc2cc(ncn2)c3ccccc3C)c1" ,"C[*:1]>>Cl[*:1]"
# Ignored following error / the following code can not read transforms.
molobj = [ Chem.MolFromSmiles(smi) for smi in tforms ]
# Draw them!
Draw.MolsToGridImage( [molobj[0],molobj[2], molobj[4], molobj[6]], molsPerRow=4 )

mmp_square

I want to develop chemoinformatics tools using rdkit and neo4j.