Integration of RDKit and Neo4j #RDKit #Neo4j #GraphDB #Chemoinformatics

One of the most powerful storms of the year named ‘Hagibis’ is coming now. All weekend schedules are cancelled. I’m staying at my home and writing code…

BTW, In the RDKit UGM 2019, neo4j-rdkit integration project was introduced. The project is one of the topic in Google summer of code. You can find the project in following URL.

Neo4j is No-SQL graph based database. Some years ago I wrote post about the DB to store the MMP data but native neo4j function can’t conduct chemical specific query such as sub structure, exact structure search etc. But, if we use neo4j-rdkit plugin, we can do substructure search in neo4j. It seems nice. I would like to try it!

To do it, it is required to build plugin .jar file. The procedure is described in

At first, install neo4j and maeven. It is easy to do it. For debian, apt is used.
install neo4j
install maeven

$ wget -O - | sudo apt-key add -
$ echo 'deb stable/' | sudo tee -a /etc/apt/sources.list.d/neo4j.list
$ sudo apt-get update
$ sudo apt-get install neo4j=1:3.5.11
$ sudo apt install maven

After installing neo4j and maeven, the build neo4j-rdkit! Clone repo and build jar files by typing following command.

$ git clone
$ neo4j-rdkit
$ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKit.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit -Dversion=1.0.0 \
$ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKitDoc.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit-doc -Dversion=1.0.0 \

Then generate .jar file with all dependencies. It can do just type ‘mvn packgage’

After do it cp jar file to neo4j plugins folder. Following my environment’s example.
Then launch neo4j. After staring neo4j, localhost:7474 is accessible from web browser.

$ mvn package
$ sudo cp ./target/rdkit-index-0.0.7.jar /var/lib/neo4j/plugins
$ sudo neo4j start

I imported MMP data by using load csv command. The data already has matched pair information. So there is two nodes information per line.

$ head mmp_data_sets/ChEMBL17_IC50_RECAP_MMP_List.dat 
Target_ChEMBLID	Target_Name	Cpd1_ChEMBLID	Cpd2_ChEMBLID	Cpd1_pIC50	Cpd2_pIC50	Num_Cuts	KeyFragment	Transformation	Cpd1_SMILES	Cpd2_SMILES
CHEMBL1075097	Arginase-1	CHEMBL2348488	CHEMBL2326095	6.28399665637	6.79588001734	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCC[C@@]([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097	Arginase-1	CHEMBL2326085	CHEMBL2348486	6.56863623584	6.49485002168	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCC[C@@]([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097	Arginase-1	CHEMBL2348488	CHEMBL2348486	6.28399665637	6.49485002168	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]

From web UI, execute following command. TIP data should be stored in /var/lib/neo4j/import folder. And to use rdkit plugin, CALL the command at first.

#Cypher web browser
CALL[‘Structure’, ‘Chemical’])

After installing the plugin, some chemoinformatics related property names are reserved listed below. And after importing the data several properties are automatically calculated.

  1. canonical_smiles
  2. inchi
  3. formula
  4. molecular_weight
  5. fp – bit-vector fingerprint in form of indexes of positive bits ("1 4 19 23")
  6. fp_ones – count of positive bits
  7. mdlmol

MDL format, fingerprint, inchi, etc is added !!!!

#Cypher web browser
MERGE (m1:Chemical:Structure { mol_id: line.Cpd1_ChEMBLID, smiles: line.Cpd1_SMILES })
MERGE (m2:Chemical:Structure { mol_id: line.Cpd2_ChEMBLID, smiles: line.Cpd2_SMILES })
MERGE (m1)-[r:MMP { transform:line.Transformation, targetid:line.Target_ChEMBLID}]->(m2); # Maybe MERGE is better...

OK let’s do substructure search! It is really chemistry specific problem. It is useful for researcher for searching MMPs which has user defined substructure. rdkit.*.search function returns canonical smiles and score, so if you want to omit low score results, you should use where clauser I think.

CALL['Chemical', 'Structure'], 'C1CCCCC1') YIELD canonical_smiles, score
#WHERE score>500 and n.smiles=canonical_smiles

Of course exact mach search is available.

Similarity search example is below.

Neo4j provides python driver

This is very short example of neo4j-rdkit. LOAD CSV method of my code is not good for large data set. I would like to read neo4j document. And use it more deeply. This is very cool plugin thank you developer!

May the typhoon pass away early!


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: