Integration of RDKit and Neo4j #RDKit #Neo4j #GraphDB #Chemoinformatics

One of the most powerful storms of the year named ‘Hagibis’ is coming now. All weekend schedules are cancelled. I’m staying at my home and writing code…

BTW, In the RDKit UGM 2019, neo4j-rdkit integration project was introduced. The project is one of the topic in Google summer of code. You can find the project in following URL. https://github.com/rdkit/neo4j-rdkit

Neo4j is No-SQL graph based database. Some years ago I wrote post about the DB to store the MMP data but native neo4j function can’t conduct chemical specific query such as sub structure, exact structure search etc. But, if we use neo4j-rdkit plugin, we can do substructure search in neo4j. It seems nice. I would like to try it!

To do it, it is required to build plugin .jar file. The procedure is described in README.md.

At first, install neo4j and maeven. It is easy to do it. For debian, apt is used.
install neo4j
https://neo4j.com/docs/operations-manual/current/installation/
install maeven
https://linuxize.com/post/how-to-install-apache-maven-on-ubuntu-18-04/

$ wget -O - https://debian.neo4j.org/neotechnology.gpg.key | sudo apt-key add -
$ echo 'deb https://debian.neo4j.org/repo stable/' | sudo tee -a /etc/apt/sources.list.d/neo4j.list
$ sudo apt-get update
$ sudo apt-get install neo4j=1:3.5.11
$ sudo apt install maven

After installing neo4j and maeven, the build neo4j-rdkit! Clone repo and build jar files by typing following command.

$ git clone https://github.com/rdkit/neo4j-rdkit.git
$ neo4j-rdkit
$ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKit.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit -Dversion=1.0.0 \
                         -Dpackaging=jar
                         
$ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKitDoc.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit-doc -Dversion=1.0.0 \
                         -Dpackaging=jar

Then generate .jar file with all dependencies. It can do just type ‘mvn packgage’

After do it cp jar file to neo4j plugins folder. Following my environment’s example.
Then launch neo4j. After staring neo4j, localhost:7474 is accessible from web browser.

$ mvn package
$ sudo cp ./target/rdkit-index-0.0.7.jar /var/lib/neo4j/plugins
$ sudo neo4j start

I imported MMP data by using load csv command. The data already has matched pair information. So there is two nodes information per line.

$ head mmp_data_sets/ChEMBL17_IC50_RECAP_MMP_List.dat 
Target_ChEMBLID	Target_Name	Cpd1_ChEMBLID	Cpd2_ChEMBLID	Cpd1_pIC50	Cpd2_pIC50	Num_Cuts	KeyFragment	Transformation	Cpd1_SMILES	Cpd2_SMILES
CHEMBL1075097	Arginase-1	CHEMBL2348488	CHEMBL2326095	6.28399665637	6.79588001734	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCC[C@@]([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097	Arginase-1	CHEMBL2326085	CHEMBL2348486	6.56863623584	6.49485002168	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCC[C@@]([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097	Arginase-1	CHEMBL2348488	CHEMBL2348486	6.28399665637	6.49485002168	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]

From web UI, execute following command. TIP data should be stored in /var/lib/neo4j/import folder. And to use rdkit plugin, CALL the command at first.

#Cypher web browser
CALL org.rdkit.search.createIndex([‘Structure’, ‘Chemical’])

After installing the plugin, some chemoinformatics related property names are reserved listed below. And after importing the data several properties are automatically calculated.

  1. canonical_smiles
  2. inchi
  3. formula
  4. molecular_weight
  5. fp – bit-vector fingerprint in form of indexes of positive bits ("1 4 19 23")
  6. fp_ones – count of positive bits
  7. mdlmol

MDL format, fingerprint, inchi, etc is added !!!!

#Cypher web browser
LOAD CSV WITH HEADERS FROM "file:///ChEMBL17_IC50_RECAP_MMP_10k.csv" AS line
FIELDTERMINATOR '\t'
MERGE (m1:Chemical:Structure { mol_id: line.Cpd1_ChEMBLID, smiles: line.Cpd1_SMILES })
MERGE (m2:Chemical:Structure { mol_id: line.Cpd2_ChEMBLID, smiles: line.Cpd2_SMILES })
MERGE (m1)-[r:MMP { transform:line.Transformation, targetid:line.Target_ChEMBLID}]->(m2); # Maybe MERGE is better...

OK let’s do substructure search! It is really chemistry specific problem. It is useful for researcher for searching MMPs which has user defined substructure. rdkit.*.search function returns canonical smiles and score, so if you want to omit low score results, you should use where clauser I think.

CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'C1CCCCC1') YIELD canonical_smiles, score
MATCH (n)
#WHERE score>500 and n.smiles=canonical_smiles
RETURN n

Of course exact mach search is available.

Similarity search example is below.

Neo4j provides python driver https://neo4j.com/developer/python/.

This is very short example of neo4j-rdkit. LOAD CSV method of my code is not good for large data set. I would like to read neo4j document. And use it more deeply. This is very cool plugin thank you developer!

May the typhoon pass away early!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.