Use Neo4j to store MMP data.

I read news about ‘panama papers’.
Data of panama pares was analysed with graph database! It was exciting for me.
So, I’m interested in neo4j.
Fortunately, Mac user can install neo4j by using homebrew. 😉
Neo4j has original SQL like language named Cypher.
I used cypher to read data following reason.
At first I used py2neo but it was difficult to handle large dataset. Because py2neo communicates with neo4j server using REST and it took long time to read data, caused time out.

Fist step of starting Cyper is Create node, and relation.
It can with simple way.

Node is created following command.
CREATE ( n: name { property:value, …. } )
And relation is created following.
CREATE (n)-[ r:name {property: value, ….} ]->(n1)
(n)-[r]->() represents relation represents directed graph.
And It was easy to set properties. 😉
Let’s make sample dataset. I got mmp data from following url.
I renamed data and checked data.

iwatobipen$ head ChEMBL17_IC50_RECAP_MMP_list.csv 
iwatobipen$ wc ChEMBL17_IC50_RECAP_MMP_list.csv 
  240323  240323 60542670 ChEMBL17_IC50_RECAP_MMP_list.csv

The data has mmp information of ChEMBL17.
Then load data from neo4j-shell.
The csv file has header, so I used command LOAD CSV WITH HEADERS FROM…
I got 10 thousand data from original dataset.
iwatobioen $ head -n 10000 ChEMBL17_IC50_RECAP_MMP_list.csv > ChEMBL17_IC50_RECAP_MMP_10000list.csv
To load 10 thousand of data, it took lots time.

neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "file:////Users/iwatobipen/develop/py3env/neo4jtest/ChEMBL17_IC50_RECAP_MMP_10000list.csv" AS line
> MERGE (m1:mol { molid: line.Cpd1_ChEMBLID, smi: line.Cpd1_SMILES })
> MERGE (m2:mol { molid: line.Cpd2_ChEMBLID, smi: line.Cpd2_SMILES })
> CREATE (m1)-[r:MMP { transform:line.Transformation, targetid:line.Target_ChEMBLID}]->(m2); # Maybe MERGE is better...
| No data returned. |
Nodes created: 3522
Relationships created: 9999
Properties set: 27042
Labels added: 3522
63150 ms

Hmm It needed too long time, I want to know more efficient way.
Anyway, I loaded data to graph database.
Then access http://localhost:7474 (default settings.)

I got following image …

Write Query. “Select node and count relation as degree”.

$MATCH (node)-[r]->() RETURN node, count (r) AS degree ORDER BY degree DESC LIMIT 25;



I think graph database is useful to detect molecular matched series(MMS) because MMS is represented a path like nodeA -> nodeB -> nodeC ….
And also Molecular Matched square is represented a path like nodeA->nodeB->nodeC->nodeA.
These path is easily detect by using Cypher.

MATCH (n)-->(a)-->(b)
RETURN n, a, b


MATCH (n)-->(a)-->(b)-->(n)
RETURN n, a, b

Try it…

$MATCH (n)-->(a)-->(n) RETURN n,a LIMIT 100;

Then I got following result.

I think neo4j is interesting for chemoinformatics and I want to analyse in-house and public data ASAP.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s