Use Neo4j to store MMP data.

I read news about ‘panama papers’.
Data of panama pares was analysed with graph database! It was exciting for me.
http://neo4j.com/blog/analyzing-panama-papers-neo4j/
So, I’m interested in neo4j.
Fortunately, Mac user can install neo4j by using homebrew. 😉
Neo4j has original SQL like language named Cypher.
I used cypher to read data following reason.
At first I used py2neo but it was difficult to handle large dataset. Because py2neo communicates with neo4j server using REST and it took long time to read data, caused time out.

Fist step of starting Cyper is Create node, and relation.
It can with simple way.

Node is created following command.
CREATE ( n: name { property:value, …. } )
And relation is created following.
CREATE (n)-[ r:name {property: value, ….} ]->(n1)
(n)-[r]->() represents relation represents directed graph.
And It was easy to set properties. 😉
Let’s make sample dataset. I got mmp data from following url.
https://zenodo.org/record/8418#.VxLvURN97Uo
I renamed data and checked data.

iwatobipen$ head ChEMBL17_IC50_RECAP_MMP_list.csv 
Target_ChEMBLID,Cpd1_ChEMBLID,Cpd2_ChEMBLID,KeyFragment,Transformation,Cpd1_SMILES,Cpd2_SMILES
CHEMBL1075097,CHEMBL2348488,CHEMBL2326095,[R1]CCC(CCCCB(O)O)(C(=O)O)N,[R1]N(CC)CC>>[R1]N1CCCC1,OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-],OB(O)CCCC[C@@]([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097,CHEMBL2326085,CHEMBL2348486,[R1]CCC(CCCCB(O)O)(C(=O)O)N,[R1]N(CC)CC>>[R1]N1CCCC1,OB(O)CCCC[C@@]([NH3+])(CC[NH+](CC)CC)C(=O)[O-],OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097,CHEMBL2348488,CHEMBL2348486,[R1]CCC(CCCCB(O)O)(C(=O)O)N,[R1]N(CC)CC>>[R1]N1CCCC1,OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-],OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
iwatobipen$ wc ChEMBL17_IC50_RECAP_MMP_list.csv 
  240323  240323 60542670 ChEMBL17_IC50_RECAP_MMP_list.csv

The data has mmp information of ChEMBL17.
Then load data from neo4j-shell.
The csv file has header, so I used command LOAD CSV WITH HEADERS FROM…
I got 10 thousand data from original dataset.
iwatobioen $ head -n 10000 ChEMBL17_IC50_RECAP_MMP_list.csv > ChEMBL17_IC50_RECAP_MMP_10000list.csv
To load 10 thousand of data, it took lots time.

neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "file:////Users/iwatobipen/develop/py3env/neo4jtest/ChEMBL17_IC50_RECAP_MMP_10000list.csv" AS line
> MERGE (m1:mol { molid: line.Cpd1_ChEMBLID, smi: line.Cpd1_SMILES })
> MERGE (m2:mol { molid: line.Cpd2_ChEMBLID, smi: line.Cpd2_SMILES })
> CREATE (m1)-[r:MMP { transform:line.Transformation, targetid:line.Target_ChEMBLID}]->(m2); # Maybe MERGE is better...
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 3522
Relationships created: 9999
Properties set: 27042
Labels added: 3522
63150 ms

Hmm It needed too long time, I want to know more efficient way.
Anyway, I loaded data to graph database.
Then access http://localhost:7474 (default settings.)

I got following image …
neo4jtop

Write Query. “Select node and count relation as degree”.

$MATCH (node)-[r]->() RETURN node, count (r) AS degree ORDER BY degree DESC LIMIT 25;

query_res_cypher2

query_cypher1

I think graph database is useful to detect molecular matched series(MMS) because MMS is represented a path like nodeA -> nodeB -> nodeC ….
And also Molecular Matched square is represented a path like nodeA->nodeB->nodeC->nodeA.
These path is easily detect by using Cypher.

MATCH (n)-->(a)-->(b)
RETURN n, a, b

or

MATCH (n)-->(a)-->(b)-->(n)
RETURN n, a, b

Try it…

$MATCH (n)-->(a)-->(n) RETURN n,a LIMIT 100;

Then I got following result.
query_res_cypher3

I think neo4j is interesting for chemoinformatics and I want to analyse in-house and public data ASAP.

広告

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中