JP-Clustering with python and rdkit.

Some days ago, my colleague asked me how to do jarvis-patrick clustering using python.
JP-clustering is one of unsupervised learning method. It’s almost like k-mean clustering.
I found good solution for answering his question. A library named “jarvispatrick” was developed recently and uploaded to pypi. 😉
So, the library can be installed using pip command.
The library has one class so it’s very simple.
To do clustering, user only need to call JarvisPatrick and set 2 parameters, number of k and minimum common number of k-s.
I wrote sample snippet.
I used RDKit for fingerprint calculation.”In the github document, author used chemfp.”

To use JarvisPatrick class, I set 2 arguments, 1st list of elements, 2nd function for distance calculation.
Alter made culster_gen object, set k and k-min args. That’s all !
It’s very easy and simple method for clustering molecules.
User can choice any fingerprint calculation method.

from __future__ import print_function
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
import jarvispatrick
import sys
inF = sys.argv[ 1 ]
outF = Chem.SDWriter( 'morgan_res.sdf' )
mols = [ mol for mol in Chem.SDMolSupplier( inF ) if mol != None ]
# calc ECFP4 like fp as bit vect.
def get_ecfp_sim( m1, m2 ):
    fp1 = AllChem.GetMorganFingerprintAsBitVect( m1, 2 )
    fp2 = AllChem.GetMorganFingerprintAsBitVect( m2, 2 )
    tc = DataStructs.TanimotoSimilarity( fp1, fp2 )
    return tc

cluster_gen = jarvispatrick.JarvisPatrick( mols, get_ecfp_sim )
# cluster is dictionary.
# key is index of cluster, value is list of molecules(elements).
cluster = cluster_gen( 4, 2 ) #(k and k-min)
for k,v in cluster.items():
    for mol in v:
        mol.SetProp( 'cluster_idx', str(k) )
        outF2.write( mol )


以下に詳細を記入するか、アイコンをクリックしてログインしてください。 ロゴ アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中