Some days ago, my colleague asked me how to do jarvis-patrick clustering using python.
JP-clustering is one of unsupervised learning method. It’s almost like k-mean clustering.
I found good solution for answering his question. A library named “jarvispatrick” was developed recently and uploaded to pypi. ;-)
So, the library can be installed using pip command.
The library has one class so it’s very simple.
To do clustering, user only need to call JarvisPatrick and set 2 parameters, number of k and minimum common number of k-s.
I wrote sample snippet.
I used RDKit for fingerprint calculation.”In the github document, author used chemfp.”
To use JarvisPatrick class, I set 2 arguments, 1st list of elements, 2nd function for distance calculation.
Alter made culster_gen object, set k and k-min args. That’s all !
It’s very easy and simple method for clustering molecules.
User can choice any fingerprint calculation method.
from __future__ import print_function from rdkit import Chem from rdkit.Chem import AllChem, DataStructs import jarvispatrick import sys inF = sys.argv[ 1 ] outF = Chem.SDWriter( 'morgan_res.sdf' ) mols = [ mol for mol in Chem.SDMolSupplier( inF ) if mol != None ] # calc ECFP4 like fp as bit vect. def get_ecfp_sim( m1, m2 ): fp1 = AllChem.GetMorganFingerprintAsBitVect( m1, 2 ) fp2 = AllChem.GetMorganFingerprintAsBitVect( m2, 2 ) tc = DataStructs.TanimotoSimilarity( fp1, fp2 ) return tc cluster_gen = jarvispatrick.JarvisPatrick( mols, get_ecfp_sim ) # cluster is dictionary. # key is index of cluster, value is list of molecules(elements). cluster = cluster_gen( 4, 2 ) #(k and k-min) for k,v in cluster.items(): for mol in v: mol.SetProp( 'cluster_idx', str(k) ) outF2.write( mol ) outF.close()