# Scoring 3D diversity using RDKit #RDKit

Recently importance of 3D character of molecules are increasing.
If I design a libraries, I want to estimate not only 2D, but also 3D diversity.
Fortunately RDKit implemented function for characterize the 3D character of molecules named ‘plane of best fit’ (PBF).
You can call this function from rdkit/Contrib/PBF folder. ;-) Great!!!
And reference of PBF is following.
Plane of Best Fit: A Novel Method to Characterize the Three-Dimensionality of Molecules
http://pubs.acs.org/doi/abs/10.1021/ci300293f

In the paper, the authors compare with relationship between PBFscore and PMI( principal moment of inertia ). I’m interested in the algorithm, so I calculated two scores.
At first, I can calculate PBF by using RDKit Contrib/PBF code. But code for calculate PMI is not available. So, I write calculator.
My code is following. calc_pmi.py.

```import sys
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from numpy import linalg
from sklearn.preprocessing import normalize
import numpy as np

def inertia( mol,confId=-1,  principal = True, moments_only = True  ):
numatoms = mol.GetNumAtoms()
conf = mol.GetConformer( confId )
if not conf.Is3D():
return 0
#get cordinate of each atoms
pts =  np.array( [ list( conf.GetAtomPosition(atmidx) ) for atmidx in range( numatoms ) ]  )
atoms = [ atom for atom in mol.GetAtoms() ]
mass = Descriptors.MolWt( mol )
#get center of mass
center_of_mass = np.array(np.sum( atoms[i].GetMass() * pts[i] for i in range( numatoms ))) /mass

inertia = np.zeros( (3,3) )
for atmidx in range( numatoms ):
cmx,cmy,cmz = pts[ atmidx ] - center_of_mass
inertia += -atoms[ atmidx ].GetMass() / mass * np.matrix(
[[ 0, -cmz, cmy ],
[ cmz, 0, -cmx ],
[ -cmy, cmx, 0 ]]
) ** 2
if principal:
if moments_only:
return np.linalg.eigvalsh( inertia )
else:
return np.linalg.eigh( inertia )
else:
return inertia
if __name__=='__main__':
mols = [ mol for mol in Chem.SDMolSupplier( sys.argv, removeHs=False )  ]
for mol in mols:
I = inertia(mol)
npr1, npr2 = I/I, I/I
print( npr1, npr2 )
```

OK, let’s start calculation!
I picked up sample from one molecule from the reference.

```from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG=True
# pbf imported from rdkit/Contrib/PBF
import pbf
import calc_pmi

#(1) mol in the ref.
mol = Chem.MolFromSmiles( 'c1cc(O)c(C(=O)O)cc1N=Nc1ccc(O)c(C(=O)O)c1' )
AllChem.EmbedMolecule(hmol,useExpTorsionAnglePrefs=True ,useBasicKnowledge=True)
AllChem.UFFOptimizeMolecule(hmol)
rhmol = Chem.RemoveHs(hmol)

hmolI = calc_pmi.inertia(hmol)
rmolI = calc_pmi.inertia(rhmol)
```

Compare score with results in the reference.

```#PBF score
print(pbf.PBFRD(hmol), pbf.PBFRD(rhmol))
#out
#0.140618602774 0.0769181645171

#NPR1, NPR2
hmolI = calc_pmi.inertia(hmol)
rmolI = calc_pmi.inertia(rhmol)
print(hmolI/hmolI, hmolI/hmolI)
print(rmolI/rmolI, rmolI/rmolI)
# out
#with hydrogen
#0.128978367365 0.876583937787
#without hydorgen
#0.125631084212 0.879566191929
```

In the reference data is following.

NPR1 = 0.089 ( 0.128978367365, 0.125631084212 )
NPR2 = 0.910 ( 0.876583937787, 0.879566191929 )
PBF = 0.00313 ( 0.140618602774 , 0.0769181645171)
*data which surrounded by parentheses is calculated data with hydrogen, and without hydrogen in my code.

My data indicate that PMI does not see depend on existence of hydrogen, but PBF score is depended on hydrogen.
And data without hydrogen seems good correlation of literature data. I can’t understand the reason. :-(
I need to study more deep in the method.
BTW RDKit has a lots of function for chemoinformatics. It’s Swiss knife for chemoinformatics!

I referred following URL.
The URL gave good suggestion for me.
https://github.com/wrwrwr/mmoi-calc

Also, You can calculate PMI using Knime vernalis community nodes!!!
Wow! that’s nice for OSS! That’s so cool! ;-) 