Natural Product likenes Score

Some years ago, large amount of molecules produced by using palladium catalysed cross coupling reaction, like suzuki-miyaura, negishi, stille, etc.
It showed great impact for medicinal chemistry but these reaction tend to produce flat molecules like low fsp3 score.
Now I often read the word ‘Escape from flat land, sp3 rich molecules, 3D diversity …’.
Natural products shows complex, rich sp3 structure.
So, natural product likeness is one of the score for estimation of the library.
Fortunately, we can get NPscore using RDKit. 😉
RDKit implemented following algorithms and easy to use it.
Natural Product-likeness Score and Its Application for Prioritization of Compound Libraries
Peter Ertl, Silvio Roggo, and Ansgar Schuffenhauer
Journal of Chemical Information and Modeling, 48, 68-74 (2008)

Lest try it. At first, I got dataset from NCI.
Diversity set 5, and Natural products set as SDF. And convert SDF to smiles.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdBase
div5 = [ m for m in Chem.SDMolSupplier('Div5_2DStructures_Oct2014.sdf') if m != None ]
nat = [ m for m in Chem.SDMolSupplier( 'NAtProd4.sdf' ) if m != None ]
f = open(  'dataset.txt', 'w' )
for m in div5:
    name = m.GetProp( 'NSC' )
    smi = Chem.MolToSmiles( m )
    f.write(  smi + ' ' + name + ' DIV\n' )
for m in nat:
    name = m.GetProp( 'NSC' )
    smi = Chem.MolToSmiles( m )
    f.write(  smi + ' ' + name + ' NAT\n' )

OK, next I run and merge resultdata.

NP_Score iwatobipen$ python dataset.txt > res.txt 
import seaborn as sns
import pandas as pd
df1 = pd.read_table( 'dataset.txt', sep=' ', names=['smi','nsc','cat'] )
df2 = pd.read_table( 'res.txt', sep='\t', names=['smi','nsc','np'] )
sns.distplot( df[ == 'DIV'].np)
sns.distplot( df[ == 'NAT'].np)

I got following image.
NP set showed higher score than Div5 set.

screen shot

Data summary is following.

count    1593.000000
mean       -0.654590
std         1.035716
min        -3.258000
25%        -1.325000
50%        -0.753000
75%        -0.162000
max         4.054000
Name: np, dtype: float64
In [39]:

count    419.000000
mean       1.594697
std        1.012731
min       -1.541000
25%        0.911500
50%        1.485000
75%        2.228000
max        4.054000
Name: np, dtype: float64

