Some years ago, large amount of molecules produced by using palladium catalysed cross coupling reaction, like suzuki-miyaura, negishi, stille, etc.
It showed great impact for medicinal chemistry but these reaction tend to produce flat molecules like low fsp3 score.
Now I often read the word ‘Escape from flat land, sp3 rich molecules, 3D diversity …’.
Natural products shows complex, rich sp3 structure.
So, natural product likeness is one of the score for estimation of the library.
Fortunately, we can get NPscore using RDKit. ;-)
RDKit implemented following algorithms and easy to use it.
Natural Product-likeness Score and Its Application for Prioritization of Compound Libraries
Peter Ertl, Silvio Roggo, and Ansgar Schuffenhauer
Journal of Chemical Information and Modeling, 48, 68-74 (2008)
Lest try it. At first, I got dataset from NCI.
https://wiki.nci.nih.gov/display/NCIDTPdata/Compound+Sets
Diversity set 5, and Natural products set as SDF. And convert SDF to smiles.
from rdkit import Chem from rdkit.Chem import Draw from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem import rdBase div5 = [ m for m in Chem.SDMolSupplier('Div5_2DStructures_Oct2014.sdf') if m != None ] nat = [ m for m in Chem.SDMolSupplier( 'NAtProd4.sdf' ) if m != None ] f = open( 'dataset.txt', 'w' ) for m in div5: name = m.GetProp( 'NSC' ) smi = Chem.MolToSmiles( m ) f.write( smi + ' ' + name + ' DIV\n' ) for m in nat: name = m.GetProp( 'NSC' ) smi = Chem.MolToSmiles( m ) f.write( smi + ' ' + name + ' NAT\n' ) f.close()
OK, next I run npscore.py and merge resultdata.
NP_Score iwatobipen$ python npscorer.py dataset.txt > res.txt
import seaborn as sns import pandas as pd df1 = pd.read_table( 'dataset.txt', sep=' ', names=['smi','nsc','cat'] ) df2 = pd.read_table( 'res.txt', sep='\t', names=['smi','nsc','np'] ) df=df1.join(df2.np) sns.distplot( df[df.cat == 'DIV'].np) sns.distplot( df[df.cat == 'NAT'].np)
I got following image.
NP set showed higher score than Div5 set.
Data summary is following.
count 1593.000000 mean -0.654590 std 1.035716 min -3.258000 25% -1.325000 50% -0.753000 75% -0.162000 max 4.054000 Name: np, dtype: float64 In [39]: df[df.cat=='NAT'].np.describe() Out[39]: count 419.000000 mean 1.594697 std 1.012731 min -1.541000 25% 0.911500 50% 1.485000 75% 2.228000 max 4.054000 Name: np, dtype: float64
Hello,nice share.
could you please provide me with the file: npscorer.py
Hi Mohamed,
Thanks for your query.
You can call the script from rdkit contrib folder.
And you can see the code at following URL
https://github.com/rdkit/rdkit/tree/master/Contrib/NP_Score
Thanks ;)
Thanks, that helped me alot