Difference between santize mol and not sanitize mol #memo #rdkit

I posted about fast compound search with rdkit. And in the post, I used patternfinger print in the post.

Today I checked behavior of the fingerprint. Patternfingerprint can calculate molecules which is not sanitized. However the fingerprint is different to the fingerprint which is calculated from sanitized mol.

Here is a simple example.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
m1 = Chem.MolFromSmiles('c1ccccc1')
m2 = Chem.MolFromSmiles('c1ccccc1', sanitize=False)
m3 = Chem.MolFromSmiles('C1=CC=CC=C1')
m4 = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False)
mols = [m1, m2, m3, m4]
for m in mols:
    print(Chem.MolToSmiles(m))
fps = [Chem.PatternFingerprint(m) for m in mols]
print('#####################')
for fp in fps:
    print(DataStructs.TanimotoSimilarity(fp, fps[0]))

The output is below.

c1ccccc1
c1ccccc1
c1ccccc1
C1=CC=CC=C1
#####################
1.0
1.0
1.0
0.8269230769230769

I calculated fingerprint from same structure but different settings. And last one, kekulized and non sanitized molecule gave different fingerprint.

So I think that it is important to make fingerprint db with suitable format of SMILES.

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

2 thoughts on “Difference between santize mol and not sanitize mol #memo #rdkit

  1. You could do a partial sanitation. According to Greg’s blog you still would get some speedup….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: