I posted about fast compound search with rdkit. And in the post, I used patternfinger print in the post.
Today I checked behavior of the fingerprint. Patternfingerprint can calculate molecules which is not sanitized. However the fingerprint is different to the fingerprint which is calculated from sanitized mol.
Here is a simple example.
from rdkit import Chem from rdkit.Chem import DataStructs from rdkit.Chem import AllChem m1 = Chem.MolFromSmiles('c1ccccc1') m2 = Chem.MolFromSmiles('c1ccccc1', sanitize=False) m3 = Chem.MolFromSmiles('C1=CC=CC=C1') m4 = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False) mols = [m1, m2, m3, m4] for m in mols: print(Chem.MolToSmiles(m)) fps = [Chem.PatternFingerprint(m) for m in mols] print('#####################') for fp in fps: print(DataStructs.TanimotoSimilarity(fp, fps[0]))
The output is below.
c1ccccc1
c1ccccc1
c1ccccc1
C1=CC=CC=C1
#####################
1.0
1.0
1.0
0.8269230769230769
I calculated fingerprint from same structure but different settings. And last one, kekulized and non sanitized molecule gave different fingerprint.
So I think that it is important to make fingerprint db with suitable format of SMILES.
You could do a partial sanitation. According to Greg’s blog you still would get some speedup….
Thanks! I’ll check it ;)