standardization of tautomers #RDKit

One of the hot topic of new version of RDKit is an integration of MolVS which is tool for molecular standardization.
Molecular standardization is important for not only chemist but also chemoinformatist. Because tautomer shows different representation of molecule and it will be affect accuracy of QSAR models.
I wrote molecular standardization tools named ‘MolVS’ before and MolVS is an another library at the time. Now we can call molvs from native RDKit.
I used 2-hydroxy prydine as an example.

from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw


from rdkit.Chem import MolStandardize

smi1 = 'c1cccc(O)n1'
mol1 = Chem.MolFromSmiles(smi1)
smi2 = 'C1=CC(=O)NC=C1'
mol2 = Chem.MolFromSmiles(smi2)

Draw.MolsToGridImage([mol1, mol2])

Same formula but different structure.

Standardization method is very simple.

stsmi1 = MolStandardize.canonicalize_tautomer_smiles(smi1)
stsmi2 = MolStandardize.canonicalize_tautomer_smiles(smi2)

Draw.MolsToGridImage([Chem.MolFromSmiles(stsmi1), Chem.MolFromSmiles(stsmi2)])

Also it is easy to get possible tautomers from a smiles. And MolStandarize class has many method. It is very useful for data preprocessing I think.

tautomers = MolStandardize.enumerate_tautomers_smiles(smi1)
>{'O=c1cccc[nH]1', 'Oc1ccccn1'}

I uploaded the snippet to my repo. It can read from following URL.

I will go to Kumamoto to participate chemoinformatics conference tomorrow. I hope I can have many fruitful discussions.