In the computer vision, it is often used data augmentation technique for getting large data set. On the other hand, Canonical SMILES representations are used in chemoinformatics area.
RDKit UGM in last year, Dr. Esben proposed new approach for RNN with SMILES. He expanded 602 training molecules to almost 8000 molecules with different smiles representation technique.
This approach seems works well.
In the UGM hackathon at this year, this random smiles generate function is implemented and it can call from new version of RDKit!
I appreciate rdkit developers!
It is very easy to use, pls see the code below.
from rdkit import Chem from rdkit.Chem import Draw from rdkit.Chem.Draw import IPythonConsole from rdkit import rdBase print(rdBase.rdkitVersion) >2018.09.1
I used kinase inhibitor as an example.
testsmi = 'CC(C1=C(C=CC(=C1Cl)F)Cl)OC2=C(N=CC(=C2)C3=CN(N=C3)C4CCNCC4)N' mol = Chem.MolFromSmiles(testsmi) mol
But if you set MolToSmiles with doRandom=True option, the function return random but valid SMILES.
mols =  for _ in range(50): smi = Chem.MolToSmiles(mol, doRandom=True) print(smi) m = Chem.MolFromSmiles(smi) mols.append(m) >Fc1c(Cl)c(C(Oc2cc(-c3cn(nc3)C3CCNCC3)cnc2N)C)c(cc1)Cl >O(c1cc(-c2cnn(c2)C2CCNCC2)cnc1N)C(c1c(Cl)c(ccc1Cl)F)C >--snip-- >c1(N)ncc(-c2cnn(C3CCNCC3)c2)cc1OC(c1c(Cl)ccc(F)c1Cl)C #check molecules Draw.MolsToGridImage(mols, molsPerRow = 10)
There are many deep learning approaches which use SMIELS as input. It is useful for these models to augment input data I think.
I uploaded my example code on google colab and github my repo.