encode and decode SMILES strings.

Today was very cold. ;-( But, my kids play at a park long time! @_@
I want to go snow board with my family…
BTW, recently I am interested in recurrent neural network. There are a lot of report about deep learning using smiles strings as input.
To use smiles strings as input, it needs to convert smiles to matrix.
I tried to implement smiles encoder and decoder.
Encoder converts smiles strings as matrix of one hot vector.
Decoder converts matrix to smiles.

from rdkit import Chem

SMILES_CHARS = [' ',
                  '#', '%', '(', ')', '+', '-', '.', '/',
                  '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
                  '=', '@',
                  'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P',
                  'R', 'S', 'T', 'V', 'X', 'Z',
                  '[', '\\', ']',
                  'a', 'b', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's',
                  't', 'u']
smi2index = dict( (c,i) for i,c in enumerate( SMILES_CHARS ) )
index2smi = dict( (i,c) for i,c in enumerate( SMILES_CHARS ) )
def smiles_encoder( smiles, maxlen=120 ):
    smiles = Chem.MolToSmiles(Chem.MolFromSmiles( smiles ))
    X = np.zeros( ( maxlen, len( SMILES_CHARS ) ) )
    for i, c in enumerate( smiles ):
        X[i, smi2index[c] ] = 1
    return X

def smiles_decoder( X ):
    smi = ''
    X = X.argmax( axis=-1 )
    for i in X:
        smi += index2smi[ i ]
    return smi

Test the code.

mat=smiles_encoder('CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N')
mat.shape
Out:
(120, 56)
print( mat )
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

Next, decode mat.

dec=smiles_decoder(mat)
print(dec)
>>>CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12          

This is first step of my study of deep learning.
ref
https://arxiv.org/abs/1610.02415
https://github.com/maxhodak/keras-molecules/blob/master/molecules/vectorizer.py

Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

7 thoughts on “encode and decode SMILES strings.

  1. Hi thank you for the posts, I believe that they are very educational. May I ask why is the decoded SMILES string different from the input SMILES string?

    input : ‘CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N’
    decoded : ‘CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12’

  2. Hi William,
    Thank you for your query. At first, these smiles strings represents same molecule. This means SMLIES representation of molecules is not canonical. So I would like to convert input SMILES to canonical SMILES.

    ‘CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N’ is not canonical SMILES.

    As you can see following code.
    I converted input smiles to RDKit Mol object and then converted RDKit canonical SMILES from RDKit Mol Object.

    “””
    def smiles_encoder( smiles, maxlen=120 ):
    smiles = Chem.MolToSmiles(Chem.MolFromSmiles( smiles ))
    “””
    # above code is equal to following code.
    rdkitmol = Chem.MolFromSmiles(smiles)
    smiles = Chem.MolToSmiles(rdkitmol)

    Does that make sense? ;-)

    1. Thank you for replying. I think that makes perfect sense. Currently, I am trying to use a RNN for chemical property prediction from SMILES molecules. May I please ask how you did you input the SMILES molecules into your RNN?

      I have tried using a 1 by 120xlen(SMILES_CHARS) array as the input, but I find that this may be too high dimensional. I am available to private message if you prefer.

      1. Hi, I encoded SMILES strings to one_hot_vector which has [120, num of chars] dimension. The order of the input data depends on packages that you would like to use. For example pytorch, keras, TF.
        RNN or CNN will handle huge amount of parameters. I recommend to use GPU machine to perform your calculation.

  3. How can I convert an entire .CSV file in encoded format. if I want to apply knn algorithm so I need to split data in a training and testing data before I need to encode this format so how can I do this

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: