encode and decode SMILES strings.

Today was very cold. ;-( But, my kids play at a park long time! @_@
I want to go snow board with my family…
BTW, recently I am interested in recurrent neural network. There are a lot of report about deep learning using smiles strings as input.
To use smiles strings as input, it needs to convert smiles to matrix.
I tried to implement smiles encoder and decoder.
Encoder converts smiles strings as matrix of one hot vector.
Decoder converts matrix to smiles.

from rdkit import Chem

SMILES_CHARS = [' ',
                  '#', '%', '(', ')', '+', '-', '.', '/',
                  '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
                  '=', '@',
                  'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P',
                  'R', 'S', 'T', 'V', 'X', 'Z',
                  '[', '\\', ']',
                  'a', 'b', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's',
                  't', 'u']
smi2index = dict( (c,i) for i,c in enumerate( SMILES_CHARS ) )
index2smi = dict( (i,c) for i,c in enumerate( SMILES_CHARS ) )
def smiles_encoder( smiles, maxlen=120 ):
    smiles = Chem.MolToSmiles(Chem.MolFromSmiles( smiles ))
    X = np.zeros( ( maxlen, len( SMILES_CHARS ) ) )
    for i, c in enumerate( smiles ):
        X[i, smi2index[c] ] = 1
    return X

def smiles_decoder( X ):
    smi = ''
    X = X.argmax( axis=-1 )
    for i in X:
        smi += index2smi[ i ]
    return smi

Test the code.

mat=smiles_encoder('CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N')
mat.shape
Out:
(120, 56)
print( mat )
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

Next, decode mat.

dec=smiles_decoder(mat)
print(dec)
>>>CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12          

This is first step of my study of deep learning.
ref
https://arxiv.org/abs/1610.02415
https://github.com/maxhodak/keras-molecules/blob/master/molecules/vectorizer.py

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s