Today was very cold. ;-( But, my kids play at a park long time! @_@
I want to go snow board with my family…
BTW, recently I am interested in recurrent neural network. There are a lot of report about deep learning using smiles strings as input.
To use smiles strings as input, it needs to convert smiles to matrix.
I tried to implement smiles encoder and decoder.
Encoder converts smiles strings as matrix of one hot vector.
Decoder converts matrix to smiles.
from rdkit import Chem SMILES_CHARS = [' ', '#', '%', '(', ')', '+', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', '@', 'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'V', 'X', 'Z', '[', '\\', ']', 'a', 'b', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's', 't', 'u'] smi2index = dict( (c,i) for i,c in enumerate( SMILES_CHARS ) ) index2smi = dict( (i,c) for i,c in enumerate( SMILES_CHARS ) ) def smiles_encoder( smiles, maxlen=120 ): smiles = Chem.MolToSmiles(Chem.MolFromSmiles( smiles )) X = np.zeros( ( maxlen, len( SMILES_CHARS ) ) ) for i, c in enumerate( smiles ): X[i, smi2index[c] ] = 1 return X def smiles_decoder( X ): smi = '' X = X.argmax( axis=-1 ) for i in X: smi += index2smi[ i ] return smi
Test the code.
mat=smiles_encoder('CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N') mat.shape Out: (120, 56) print( mat ) [[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]]
Next, decode mat.
dec=smiles_decoder(mat) print(dec) >>>CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12
This is first step of my study of deep learning.
ref
https://arxiv.org/abs/1610.02415
https://github.com/maxhodak/keras-molecules/blob/master/molecules/vectorizer.py
Hi thank you for the posts, I believe that they are very educational. May I ask why is the decoded SMILES string different from the input SMILES string?
input : ‘CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N’
decoded : ‘CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12’
Hi William,
Thank you for your query. At first, these smiles strings represents same molecule. This means SMLIES representation of molecules is not canonical. So I would like to convert input SMILES to canonical SMILES.
‘CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N’ is not canonical SMILES.
As you can see following code.
I converted input smiles to RDKit Mol object and then converted RDKit canonical SMILES from RDKit Mol Object.
“””
def smiles_encoder( smiles, maxlen=120 ):
smiles = Chem.MolToSmiles(Chem.MolFromSmiles( smiles ))
“””
# above code is equal to following code.
rdkitmol = Chem.MolFromSmiles(smiles)
smiles = Chem.MolToSmiles(rdkitmol)
Does that make sense? ;-)
Thank you for replying. I think that makes perfect sense. Currently, I am trying to use a RNN for chemical property prediction from SMILES molecules. May I please ask how you did you input the SMILES molecules into your RNN?
I have tried using a 1 by 120xlen(SMILES_CHARS) array as the input, but I find that this may be too high dimensional. I am available to private message if you prefer.
Hi, I encoded SMILES strings to one_hot_vector which has [120, num of chars] dimension. The order of the input data depends on packages that you would like to use. For example pytorch, keras, TF.
RNN or CNN will handle huge amount of parameters. I recommend to use GPU machine to perform your calculation.
How can I convert an entire .CSV file in encoded format. if I want to apply knn algorithm so I need to split data in a training and testing data before I need to encode this format so how can I do this