Can machine learn important feature from SMILES?

Today I found challenging article in arxiv.
It describes about SMILES2Vec. https://arxiv.org/pdf/1712.02034.pdf
You know word2vec is very attractive and major application for ML area and SMILES2Vec has same concept.
It converts smiles to vector and learn which character is important. The author use “black box” models for building model. I am not sure about “black box” model but I think it likes leave one out. The method masks some features, builds model and finds important features.

To use the method, SMILES2Vec can find important characters in the given smiles.
They found CNN-GRU model gives best result for solubility prediction. My question is … Why convolution of SMILES work fine???
My opinion is that solubility or logP depends on the presence or absence of substituents such as hydroxyl or amino groups, they do not strongly depend on the position some case. So I think the method is interesting but difficult to predict biological affinity.

SMILES strings is major input format for deep learning area. Also I often use SMILES. 😉 But I want to find another format for DL.

ref for black box model
https://arxiv.org/pdf/1602.07043.pdf

Advertisements

molecule encoder/decoder in deepchem #rdkit #deepchem

Today I updated deepchem in my mac.
It was easy to install new version of deepchem on Mac.

iwatobipen$ git clone https://github.com/deepchem/deepchem.git
iwatobipen$ cd deepchem
iwatobipen$ bash scripts/install_deepchem_conda.sh

That’s all. 😉

New version of deepchem is implemented MoleculeVAE. MoeculeVAE generates new molecules by using pre defined model.
Deepchem can use pre defined model that was trained with Zinc Dataset.
OK let’s run the code.
I tested moleculeVAE by reference to following web page.
https://www.deepchem.io/_modules/deepchem/models/autoencoder_models/test_tensorflowEncoders.html

Deepchem provides lots of useful function for data preparation.
For example, convert smiles to one hot vector and vise versa.
I used cdk2.sdf for structure generation.

from __future__ import print_function
import os
from rdkit import Chem, RDConfig
from rdkit.Chem import Draw
import deepchem as dc
from deepchem.models.autoencoder_models.autoencoder import TensorflowMoleculeEncoder, TensorflowMoleculeDecoder
from deepchem.feat.one_hot import zinc_charset
from deepchem.data import DiskDataset

datadir = os.path.join( RDConfig.RDDocsDir, 'Book/data/cdk2.sdf' )

mols = [ mol for mol in Chem.SDMolSupplier( datadir ) ]
smiles = [ Chem.MolToSmiles( mol ) for mol in mols ]
print( len( smiles ))

tf_encoder = TensorflowMoleculeEncoder.zinc_encoder()
tf_decoder = TensorflowMoleculeDecoder.zinc_decoder()

featurizer = dc.feat.one_hot.OneHotFeaturizer( zinc_charset, 120 )

# default setting ; encode smiles to one_hot vector and padding to 120 character.
features = featurizer( mols )
print( features.shape )

dataset = DiskDataset.from_numpy( features, features )
prediction = tf_encoder.predict_on_batch( dataset.X )
one_hot_dec = tf_decoder.predict_on_batch( prediction )
decoded_smiles = featurizer.untransform( one_hot_dec )

for smiles in decoded_smiles:
    print( smiles[0] )
    print( Chem.MolFromSmiles( smiles[0] ))
mols = [ Chem.MolFromSmiles( smi[0] ) for smi in decoded_smiles ]
im = Draw.MolsToGridImage( mols )
im.save( 'res.png' )

And results was …. ;-(

iwatobipen$ python molVAE.py
/Users/iwatobipen/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Using TensorFlow backend.
47
2017-10-10 22:25:23.051035: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-10 22:25:23.051056: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-10 22:25:23.051060: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-10 22:25:23.051064: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
(47, 120, 35)
TIMING: dataset construction took 0.059 s
Loading dataset from disk.
CCCCNC(=O)CCn1c(=O)ncc[n1)ccn2
[22:25:29] SMILES Parse Error: syntax error for input: 'CCCCNC(=O)CCn1c(=O)ncc[n1)ccn2'
None
CC(C))CCN1CCCC1)c2nc[nH+]nn2
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C))CCN1CCCC1)c2nc[nH+]nn2'
None
CC(C)(C)C(CCC(=O)NCCc1cc[nH+]cn1
[22:25:29] SMILES Parse Error: extra open parentheses for input: 'CC(C)(C)C(CCC(=O)NCCc1cc[nH+]cn1'
None
CC(C)(C)N1CCCCC1)c2nc[nH+]nn2
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(C)N1CCCCC1)c2nc[nH+]nn2'
None
CC(C)CCNCCCC)CN=C)c1cc[nH+]cn1
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)CCNCCCC)CN=C)c1cc[nH+]cn1'
None
Cc1ccnnc1SCc2ccccc2Crc2NCCC))F)C
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1ccnnc1SCc2ccccc2Crc2NCCC))F)C'
None
CC((C)))C(=O)NCc1ccccc1)c(/c(=C)C(C(=])C
[22:25:29] SMILES Parse Error: syntax error for input: 'CC((C)))C(=O)NCc1ccccc1)c(/c(=C)C(C(=])C'
None
Cc1ccnn1CCC(=O)N((C)(C)Ccc2[n(ccn2)CCC)(C)C
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1ccnn1CCC(=O)N((C)(C)Ccc2[n(ccn2)CCC)(C)C'
None
CC(CC)(C)CC(C)(C)CCNCCC(CCC)N
<rdkit.Chem.rdchem.Mol object at 0x10079c210>
COc1ccc2c(c1)CC)Cc2cc((cn=O)C(=O)N3
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'COc1ccc2c(c1)CC)Cc2cc((cn=O)C(=O)N3'
None
Cc1ncsc(=O)n1C)CC(=O)Nc2ccss[N/C(=O)[C)[O-])C
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1ncsc(=O)n1C)CC(=O)Nc2ccss[N/C(=O)[C)[O-])C'
None
Cc1c(c(=O)n2c(n1)C(=O)CC(C)C)C)C
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1c(c(=O)n2c(n1)C(=O)CC(C)C)C)C'
None
CN(C))(O)CNN(Cc1cccn1+c2cc[nH]c2c2N
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CN(C))(O)CNN(Cc1cccn1+c2cc[nH]c2c2N'
None
CC11ccc2n1c(ccnH=)Nc3cc((c(c3=O)C)NCC2OO
[22:25:29] SMILES Parse Error: syntax error for input: 'CC11ccc2n1c(ccnH=)Nc3cc((c(c3=O)C)NCC2OO'
None
CCNH+]1CCc2cnc2c1c(n(c3=))c4cc(cc(cc3C)C))CC1=O
[22:25:29] SMILES Parse Error: syntax error for input: 'CCNH+]1CCc2cnc2c1c(n(c3=))c4cc(cc(cc3C)C))CC1=O'
None
CC(=O)Nc1cc(ccc1OC)CCc2c3c([nH+nn2)cccco3
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(=O)Nc1cc(ccc1OC)CCc2c3c([nH+nn2)cccco3'
None
COc1ccc2c(n1)C)C)ccnc3ccccc3)NCC=)))C(=O)CC
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'COc1ccc2c(n1)C)C)ccnc3ccccc3)NCC=)))C(=O)CC'
None
CCS(=O)(=O)c1ccc(c(c1)C)2CC=c3cc(ccc3)NC(C)C)C(=O)N2
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCS(=O)(=O)c1ccc(c(c1)C)2CC=c3cc(ccc3)NC(C)C)C(=O)N2'
None
CC(=O)Nc1cccccc1OC)CCc2c3c([nH+]n23cccccc4
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(=O)Nc1cccccc1OC)CCc2c3c([nH+]n23cccccc4'
None
Cc1cccc2c11[nH]nc(c2))CCC(=O)N(C3)CCc3ccc(H+]c)C
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1cccc2c11[nH]nc(c2))CCC(=O)N(C3)CCc3ccc(H+]c)C'
None
Cc1cc(cc211[nH]cc(c2)CCCC(=O)CC(=O)NC3CCCn4ccc(=O)n4C
[22:25:29] SMILES Parse Error: extra open parentheses for input: 'Cc1cc(cc211[nH]cc(c2)CCCC(=O)CC(=O)NC3CCCn4ccc(=O)n4C'
None
Ccc1ccc2ccnn))n3c2ccc1)c4ccccc4F)
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Ccc1ccc2ccnn))n3c2ccc1)c4ccccc4F)'
None
Cc1nc(c2c(nn))n3c2ccc3)S4ccs2)C(C)(C[NH+]CCC))C)))n
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1nc(c2c(nn))n3c2ccc3)S4ccs2)C(C)(C[NH+]CCC))C)))n'
None
CC(CCc1ccncc1)Cc2cccc3c2NN3CCNCC3=O
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CC(CCc1ccncc1)Cc2cccc3c2NN3CCNCC3=O'
None
CC(=O)(=OOc1ccn1C(=C)C(=O)N)2cc(=O)(c2=O)())
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(=O)(=OOc1ccn1C(=C)C(=O)N)2cc(=O)(c2=O)())'
None
CNS(=O)(=O)c(ccn1C((C)C(=O)Nc2ccc(cc2Cl)F
[22:25:29] SMILES Parse Error: syntax error for input: 'CNS(=O)(=O)c(ccn1C((C)C(=O)Nc2ccc(cc2Cl)F'
None
CCCC)CCC(=O)(=O)c1cO)1C(CC)C(=O)Nc2ccccc2F)/s1
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCCC)CCC(=O)(=O)c1cO)1C(CC)C(=O)Nc2ccccc2F)/s1'
None
CC((Cc1c(=O)nnc(=O)C(=O)CC(C)))c2ccccc2C2=O
[22:25:29] SMILES Parse Error: syntax error for input: 'CC((Cc1c(=O)nnc(=O)C(=O)CC(C)))c2ccccc2C2=O'
None
c1cs=c1C(=O)N(C2(CCCC2))c3c[nH]c(=O)])
[22:25:29] SMILES Parse Error: syntax error for input: 'c1cs=c1C(=O)N(C2(CCCC2))c3c[nH]c(=O)])'
None
CS(=O)(=O)c1cc(1CCc=O)NCC2CCC(CC2)c3c[nH+]c(=+)n3
[22:25:29] SMILES Parse Error: syntax error for input: 'CS(=O)(=O)c1cc(1CCc=O)NCC2CCC(CC2)c3c[nH+]c(=+)n3'
None
CN(Cc1ccccc1)c2nc(nnn+]c2c(c2SCc4ccccc4
[22:25:29] SMILES Parse Error: syntax error for input: 'CN(Cc1ccccc1)c2nc(nnn+]c2c(c2SCc4ccccc4'
None
CS(=O)(=O)c1cc==O)2c(n3c2cccc1SCC(=O)NO)))c(cc)/(([O)on
[22:25:29] SMILES Parse Error: syntax error for input: 'CS(=O)(=O)c1cc==O)2c(n3c2cccc1SCC(=O)NO)))c(cc)/(([O)on'
None
CN(C)C(=O)CC(C(=O)[O-])C(=O)CSc1ccc[nH+]c1=N
[22:25:29] Can't kekulize mol.  Unkekulized atoms: 14 15 16 17 18

None
Cc1c=O(c=n1Cc2cccn2)c3c(n2)CC((((=C)C)])CC-/)
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1c=O(c=n1Cc2cccn2)c3c(n2)CC((((=C)C)])CC-/)'
None
CC(C)(CC))C(=O)NCc1cccnc1Cl)CCc(=O)(=))CC///n2
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(CC))C(=O)NCc1cccnc1Cl)CCc(=O)(=))CC///n2'
None
CC(C)(CCN)C(=O)CSc1cc(=O)c(=CC))]))))))))S2/c2)=O)C))/s)
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(CCN)C(=O)CSc1cc(=O)c(=CC))]))))))))S2/c2)=O)C))/s)'
None
CCN1CC11C(C2)C(=O)N2C2N(C2S((=O)NC3CC2)C(=O)OCC
[22:25:29] SMILES Parse Error: syntax error for input: 'CCN1CC11C(C2)C(=O)N2C2N(C2S((=O)NC3CC2)C(=O)OCC'
None
CCN=N)c1cc=1Cc==)(=O)C)(C))C)CC2CCCCCC2)))())
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCN=N)c1cc=1Cc==)(=O)C)(C))C)CC2CCCCCC2)))())'
None
CC(CCc2ccccc2)C[=O)c3c(=O)n(c)+n3)c3nc(nc))c1
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(CCc2ccccc2)C[=O)c3c(=O)n(c)+n3)c3nc(nc))c1'
None
CCS(=O)(=O)CS(=O)NC1CC)C(=O)Nc1ccc2c1cccc1F))n1
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCS(=O)(=O)CS(=O)NC1CC)C(=O)Nc1ccc2c1cccc1F))n1'
None
CC1CCc2c(s2c2nc(n(=O)n)NC(=O)c3cccc(c3)SS(=O)=OO)C1
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CC1CCc2c(s2c2nc(n(=O)n)NC(=O)c3cccc(c3)SS(=O)=OO)C1'
None
C[NH+]1CCCC(=O)[O-])CCc1c(=O)oc2c2c2cccc2C(s=])CCCC2=O
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'C[NH+]1CCCC(=O)[O-])CCc1c(=O)oc2c2c2cccc2C(s=])CCCC2=O'
None
Cc1cn(c(=O)c1C)CC(=O)Nc2ccss[N]S(=O)N)C(O)C)3(CCCC(C)\C)//)))C
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1cn(c(=O)c1C)CC(=O)Nc2ccss[N]S(=O)N)C(O)C)3(CCCC(C)\C)//)))C'
None
C[NH]11CCCC(=O)[O-])NC(=O)CN1C)c2ccn12)C((O)/O))C)CCCC)2)C1
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'C[NH]11CCCC(=O)[O-])NC(=O)CN1C)c2ccn12)C((O)/O))C)CCCC)2)C1'
None
CCN1CC11(CCN)n(c(=NCc2ccccc2))4ccn23C4CCCC4=O2CCC
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CCN1CC11(CCN)n(c(=NCc2ccccc2))4ccn23C4CCCC4=O2CCC'
None
CC(C=()Scc1n[nH]c2c(n1)CCCC(=O)CC(=O)N3CC[NH+](CC3)CCc4cccon)))
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(C=()Scc1n[nH]c2c(n1)CCCC(=O)CC(=O)N3CC[NH+](CC3)CCc4cccon)))'
None
CC1CCc2c(c3cccccn2)C[H+]ccc(1O((((=+]3)Cc4ccss4Cl)1
[22:25:29] SMILES Parse Error: syntax error for input: 'CC1CCc2c(c3cccccn2)C[H+]ccc(1O((((=+]3)Cc4ccss4Cl)1'
None
[22:25:29] SMILES Parse Error: syntax error for input: 'CCCCNC(=O)CCn1c(=O)ncc[n1)ccn2'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C))CCN1CCCC1)c2nc[nH+]nn2'
[22:25:29] SMILES Parse Error: extra open parentheses for input: 'CC(C)(C)C(CCC(=O)NCCc1cc[nH+]cn1'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(C)N1CCCCC1)c2nc[nH+]nn2'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)CCNCCCC)CN=C)c1cc[nH+]cn1'
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1ccnnc1SCc2ccccc2Crc2NCCC))F)C'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC((C)))C(=O)NCc1ccccc1)c(/c(=C)C(C(=])C'
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1ccnn1CCC(=O)N((C)(C)Ccc2[n(ccn2)CCC)(C)C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'COc1ccc2c(c1)CC)Cc2cc((cn=O)C(=O)N3'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1ncsc(=O)n1C)CC(=O)Nc2ccss[N/C(=O)[C)[O-])C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1c(c(=O)n2c(n1)C(=O)CC(C)C)C)C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CN(C))(O)CNN(Cc1cccn1+c2cc[nH]c2c2N'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC11ccc2n1c(ccnH=)Nc3cc((c(c3=O)C)NCC2OO'
[22:25:29] SMILES Parse Error: syntax error for input: 'CCNH+]1CCc2cnc2c1c(n(c3=))c4cc(cc(cc3C)C))CC1=O'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(=O)Nc1cc(ccc1OC)CCc2c3c([nH+nn2)cccco3'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'COc1ccc2c(n1)C)C)ccnc3ccccc3)NCC=)))C(=O)CC'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCS(=O)(=O)c1ccc(c(c1)C)2CC=c3cc(ccc3)NC(C)C)C(=O)N2'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(=O)Nc1cccccc1OC)CCc2c3c([nH+]n23cccccc4'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1cccc2c11[nH]nc(c2))CCC(=O)N(C3)CCc3ccc(H+]c)C'
[22:25:29] SMILES Parse Error: extra open parentheses for input: 'Cc1cc(cc211[nH]cc(c2)CCCC(=O)CC(=O)NC3CCCn4ccc(=O)n4C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Ccc1ccc2ccnn))n3c2ccc1)c4ccccc4F)'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1nc(c2c(nn))n3c2ccc3)S4ccs2)C(C)(C[NH+]CCC))C)))n'
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CC(CCc1ccncc1)Cc2cccc3c2NN3CCNCC3=O'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(=O)(=OOc1ccn1C(=C)C(=O)N)2cc(=O)(c2=O)())'
[22:25:29] SMILES Parse Error: syntax error for input: 'CNS(=O)(=O)c(ccn1C((C)C(=O)Nc2ccc(cc2Cl)F'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCCC)CCC(=O)(=O)c1cO)1C(CC)C(=O)Nc2ccccc2F)/s1'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC((Cc1c(=O)nnc(=O)C(=O)CC(C)))c2ccccc2C2=O'
[22:25:29] SMILES Parse Error: syntax error for input: 'c1cs=c1C(=O)N(C2(CCCC2))c3c[nH]c(=O)])'
[22:25:29] SMILES Parse Error: syntax error for input: 'CS(=O)(=O)c1cc(1CCc=O)NCC2CCC(CC2)c3c[nH+]c(=+)n3'
[22:25:29] SMILES Parse Error: syntax error for input: 'CN(Cc1ccccc1)c2nc(nnn+]c2c(c2SCc4ccccc4'
[22:25:29] SMILES Parse Error: syntax error for input: 'CS(=O)(=O)c1cc==O)2c(n3c2cccc1SCC(=O)NO)))c(cc)/(([O)on'
[22:25:29] Can't kekulize mol.  Unkekulized atoms: 14 15 16 17 18

[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1c=O(c=n1Cc2cccn2)c3c(n2)CC((((=C)C)])CC-/)'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(CC))C(=O)NCc1cccnc1Cl)CCc(=O)(=))CC///n2'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(CCN)C(=O)CSc1cc(=O)c(=CC))]))))))))S2/c2)=O)C))/s)'
[22:25:29] SMILES Parse Error: syntax error for input: 'CCN1CC11C(C2)C(=O)N2C2N(C2S((=O)NC3CC2)C(=O)OCC'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCN=N)c1cc=1Cc==)(=O)C)(C))C)CC2CCCCCC2)))())'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(CCc2ccccc2)C[=O)c3c(=O)n(c)+n3)c3nc(nc))c1'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCS(=O)(=O)CS(=O)NC1CC)C(=O)Nc1ccc2c1cccc1F))n1'
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CC1CCc2c(s2c2nc(n(=O)n)NC(=O)c3cccc(c3)SS(=O)=OO)C1'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'C[NH+]1CCCC(=O)[O-])CCc1c(=O)oc2c2c2cccc2C(s=])CCCC2=O'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1cn(c(=O)c1C)CC(=O)Nc2ccss[N]S(=O)N)C(O)C)3(CCCC(C)\C)//)))C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'C[NH]11CCCC(=O)[O-])NC(=O)CN1C)c2ccn12)C((O)/O))C)CCCC)2)C1'
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CCN1CC11(CCN)n(c(=NCc2ccccc2))4ccn23C4CCCC4=O2CCC'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(C=()Scc1n[nH]c2c(n1)CCCC(=O)CC(=O)N3CC[NH+](CC3)CCc4cccon)))'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC1CCc2c(c3cccccn2)C[H+]ccc(1O((((=+]3)Cc4ccss4Cl)1'
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x12a606128>>
Traceback (most recent call last):
  File "/Users/iwatobipen/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 701, in __del__
TypeError: 'NoneType' object is not callable

Hmm…. I could not get suitable structure from Molecule autoencoder.
It has difficulty for molecule generator because structure of input data based on SMILES strings.  Now ratio of invalid smiles were high. But I think DeepChem and rdkit show nice combination for chemoinformatics.

Graph convolution classification with deepchem

I posted about graph convolution regression using deepchem. And today, I tried graph convolution classification using deepchem.
Code is almost same as regression model. The only a difference point is use dc.models.MultitaskGraphClassifier instead of dc.models.MultitaskGraphRegressor.
I got sample ( JAK3 inhibitor ) data from chembl and tried to make model.

At first I used pandas to convert activity class ( active, non active )to 0,1 bit. Easy to do it.

import panda as pd
import pandas as pd
df = pd.read_table('jak3_chembl.txt', header=0)
df['activity_class'] = pd.factorize( df.ACTIVITY_COMMENT )
pd.factorize( df.ACTIVITY_COMMENT )
len(pd.factorize( df.ACTIVITY_COMMENT ))
df['activity_class'] = pd.factorize( df.ACTIVITY_COMMENT )[0]

df.to_csv('./preprocessed_jak3.csv', index=False)

Next wrote model and test it.

import tensorflow as tf
import deepchem as dc
import numpy as np
import pandas as pd

graph_featurizer = dc.feat.graph_features.ConvMolFeaturizer()
loader = dc.data.data_loader.CSVLoader( tasks=['activity_class'], smiles_field="CANONICAL_SMILES", id_field="CMPD_CHEMBLID", featurizer=graph_featurizer )
dataset = loader.featurize( './preprocessed_jak3.csv' )

splitter = dc.splits.splitters.RandomSplitter()
trainset,testset = splitter.train_test_split( dataset )

hp = dc.molnet.preset_hyper_parameters
param = hp.hps[ 'graphconv' ]
print(param['batch_size'])
g = tf.Graph()
graph_model = dc.nn.SequentialGraph( 75 )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), 75, activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.Dense( int(param['n_fully_connected_nodes']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphGather( 10 , activation='tanh'))

with tf.Session() as sess:
    model_graphconv = dc.models.MultitaskGraphClassifier( graph_model,
                                                      1,
                                                      75,
                                                     batch_size=10,
                                                     learning_rate = param['learning_rate'],
                                                     optimizer_type = 'adam',
                                                     beta1=.9,beta2=.999)
    model_graphconv.fit( trainset, nb_epoch=50 )

train_scores = {}
#regression_metric = dc.metrics.Metric( dc.metrics.pearson_r2_score, np.mean )
classification_metric = dc.metrics.Metric( dc.metrics.roc_auc_score, np.mean )
train_scores['graphconvreg'] = model_graphconv.evaluate( trainset,[ classification_metric ]  )
p=model_graphconv.predict( testset )

for i in range( len(p )):
    print( p[i], testset.y[i] )

print(train_scores) 

root@08d8f729f78b:/deepchem/pen_test# python graphconv_jak3.py

And datalog file is….

Loading raw samples now.
shard_size: 8192
About to start loading CSV from ./preprocessed_jak3.csv
Loading shard 1 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 0 took 2.023 s
TIMING: dataset construction took 3.830 s
Loading dataset from disk.
TIMING: dataset construction took 2.263 s
Loading dataset from disk.
TIMING: dataset construction took 1.147 s
Loading dataset from disk.
50
Training for 50 epochs
Starting epoch 0
On batch 0
...............
On batch 0
On batch 50
computed_metrics: [0.97176380945032259]
{'graphconvreg': {'mean-roc_auc_score': 0.97176380945032259}}

Not so bad.
Classification model gives better result than regression model.
All code is pushed my github repository.
https://github.com/iwatobipen/deeplearning

Build regression model in Keras

I introduced Keras in mishimasyk#9. And my presentation was how to build classification model in Keras.
A participant asked me that how to build regression model in Keras. I could not answer his question.
After syk#9, I searched Keras API and found good method.
Keras has Scikit-learn API. The API can build regression model. 😉
https://keras.io/scikit-learn-api/
Example code is following.
The code is used for build QSAR model.

import numpy as np
import pandas as pd
import sys

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout
from keras.wrappers.scikit_learn import KerasRegressor

def getFpArr( mols, nBits = 1024 ):
    fps = [ AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=nBits ) for mol in mols ]
    X = []
    for fp in fps:
        arr = np.zeros( (1,) )
        DataStructs.ConvertToNumpyArray( fp, arr )
        X.append( arr )
    return X

def getResponse( mols, prop="ACTIVITY" ):
    Y = []
    for mol in mols:
        act = mol.GetProp( prop )
        act = 9. - np.log10( float( act ) )
        Y.append( act )
    return Y

def base_model():
    model = Sequential()
    model.add( Dense( input_dim=1024, output_dim = 100 ) )
    model.add( Activation( "relu" ) )
    model.add( Dense( 100 ) )
    model.add( Activation( "relu" ) )
    model.add( Dense( 1 ) )
    #model.add( Activation( 'relu' ) )
    model.compile( loss="mean_squared_error",  optimizer="adam" )
    return model


if __name__ == '__main__':
    filename = sys.argv[1]
    sdf = [ mol for mol in Chem.SDMolSupplier( filename ) ]
    X = getFpArr( sdf )
    Y = getResponse( sdf )

    trainx, testx, trainy, testy = train_test_split( X, Y, test_size=0.2, random_state=0 )
    trainx, testx, trainy, testy = np.asarray( trainx ), np.asarray( testx ), np.asarray( trainy ), np.asarray( testy )
    estimator = KerasRegressor( build_fn = base_model,
                                nb_epoch=100,
                                batch_size=20,
                                 )
    estimator.fit( trainx, trainy )
    pred_y = estimator.predict( testx )
    r2 = r2_score( testy, pred_y )
    rmse = mean_squared_error( testy, pred_y )
    print( "KERAS: R2 : {0:f}, RMSE : {1:f}".format( r2, rmse ) )

Run the code.
I used CHEMBLdatast.

mishimasyk9 iwatobipen$ python keras_regression.py sdf/CHEMBL952131_EGFR.sdf 
Using Theano backend.
Epoch 1/100
102/102 [==============================] - 0s - loss: 62.2934     
........................   
Epoch 100/100
102/102 [==============================] - 0s - loss: 0.0123     
KERAS: R2 : 0.641975, RMSE : 0.578806

R2 is 0.64. It’s not so bad. 😉
I pushed the script to the syk 9 repository.
https://github.com/Mishima-syk/9/tree/master/iwatobipen

RemoteMonitor in keras

There are several packages to perform deep learning in python. And my favorite one is keras.
https://keras.io/
Today, I found new function in keras.callbacks named RemoteMonitor.
The function provide real time visualization of learning.
So, I wrote very simple example using IRIS dataset.
At first to use RemoteMonitor, I need clone api from following URL.
https://github.com/fchollet/hualos
And run api.py
Now, I can access localhost:9000 via web browser.

iwatobipen$ git clone https://github.com/fchollet/hualos
iwatobipen$ cd hualos
iwatobipen$ python api.py

Then, perform machine learning.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import RemoteMonitor
# prepare dataset
irisdata = load_iris()
X,Y = irisdata["data"], irisdata["target"]
train_X, test_X, train_Y, test_Y = train_test_split( X, Y, test_size=0.2, random_state=123 )
# make monitor with default settings.
monitor = RemoteMonitor()

from keras.models import Sequential
# build model
model = Sequential()
from keras.layers import Dense, Activation
model.add( Dense( output_dim=12, input_dim=4 ) )
model.add( Activation( "relu" ) )
model.add( Dense( output_dim=3 ) )
model.add( Activation( "softmax" ) )
model.compile( optimizer="sgd",
               loss="sparse_categorical_crossentropy",
               metrics = ["accuracy"] )

OK. Let’s learn!
To monitor the progress, I set [ monitor ] in callbacks argument.

hist = model.fit( train_X, train_Y, nb_epoch=50, batch_size=20, callbacks=[ monitor ] )

During the learning, localhost:9000 is automatically updated and I can get following image.
This page made by Flask and d3.js. Cool !!!
keras_viz
“api.py” will not work on python3.x.
If you want to run the code in python3.x, I recommend replace following line because of iteritems is renamed to items in python3.

        lines = ["%s: %s" % (v, k)
--                 for k, v in self.desc_map.iteritems() if k]
++                 for k, v in self.desc_map.items() if k]

FYI.
9th Mishima.syk will be held 10th Dec. Topic is TensorFlow and Keras hans on.
https://connpass.com/event/42284/

Don’t miss it. 😉

Use convolution2D in QSAR.

Recently there are lots of report about deep learning to predict biological activity(QSAR).
I think almost of these predictors use MLP. I wonder if I could use another method like 2DCNN, I could get good predictor.
So, I tried to build 2DCNN QSAR model.
Fortunately, 1024 bit fingerprint is easily convert to 2D 32 x 32 fingerprint.
Let’s start.
At first, I converted molecules to 2D bit image. Data source is ChEMBL.

rom rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
from rdkit.Chem import Draw
import pandas as pd
import numpy as np
import glob
import pickle
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split


df = pd.read_table('bioactivity-15_13-09-28.txt', header=0)
df_bind = df[ df.ASSAY_TYPE=="B" ]
df_bind = df_bind[ df_bind.STANDARD_VALUE != None ]
df_bind = df_bind[ df_bind.STANDARD_VALUE >= 0 ]

rows = df_bind.shape[ 0 ]
mols = [ ]
act = [ ]
X = [ ]

def generate_fparr( mol ):
    arr = np.zeros( (1,) )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits = 1024, useFeatures = True )
    DataStructs.ConvertToNumpyArray( fp, arr )
    size = 32
    return arr.reshape( 1, size, size )

def save_fig( fparr, filepath, size=32 ):
    X, Y = np.meshgrid( range(size), range(size) )
    Z = fparr
    Z = Z[::-1,:]
    plt.xlim( 0, 31 )
    plt.ylim( 0, 31 )
    plt.pcolor(X, Y, Z[0])
    plt.gray()
    plt.savefig( filepath )
    plt.close()

def act2bin( val ):
    if val > 10000:
        return 0
    else:
        return 1

for i in range( rows ):
    try:
        smi = df_bind.CANONICAL_SMILES[i]
        mol = Chem.MolFromSmiles( smi )
        if mol != None:
            mols.append( mol )
            act.append( act2bin( df_bind.STANDARD_VALUE[i]) )
        else:
            pass
    except:
        pass

# save mols image dataset
for idx, mol in enumerate( mols ):
    X.append( generate_fparr( mol ) )
    if act[ idx ] == 1:
        save_fig( X[ idx ], "./posi/idx_{}.png".format( idx ) )
    elif act[ idx ] == 0:
        save_fig( X[ idx ], "./nega/idx_{}.png".format( idx ) )


X = np.asarray(X)
Y = np.asarray(act)

x_train, x_test, y_train, y_test = train_test_split( X,Y, test_size=0.2, random_state=123 )

f = open( 'fpimagedataset.pkl', 'wb' )
pickle.dump([ ( x_train,y_train ), ( x_test, y_test ) ], f)
f.close()

Now I got 2D Fingerprint dataset and 2D fingerprint molecular bit image. Following image is one of positive bit image. Of course I can not understand from the image why the molecule was active.
idx_0

Next, I wrote predictor using Keras.

import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, Flatten
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
import pickle
import matplotlib
import matplotlib.pyplot as plt

( x_train, y_train ), ( x_test, y_test ) = pickle.load( open('fpimagedataset.pkl', 'rb') )

batch_size = 200
nb_classes = 2
nb_epoch = 100
nb_filters = 32
nb_pool = 2
nb_conv = 3
im_rows , im_cols = 32, 32
im_channels = 1

x_train = x_train.astype( "float32" )
x_test = x_test.astype( "float32" )

y_train = np_utils.to_categorical( y_train, nb_classes )
y_test = np_utils.to_categorical( y_test, nb_classes )

print( x_train.shape[0], 'train samples' )
print( x_test.shape[0], 'test samples' )


model = Sequential()
model.add( Convolution2D( 32, 3, 3,
                            input_shape = ( im_channels, im_rows, im_cols ) ) )
model.add( BatchNormalization() )
model.add( Activation( 'relu' ) )

model.add( Convolution2D( 32, 3, 3,    ))
model.add( BatchNormalization() )
model.add( Activation( 'relu' ) )

model.add( Convolution2D( 32, 3, 3 ) )
model.add( BatchNormalization() )
model.add( Activation( 'relu' ) )
model.add( MaxPooling2D( pool_size=( 2, 2 ) ) )
model.add( Flatten() )
model.add( Dense( 200 ) )
model.add( Activation( 'relu' ) )
model.add( Dropout( 0.5 ) )
model.add( Dense(nb_classes) )
model.add( Activation('softmax') )
model.compile( loss='categorical_crossentropy',
               optimizer='adadelta',
               metrics=['accuracy'],
               )

hist = model.fit( x_train, y_train,
                  batch_size = batch_size,
                  nb_epoch = nb_epoch,
                  verbose = 1,
                  validation_data = ( x_test, y_test ))


print( model.summary() )
score = model.evaluate( x_test, y_test, verbose=0 )

loss = hist.history[ 'loss' ]
acc = hist.history[ 'acc' ]
val_loss = hist.history[ 'val_loss' ]
val_acc = hist.history[ 'val_acc' ]
plt.plot( range(len( loss )), loss, label='loss' )
plt.plot( range(len( val_loss )), val_loss, label='val_loss' )
plt.xlabel( 'epoch' )
plt.ylabel( 'loss' )
plt.savefig( 'loss.png' )
plt.close()
plt.plot( range(len( acc )), acc, label='accuracy' )
plt.plot( range(len( val_acc )), val_acc, label='val_accuracy' )
plt.xlabel( 'epoch' )
plt.ylabel( 'acc' )
plt.savefig( 'acc.png' )
plt.close()

Finally, I run the code.
It took long time to finish the calculation and I got 2 images.
Accuracy of training set was increasing depend on number of epochs, but accuracy of test set was not same.
It was same as loss score. I think the predictor is overfitting. ;-(
I used drop out, normalisation but I couldn’t avoid over fitting. Hmm……

loss

acc

Callback function of keras.

I’m still building QSAR models using deep learning. And I thought I got problem of over fitting. 🙂
Training error was decreasing but, validation error was increasing depend on number of epochs. :-/
It seems over fitting and I could not avoid the event even if I used drop out function.

Tried lots of learning conditions but all challenge was failed…..finally I thought that too long learning did not have good effect for QSAR.
I thought reason why over fitting was occurred.
1st) I could not optimise learning conditions.
2nd) I did not have enough amount of training dataset. But the problem is difficult to solve in the actual drug discovery project.
3rd) Long learning time was not good, so early stopping is better.

Keras has same callback functions. And there is early stopping function too!
So, I wrote some code.
1st, make dataset for hERG binary classification.

 

from __future__ import print_function
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np
import pickle

df = pd.read_table('bioactivity-15_13-09-28.txt', header=0)
df_bind = df[ df.ASSAY_TYPE=="B" ]
df_bind = df_bind[ df_bind.STANDARD_VALUE != None ]
df_bind = df_bind[ df_bind.STANDARD_VALUE >= 0 ]

rows = df_bind.shape[ 0 ]
mols = [ ]
act = [ ]
fps = []
def act2bin( val ):
    if val > 10000:
        return 0
    else:
        return 1

for i in range( rows ):
    try:
        smi = df_bind.CANONICAL_SMILES[i]
        mol = Chem.MolFromSmiles( smi )
        if mol != None:
            mols.append( mol )
            act.append( act2bin( df_bind.STANDARD_VALUE[i]) )
        else:
            pass
    except:
        pass
for mol in mols:
    arr = np.zeros( (1,) )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=1024 )
    DataStructs.ConvertToNumpyArray( fp, arr )
    fps.append( fp )

fps = np.array( fps, dtype = np.float32 )
act = np.array( act, dtype = np.int32 )

train_x, test_x, train_y, test_y = train_test_split( fps, act, test_size=0.3, random_state=5 )

f = open('dataset_fp1024.pkl', 'wb')
pickle.dump( [train_x, train_y, test_x, test_y ], f )
f.close()

Distribution of logIC50

ic50dist

 

Then wrote model builder using keras.
EarlyStopping Class provides early stop function.
And If I want to use the function, it’s easy. Just set callbacks option in fit().
Following code, I set another callback function ‘ModelCheckpoint’ to save parameters.
Finally I got better results not only training set but also validation set.
Keras is easy and good tool for deep learning.

import pickle
import matplotlib
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adagrad, Adam, Adadelta
from keras.utils.visualize_util import plot
from keras.utils import np_utils
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint

earlystopping = EarlyStopping( monitor='val_loss',
                             patience = 4,
                             verbose=0,
                             mode='auto' )

modelcheckpoint = ModelCheckpoint( './bestmodel.hdf5',
                                   monitor='val_loss',
                                   verbose=0,
                                   save_best_only=True )

nb_epoch = 250
f = open( 'dataset_fp1024.pkl', 'rb' )
train_x, train_y, test_x, test_y = pickle.load( f )
train_y = np_utils.to_categorical(train_y, 2)
test_y = np_utils.to_categorical(test_y, 2)

model = Sequential()
model.add( Dense( output_dim = 500, input_shape=(1024,) ) )
model.add( Activation( 'sigmoid' ) )
model.add( Dropout(0.2) )
model.add( Dense( output_dim = 100 ) )
model.add( Activation( 'sigmoid' ))
model.add( Dropout(0.2) )

model.add( Dense( output_dim = 20 ) )
model.add( Activation( 'sigmoid' ))
model.add( Dropout(0.2) )

model.add( Dense( 2 ) )
model.add( Activation( 'sigmoid' ) )

model.compile( optimizer=Adadelta(lr=0.95), loss='categorical_crossentropy' )

hist = model.fit( train_x, train_y ,
                     nb_epoch=nb_epoch,
                     batch_size=50,
                     validation_split=0.1,
                     show_accuracy=True,
                     callbacks=[earlystopping, modelcheckpoint, monitor])
score = model.evaluate( test_x, test_y, show_accuracy=True, verbose=1 )
print( ' testscore: ', score[0], ' testaccuracy: ', score[1] )
model.summary()

loss = hist.history[ 'loss' ]
val_loss = hist.history[ 'val_loss' ]
plt.plot( range( len(val_loss ) ), loss, label='loss' )
plt.plot( range( len( val_loss ) ), val_loss,  label = 'val_loss' )
plt.legend( loc='best', fontsize=10 )
plt.grid()
plt.xlabel( 'epoch' )
plt.ylabel( 'loss' )
plt.savefig( 'res_gpu.png' )
plt.close()
acc = hist.history[ 'acc' ]
val_acc = hist.history[ 'val_acc' ]
plt.plot( range( len( acc )), acc, label='accuracy' )
plt.plot( range( len( val_acc )), val_acc, label='val_accuracy' )
plt.xlabel( 'epoch' )
plt.ylabel( 'accuracy' )
plt.savefig( 'acc_gpu.png' )

Accuracy…
acc_gpu
Loss….
res_gpu