Create MMPDB ( matched molecular pair )!

Matched molecular pair analysis is very common method to analyze SAR for medicinal chemists. There are lots of publications about it and applications in these area.
I often use rdkit/Contrib/mmpa to make my own MMP dataset.
The origin of the algorithm is described in following URL.
https://www.ncbi.nlm.nih.gov/pubmed/20121045

Yesterday, good news announced by @RDKit_org. It is release the package that can make MMPDB.
I tried to use the package immediately.
This package is provided from github repo. And to use the package, I need to install apsw at first. APSW can install by using conda.
And the install mmpdb by python script.

iwatobipen$ conda install -c conda-forge apsw
iwatobipen$ git clone https://github.com/rdkit/mmpdb.git
iwatobipen$ cd mmpdb
iwatobipen$ python setup.py install

After success of installation, I could found mmpdb command in terminal.
I used CYP3A4 inhibition data from ChEMBL for the test.
I prepared two files, one has smiles and id, and the another has id and ic50 value.
* means missing value. In the following case, I provided single property ( IC50 ) but the package can handle multiple properties. If reader who is interested the package, please show more details by using mmpdb –help command.

iwatobipen$ head -n 10 chembl_cyp3a4.csv 
CANONICAL_SMILES,MOLREGNO
Cc1ccccc1c2cc(C(=O)n3cccn3)c4cc(Cl)ccc4n2,924282
CN(C)CCCN1c2ccccc2CCc3ccccc13,605
Cc1ccc(cc1)S(=O)(=O)\N=C(/c2ccc(F)cc2)\n3c(C)nc4ccccc34,1698776
NC[C@@H]1O[C@@H](Cc2c(O)c(O)ccc12)C34CC5CC(CC(C5)C3)C4,59721
Cc1ccc(cc1)S(=O)(=O)N(Cc2ccccc2)c3ccccc3C(=O)NCc4occc4,759749
O=C(N1CCC2(CC1)CN(C2)c3ccc(cc3)c4ccccc4)c5ccncc5,819161
iwatobipen$ head -n 10 prop.csv 
ID	STANDARD_VALUE
924282	*
605	*
1698776	*
59721	19952.62
759749	2511.89
819161	2511.89

mmdb fragment has –cut-smarts option.
It seems attractive for me! 😉
”’
–cut-smarts SMARTS alternate SMARTS pattern to use for cutting (default:
‘[#6+0;!$(*=,#[!#6])]!@!=!#[!#0;!#1;!$([CH2]);!$([CH3]
[CH2])]’), or use one of: ‘default’,
‘cut_AlkylChains’, ‘cut_Amides’, ‘cut_all’,
‘exocyclic’, ‘exocyclic_NoMethyl’
”’
Next step, make mmpdb and join the property to db.

# run fragmentation and my input file has header, delimiter is comma ( default is white space ). Output file is cyp3a4.fragments.
# Each line of inputfile must be unique!
iwatobipen$ mmpdb fragment chembl_cyp3a4.csv --has-header --delimiter 'comma' -o cyp3a4.fragments
# rung indexing with fragmented file and create a mmpdb. 
iwatobipen$ mmpdb index cyp3a4.fragments -o cyp3a4.mmpdb

OK I got cyp3a4.mmpdb file. (sqlite3 format)
Add properties to a DB.
Type following command.

iwatobipen$ mmpdb loadprops -p prop.csv cyp3a4.mmpdb
Using dataset: MMPs from 'cyp3a4.fragments'
Reading properties from 'prop.csv'
Read 1 properties for 17143 compounds from 'prop.csv'
5944 compounds from 'prop.csv' are not in the dataset at 'cyp3a4.mmpdb'
Imported 5586 'STANDARD_VALUE' records (5586 new, 0 updated).
Generated 83759 rule statistics (1329408 rule environments, 1 properties)
Number of rule statistics added: 83759 updated: 0 deleted: 0
Loaded all properties and re-computed all rule statistics.

Ready to use DB. Let’s play with the DB.
Identify possible transforms.

iwatobipen$ mmpdb transform --smiles 'c1ccc(O)cc1' cyp3a4.mmpdb --min-pair 10 -o transfom_res.txt
iwatobipen$ head -n3 transfom_res.txt 
ID	SMILES	STANDARD_VALUE_from_smiles	STANDARD_VALUE_to_smiles	STANDARD_VALUE_radius	STANDARD_VALUE_fingerprint	STANDARD_VALUE_rule_environment_id	STANDARD_VALUE_counSTANDARD_VALUE_avg	STANDARD_VALUE_std	STANDARD_VALUE_kurtosis	STANDARD_VALUE_skewness	STANDARD_VALUE_min	STANDARD_VALUE_q1	STANDARD_VALUE_median	STANDARD_VALUE_q3	STANDARD_VALUE_max	STANDARD_VALUE_paired_t	STANDARD_VALUE_p_value
1	CC(=O)NCCO	[*:1]c1ccccc1	[*:1]CCNC(C)=O	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	1049493	14	3632	5313.6	-0.71409	-0.033683	-6279.7	498.81	2190.5	7363.4	12530	-2.5576	0.023849
2	CC(C)CO	[*:1]c1ccccc1	[*:1]CC(C)C	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	1026671	20	7390.7	8556.1	-1.1253	-0.082107	-6503.9	-0	8666.3	13903	23534	-3.863	0.0010478

Output file has information of transformation with statistics values.
And the db can use to make a prediction.
Following command can generate two files with prefix CYP3A-.
CYP3A_pairs.txt
CYP3A_rules.txt

iwatobipen$ mmpdb predict --reference 'c1ccc(O)cc1' --smiles 'c1ccccc1' cyp3a4.mmpdb  -p STANDARD_VALUE --save-details --prefix CYP3A
iwatobipen$ head -n 3 CYP3A_pairs.txt
rule_environment_id	from_smiles	to_smiles	radius	fingerprint	lhs_public_id	rhs_public_id	lhs_smiles	rhs_smiles	lhs_value	rhs_value	delta
868610	[*:1]O	[*:1][H]	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	1016823	839661	C[C@]12CC[C@@H]3[C@H](CC[C@H]4C[C@@H](O)CC[C@@]43C)[C@@H]1CC[C@H]2C(=O)CO	CC(=O)[C@@H]1CC[C@H]2[C@H]3CC[C@H]4C[C@@H](O)CC[C@]4(C)[C@@H]3CC[C@@]21C	1000	15849	14849
868610	[*:1]O	[*:1][H]	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	3666	47209	O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12	O=c1cc(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12	15849	5011.9	-10837
iwatobipen$ head -n 3 CYP3A_rules.txt 
rule_environment_statistics_id	rule_id	rule_environment_id	radius	fingerprint	from_smiles	to_smiles	count	avg	std	kurtosis	skewness	min	q1	median	q3	max	paired_t	p_value
28699	143276	868610	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	[*:1]O	[*:1][H]	16	-587.88	14102	-0.47579	-0.065761	-28460	-8991.5	-3247.8	10238	23962	0.16674	0.8698
54091	143276	1140189	1	tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8	[*:1]O	[*:1][H]	15	-1617	13962	-0.25757	-0.18897	-28460	-9534.4	-4646	7271.1	23962	0.44855	0.66062

It is worth that the package ca handle not only structure based information but also properties.
I learned a lot of things from the source code.
RDKit org is cool community!
I pushed my code to my repo.
https://github.com/iwatobipen/mmpdb_test

original repo URL is
https://github.com/rdkit/mmpdb
Do not miss it!

3d conformer fingerprint calculation using RDKit # RDKit

Recently, attractive article was published in ACS journal.
The article describes how to calculate 3D structure based fingerprint and compare some finger prints that are well known in these area.
New method called “E3FP” is algorithm to calculate 3D conformer fingerprint like Extended Connectivity Fingerprint(ECFP). E3FP encodes information only atoms that are connected but also atoms that are not connected.
http://pubs.acs.org/doi/abs/10.1021/acs.jmedchem.7b00696

The author showed several examples. Very similar in 2D but not similar in 3D and vice versa.
Also compare E3FP similarity and ROCS score( TANIMOTO COMBO ) and showed good performance.
I was interested in the fingerprint. Fortunately, the author published the code in Anaconda cloud!!!!!!!
Install it and use it ASAP. ;-D
I am mac user, so installation is very very very easy! Just type.
I found some tips to use the package.
At first, molecules need _Name property to perform calculation.
Second, mol_from_sdf can read molecule from sdf but the function can not read sdf that has multiple molecules. So, I recommend to use molecule list instead of SDF.

conda install -c sdaxen sdaxen_python_utilities
conda install -c keiserlab e3fp

I used CDK2.sdf for test.
E3FP calculates unfolded finger print. But it can convert folded fingerprint and rdkit fingerprint using flod and to_rdkit function.

%matplotlib inline
import pandas as pd
import numpy as np
from rdkit import Chem
from e3fp.fingerprint.generate import fp, fprints_dict_from_mol
from e3fp.conformer.generate import generate_conformers
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
IPythonConsole.ipython_useSVG=True
# this sdf has 3D conformer, so I do not need to generate 3D conf.
mols = [ mol for mol in Chem.SDMolSupplier( "cdk2.sdf", removeHs=False ) ]
fpdicts = [ fprints_dict_from_mol( mol ) for mol in mols ]
# get e3fp fingerprint
# if molecule has multiple conformers the function will generate multiple fingerprints.
fps = [ fp[5][0] for fp in fpdicts]
# convert to rdkit fp from e3fp fingerprint
binfp = [ fp.fold().to_rdkit() for fp in fps ]
# getmorganfp
morganfp = [ AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in mols ]

# calculate pair wise TC
df = {"MOLI":[], "MOLJ":[], "E3FPTC":[], "MORGANTC":[],"pairidx":[]}
for i in range( len(binfp) ):
    for j in range( i ):
        e3fpTC = DataStructs.TanimotoSimilarity( binfp[i], binfp[j] )
        morganTC = DataStructs.TanimotoSimilarity( morganfp[i], morganfp[j] )
        moli = mols[i].GetProp("_Name")
        molj = mols[j].GetProp("_Name")
        df["MOLI"].append( moli )
        df["MOLJ"].append( molj )
        df["E3FPTC"].append( e3fpTC )
        df["MORGANTC"].append( morganTC )
        df["pairidx"].append( str(i)+"_vs_"+str(j) )

The method is fast and easy to use. Bottle neck is how to generate suitable conformer(s).
Readers who interested in the package, please check the authors article.

I pushed my sample code to my repo.
https://github.com/iwatobipen/3dfp/blob/master/sample_script.ipynb

GET TID and PREF_NAME from CHEMBL

I want to retrieve relationship between TID and PREF_NAME in specific case from ChEMBL DB.
SQL query is following.

COPY (
  2         SELECT  DISTINCT TID, PREF_NAME FROM ACTIVITIES
  3                       JOIN ASSAYS USING (ASSAY_ID)
  4                       JOIN TARGET_DICTIONARY USING (TID)
  5                       WHERE STANDARD_TYPE = 'Ki'
  6                       AND STANDARD_VALUE IS NOT NULL
  7                       AND STANDARD_RELATION = '='
  8                        )
  9         TO '/path/td.csv'
 10         ( FORMAT CSV )

I ran the sql and got result like ….

1,Maltase-glucoamylase
3,Phosphodiesterase 5A
6,Dihydrofolate reductase
7,Dihydrofolate reductase
8,Tyrosine-protein kinase ABL
9,Epidermal growth factor receptor erbB1
11,Thrombin
12,Plasminogen
13,Beta-lactamase TEM
14,Adenosine deaminase
15,Carbonic anhydrase II
19,Estrogen receptor alpha
21,Neuraminidase
23,Plasma kallikrein
24,HMG-CoA reductase
25,Glucocorticoid receptor
28,Thymidylate synthase
30,Aldehyde dehydrogenase
35,Insulin receptor
36,Progesterone receptor
41,Alcohol dehydrogenase alpha c

🙂

 

Installing TensorFlow on Mac OX X with GPU support

Yesterday, I tried to install tensorflow-gpu on my mac.
My PC is MacBook Pro (Retina, 15-inch, Mid 2014). The PC has NVIDA GPU.
OS is Seirra.
Details are described in following URL.
https://www.tensorflow.org/install/install_mac

I installed tensorflow directly by using pip command.

 $ pip install --upgrade tensorflow-gpu  # for Python 2.7 and GPU #for python2
 $ pip3 install --upgrade tensorflow-gpu # for Python 3.n and GPU  #for python2

Almost done, but not finished yet.
To finish the installation, I need to disable System Integrity Protection (SIP).
To do that I need follow these steps.

Restart my Mac.
Before OS X starts up, hold down Command-R and keep it held down until you see an Apple icon and a progress bar. ...
From the Utilities menu, select Terminal.
At the prompt type exactly the following and then press Return: csrutil disable.

I tested following code.

import tensorflow as tf

# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

# Runs the op.
print sess.run(c)

And the results seems tensorflow can use GPU.

iwatobipen$ python testcode.py
2017-06-13 22:24:28.952288: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-13 22:24:28.952314: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-13 22:24:28.952319: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-13 22:24:28.952323: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-06-13 22:24:29.469570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] OS X does not support NUMA - returning NUMA node zero
2017-06-13 22:24:29.470683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.80GiB
2017-06-13 22:24:29.470713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-06-13 22:24:29.470720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y
2017-06-13 22:24:29.470731: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0
2017-06-13 22:24:29.490805: I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0

MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
2017-06-13 22:24:29.495363: I tensorflow/core/common_runtime/simple_placer.cc:841] MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
b: (Const): /job:localhost/replica:0/task:0/gpu:0
2017-06-13 22:24:29.495384: I tensorflow/core/common_runtime/simple_placer.cc:841] b: (Const)/job:localhost/replica:0/task:0/gpu:0
a: (Const): /job:localhost/replica:0/task:0/gpu:0
2017-06-13 22:24:29.495395: I tensorflow/core/common_runtime/simple_placer.cc:841] a: (Const)/job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 [ 49.  64.]]

ref URL
https://github.com/tensorflow/tensorflow/issues/3723

Open drug discovery toolkit for python

Recently There are lots of python libraries for chemoinformatics and machine learning. One of my favorites is RDKit. 😉
These area is still active. And today I tried new library named “ODDT” open drug discovery toolkit.
Reference URL is
https://jcheminf.springeropen.com/articles/10.1186/s13321-015-0078-2.
ODDT is well documented in http://oddt.readthedocs.io/en/latest/index.html?highlight=InteractionFingerprint. ⭐️
Oddt is implemented shape and electronic similarities!! I have never known the open source library that implemented electronic similarity. And the library also implemented function that can detect protein ligand interaction.
So, I tried to use oddt.
First, calculation of some similarities codes are below.
To calculate electroshape, just use shape.electroshape method.

from oddt import toolkit
from oddt import shape
from oddt import fingerprints
from rdkit.Chem import Draw
mols = toolkit.readfile( 'sdf', 'cdk2.sdf' )
mols = [ m for m in mols ]
print(len( mols ))
[out] 47
e_shapes = [ shape.electroshape( mol ) for mol in mols ]
usrcats = [ shape.usr_cat( mol ) for mol in mols ]
usrs = [ shape.usr( mol ) for mol in mols ]

To calculate similarity, just use usr_similarity method.
Following.

for i in range( len( mols[ :5 ] ) ):
    for j in range( i ):
        e_sim = shape.usr_similarity( e_shapes[i], e_shapes[j] )
        usrcat_sim = shape.usr_similarity( usrcats[i], usrcats[j] )
        usr_sim = shape.usr_similarity( usrs[i], usrs[j])
        print( i, j, "e_shim", e_sim, 'usrcat_sim', usrcat_sim,'usr_sim',usr_sim )
1 0 e_shim 0.879372074943 usrcat_sim 0.742055515733 usr_sim 0.676152090576
2 0 e_shim 0.865690221755 usrcat_sim 0.428271350002 usr_sim 0.686898339111
2 1 e_shim 0.896725884564 usrcat_sim 0.481233989554 usr_sim 0.763231432529
3 0 e_shim 0.766813506629 usrcat_sim 0.609482600031 usr_sim 0.463058006246
3 1 e_shim 0.7349875959 usrcat_sim 0.548950403001 usr_sim 0.459194544856
3 2 e_shim 0.715411936912 usrcat_sim 0.360330544106 usr_sim 0.424537194619
4 0 e_shim 0.810683079155 usrcat_sim 0.62174869307 usr_sim 0.61705827303
4 1 e_shim 0.774077718141 usrcat_sim 0.635441642096 usr_sim 0.694498992613
4 2 e_shim 0.755174336047 usrcat_sim 0.394074936141 usr_sim 0.618174238781
4 3 e_shim 0.931446873697 usrcat_sim 0.780733001638 usr_sim 0.562721912484

OK, next check protein-ligand contact. To do that I prepare protein and ligand file from pdb.
And the read each files and preform calculation.

from oddt import interactions
pdb1 = next(toolkit.readfile( 'pdb', '1atp_apo.pdb'))
pdb1.protein = True
ligand = next( toolkit.readfile('sdf', 'atp.sdf'))
proteinatoms, ligandatoms, strict=interactions.hbonds( pdb1, ligand )
proteinatoms['resname']
[out]
array(['GLU', 'GLU', 'GLU', 'GLU', 'HOH', 'HOH', 'ARG', 'VAL', 'SER',
       'ALA', 'HOH', 'PHE', 'GLY', 'LYS', 'HOH', 'HOH', 'THR'], 
      dtype='<U3')

ODDT also can calculate protein-ligand interaction fingerprint.

IFP = fingerprints.InteractionFingerprint( ligand, pdb1)
print( IFP )
[out] array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

I think oddt is very nice toolkit for chemoinformatics.
I uploaded my code on my github repo.

https://github.com/iwatobipen/oddt_test

Graph convolution classification with deepchem

I posted about graph convolution regression using deepchem. And today, I tried graph convolution classification using deepchem.
Code is almost same as regression model. The only a difference point is use dc.models.MultitaskGraphClassifier instead of dc.models.MultitaskGraphRegressor.
I got sample ( JAK3 inhibitor ) data from chembl and tried to make model.

At first I used pandas to convert activity class ( active, non active )to 0,1 bit. Easy to do it.

import panda as pd
import pandas as pd
df = pd.read_table('jak3_chembl.txt', header=0)
df['activity_class'] = pd.factorize( df.ACTIVITY_COMMENT )
pd.factorize( df.ACTIVITY_COMMENT )
len(pd.factorize( df.ACTIVITY_COMMENT ))
df['activity_class'] = pd.factorize( df.ACTIVITY_COMMENT )[0]

df.to_csv('./preprocessed_jak3.csv', index=False)

Next wrote model and test it.

import tensorflow as tf
import deepchem as dc
import numpy as np
import pandas as pd

graph_featurizer = dc.feat.graph_features.ConvMolFeaturizer()
loader = dc.data.data_loader.CSVLoader( tasks=['activity_class'], smiles_field="CANONICAL_SMILES", id_field="CMPD_CHEMBLID", featurizer=graph_featurizer )
dataset = loader.featurize( './preprocessed_jak3.csv' )

splitter = dc.splits.splitters.RandomSplitter()
trainset,testset = splitter.train_test_split( dataset )

hp = dc.molnet.preset_hyper_parameters
param = hp.hps[ 'graphconv' ]
print(param['batch_size'])
g = tf.Graph()
graph_model = dc.nn.SequentialGraph( 75 )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), 75, activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.Dense( int(param['n_fully_connected_nodes']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphGather( 10 , activation='tanh'))

with tf.Session() as sess:
    model_graphconv = dc.models.MultitaskGraphClassifier( graph_model,
                                                      1,
                                                      75,
                                                     batch_size=10,
                                                     learning_rate = param['learning_rate'],
                                                     optimizer_type = 'adam',
                                                     beta1=.9,beta2=.999)
    model_graphconv.fit( trainset, nb_epoch=50 )

train_scores = {}
#regression_metric = dc.metrics.Metric( dc.metrics.pearson_r2_score, np.mean )
classification_metric = dc.metrics.Metric( dc.metrics.roc_auc_score, np.mean )
train_scores['graphconvreg'] = model_graphconv.evaluate( trainset,[ classification_metric ]  )
p=model_graphconv.predict( testset )

for i in range( len(p )):
    print( p[i], testset.y[i] )

print(train_scores) 

root@08d8f729f78b:/deepchem/pen_test# python graphconv_jak3.py

And datalog file is….

Loading raw samples now.
shard_size: 8192
About to start loading CSV from ./preprocessed_jak3.csv
Loading shard 1 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 0 took 2.023 s
TIMING: dataset construction took 3.830 s
Loading dataset from disk.
TIMING: dataset construction took 2.263 s
Loading dataset from disk.
TIMING: dataset construction took 1.147 s
Loading dataset from disk.
50
Training for 50 epochs
Starting epoch 0
On batch 0
...............
On batch 0
On batch 50
computed_metrics: [0.97176380945032259]
{'graphconvreg': {'mean-roc_auc_score': 0.97176380945032259}}

Not so bad.
Classification model gives better result than regression model.
All code is pushed my github repository.
https://github.com/iwatobipen/deeplearning

Graph convolution regression with deepchem

Somedays ago, I posted blog about deepchem. I am still playing with deepchem. Today I tried to use graph convolution regression model.
Deepchem provided Graph convolution Regressor. Cool.
I used solubility data provided from AstraZeneca. https://www.ebi.ac.uk/chembl/assay/inspect/CHEMBL3301364
My test code is following. Almost same as deepchem”s example code.
CSVLoader method is very useful because it can not only read data but also calculate graph feature of each molecule.
Next, Define of Graph convolution network.

import tensorflow as tf
import deepchem as dc
import numpy as np
graph_featurizer = dc.feat.graph_features.ConvMolFeaturizer()
loader = dc.data.data_loader.CSVLoader( tasks=['LogS'], smiles_field="CANONICAL_SMILES", id_field="CMPD_CHEMBLID", featurizer=graph_featurizer )
dataset = loader.featurize( './bioactivity.csv' )

splitter = dc.splits.splitters.RandomSplitter()
trainset,testset = splitter.train_test_split( dataset )

hp = dc.molnet.preset_hyper_parameters
param = hp.hps[ 'graphconvreg' ]
print(param['batch_size'])
g = tf.Graph()
graph_model = dc.nn.SequentialGraph( 75 )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), 75, activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.Dense( int(param['n_fully_connected_nodes']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
#graph_model.add( dc.nn.GraphGather(param['batch_size'], activation='tanh'))
graph_model.add( dc.nn.GraphGather( 10 , activation='tanh'))

with tf.Session() as sess:
    model_graphconv = dc.models.MultitaskGraphRegressor( graph_model,
                                                      1,
                                                      75,
                                                     batch_size=10,
                                                     learning_rate = param['learning_rate'],
                                                     optimizer_type = 'adam',
                                                     beta1=.9,beta2=.999)
    model_graphconv.fit( trainset, nb_epoch=30 )

train_scores = {}
regression_metric = dc.metrics.Metric( dc.metrics.pearson_r2_score, np.mean )
train_scores['graphconvreg'] = model_graphconv.evaluate( trainset,[ regression_metric ]  )
p=model_graphconv.predict( testset )

print(train_scores) 

Next run the code.

root@08d8f729f78b:/deepchem/pen_test# python graphconv_test.py > datalog

And datalog file is….

Loading raw samples now.
shard_size: 8192
About to start loading CSV from ./bioactivity.csv
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
...
Starting epoch 29
On batch 0
On batch 50
On batch 100
computed_metrics: [0.52744994044080606]
{'graphconvreg': {'mean-pearson_r2_score': 0.52744994044080606}}

r2 score is still row, but I think it can improve by change of nb_epochs.

All sample code was uploaded to github.
https://github.com/iwatobipen/deeplearning/blob/master/datalog