東海ブロックジュニアドッジボール選手権 #Dodgeball

今日はエコパアリーナで開催される東海ブロックジュニアドッジボール選手権の応援に行ってきました。ジュニアドッジボールとは小学4年生までの子供達で編成されるチームによる競技ドッジです。チームにより4年が多いところや、まだ2、3粘性が中心のチームがあったりするので結構デコボコしている感じはあるのですが競合はやはりどこもパス回しも、アタックも強烈です。

長男が所属しているチームは残念ながら今回は予選敗退となってしまいました。現4年生はこの選手権と次の桜杯がジュニアとして出られる最後の試合になります。全力を出しきりオフィシャルへ進んで欲しいものです。

予選で敗退してしまったものの、長男は高学年の球も頑張ってキャッチして自信をつけてきました。投げる球もだいぶ早くなってきましたしこれからも色々壁にぶつかるだろうけどへこたれずに頑張って欲しいと思います。

一年から始めた競技ドッジ。最初は強い球が取れず泣いたりすることが多かったですが、試合に起用して頂き、そこでキャッチできたり、色々経験してだんだん自信をつけてきたようです。小学生でエコパアリーナ、このはなアリーナなどの大きな体育館で大勢の観客のいる前でプレーできる機会なんてなかなかないと思います。これからも貪欲に練習して成長して欲しいなと思っています。まあ勉強はボロボロですがね、、、

レベルの高いチームと比べるとまだまだ課題もある気がしますが、これからもっともっと伸びていけるチームだと思います。今日の結果はしっかりと受け止め、次に向かってまた楽しく練習して強いチームになって欲しいです!

次の桜杯は私も審判で出る予定なのでもう長男の試合をゆっくり応援はできないかな。w

Advertisements

Make interactive dashboard with Dash2 #chemoinformatcs #RDKit

Still I am playing with dash. And I would like to write document about py4chemoinfomatics in this weekend (tonight).

Before I posted about how to make interactive 2D plot with Dash. And today I added file up loader and structure renderer to the code.

All code can find following URL.
https://github.com/iwatobipen/chemo_dash

For usage in closed environment, I make local css server in the code. Dash uses flask for web framework so, user can easy to make additional page like flask. Here is whole code of the demo.
I defined smiles to svg image and supply the SVG string to dash_dangerously_set_inner_html (dhtml). DHTML is not secure because the function can directly embed HTML tag to html.Div. I used the method because I would like to draw mol SVG on the fly.

import flask
import base64
import io
import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_dangerously_set_inner_html as dhtml
import dash_table
from dash.dependencies import Input, Output

import os
import argparse
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdDepictor
from sklearn.decomposition import PCA

import plotly.graph_objs as go
descdict = dict(Descriptors._descList)

app = dash.Dash(__name__)

css_directory = os.getcwd()+'/static/'
stylesheets = ['bWLwgP.css']
static_css_route = '/static/'

def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    rdDepictor.Compute2DCoords(mol)
    mc = Chem.Mol(mol.ToBinary())
    Chem.Kekulize(mc)
    drawer = Draw.MolDraw2DSVG(300,300)
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText().replace('svg:','')
    return svg

def parse_contents(contents):
    content_type, content_string = contents.split(',')
    decoded = base64.b64decode(content_string)
    with open('./static/temp.sdf', 'w') as f:
        f.write(decoded.decode('utf-8'))
    mols = [mol for mol in Chem.SDMolSupplier('./static/temp.sdf')]
    return mols

@app.server.route('{}'.format(static_css_route))
def serve_stylesheet(stylesheet):
    if stylesheet not in stylesheets:
        raise Exception()
    return flask.send_from_directory(css_directory, stylesheet)

for stylesheet in stylesheets:
    app.css.append_css({"external_url": "/static/{}".format(stylesheet)})


app.layout = html.Div(children=[
    html.H1(children='Hello Chemoinfo'),
    dcc.Upload(
        id='upload-data',
        children=html.Div(
            ['Drag and Drop SDF or ',
            html.A('Select Files')
            ]),
            style={
                    'width': '80%',
                    'height': '60px',
                    'lineHeight': '60px',
                    'borderWidth': '1px',
                    'borderStyle': 'dashed',
                    'borderRadius': '5px',
                    'textAlign': 'center',
                    'margin': '10px'
            },
            multiple=True
        ),
    
    html.Div(children='''
    Dash : sample plot
    '''),

    html.Div([dcc.Dropdown(id='x-column',
                           value='MaxEStateIndex',
                           options=[{'label': key, 'value': key} for key in descdict.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
              dcc.Dropdown(id='y-column',
                           value='MaxEStateIndex',
                           options=[{'label': key, 'value': key} for key in descdict.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
                           ]),
    html.Div([
        html.Div([html.Div(id="molimg")], className="four columns"),
        html.Div([dcc.Graph(id='example-graph')], className="eight columns")
        ], className="row"),
    #html.Div([dcc.Graph(id='chemical-space')])
    ])

@app.callback(
    Output('example-graph', 'figure'),
     
    [Input('upload-data', 'contents'),
     Input('x-column', 'value'),
     Input('y-column', 'value')]
)
def update_graph(contents, x_column_name, y_column_name):
    mols = parse_contents(contents[0])
    for i, mol in enumerate(mols):
        AllChem.Compute2DCoords(mol)
    x = [descdict[x_column_name](mol) for mol in mols]
    y = [descdict[y_column_name](mol) for mol in mols]

    return {'data':[go.Scatter(
        x=x,
        y=y,
        #text=['mol_{}'.format(i) for i in range(len(mols))],
        text=[Chem.MolToSmiles(mol) for mol in mols],
        mode='markers',
        marker={
            'size':15,
            'opacity':0.5
        }
    )],
    'layout':go.Layout(
        xaxis={'title':x_column_name},
        yaxis={'title':y_column_name}
    )}


@app.callback(
    Output('molimg', 'children'),
    [Input('example-graph', 'hoverData'),
    ]
)
def update_img(hoverData1):
    try:
        svg = smi2svg(hoverData1['points'][0]['text'])
    except:
        svg = 'Select molecule'
    return dhtml.DangerouslySetInnerHTML(svg)

if __name__=='__main__':
    app.run_server(debug=True)

Let’s run the app

iwatobipen$ python app.py

Official document supplies some examples for multiple type of file uploading. My code is not smart because I could not handle SDMolSupplier with StringIO so I saved temp.sdf and reload it. Any advice and comments are appreciated about how to handle uploaded sdf dynamically in the web app. And also I would like to know how to handle multiple out put in call back function.

Make interactive dashboard with Dash #chemoinformatcs #RDKit

I am happy that I could have fruitful discussions in mishima.syk #13 held on last Saturday.
And I knew the new package for data visualization named dash.
From the document, dash can provide interactive dashboard for data analysis. I read the documents and wrote sample code.

Following code provides 2D scatter plot with two descriptors which are calculated with RDKit. Callback function of Dash can make interactive response.

It is worth to know that by using dash, I can make cool web app with out JS coding!

import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output

import os
import argparse
from rdkit import Chem
from rdkit.Chem import Descriptors
import plotly.graph_objs as go
descdict = dict(Descriptors._descList)
app = dash.Dash(__name__)

def getParser():
    parser = argparse.ArgumentParser()
    parser.add_argument('--sdf', default='cdk2.sdf', type=str)
    return parser

app.layout = html.Div(children=[
    html.H1(children='Hello Chemoinfo'),
    html.Div(children='''
    Dash : sample plot
    '''),

    html.Div([dcc.Dropdown(id='x-column',
                           value='MaxEStateIndex',
                           options=[{'label': key, 'value': key} for key in descdict.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
              dcc.Dropdown(id='y-column',
                           value='MaxEStateIndex',
                           options=[{'label': key, 'value': key} for key in descdict.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
                           ]),
    dcc.Graph(id='example-graph'),
    ])

@app.callback(
    Output('example-graph', 'figure'),
    [Input('x-column', 'value'),
     Input('y-column', 'value')]
)
def update_graph(x_column_name, y_column_name):
    args = getParser().parse_args()
    mols = [mol for mol in Chem.SDMolSupplier(args.sdf) if mol != None]
    x = [descdict[x_column_name](mol) for mol in mols]
    y = [descdict[y_column_name](mol) for mol in mols]
    return {'data':[go.Scatter(
        x=x,
        y=y,
        text=[Chem.MolToSmiles(mol) for mol in mols],
        mode='markers',
        marker={
            'size':15,
            'opacity':0.5
        }
    )],
    'layout':go.Layout(
        xaxis={'title':x_column_name},
        yaxis={'title':y_column_name}
    )}
if __name__=='__main__':
    app.run_server(debug=True)

Let’s run the app

iwatobipen$ python app.py

You can see interactive plot. This is very simple example for dash usage. I would like to make more content rich dashboard for chemoinformatics near the future.

Convert fingerprint to numpy array and convert numpy array to fingerprint #RDKit #memorandum

This is just memorandum for my self.
RDKit has ConvertToNumpyArray method for converting rdkit fp to numpy array. But there is not direct method for convert numpy array to rdkit fp.
However, rdkit has CreateFromBitString method.

So, I tried to convert numpy array to rdkit fp with the method.

import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
mol = Chem.MolFromSmiles('C1CCCOC1')
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=512)
arr = np.zeros((0,), dtype=np.int8)
arr
> array([], dtype=int8)

Then convert fp to numpy array.

DataStructs.ConvertToNumpyArray(fp,arr)
arr
> array([0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Next convert numpy array to rdkit fp.

bitstring="".join(arr.astype(str))
fp2 = DataStructs.cDataStructs.CreateFromBitString(bitstring)

Finally, check both fingerprint.

list(fp.GetOnBits())
> [2, 4, 11, 28, 144, 225, 381, 414, 438]

list(fp2.GetOnBits())
> [2, 4, 11, 28, 144, 225, 381, 414, 438]

RDKit has many functions. I am happy to find new methods that I have not used before. ;)

Virtual Screening(VS) with over hundred million compound in a few days! #Chemoinformatics

Recently virtual screening often is used for first screening for drug discovery project. Because it can screen huge amount of compound very fast compared to wet screening. I thought docking score is not reflect binding affinity of ligand and target protein. But today I read interesting article and I changed my mind.
The title was ‘Ultra-large library docking for discovering new chemotypes’, URL is below.

https://www.nature.com/articles/s41586-019-0917-9

The author focused on make-on-demand library which is compound library designed from 70,000 building blocks from Enamine and 130 well-charactalized reactions. So if there is hit in virtual screening, the hit compound can synthesis with high probability.

The authors conducted VS against two targets AmpC and dopamine D4 receptor with over 100 million compounds! In the case of D4 receptor, They used 138 million molecules for VS and it took only 1.2 calendar days! Of course they used many cores but it is exciting for me. From their analysis performance of virtual screening correlates number of compounds so bigger is better.

After the VS, they selected 589 molecules by docking score and 549 were successfully synthesized! Wow very high success rate amazing!!! ;)
Interestingly the authors analyzed performance of compound selection by human and machine.
Extended Data Fig. 7 shows comparison of hit rates achieved by combined docking score and human prioritization(visual inspection) compared to the rates achieved by docking score alone.
It is very interesting for me that hit rate from machine predictions seems slightly higher than human predictions. And compounds selected by human had better binding affinity. I surprised that compound selection by using only docking score is not so bad.

And finally they could get new chemotype with high potency. It is worth to know that huge amount of VS give an opportunity to accessing new chemical space and to get chance to get high quality hit compounds.

And I also amazed that delivery time of over 500 compounds was only six weeks. Of course all compounds has high purity (>90%).

Massive computing power is useful for drug discovery.

BTW, Mishima.syk #13 will be held on this weekend. If reader who will be participate the meeting and would like to enjoy hands-on session, I recommend to build conda python 3.6 environment with pytorch 0.3 and rdkit. ;). If it is difficult, I will provide google colab env.

3D printing technology for science #memorandum

Recently, 3D printing technology is wildly used in many area not only chemical but also pharmaceutical.
Here is a nice review for 3D printing applications.

https://www.nature.com/articles/s41570-018-0058-y

I know only a little bit about the application of 3D printing for pharmaceutical company, for example micro reactor or labo on tip. Also Cronin group published exciting articles about AI and 3D printing.

In the article, authors reviews many examples. One is a laboratory equipments. By using 3D printer, researcher can design their own custom equipment with low cost. I really amazed that 3D printing used for making NMR tube and it could be used for reaction monitoring! Wow!!!
https://pubs.rsc.org/en/content/articlelanding/2017/nj/c6nj03614g#!divAbstract

Second my interest was micro fluidics and milli fluidics. Making chemical reactor is interesting for me because I love organic synthesis! Fig3 in the article shows some examples of chemical reactors. Fig3-c is multi component reactor for synthesis of baclofen. And Fig3-d is flow reactor, the reactor used for a difluoromethylation reaction with n-BuLi at -65deg.
It is interesting for me that 3D printed reactor is compatible for organic reaction. And the 3D printing technology can provide catalytically active surfaces by modifying the chemical composition of the print material. The article describes some examples of catalytic reaction with the device.

Third my interest was drug delivery.
Fig8 shows some examples of API(Drug) printing for controlling drug release rates.

The technology is growing very rapidly and I would like to use the technology in my own research project in the future.

Try GCN QSPR with pytorch based graph library #RDKit #Pytorch #dgl

Recently many machine learning articles use pytorch for their implementation. And I found very attractive package for graph based deep learning, named ‘DGL;Deep Graph Library’. The package supports pytorch and mxnet for backend. The author provides not only package but also very nice documentation. I read the document and try GCN for QSPR with DGL.

https://www.dgl.ai/pages/about.html

The package can make graph object from networkx and can convert graph object to networkx object.
Following code example is based their example about batched graph convolution.
https://docs.dgl.ai/tutorials/basics/4_batch.html
At first, import packages which will use.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from collections import namedtuple
import dgl
from dgl import DGLGraph
import dgl.function as fn

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, TensorDataset
from torch.utils.data import DataLoader
import torch.optim as optim

import networkx as nx
import copy
import os
from rdkit import Chem
from rdkit.Chem import RDConfig
import numpy as np

Next define mol2graph function. Following example only node features used bond features did not used.

# Following code borrowed from dgl's junction tree example.
ELEM_LIST = ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg', 'Na', 'Ca', 'Fe', 'Al', 'I', 'B', 'K', 'Se', 'Zn', 'H', 'Cu', 'Mn', 'unknown']

ATOM_FDIM = len(ELEM_LIST) + 6 + 5 + 1
MAX_ATOMNUM =60
BOND_FDIM = 5 
MAX_NB = 10

PAPER = os.getenv('PAPER', False)

def onek_encoding_unk(x, allowable_set):
    if x not in allowable_set:
        x = allowable_set[-1]
    return [x == s for s in allowable_set]

# Note that during graph decoding they don't predict stereochemistry-related
# characteristics (i.e. Chiral Atoms, E-Z, Cis-Trans).  Instead, they decode
# the 2-D graph first, then enumerate all possible 3-D forms and find the
# one with highest score.
'''
def atom_features(atom):
    return (torch.Tensor(onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()]))
'''
def atom_features(atom):
    return (onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()])

def bond_features(bond):
    bt = bond.GetBondType()
    return (torch.Tensor([bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE, bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC, bond.IsInRing()]))

def mol2dgl_single(mols):
    cand_graphs = []
    n_nodes = 0
    n_edges = 0
    bond_x = []

    for mol in mols:
        n_atoms = mol.GetNumAtoms()
        n_bonds = mol.GetNumBonds()
        g = DGLGraph()        
        nodeF = []
        for i, atom in enumerate(mol.GetAtoms()):
            assert i == atom.GetIdx()
            nodeF.append(atom_features(atom))
        g.add_nodes(n_atoms)

        bond_src = []
        bond_dst = []
        for i, bond in enumerate(mol.GetBonds()):
            a1 = bond.GetBeginAtom()
            a2 = bond.GetEndAtom()
            begin_idx = a1.GetIdx()
            end_idx = a2.GetIdx()
            features = bond_features(bond)

            bond_src.append(begin_idx)
            bond_dst.append(end_idx)
            bond_x.append(features)
            bond_src.append(end_idx)
            bond_dst.append(begin_idx)
            bond_x.append(features)
        g.add_edges(bond_src, bond_dst)
        g.ndata['h'] = torch.Tensor(nodeF)
        cand_graphs.append(g)
    return cand_graphs

Next defined the original collate function for data loader. And defined msg and reduce function.
msg function get message from neighbor node and reduce function aggregates the massages.

msg = fn.copy_src(src="h", out="m")
def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)
def reduce(nodes):
    # summazation by avarage is different part
    accum = torch.mean(nodes.mailbox['m'], 1)
    return {'h': accum}

Then defined the network. By using the dgl user can easily access node and features. For example graph.ndata[‘name’] method can access node features named ‘name’. NodeApplyModule is used for calculation of each node.
It is worth to know that by using torch nn.ModuleList, user can write network like ‘Keras’.

class NodeApplyModule(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(NodeApplyModule, self).__init__()
        self.linear = nn.Linear(in_feats, out_feats)
        self.activation = activation
    
    def forward(self, node):
        h = self.linear(node.data['h'])
        h = self.activation(h)
        return {'h': h}
    

class GCN(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(GCN, self).__init__()
        self.apply_mod = NodeApplyModule(in_feats, out_feats, activation)
    
    def forward(self, g, feature):
        g.ndata['h'] = feature
        g.update_all(msg, reduce)
        g.apply_nodes(func=self.apply_mod)
        h =  g.ndata.pop('h')
        #print(h.shape)
        return h
    
class Classifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, n_classes):
        super(Classifier, self).__init__()
        self.layers = nn.ModuleList([GCN(in_dim, hidden_dim, F.relu),
                                    GCN(hidden_dim, hidden_dim, F.relu)])
        self.classify = nn.Linear(hidden_dim, n_classes)
    def forward(self, g):
        h = g.ndata['h']
        for conv in self.layers:
            h = conv(g, h)
        g.ndata['h'] = h
        hg = dgl.mean_nodes(g, 'h')
        return self.classify(hg)

Let’s load data. I used solubility dataset in RDKit

solcls = {'(A) low':0, '(B) medium':1, '(C) high':2}
train_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf'))]
train_y = [solcls[m.GetProp('SOL_classification')] for m in train_mols]
test_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf'))]
test_y = [solcls[m.GetProp('SOL_classification')] for m in test_mols]
train_graphs = mol2dgl_single(train_mols)
test_graphs = mol2dgl_single(test_mols)

dataset = list(zip(train_graphs, train_y))
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate)

Finally run train and check the performance.

model = Classifier(ATOM_FDIM, 256, len(solcls))
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()

epoch_losses = []
for epoch in range(200):
    epoch_loss = 0
    for i, (bg, label) in enumerate(data_loader):
        bg.set_e_initializer(dgl.init.zero_initializer)
        bg.set_n_initializer(dgl.init.zero_initializer)        
        pred = model(bg)
        loss = loss_func(pred, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
    epoch_loss /= (i + 1)
    if (epoch+1) % 20 == 0:
        print('Epoch {}, loss {:.4f}'.format(epoch+1, epoch_loss))
    epoch_losses.append(epoch_loss)

>Epoch 20, loss 0.6104
>Epoch 40, loss 0.5616
>Epoch 60, loss 0.5348
>Epoch 80, loss 0.5095
>Epoch 100, loss 0.4915
>Epoch 120, loss 0.5163
>Epoch 140, loss 0.5348
>Epoch 160, loss 0.4385
>Epoch 180, loss 0.4421
>Epoch 200, loss 0.4318
plt.plot(epoch_losses, c='b')
model.eval()
test_bg = dgl.batch(test_graphs)
test_y_tensor = torch.tensor(test_y).float().view(-1,1)
test_bg.set_e_initializer(dgl.init.zero_initializer)
test_bg.set_n_initializer(dgl.init.zero_initializer)
logit = model(test_bg)
probs = torch.softmax(logit, 1).detach().numpy()
pred_y = np.argmax(probs,1)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
accuracy_score(test_y, pred_y)
>0.7587548638132295
print(classification_report(test_y, pred_y))
              precision    recall  f1-score   support

           0       0.70      0.86      0.78       102
           1       0.79      0.64      0.71       115
           2       0.87      0.82      0.85        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.79      0.78      0.78       257
weighted avg       0.77      0.76      0.76       257

Hmm not so bad. OK I tried random forest next. I would like to use RDKit new descriptor, ‘dragon-type descriptor’. So I used it ;)

https://github.com/rdkit/UGM_2018/blob/master/Notebooks/Landrum_Whats_New.ipynb

from rdkit.Chem import AllChem
from rdkit.Chem.Descriptors import rdMolDescriptors
from sklearn.preprocessing import normalize
# generate 3D conf
train_mols2 = copy.deepcopy(train_mols)
test_mols2 = copy.deepcopy(test_mols)

ps = AllChem.ETKDGv2()
for m in train_mols2:
    m = Chem.AddHs(m)
    AllChem.EmbedMolecule(m,ps)
for m in test_mols2:
    m = Chem.AddHs(m)
    AllChem.EmbedMolecule(m,ps)
def calc_dragon_type_desc(mol):
    return rdMolDescriptors.CalcAUTOCORR3D(mol) + rdMolDescriptors.CalcMORSE(mol) + \
        rdMolDescriptors.CalcRDF(mol) + rdMolDescriptors.CalcWHIM(mol)
train_X = normalize([calc_dragon_type_desc(m) for m in train_mols2])
test_X = normalize([calc_dragon_type_desc(m) for m in test_mols2])

For convenience I use only one conformer the above code.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(train_X, train_y)
rf_pred_y = rfc.predict(test_X)
accuracy_score(test_y, rf_pred_y)
>0.7587548638132295
print(classification_report(test_y, rf_pred_y))
              precision    recall  f1-score   support

           0       0.77      0.87      0.82       102
           1       0.79      0.66      0.72       115
           2       0.67      0.75      0.71        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.74      0.76      0.75       257
weighted avg       0.76      0.76      0.76       257

Random Forest showed same performance with GCN in this test.
DGL is very useful package for graph based deeplearning I think. Their original repo provides many example codes! Reader who is interested in pls check the repo.

I uploaded today’s code my git hub and it can check following URL.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/GCN_chemo.ipynb