今年ももう直ぐ終わりですね

 この記事は予約投稿で2355にパブリッシュされる予定です。うまく動いているといいですが。。。
このブログはWordPress.comで書いています。アクセストラッキングを見ると昨年よりアクセス数はおよそ二倍くらいに増えていますが、投稿数は実は昨年の方が多かったです。

Screen Shot 2017-12-31 at 22.09.25
 ここ数年、なんとなく英語の記事を書いているわけですが、全然レベルが上がっている気がしない今日この頃です。

 今年を振り返ると、仕事では色々と新しい経験をする機会がありました。毎年一緒に仕事させていただく方がレベルが高い方々ばかりで、刺激を受けることが多く助かる反面、自分全然足りないわーと感じることも増えています。
 また、新しいことを色々やる機会が増えた一方で、実験室で合成実験する頻度が激減しました(´・ω・`)ショボーン。というかほとんどしてない、、、。今年はCompChemの仕事も取り組めるようになり、ちょっとですが、実務としてコードを書いてプロジェクトに貢献できたのは個人的にモチベーションを保つ良い刺激になりました。後は社内外の調整みたいなことを結構やった感じ。ペーパーワークのスキルがちょっと上がったw

 社歴もまあまあ長くなり上から数えた方が早くなりました。昔、自分よりずっと上の先輩社員の方が「これからは君ら若手が道を切り開いていくんだ。応援するよ。」とかいうのを聞いて、釈然としない気分なったことをいまだに覚えています。だんだんそちら側の年に近くなって来ました。私は若い方にそんなこと言うつもり全くないです、もちろん邪魔するとかじゃなくて協力もしますし、応援もします。が、むしろ若手に負けないくらい自分が頑張らないといかんと危機感を覚えます、、、後は任せるよ、応援するよとかいった時点で研究者としての責任を放棄したのではないかと思うわけで、そんなこと言うくらいならポスト譲るべきでしょ。

 と言うわけで来年も1日1日がんばります、息抜きもしつつ。
F2Fでお話しさせていただいた方、あったことはないけどSNSを通じて交流させていただいた方、皆様のより一層のご活躍と、成功を祈りつつ、今年最後の記事としたいと思います。

皆様にとって来年がより良い年になりますように!!!!

Advertisements

Build QSAR model with pytorch and rdkit #RDKit

There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.
I used solubility data that is provided from rdkit and I used the dataset before.

Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.

import pprint
import argparse
import torch
import torch.optim as optim
from torch import nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
#from sklearn import preprocessing


def base_parser():
    parser = argparse.ArgumentParser("This is simple test of pytorch")
    parser.add_argument("trainset", help="sdf for train")
    parser.add_argument("testset", help="sdf for test")
    parser.add_argument("--epochs", default=150)
    return parser

parser = base_parser()
args = parser.parse_args()
traindata = [mol for mol in Chem.SDMolSupplier(args.trainset) if mol is not None]
testdata = [mol for mol in Chem.SDMolSupplier(args.testset) if mol is not None]

def molsfeaturizer(mols):
    fps = []
    for mol in mols:
        arr = np.zeros((0,))
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        DataStructs.ConvertToNumpyArray(fp, arr)
        fps.append(arr)
    fps = np.array(fps, dtype = np.float)
    return fps

classes = {"(A) low":0, "(B) medium":1, "(C) high":2}
#classes = {"(A) low":0, "(B) medium":1, "(C) high":1}

trainx = molsfeaturizer(traindata)
testx = molsfeaturizer(testdata)
# for pytorch, y must be long type!!
trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64)
testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)

torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.

X_train = torch.from_numpy(trainx)
X_test = torch.from_numpy(testx)
Y_train = torch.from_numpy(trainy)
Y_test = torch.from_numpy(testy)
print(X_train.size(),Y_train.size())
print(X_test.size(), Y_train.size())

class QSAR_mlp(nn.Module):
    def __init__(self):
        super(QSAR_mlp, self).__init__()
        self.fc1 = nn.Linear(2048, 524)
        self.fc2 = nn.Linear(524, 10)
        self.fc3 = nn.Linear(10, 10)
        self.fc4 = nn.Linear(10,3)
    def forward(self, x):
        x = x.view(-1, 2048)
        h1 = F.relu(self.fc1(x))
        h2 = F.relu(self.fc2(h1))
        h3 = F.relu(self.fc3(h2))
        output = F.sigmoid(self.fc4(h3))
        return output

After defining the model I tried to lean and prediction.
Following code is training and prediction parts.

model = QSAR_mlp()
print(model)

losses = []
optimizer = optim.Adam( model.parameters(), lr=0.005)
for epoch in range(args.epochs):
    data, target = Variable(X_train).float(), Variable(Y_train).long()
    optimizer.zero_grad()
    y_pred = model(data)
    loss = F.cross_entropy(y_pred, target)
    print("Loss: {}".format(loss.data[0]))
    loss.backward()
    optimizer.step()

pred_y = model(Variable(X_test).float())
predicted = torch.max(pred_y, 1)[1]

for i in range(len(predicted)):
    print("pred:{}, target:{}".format(predicted.data[i], Y_test[i]))

print( "Accuracy: {}".format(sum(p==t for p,t in zip(predicted.data, Y_test))/len(Y_test)))

Check the code.

iwatobipen$ python qsar_pytorch.py solubility.train.sdf solubility.test.sdf 
torch.Size([1025, 2048]) torch.Size([1025])
torch.Size([257, 2048]) torch.Size([1025])
QSAR_mlp(
  (fc1): Linear(in_features=2048, out_features=524)
  (fc2): Linear(in_features=524, out_features=10)
  (fc3): Linear(in_features=10, out_features=10)
  (fc4): Linear(in_features=10, out_features=3)
)
Loss: 1.1143544912338257
-snip-
Loss: 0.6231405735015869
pred:1, target:0
pred:1, target:0
-snip-
pred:0, target:0
Accuracy: 0.642023346303502

Hmm, accuracy is not so high. I believe there is still room for improvement. I am newbie of pytorch. I will try to practice pytorch next year.

This is my last code of this year. I would like to post my blog more in next year.
If readers who find mistake in my code, please let me know.

Have a happy new year !!!!
;-)

Back to my home

I and my family visited Okinawa and back to home today. That place was warm and comfortable. We enjoyed the travel. My son excited his first experience of taking airplane. ;-)
We enjoyed Okinawa’s soul food. Okinawa was very warm and conforta
Rafute

Goya Chample and orion beer!

Visited Churaumi Aquarium and saw Whale shark. Body weight of the shark is over 5000kg! @_@

And we could see beautiful sea
Following movie took by my dorone!
Kouri Bridge!

Sunset beach!

I think this family vacation was really successful.

New Year’s holiday.

I am waiting for my flight in Haneda airport now. ;-)
I planed travel with my family during my new year’s holiday. It is first time for me to go such a travel since the birth of kids.
My kids are looking forward to getting on an airplane.
I will take photo and tweet them. I am taking my drone!
I hope they enjoy the vacation.

How do pharmaceutical companies improve their productivity?

One of my favorite journal is Drug Discovery Today from Elsevier.
And I read a short review about the productivity it was interesting for me. The title is “Do large mergers increase or decrease the productivity of pharmaceutical R&D?”. @_@
https://www.ncbi.nlm.nih.gov/pubmed/28646641
I am in charge of medicinal chemistry of middle size pharma, so I am not sure about effect of mergers.
The author analyzed the effect of large mergers and compare the productivity with pre and post mergers.
Astellas Pharma is listed from Japan (Yamanouchi + Fujisawa).
In the Table3, they showed R&D productivity in post-merger windows versus controls. And the table shows almost mergers improve productivity except for GSK. The metrics of the productivity is based on “total peak sales generated from new molecular
entities (NMEs, new drugs approved by the
FDA) per dollar of R&D spend as the measure
of success”

It is interesting because productivity ratio of Asteras and Bayer are more than 10 times.
On the other hand many pharmaceutical companies are downsizing their research area and focus on specific therapeutic areas.
Now block buster model is gone and new strategy is needed for drug discovery.
Where do pharmaceutical companies go?

…..

Ultra fast clustering script with RDKit #RDKit

Some years ago, I got very useful information for molecular clustering. Bayon is ultra fast clustering tool. The author made not only Japanese-tutorial but also English-tutorial.
This tools is easy to use but to use bayon in chemoinformatics area, user needs data preparation.

I wrote simple script that converts smiles to bayon input format and then perform clustering and parse output data.
Main code is following.

import argparse
import subprocess
import pickle
import os
from rdkit import Chem
from rdkit.Chem import AllChem

parser = argparse.ArgumentParser( "Fast clustering for chemoinformatics" )
parser.add_argument( "input" )
parser.add_argument( "n", help = "the number of clusters" )
#parser.add_argument( "-l", help = "limit value of cluster bisection" )
#parser.add_argument( "-p", help = "output similarity points" )
parser.add_argument( "--output", default = "clustered.tsv" )
parser.add_argument( "-c", help = "filename of centroid", default = "centroid.tsv" )

def smi2fp( molid, smiles ):
    mol = Chem.MolFromSmiles( smiles )
    onbits = AllChem.GetMorganFingerprintAsBitVect( mol, 2 ).GetOnBits()
    row = molid
    for bit in onbits:
        row += "\tFP_{}\t1.0".format( bit )
    row += "\n"
    return row


if __name__ == "__main__":
    args = parser.parse_args()
    inputf = open( args.input, "r" )
    n = args.n
    c = args.c
    output = args.output
    tempf = open( "fp.tsv", "w" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        tempf.write( smi2fp( line[0], line[1] ))
    tempf.close()
    res = subprocess.call( "time bayon -p -c {} -n {} fp.tsv > {}".format( c, n, output ), shell=True )

    #parse results
    parsefile = open( output.split(".")[0]+"_parse.tsv", "w" )
    inputf = open( output, "r" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        cluster_id = line[0]
        for i in range( 1, len( line )-1, 2 ) :
            molid = line[ i ]
            point = line[ i + 1 ]
            parsefile.write("{}\t{}\tCLS_ID_{}\n".format( molid, point, cluster_id ))
    parsefile.close()

The code perform clustering molecules and output cluster with point ( similarity ) and parse default bayon format.

I ran the code with rdkit cdk2.sdf data.
47 compound clustered into 5 clusters within 0.006s!

iwatobipen$ python fastcluster.py cdk2.smi 5

real	0m0.015s
user	0m0.006s
sys	0m0.002s
Done!

It seems work fine. ;-)

Can machine learn important feature from SMILES?

Today I found challenging article in arxiv.
It describes about SMILES2Vec. https://arxiv.org/pdf/1712.02034.pdf
You know word2vec is very attractive and major application for ML area and SMILES2Vec has same concept.
It converts smiles to vector and learn which character is important. The author use “black box” models for building model. I am not sure about “black box” model but I think it likes leave one out. The method masks some features, builds model and finds important features.

To use the method, SMILES2Vec can find important characters in the given smiles.
They found CNN-GRU model gives best result for solubility prediction. My question is … Why convolution of SMILES work fine???
My opinion is that solubility or logP depends on the presence or absence of substituents such as hydroxyl or amino groups, they do not strongly depend on the position some case. So I think the method is interesting but difficult to predict biological affinity.

SMILES strings is major input format for deep learning area. Also I often use SMILES. ;-) But I want to find another format for DL.

ref for black box model
https://arxiv.org/pdf/1602.07043.pdf