Ultra fast clustering script with RDKit #RDKit

Some years ago, I got very useful information for molecular clustering. Bayon is ultra fast clustering tool. The author made not only Japanese-tutorial but also English-tutorial.
This tools is easy to use but to use bayon in chemoinformatics area, user needs data preparation.

I wrote simple script that converts smiles to bayon input format and then perform clustering and parse output data.
Main code is following.

import argparse
import subprocess
import pickle
import os
from rdkit import Chem
from rdkit.Chem import AllChem

parser = argparse.ArgumentParser( "Fast clustering for chemoinformatics" )
parser.add_argument( "input" )
parser.add_argument( "n", help = "the number of clusters" )
#parser.add_argument( "-l", help = "limit value of cluster bisection" )
#parser.add_argument( "-p", help = "output similarity points" )
parser.add_argument( "--output", default = "clustered.tsv" )
parser.add_argument( "-c", help = "filename of centroid", default = "centroid.tsv" )

def smi2fp( molid, smiles ):
    mol = Chem.MolFromSmiles( smiles )
    onbits = AllChem.GetMorganFingerprintAsBitVect( mol, 2 ).GetOnBits()
    row = molid
    for bit in onbits:
        row += "\tFP_{}\t1.0".format( bit )
    row += "\n"
    return row


if __name__ == "__main__":
    args = parser.parse_args()
    inputf = open( args.input, "r" )
    n = args.n
    c = args.c
    output = args.output
    tempf = open( "fp.tsv", "w" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        tempf.write( smi2fp( line[0], line[1] ))
    tempf.close()
    res = subprocess.call( "time bayon -p -c {} -n {} fp.tsv > {}".format( c, n, output ), shell=True )

    #parse results
    parsefile = open( output.split(".")[0]+"_parse.tsv", "w" )
    inputf = open( output, "r" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        cluster_id = line[0]
        for i in range( 1, len( line )-1, 2 ) :
            molid = line[ i ]
            point = line[ i + 1 ]
            parsefile.write("{}\t{}\tCLS_ID_{}\n".format( molid, point, cluster_id ))
    parsefile.close()

The code perform clustering molecules and output cluster with point ( similarity ) and parse default bayon format.

I ran the code with rdkit cdk2.sdf data.
47 compound clustered into 5 clusters within 0.006s!

iwatobipen$ python fastcluster.py cdk2.smi 5

real	0m0.015s
user	0m0.006s
sys	0m0.002s
Done!

It seems work fine. 😉

Advertisements

Can machine learn important feature from SMILES?

Today I found challenging article in arxiv.
It describes about SMILES2Vec. https://arxiv.org/pdf/1712.02034.pdf
You know word2vec is very attractive and major application for ML area and SMILES2Vec has same concept.
It converts smiles to vector and learn which character is important. The author use “black box” models for building model. I am not sure about “black box” model but I think it likes leave one out. The method masks some features, builds model and finds important features.

To use the method, SMILES2Vec can find important characters in the given smiles.
They found CNN-GRU model gives best result for solubility prediction. My question is … Why convolution of SMILES work fine???
My opinion is that solubility or logP depends on the presence or absence of substituents such as hydroxyl or amino groups, they do not strongly depend on the position some case. So I think the method is interesting but difficult to predict biological affinity.

SMILES strings is major input format for deep learning area. Also I often use SMILES. 😉 But I want to find another format for DL.

ref for black box model
https://arxiv.org/pdf/1602.07043.pdf

創薬化学者の仕事について #souyakuAC2017

こんにちは。情弱のiwatobipenといいます。当方とある企業で創薬化学者のお仕事をさせていただいています。

今日は悪ノリで創薬 Advent Calendar 2017にエントリーしたので、創薬化学者の仕事ってどんなもんかなというのを私の視点で書いてみようかなって思っています。

ちなみに会社によっても変わる部分です(まあ僕の会社の中の人でもこんなん違うだろ!って意見もあると思います。それは全然否定しません。)。まあ、ただの読み物として捉えて下さい。
なお、この記事は創薬 Advent Calendar 2017 (#souyakuAC2017) の5日目の記事です。

創薬化学者は薬になりうる化合物を設計して合成することがお仕事です。合成は自分がするとは限りません。従いまして、創薬化学者と有機合成化学者は異なります。どっちがどうだとかそう言った議論ではなく、目的が違うんですね。さて、薬に求められることって色々あります。
– 薬効(これがなければ病に困っている患者様を助けられないですね。)
– 毒性がないこと(どんなに強くても副作用がそれを上回ってしまえば毒になってしまいます。)
– 飲みやすさ(錠剤なら小さいとか、経口剤の方が飲みやすいとか、溶けやすいかとか、、、注射剤の方が好ましいケースももちろんあります。)
– コスト(医療経済、薬価は議論がありますが、高いより安い方がもちろん好ましいですね。)
– 作った化合物の権利=特許が取れないとダメですから知財の知識も求められます。
– 製造できるプロセスが必要です。目的の化合物を効率的に作りきるプロセスケミストリーの知識、スキルも必要でしょう。
もう書いてるだけ目が回ります。全てのスペシャリストもいますが得意な場所を伸ばしてあるパートのスペシャリストの道に進む人もいます。例えば、どんな難しい化合物であっても商用プロセスに仕上げてしまう合成のスペシャリストとか。第六感のようなセンスの塊でなんかわかんないけど、あの人が作る化合物はすごい。みたいな。いろんなタイプの研究者人います。ですが、目的は薬を世に出し、苦しんでいる方々を救うことです!

チームワークも大切です。薬理学研究者、薬物動態の研究者、毒性学の研究者、いろんな専門の人と議論して、構造に落とし込んで次の一手を考える必要があります。
、、、ここに置換基入れると活性上がるよなぁ、と構造活性相関でわかっていても、そこに入れると脂溶性が上がってしまい代謝安定性が悪化したり、毒性が出たりしてしまう。んーここには入れられぬ、、、のようなジレンマとの戦いの日々もあります。
全て100点のもが取れれば最高ですがそんなことは、ほとんどありません。リスクandベネフィットを考慮してどうやったら先に進められるかという戦略を立てることが重要です。この文献は仕事の同僚から教えてもらったのですがとても面白いと思います。
Drug Discovery: A Modern Decathlon創薬は総合競技です。
あ、ちなみに私はもうこの仕事してだいぶたちますけど、正直成功した経験はありません。退職するまで一つくらいは世に薬出す仕事に貢献したいと思っていますけどね。
やりがいはあるけど、そうそう成功するわけでもない職種だと思います。情熱があれば薬が作れるかというとそんなことは100%ないです。情熱は必要ですが、サイエンスに基づいた戦略がないとエネルギーの浪費です。私の尊敬する大先輩は、ある会議で根性論的な議論になった際にボソッと”薬は情熱で作るんじゃないよサイエンスで作るんだよ”って呟いていました。忘れられません。

ここから最近の話題です
メガファーマがサイトをクローズすることが増えましたね。これに合わせてCROが増えてきているように思います。CROとはContract Research Organizationの略で、色々と創薬に関する業務を受託する企業のことです。メガファーマでの経験豊富な人材、ファシリティがあります。問題なんでも解決しますよ。という提案です。探索用のライブラリ合成、プロセス検討、スケールアップ、薬理評価などなどなんでもできます。こういった企業は価格もそれなりにします。一方でコスパを考えれば人件費の安い国にいけば良いわけで、比較的低コストで合成の人員を依頼できるCROもあります。
極論ですが、創薬化学者の仕事は委託でまかなえる部分が増えているわけです。実際バーチャルファーマと言って、考えるヘッドの数人だけで後の実務は全部CROに委託して創薬をやるといった企業もあります。これは当事者にしてみれば結構シビアな問題です。会社で給料をもらうということは、自分に投資してもらうってことと同じなので、私は投資する価値がありますよって示さないとならないと思うんです。つまりそういったメガファーマ出身の猛者がうようよいるところで、私に投資した方がいいですって自信持って言えるのかってことだと思うのです。
じゃあどうすんのさってのは正直私も答えを持っていません。ただ、何か一つの専門性だけでやってこうと思うとそれこそ突き抜けたレベルに行かないとダメなのかなって。自分はちょっと脇道に外れてT shapeな人になりたいなって考えています。最近はTじゃなくてπという話もあります。

つらつら書いてしまいましたが、最先端の科学技術と自分の合成スキルを組み合わせてデザインしたものを形にできるという仕事はとってもやりがいがあります 私にとっては。
もう一回書きますが、これはあくまで私見なのでいやいや違うでしょ、とかいろんな意見があると思います。ご意見、あればお受けいたします。

最後になりますが駄文に最後までお付き合い頂いたことに感謝意を述べておしまいにしたいと思います。
最後まで読んでいただきありがとうございました。

次はチャンスがあればもっと技術よりの話題を書こうかな

Data sharing for drug development

Many kinds of predictive models are used not only prediction of target activity but also prediction of safety.
It often needs lots of data to build robust predictive model. It’s problematic.
The article describes the challenge to solve the problem for non-clinical safety area.

eTOX project, which started in 2010. The project over came following problems!

“””
The first challenge to be overcome by the eTOX consortium
was the apprehension of pharmaceutical companies
about sharing sensitive proprietary preclinical data. This
required a combination of legal (consortium agreement),
technical (database installed behind companies’ firewalls
and models implemented within self-contained virtual
machines), organizational (the ‘honest broker’ concept),
psychological (trust gained through collaboration), political
(data-sharing pressure, such as the FAIR (Findability,
Accessibility, Interoperability and Reusability) principles)
and social (snowball effect) solutions

“””
It against the traditional Japanese company, it will take long long time and will be required many human resources I think…. ( how about reader’s opinion ???)
And also they challenged standardization in the data that was provided from pharmaceutical companies. I am interested in how to do that. Data standardization might affect accuracy of the model.
The project now entered into its sustainability phase.

Recently collaboration research is very common strategy for pharmaceutical area. An appropriate structure of collaboration, data normalization and speed are key factor for success.

Latent semantic analysis with python

Yesterday, I learned about gensim. Gensim is a free python library for topic modeling.
The library seems easy to use and is implemented lots of method like doc2vec, word2vec….
At first, I tried basic tutorial for doc2vec and similarity queries.
Code is following. Almost is same as official site.
https://radimrehurek.com/gensim/tut3.html

In this tutorial, I found pprint function is useful !
Basic work flow is…
– step1 make a training set without stop words.
– step2 pick up words that used at least 2 times.
– step3 make a dictionary. gensim has corpora.Dictionary function makes a dictionary that has index and word.
– step4 make a model. In the following code, I used LsiModel. Also gensim can make Tfidf model. 😉 tfidf = text frequency, inverse document frequency
– step5 calculate similarities between documents and new document.
– step6 enjoy !

import logging
from pprint import pprint
#logging.basicConfig( format='%(asctime)s:%(levelname)s:%(message)s', level=logging.DEBUG)

from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

pprint( len( documents ))

stoplist = set( 'for of a the and to in'.split() )
texts = [ [word for word in document.lower().split() if word not in stoplist ] for document in documents ]
pprint( texts )

from collections import defaultdict
frequency = defaultdict( int )
for text in texts:
    for token in text:
        frequency[ token ] += 1
pprint( frequency )

texts2 = [ [ token for token in text if frequency[ token ] > 1 ] for text in texts ]

pprint( texts2 )
dictionary = corpora.Dictionary( texts2 )
print( dictionary )
print( dictionary.token2id )

newdoc = 'human computer interaction'
newvec = dictionary.doc2bow( newdoc.split() )
        

corpus = [ dictionary.doc2bow( text ) for text in texts ]
corpora.MmCorpus.serialize( './deerwster.mm', corpus )
corpora.SvmLightCorpus.serialize('./corpus.svmlight', corpus)
tfidf = models.TfidfModel( corpus )
corpus_tfidf = tfidf[ corpus ]
for doc in corpus_tfidf:
    print( doc )
# latent semantic analysis
lsi = models.LsiModel( corpus, id2word=dictionary, num_topics=2)
index = similarities.MatrixSimilarity( lsi[ corpus ] )
veclsi = lsi[ newvec ]
print( '+'*20 )
print( veclsi )

sims = index[ veclsi ]
for i, sim in enumerate( sims):
    pprint( documents[i]+":::"+newdoc+" similarity_score_is {}".format( sim )  )

print( "+"*20 )
print( "+"*20 )
sims2 = index[  lsi[ dictionary.doc2bow(texts2[0])] ]
for i, sim in enumerate( sims2):
    pprint( documents[i]+":::"+documents[0]+" similarity_score_is {}".format( sim )  )

BTW, what’s LsiModel ?
The article describes Lsi. It was published 1990. And basic concept is principal component analysis ( PCA ) / singular value decomposition of Text/Document matrix.
http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf
The author described to extract ‘latent semantics’ from document for classification. How to compare similarity between document A and document B.
For drug discovery, medicinal chemists always think about what is pharmacophore. Is it semantics of molecule-target interaction ? Hmm. I feel it’s interesting. But I know there is a large gap NLP and Drug design. How to fill the gap. 😉

And output is following.

iwatobipen$ python gensimtest.py 
9
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]
defaultdict(<class 'int'>,
            {'abc': 1,
             'applications': 1,
             'binary': 1,
             'computer': 2,
             'engineering': 1,
             'eps': 2,
             'error': 1,
             'generation': 1,
             'graph': 3,
             'human': 2,
             'interface': 2,
             'intersection': 1,
             'iv': 1,
             'lab': 1,
             'machine': 1,
             'management': 1,
             'measurement': 1,
             'minors': 2,
             'opinion': 1,
             'ordering': 1,
             'paths': 1,
             'perceived': 1,
             'quasi': 1,
             'random': 1,
             'relation': 1,
             'response': 2,
             'survey': 2,
             'system': 4,
             'testing': 1,
             'time': 2,
             'trees': 3,
             'unordered': 1,
             'user': 3,
             'well': 1,
             'widths': 1})
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]
Dictionary(12 unique tokens: ['human', 'system', 'eps', 'survey', 'response']...)
{'human': 0, 'system': 4, 'eps': 8, 'survey': 6, 'response': 3, 'interface': 1, 'minors': 11, 'graph': 10, 'user': 5, 'time': 7, 'computer': 2, 'trees': 9}
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(2, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.3244870206138555), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.44424552527467476)]
[(1, 0.5710059809418182), (4, 0.4170757362022777), (5, 0.4170757362022777), (8, 0.5710059809418182)]
[(0, 0.49182558987264147), (4, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (5, 0.45889394536615247), (7, 0.6282580468670046)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(6, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]
++++++++++++++++++++
[(0, 0.46182100453271591), (1, -0.070027665278999826)]
('Human machine interface for lab abc computer applications:::human computer '
 'interaction similarity_score_is 0.9980930089950562')
('A survey of user opinion of computer system response time:::human computer '
 'interaction similarity_score_is 0.9374863505363464')
('The EPS user interface management system:::human computer interaction '
 'similarity_score_is 0.9984452724456787')
('System and human system engineering testing of EPS:::human computer '
 'interaction similarity_score_is 0.9865885972976685')
('Relation of user perceived response time to error measurement:::human '
 'computer interaction similarity_score_is 0.9075594544410706')
('The generation of random binary unordered trees:::human computer interaction '
 'similarity_score_is -0.12416791915893555')
('The intersection graph of paths in trees:::human computer interaction '
 'similarity_score_is -0.10639259219169617')
('Graph minors IV Widths of trees and well quasi ordering:::human computer '
 'interaction similarity_score_is -0.09879463911056519')
('Graph minors A survey:::human computer interaction similarity_score_is '
 '0.050041764974594116')
++++++++++++++++++++
++++++++++++++++++++
('Human machine interface for lab abc computer applications:::Human machine '
 'interface for lab abc computer applications similarity_score_is 1.0')
('A survey of user opinion of computer system response time:::Human machine '
 'interface for lab abc computer applications similarity_score_is '
 '0.9142159223556519')
('The EPS user interface management system:::Human machine interface for lab '
 'abc computer applications similarity_score_is 0.9999819993972778')
('System and human system engineering testing of EPS:::Human machine interface '
 'for lab abc computer applications similarity_score_is 0.9947828650474548')
('Relation of user perceived response time to error measurement:::Human '
 'machine interface for lab abc computer applications similarity_score_is '
 '0.8799076676368713')
('The generation of random binary unordered trees:::Human machine interface '
 'for lab abc computer applications similarity_score_is -0.1851814091205597')
('The intersection graph of paths in trees:::Human machine interface for lab '
 'abc computer applications similarity_score_is -0.16756732761859894')
('Graph minors IV Widths of trees and well quasi ordering:::Human machine '
 'interface for lab abc computer applications similarity_score_is '
 '-0.1600322276353836')
('Graph minors A survey:::Human machine interface for lab abc computer '
 'applications similarity_score_is -0.011704295873641968')

Calculate USRCAT with RDKit #RDKit

Some years ago, I posted blog about USRCAT.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3505738/
USRCAT is shape based method like ROCS. And it works very fast. The code was freely available but to use the code, user need to install it.
But as you know, new version of RDKit implements this function! That is good news isn’t it.
I tried the function just now.
Source code is following.

import os
import seaborn as sns
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdBase
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem.rdMolDescriptors import GetUSRScore, GetUSRCAT
from rdkit.Chem import DataStructs
print( rdBase.rdkitVersion )
path = os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

mols = [ mol for mol in Chem.SDMolSupplier( path ) ]
for mol in mols:
    AllChem.EmbedMolecule( mol, 
                           useExpTorsionAnglePrefs = True,
                           useBasicKnowledge = True )
usrcats = [ GetUSRCAT( mol ) for mol in mols ]
fps = [ AllChem.GetMorganFingerprintAsBitVect( mol, 2 ) for mol in mols ]

data = { "tanimoto":[], "usrscore":[] }

for i in range( len( usrcats )):
    for j in range( i ):
        tc = DataStructs.TanimotoSimilarity( fps[ i ], fps[ j ] )
        score = GetUSRScore( usrcats[ i ], usrcats[ j ] )
        data["tanimoto"].append( tc )
        data["usrscore"].append( score )
        print( score, tc )
df = pd.DataFrame( data )

fig = sns.pairplot( df )
fig.savefig( 'plot.png' )

Run the code.

iwatobipen$ python usrcattest.py
# output
2017.09.1
0.4878222403055059 0.46296296296296297
0.2983740604270427 0.48148148148148145
0.36022943735904756 0.5660377358490566
0.3480531986117265 0.5
0.3593106395905704 0.6595744680851063
0.25662588527525304 0.6122448979591837
0.18452571918677163 0.46296296296296297
0.18534407651655047 0.5769230769230769
0.1698894448811921 0.5660377358490566
0.19927998441539707 0.6956521739130435
0.2052241644475582 0.15714285714285714
0.21930710455068858 0.10526315789473684
0.21800341857284924 0.1038961038961039

Tanimoto coeff and USRScore showed different score ( 2D vs 3D pharmacophore ). I think USRScore provides new way to estimate molecular similarity.

RDKit is really cool toolkit. I love it. 😉

Visit Berlin

This week I visited Berlin in business travel. I could have useful discussion and enjoy my travel.
It was pleasure for me to discuss lots of people and learn about new technology. On the other hand I felt my inability in English. It was very difficult to discuss with foreign peoples in very limited times. Hmm… ;-(

BTW, In my free time, I visited the East Side Gallery. This site is The East Side Gallery is international memorial for freedom. I saw lots of art. I was very impressed by the art.


Also I enjoyed German traditional food. 😉
Currywurst

Eisbein

And huge potato salad!!!!

That became a good experience. I need learn more and more. Keep learning!!