今年を振り返ってみる

今年も残り2時間くらいになったところ。思い返すと今年は色々な機会や出会いに恵まれていたと思います。
twitterのフォローロフォワーの関係だった方に実際の学会でお会いしたり、ブログを書いていてレスをもらったり。
いうこと聞かないけど子供がだいぶ成長してきたり。
社内的な部分は相変わらずパッとしませんでしたが、対外的な活動ではいろいろとレスをいただきました。感謝しております。
私は一応今の職務は創薬化学研究者という位置付けで仕事をさせていただいています。日々の仕事の中で不平不満があったりするのですが、小さいころから化学が好きだったので今の職につけるということだけでとても恵まれていますよね、、、。
一方で、いろんな方とお話をしたり、業務をしながら創薬科学者の役割って何なんだろうと考える頻度が増えました。昨今、国内外には非常に優秀なCMOが沢山あります。合成に特化したところから、デザインからしてくれるところ、評価してくれるところ、ナドナド。潤沢な予算と、正確な判断ができるのであれば、自分で考えて合成しなくてもいい状況が作れてしまうわけです。しかもコスト燃やすとなった場合、この業界で生き残って行くために必要なことって何なんだろう。
まあ何にせよ、自分も経験つんで、勉強しないと正確な判断できないので、どうであれ足を止めたらそこでアウトでしょう。来年も理解が遅くともぼちぼち勉強を進めようと思います。

また、今年のmishimasykも楽しい話題が盛りだくさんでした。次回はどんな話題がいいでしょうか。社外の方との勉強会はとてもいい刺激になります。
来年も良い出会いがあるといいなあと思います。
皆様にとっても良い一年になることを願いつつそろそろ眠くなったので寝ようかなと。

今年の一番のフォト?フランスストラスブールでの一枚です

IMG_2143.JPG

Convert rdkit molecule object to igraph graph object.

Molecules are often handled as graph in chemoinformatics.
There are some libraries for graph analysis in python. Today, I wrote a sample script that convert from molecule to graph.
I used python-igraph and rdkit.
RDkit has method to get adjacency matrix from molecule so, I used the method.
Code is following.

from rdkit import Chem
from rdkit.Chem import rdmolops
import igraph
import numpy as np
# mol2graph function can convert molecule to graph. And add some node and edge attribute.
def mol2graph( mol ):
    admatrix = rdmolops.GetAdjacencyMatrix( mol )
    bondidxs = [ ( b.GetBeginAtomIdx(),b.GetEndAtomIdx() ) for b in mol.GetBonds() ]
    adlist = np.ndarray.tolist( admatrix )
    graph = igraph.Graph()
    g = graph.Adjacency( adlist ).as_undirected()
    for idx in g.vs.indices:
        g.vs[ idx ][ "AtomicNum" ] = mol.GetAtomWithIdx( idx ).GetAtomicNum()
        g.vs[ idx ][ "AtomicSymbole" ] = mol.GetAtomWithIdx( idx ).GetSymbol()
    for bd in bondidxs:
        btype = mol.GetBondBetweenAtoms( bd[0], bd[1] ).GetBondTypeAsDouble()
        g.es[ g.get_eid(bd[0], bd[1]) ][ "BondType" ] = btype
        print( bd, mol.GetBondBetweenAtoms( bd[0], bd[1] ).GetBondTypeAsDouble() )
    return g

Now test it.

# Oseltamivir ( tamiful )
mol2 = Chem.MolFromSmiles( "CCOC(=O)C1=C[C@@H](OC(CC)CC)[C@H](NC(C)=O)[C@@H](N)C1" )
g2=mol2graph(mol2)
(0, 1) 1.0
(1, 2) 1.0
(2, 3) 1.0
(3, 4) 2.0
(3, 5) 1.0
(5, 6) 2.0
(6, 7) 1.0
....

Seems work fine. ;-)
Next, get bond infromation.

for e in g2.es:
    print(e)
igraph.Edge(<igraph.Graph object at 0x10ae999a8>, 0, {'BondType': 1.0})
igraph.Edge(<igraph.Graph object at 0x10ae999a8>, 1, {'BondType': 1.0})
igraph.Edge(<igraph.Graph object at 0x10ae999a8>, 2, {'BondType': 1.0})
......

OK next, I tried to plot tamiful as Graph.
Igraph has plot function.
I added some options for plotting.

from igraph import GraphBase
layout = g2.layout_graphopt()
color_dict = {"C": "gray", "N": "blue", "O" :  "white" }
my_plot = igraph.Plot()
my_plot.add(g2,layout=layout,bbox=(400,400),
             margin=90,
             vertex_color = [color_dict[atom] for atom in g2.vs["AtomicSymbole"]],
             vertex_size = [ v.degree()*10 for v in g2.vs ])
my_plot.show()

tamiful

Ummmm @_@ not so good view.
But fun!

Convert chemical file format .

Recently I knew KCF file format. The format represents molecules as graph structure. And it is used in KEGG.
KCF uses atom label and orientation, and bond information.
RDKit or Openbabel can not convert sdf 2 kcf.

Someone who want to convert sdf to kcf. KEGG site provide us API for file format conversion.
Example is following.
I have a sdf file.

iwatobipen$ head cdk2.sdf 
ZINC03814457
                    3D
 Structure written by MMmdl.
 30 31  0  0  1  0            999 V2000
    5.4230   -0.4412    0.7616 C   0  0  0  0  0  0
    4.2434    0.3667    0.1880 C   0  0  0  0  0  0
    4.5978    0.9630   -1.1852 C   0  0  0  0  0  0
    2.9575   -0.4703    0.1074 C   0  0  0  0  0  0
    2.9988   -1.6999    0.0580 O   0  0  0  0  0  0
    1.6357    0.2975    0.0804 C   0  0  0  0  0  0

Know convert the file to kcf format using REST API.

iwatobipen$ curl -F sdffile=@cdk2.sdf http://rest.genome.jp/sdf2kcf/ > cdk2.kcf
iwatobipen$ head cdk2.kcf 
ENTRY       cdk21                       Compound
ATOM        17
            1   C1a C     5.4230   -0.4412
            2   C1c C     4.2434    0.3667
            3   C1a C     4.5978    0.9630
            4   C5a C     2.9575   -0.4703
            5   O5a O     2.9988   -1.6999
            6   C1b C     1.6357    0.2975
            7   O2a O     0.5374   -0.6063
            8   C8y C    -0.7229   -0.0532

API works fine. But if user want to convert large dataset, API is too slow.
I will try to make converter someday.

D3 based open source data visualization tool ;-)

Some days ago, I posted superset. It was amazing for me.
Superset needs database for retrieve data to make visualization. It is difficult to handle database such as Postgres, Oracle, MySQL etc. for medicinal chemist.
Today, I found very nice tool for data visualization based on D3.js.
The name is ‘raw’.
The tool can use not only web app but also your local environment. Wow it’s good news! Raw can make many visualization via web interface.
URL is following.
http://raw.densitydesign.org/
https://github.com/densitydesign/raw/

It is very easy to build local environment.
To install raw in your local PC, it is required to install bower at first. It is installed by npm.
And after installed bower, install raw is simple.

$ git clone git://github.com/densitydesign/raw.git
$ cd raw
$ bower install
# That's all ;-)
# Let's start server ( following command is for python3+ )
$ python -m http.server 4000

Now I can access http://localhost:4000 and get following image.
fig1
At first I need to paste data to input box. I used sample data for sanky diagram.

source,target1,target2,value
Barry,Elvis,Sarah,2
Frodo,Elvis,Sarah,2
Frodo,Sarah,Elvis,2
Barry,Alice,Elvis,2
Elvis,Sarah,Barry,2
Elvis,Alice,Barry,2
Sarah,Alice,Barry,4

And select some parameters for making figure. Just all, I could get sanky diagram.
fig2
Easy!!!!!

Also I can make scatter plot via very simple procedure. I used car dataset for example.
screen-shot-2016-12-14-at-22-43-34
fig4

It was amazing for me because to make visualization, it is only needed to paste data and set up some parameters through web interface. @_@
It is interesting app isn’t it ? If reader who interested in the code, please try it. ;-)

Build regression model in Keras

I introduced Keras in mishimasyk#9. And my presentation was how to build classification model in Keras.
A participant asked me that how to build regression model in Keras. I could not answer his question.
After syk#9, I searched Keras API and found good method.
Keras has Scikit-learn API. The API can build regression model. ;-)
https://keras.io/scikit-learn-api/
Example code is following.
The code is used for build QSAR model.

import numpy as np
import pandas as pd
import sys

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout
from keras.wrappers.scikit_learn import KerasRegressor

def getFpArr( mols, nBits = 1024 ):
    fps = [ AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=nBits ) for mol in mols ]
    X = []
    for fp in fps:
        arr = np.zeros( (1,) )
        DataStructs.ConvertToNumpyArray( fp, arr )
        X.append( arr )
    return X

def getResponse( mols, prop="ACTIVITY" ):
    Y = []
    for mol in mols:
        act = mol.GetProp( prop )
        act = 9. - np.log10( float( act ) )
        Y.append( act )
    return Y

def base_model():
    model = Sequential()
    model.add( Dense( input_dim=1024, output_dim = 100 ) )
    model.add( Activation( "relu" ) )
    model.add( Dense( 100 ) )
    model.add( Activation( "relu" ) )
    model.add( Dense( 1 ) )
    #model.add( Activation( 'relu' ) )
    model.compile( loss="mean_squared_error",  optimizer="adam" )
    return model


if __name__ == '__main__':
    filename = sys.argv[1]
    sdf = [ mol for mol in Chem.SDMolSupplier( filename ) ]
    X = getFpArr( sdf )
    Y = getResponse( sdf )

    trainx, testx, trainy, testy = train_test_split( X, Y, test_size=0.2, random_state=0 )
    trainx, testx, trainy, testy = np.asarray( trainx ), np.asarray( testx ), np.asarray( trainy ), np.asarray( testy )
    estimator = KerasRegressor( build_fn = base_model,
                                nb_epoch=100,
                                batch_size=20,
                                 )
    estimator.fit( trainx, trainy )
    pred_y = estimator.predict( testx )
    r2 = r2_score( testy, pred_y )
    rmse = mean_squared_error( testy, pred_y )
    print( "KERAS: R2 : {0:f}, RMSE : {1:f}".format( r2, rmse ) )

Run the code.
I used CHEMBLdatast.

mishimasyk9 iwatobipen$ python keras_regression.py sdf/CHEMBL952131_EGFR.sdf 
Using Theano backend.
Epoch 1/100
102/102 [==============================] - 0s - loss: 62.2934     
........................   
Epoch 100/100
102/102 [==============================] - 0s - loss: 0.0123     
KERAS: R2 : 0.641975, RMSE : 0.578806

R2 is 0.64. It’s not so bad. ;-)
I pushed the script to the syk 9 repository.
https://github.com/Mishima-syk/9/tree/master/iwatobipen

Cool web based data analytical platform

Yesterday, I enjoyed mishima.syk#9. ;-) I hope all participants also enjoyed the meeting.
BTY, I found cool platform for data analysis, named “Superset”.
https://github.com/airbnb/superset
You can see cool review in README.md.
If reader who want to install superset, it is very easy.
For example in MacOS. Only use pip !

# Install superset
pip install superset
# Create an admin user
fabmanager create-admin --app superset
# Initialize the database
superset db upgrade
# Load some data to play with
superset load_examples
# Create default roles and permissions
superset init
# Start the web server on port 8088
superset runserver -p 8088
# To start a development web server, use the -d switch
# superset runserver -d

Now you can access localhost:8088.

The web app can easily connect database via SQLalchemy. So, I connect local ChEMBL DB for test.
Configuration is very simple. I used pyscopg2
dbconfig
And push the test connection, I got list of tables.
dbconfig2

Next, I tried to retrieve data from chembldb and visualize it. To do it, I wrote SQL query from SQLLab.
SQLLAB
SQL Lab return the results interactively! Coool.
Test SQL is follwing.

SELECT molregno, mw_freebase, alogp, psa, ro3_pass
FROM public.compound_properties
LIMIT 500

Then made dashboard, push the visualize button and set parameters. I used bubble chart for plot alogp vs mw.
I worked fine!
bubble chart
Also users can make many type of visualization from their own dataset. Including bar chart, sankey diagram, pie chart, line chart…. ( I did not try the visualizations. )

dash board

It is surprising for me, there are many nice open source tools. Javascript and python is good partner I think. ;-D

Quantum annealing for QSAR!

In the chemoinformatics area, it is important to describe molecular similarity. Merit of fingerprint bit vector based similarity calculation is speed I think. But sometime ECFP4 or any other related methods do not sense of chemst feeling. By the way graph based similarity like a MCS is useful but calculation cost is high. You know, gWT ( graph indexing wavelet tree ) works very first for similarity search.
https://code.google.com/archive/p/gwt/
And also new version of RDKit implemented Atom-Atom Path based similarity algorithm. I’ll introduce the FP soon…

Recently I found very attractive report.

A Novel Graph-based Approach for Determining Molecular Similarity

https://arxiv.org/pdf/1601.06693v1.pdf

First author is researcher of 1QBit. 1QBit developer tools and end-to-end expertise in quantum software are built to solve the world’s most demanding computational challenges( from HP ).
http://1qbit.com/
;-) The article solve the molecular similarity based prediction used “D-Wave systems“.
D-wave system is the quantum computer based on quantum annealing (QA). QA is a metaheuristic for finding global minimum of given objective function by a process using quantum fluctuations.

They represent molecule as graph. The graph has many information, i.e. kind of atoms, bonds, formal charge, position, geometrical center of ring. In the article, the author did not describe the details of the way to molecule to graph.

Finally, they performed case study “prediction of mutagenicity”. And they compared with graph similarity based method and MACCS fingerprint based method. And graph based method showed good performance.
The results was very exiting for me. Because the result indicated that QA is useful for chemoinformatics.
I want to learn about QA more and more.
Keep watching QA!