# 今年を振り返ってみる

いうこと聞かないけど子供がだいぶ成長してきたり。

まあ何にせよ、自分も経験つんで、勉強しないと正確な判断できないので、どうであれ足を止めたらそこでアウトでしょう。来年も理解が遅くともぼちぼち勉強を進めようと思います。

また、今年のmishimasykも楽しい話題が盛りだくさんでした。次回はどんな話題がいいでしょうか。社外の方との勉強会はとてもいい刺激になります。

# Convert rdkit molecule object to igraph graph object.

Molecules are often handled as graph in chemoinformatics.
There are some libraries for graph analysis in python. Today, I wrote a sample script that convert from molecule to graph.
I used python-igraph and rdkit.
RDkit has method to get adjacency matrix from molecule so, I used the method.
Code is following.

```from rdkit import Chem
from rdkit.Chem import rdmolops
import igraph
import numpy as np
# mol2graph function can convert molecule to graph. And add some node and edge attribute.
def mol2graph( mol ):
bondidxs = [ ( b.GetBeginAtomIdx(),b.GetEndAtomIdx() ) for b in mol.GetBonds() ]
graph = igraph.Graph()
for idx in g.vs.indices:
g.vs[ idx ][ "AtomicNum" ] = mol.GetAtomWithIdx( idx ).GetAtomicNum()
g.vs[ idx ][ "AtomicSymbole" ] = mol.GetAtomWithIdx( idx ).GetSymbol()
for bd in bondidxs:
btype = mol.GetBondBetweenAtoms( bd[0], bd[1] ).GetBondTypeAsDouble()
g.es[ g.get_eid(bd[0], bd[1]) ][ "BondType" ] = btype
print( bd, mol.GetBondBetweenAtoms( bd[0], bd[1] ).GetBondTypeAsDouble() )
return g
```

Now test it.

```# Oseltamivir ( tamiful )
mol2 = Chem.MolFromSmiles( "CCOC(=O)C1=C[C@@H](OC(CC)CC)[C@H](NC(C)=O)[C@@H](N)C1" )
g2=mol2graph(mol2)
(0, 1) 1.0
(1, 2) 1.0
(2, 3) 1.0
(3, 4) 2.0
(3, 5) 1.0
(5, 6) 2.0
(6, 7) 1.0
....
```

Seems work fine. ;-)
Next, get bond infromation.

```for e in g2.es:
print(e)
igraph.Edge(<igraph.Graph object at 0x10ae999a8>, 0, {'BondType': 1.0})
igraph.Edge(<igraph.Graph object at 0x10ae999a8>, 1, {'BondType': 1.0})
igraph.Edge(<igraph.Graph object at 0x10ae999a8>, 2, {'BondType': 1.0})
......
```

OK next, I tried to plot tamiful as Graph.
Igraph has plot function.
I added some options for plotting.

```from igraph import GraphBase
layout = g2.layout_graphopt()
color_dict = {"C": "gray", "N": "blue", "O" :  "white" }
my_plot = igraph.Plot()
margin=90,
vertex_color = [color_dict[atom] for atom in g2.vs["AtomicSymbole"]],
vertex_size = [ v.degree()*10 for v in g2.vs ])
my_plot.show()
```

Ummmm @_@ not so good view.
But fun!

# Convert chemical file format .

Recently I knew KCF file format. The format represents molecules as graph structure. And it is used in KEGG.
KCF uses atom label and orientation, and bond information.
RDKit or Openbabel can not convert sdf 2 kcf.

Someone who want to convert sdf to kcf. KEGG site provide us API for file format conversion.
Example is following.
I have a sdf file.

```iwatobipen\$ head cdk2.sdf
ZINC03814457
3D
Structure written by MMmdl.
30 31  0  0  1  0            999 V2000
5.4230   -0.4412    0.7616 C   0  0  0  0  0  0
4.2434    0.3667    0.1880 C   0  0  0  0  0  0
4.5978    0.9630   -1.1852 C   0  0  0  0  0  0
2.9575   -0.4703    0.1074 C   0  0  0  0  0  0
2.9988   -1.6999    0.0580 O   0  0  0  0  0  0
1.6357    0.2975    0.0804 C   0  0  0  0  0  0
```

Know convert the file to kcf format using REST API.

```iwatobipen\$ curl -F sdffile=@cdk2.sdf http://rest.genome.jp/sdf2kcf/ > cdk2.kcf
```
```iwatobipen\$ head cdk2.kcf
ENTRY       cdk21                       Compound
ATOM        17
1   C1a C     5.4230   -0.4412
2   C1c C     4.2434    0.3667
3   C1a C     4.5978    0.9630
4   C5a C     2.9575   -0.4703
5   O5a O     2.9988   -1.6999
6   C1b C     1.6357    0.2975
7   O2a O     0.5374   -0.6063
8   C8y C    -0.7229   -0.0532
```

API works fine. But if user want to convert large dataset, API is too slow.
I will try to make converter someday.

# D3 based open source data visualization tool ;-)

Some days ago, I posted superset. It was amazing for me.
Superset needs database for retrieve data to make visualization. It is difficult to handle database such as Postgres, Oracle, MySQL etc. for medicinal chemist.
Today, I found very nice tool for data visualization based on D3.js.
The name is ‘raw’.
The tool can use not only web app but also your local environment. Wow it’s good news! Raw can make many visualization via web interface.
URL is following.
http://raw.densitydesign.org/
https://github.com/densitydesign/raw/

It is very easy to build local environment.
To install raw in your local PC, it is required to install bower at first. It is installed by npm.
And after installed bower, install raw is simple.

```\$ git clone git://github.com/densitydesign/raw.git
\$ cd raw
\$ bower install
# That's all ;-)
# Let's start server ( following command is for python3+ )
\$ python -m http.server 4000
```

Now I can access http://localhost:4000 and get following image.

At first I need to paste data to input box. I used sample data for sanky diagram.

```source,target1,target2,value
Barry,Elvis,Sarah,2
Frodo,Elvis,Sarah,2
Frodo,Sarah,Elvis,2
Barry,Alice,Elvis,2
Elvis,Sarah,Barry,2
Elvis,Alice,Barry,2
Sarah,Alice,Barry,4
```

And select some parameters for making figure. Just all, I could get sanky diagram.

Easy!!!!!

Also I can make scatter plot via very simple procedure. I used car dataset for example.

It was amazing for me because to make visualization, it is only needed to paste data and set up some parameters through web interface. @_@
It is interesting app isn’t it ? If reader who interested in the code, please try it. ;-)

# Build regression model in Keras

I introduced Keras in mishimasyk#9. And my presentation was how to build classification model in Keras.
A participant asked me that how to build regression model in Keras. I could not answer his question.
After syk#9, I searched Keras API and found good method.
Keras has Scikit-learn API. The API can build regression model. ;-)
https://keras.io/scikit-learn-api/
Example code is following.
The code is used for build QSAR model.

```import numpy as np
import pandas as pd
import sys

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout
from keras.wrappers.scikit_learn import KerasRegressor

def getFpArr( mols, nBits = 1024 ):
fps = [ AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=nBits ) for mol in mols ]
X = []
for fp in fps:
arr = np.zeros( (1,) )
DataStructs.ConvertToNumpyArray( fp, arr )
X.append( arr )
return X

def getResponse( mols, prop="ACTIVITY" ):
Y = []
for mol in mols:
act = mol.GetProp( prop )
act = 9. - np.log10( float( act ) )
Y.append( act )
return Y

def base_model():
model = Sequential()
model.add( Dense( input_dim=1024, output_dim = 100 ) )
return model

if __name__ == '__main__':
filename = sys.argv[1]
sdf = [ mol for mol in Chem.SDMolSupplier( filename ) ]
X = getFpArr( sdf )
Y = getResponse( sdf )

trainx, testx, trainy, testy = train_test_split( X, Y, test_size=0.2, random_state=0 )
trainx, testx, trainy, testy = np.asarray( trainx ), np.asarray( testx ), np.asarray( trainy ), np.asarray( testy )
estimator = KerasRegressor( build_fn = base_model,
nb_epoch=100,
batch_size=20,
)
estimator.fit( trainx, trainy )
pred_y = estimator.predict( testx )
r2 = r2_score( testy, pred_y )
rmse = mean_squared_error( testy, pred_y )
print( "KERAS: R2 : {0:f}, RMSE : {1:f}".format( r2, rmse ) )

```

Run the code.
I used CHEMBLdatast.

```mishimasyk9 iwatobipen\$ python keras_regression.py sdf/CHEMBL952131_EGFR.sdf
Using Theano backend.
Epoch 1/100
102/102 [==============================] - 0s - loss: 62.2934
........................
Epoch 100/100
102/102 [==============================] - 0s - loss: 0.0123
KERAS: R2 : 0.641975, RMSE : 0.578806
```

R2 is 0.64. It’s not so bad. ;-)
I pushed the script to the syk 9 repository.
https://github.com/Mishima-syk/9/tree/master/iwatobipen

# Cool web based data analytical platform

Yesterday, I enjoyed mishima.syk#9. ;-) I hope all participants also enjoyed the meeting.
BTY, I found cool platform for data analysis, named “Superset”.
https://github.com/airbnb/superset
You can see cool review in README.md.
If reader who want to install superset, it is very easy.
For example in MacOS. Only use pip !

```# Install superset
pip install superset
# Initialize the database
# Load some data to play with
# Create default roles and permissions
superset init
# Start the web server on port 8088
superset runserver -p 8088
# To start a development web server, use the -d switch
# superset runserver -d
```

Now you can access localhost:8088.

The web app can easily connect database via SQLalchemy. So, I connect local ChEMBL DB for test.
Configuration is very simple. I used pyscopg2

And push the test connection, I got list of tables.

Next, I tried to retrieve data from chembldb and visualize it. To do it, I wrote SQL query from SQLLab.

SQL Lab return the results interactively! Coool.
Test SQL is follwing.

```SELECT molregno, mw_freebase, alogp, psa, ro3_pass
FROM public.compound_properties
LIMIT 500
```

Then made dashboard, push the visualize button and set parameters. I used bubble chart for plot alogp vs mw.
I worked fine!

Also users can make many type of visualization from their own dataset. Including bar chart, sankey diagram, pie chart, line chart…. ( I did not try the visualizations. )

It is surprising for me, there are many nice open source tools. Javascript and python is good partner I think. ;-D

# Quantum annealing for QSAR!

In the chemoinformatics area, it is important to describe molecular similarity. Merit of fingerprint bit vector based similarity calculation is speed I think. But sometime ECFP4 or any other related methods do not sense of chemst feeling. By the way graph based similarity like a MCS is useful but calculation cost is high. You know, gWT ( graph indexing wavelet tree ) works very first for similarity search.
And also new version of RDKit implemented Atom-Atom Path based similarity algorithm. I’ll introduce the FP soon…

Recently I found very attractive report.

A Novel Graph-based Approach for Determining Molecular Similarity
https://arxiv.org/pdf/1601.06693v1.pdf

First author is researcher of 1QBit. 1QBit developer tools and end-to-end expertise in quantum software are built to solve the world’s most demanding computational challenges( from HP ).
http://1qbit.com/
;-) The article solve the molecular similarity based prediction used “D-Wave systems“.
D-wave system is the quantum computer based on quantum annealing (QA). QA is a metaheuristic for finding global minimum of given objective function by a process using quantum fluctuations.

They represent molecule as graph. The graph has many information, i.e. kind of atoms, bonds, formal charge, position, geometrical center of ring. In the article, the author did not describe the details of the way to molecule to graph.

Finally, they performed case study “prediction of mutagenicity”. And they compared with graph similarity based method and MACCS fingerprint based method. And graph based method showed good performance.
The results was very exiting for me. Because the result indicated that QA is useful for chemoinformatics.
I want to learn about QA more and more.
Keep watching QA!