2015年のまとめ

I wish you a happy new year.

WordPress.com 統計チームは、2015年のあなたのブログの年間まとめレポートを用意しました。

概要はこちらです。

ニューヨーク市の地下鉄には1,200人が乗車できます。2015にこのブログは約7,600回表示されました。ニューヨークの地下鉄に置き換えると、約6台分になります。

レポートをすべて見るにはクリックしてください。

down load images from web.

Web crawling technology is useful for getting broad range information.
There are lots of document about web crawling using python.
Today I used scrapy for crawling.
Scrapy is open source framework for web crawling.
The library can be installed using pip install.
After installed scrapy, I started following command.

iwatobipen$ scrapy startproject imagedl

Then imagedl folder was cleated and there are some files in the folder.

imagedl/
├── imagedl
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

Next I coded setting.py file. The file define the base settings of crawler.
ITEM_PIPELINES setting is needed for image handling and IMAGES_STORE means path of download folder.

BOT_NAME = 'imagedl'
SPIDER_MODULES = ['imagedl.spiders']
NEWSPIDER_MODULE = 'imagedl.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline':1}
IMAGES_STORE = './stock'

Next items.py is defined.
images_urls and images are common way to get images.

 import scrapy
 class ImagedlItem(scrapy.Item):
      image_urls = scrapy.Field()
      images = scrapy.Field()

Finally define spider/imageget.py.
name is name of this crawler.
And start_urls is page for crawl.
I used HtmlXPathSelector for selecting image url in the web page.

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.selector import HtmlXPathSelector
from imagedl.items import ImagedlItem
class ImSpider( scrapy.Spider ):
    name = 'hoge'
    allowed_domains = ['test.org']
    start_urls = [ "https://www.google.co.jp/search?q=%E6%9D%B1%E4%BA%AC%E3%82%B0%E3%83%BC%E3%83%AB&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjNw6yLzYPKAhXHHaYKHR7_Bv0Q_AUICCgC&biw=867&bih=602" ]
    def parse( self, response ):
        hxs = HtmlXPathSelector( response )
        item = ImagedlItem()
        image_urls = hxs.select( '//img/@src' ).extract()
        item['image_urls'] = [x for x in image_urls]
        return item

All setting was done.
Then run crawler.

iwatobipen$ scrapy crawl hoge

After run the command, stock folder appeared and images were stored in the folder.
If reader interested in the kind of image…. check the url.
Images are one of Japanese amine.

Integration RDKit and Knime using Python scripting node.

Knime is tool for making workflow.
Old version of Knime can’t call RDKit from python scripting node directory.
New version of Knime and python scripting node can do that. It’s means that user can build more flexible work flow. ;-)
I set up my environment and test it.
At first I installed new version of knime and python scripting node, and rdkit using anaconda.
Then set knime preference like following picture.
Path of conda’s python is …/Users/{username}/.pyenv/versions/anaconda-2.4.0/bin/python2.7.
Screen Shot 2015-12-29 at 9.31.06 AM

Next, I made sample workflow.
Screen Shot 2015-12-29 at 9.26.20 AM

This flow retrieve user defined ChemblID data from CHEMBLDB and generate Matched Pairs using ErlWood Chemoinformatics node.
At the same time, Chemical sketcher node provide user query.
This node can handle multiple molecules!
Screen Shot 2015-12-29 at 9.26.05 AM
Then python script(2:1) get user query mols and transformation rules as rxn format and convert molecules based on MMP.
Python snippet is following.

# Write in python scripting node
from rdkit import Chem
from rdkit.Chem import AllChem
import pandas as pd

mols = input_table_1['Molecule (RDKit Mol)']
rxns = input_table_2['Transformation']
# Too many rxn, it'll take long time. So, I get first 100 rxns.
rxns = [ AllChem.ReactionFromRxnBlock(str(rxn)) for rxn in rxns ][:100]

counter = 0
products = set()

for mol in mols:
    for rxn in rxns:
        rxnsmi = AllChem.ReactionToSmiles(rxn).replace("*","*:1")
        reaction = AllChem.ReactionFromSmarts( rxnsmi )
        ps = reaction.RunReactants( [mol] )
        
        for y in range(len(ps)):
            for z in range(len(ps[y])):
                p = ps[y][z]
                try:
                    Chem.SanitizeMol(p)
                except:
                    pass
                products.add(Chem.MolToSmiles(p,isomericSmiles=True))
    counter += 1
#output_table = pd.DataFrame( list(products))
output_table = pd.DataFrame(list(products), columns=['smiles'])

This node can handle data as pandas DataFrame.
Finally out put strings of smiles dataframe.
Now I got transformed molecules.
Screen Shot 2015-12-29 at 9.25.42 AM

There are some bugs in this node, but combination of knime and rdkit will be powerful tools for chemoinformatics.

Compare chainer example with GPU and without GPU.

Some days ago, @fmkz__ -san posted about chainer.
http://blog.kzfmix.com/entry/1450445178

I have not ran example code, so tried it.
At first, Run with out GPU.

iwatobipen$ time python train_mnist.py
load MNIST dataset
epoch 1
graph generated
train mean loss=0.192379207825, accuracy=0.941316669857
test  mean loss=0.0953161568585, accuracy=0.968600004911
.....................
epoch 20
train mean loss=0.00968988090991, accuracy=0.997333335777
test  mean loss=0.0921416912593, accuracy=0.984500007629
save the model
save the optimizer

real	6m33.396s
user	11m35.861s
sys	0m18.857s

Next, Run with GPU.
Only add option ‘–gpu=0’.

iwatobipen$ time python train_mnist.py --gpu=0
load MNIST dataset
epoch 1
graph generated
train mean loss=0.194442151414, accuracy=0.941800003027
test  mean loss=0.0869260625821, accuracy=0.972800006866
...............................
epoch 20
train mean loss=0.00370079859973, accuracy=0.998933334351
test  mean loss=0.104757102392, accuracy=0.983700006008
save the model
save the optimizer

real	2m4.095s
user	2m1.759s
sys	0m1.336s

3 to 5 times faster with GPU than without GPU.

Seed up of array calculation.

My colleague in CADD team told me that to calculate massive data in python, his recommendation is numpy.
And he showed me very nice code. ( Nice stuff. )
Numpy is the fundamental package for scientific computing with Python. It can handle data as vector, array.

That means native python handles Ns list data with loop of N times. The cost O(N).
But numpy can handle directly Ns list. so, user don’t need write loop. ;-)

Somedays ago, I found numexpr in pypi.
The library can boost calculation in some cases.
Numexpr can be installed using pip, conda.
http://www.numpy.org/

1st try.
Vector handling

import numpy as np
import numexpr as ne
a = np.arange(1e6)
b = np.arange(1e6)

#Use numpy
%timeit a*b-4.1*a > 2.5*b
#100 loops, best of 3: 8.51 ms per loop

#User numexr
%timeit ne.evaluate( "a*b-4.1*a > 2.5*b" )
#1000 loops, best of 3: 1.12 ms per loop

2nd Array handling

ar1 = np.random.random((1e3,1e3))
ar2 = np.random.random((1e3,1e3))
#use numpy
%timeit ar1 * ar2
#100 loops, best of 3: 1.95 ms per loop

#use numexr
%timeit ne.evaluate("ar1*ar2")
#1000 loops, best of 3: 673 µs per loop

In both examples numexr showed good performance.

Find MCS in R

Find maximum common substructure is useful for finding core scaffold.
I think that finding MCS, using commercially available tools is common (pipeline pilot ?).
I often use RDkit. ;-)
Today I found the library that search MCS in R, named fmcsR.
That’s sounds nice, because if fmcsR works fine, I’ll implement the library to Spotfire using TERR.
So, let’s try it.
Install is very easy.
Type following command.

source("http://bioconductor.org/biocLite.R")
biocLite("fmcsR")

TIPS; fmcsR depend on ChemmineR.
Then write test code.

library(fmcsR)
data("fmcstest")
test <- fmcs(fmcstest[1], fmcstest[2], au=2,bu=1)
plotMCS(test, regenerateCoords=TRUE)

au is Upper bound for the number of atom mismatches.
bu is Upper bound for the number of bound mismatches.

Then I got following image.
Rplot
Works fine.

This library also compute batch search.
Example is following.

> fmcsBatch(sdf[1], sdf[1:30], au=0, bu=0)
starting worker pid=2002 on localhost:11906 at 22:33:40.230
Query_Size Target_Size MCS_Size Tanimoto_Coefficient Overlap_Coefficient
CMP1 33 33 33 1.0000000 1.0000000
CMP2 33 26 11 0.2291667 0.4230769
CMP3 33 26 10 0.2040816 0.3846154
CMP4 33 32 9 0.1607143 0.2812500
CMP5 33 23 14 0.3333333 0.6086957
CMP6 33 19 13 0.3333333 0.6842105
CMP7 33 21 9 0.2000000 0.4285714
CMP8 33 31 8 0.1428571 0.2580645
CMP9 33 21 9 0.2000000 0.4285714
CMP10 33 21 8 0.1739130 0.3809524
CMP11 33 36 15 0.2777778 0.4545455
CMP12 33 26 12 0.2553191 0.4615385
CMP13 33 26 11 0.2291667 0.4230769
CMP14 33 16 12 0.3243243 0.7500000
CMP15 33 34 15 0.2884615 0.4545455
CMP16 33 25 8 0.1600000 0.3200000
CMP17 33 19 8 0.1818182 0.4210526
CMP18 33 24 10 0.2127660 0.4166667
CMP19 33 25 14 0.3181818 0.5600000
CMP20 33 26 10 0.2040816 0.3846154
CMP21 33 25 15 0.3488372 0.6000000
CMP22 33 21 11 0.2558140 0.5238095
CMP23 33 26 11 0.2291667 0.4230769
CMP24 33 17 6 0.1363636 0.3529412
CMP25 33 27 9 0.1764706 0.3333333
CMP26 33 24 13 0.2954545 0.5416667
CMP27 33 26 11 0.2291667 0.4230769
CMP28 33 20 10 0.2325581 0.5000000
CMP29 33 20 8 0.1777778 0.4000000
CMP30 33 18 7 0.1590909 0.3888889
>

This method is useful for bach search.

Unfortunately, batch rmcs search is very slow on win7 32bit environment. ;-(

mishima.syk #7

Yesterday, Mishima.syk #7 was held.
It was really impressive for me. Thanks to all participants and presenters.
I also presented demonstration about “Chainer” and “Tensorflow” in QSAR.

My presentation was uploaded to github.
https://github.com/iwatobipen/mishimasyk/tree/master/mishimasyk7

In my opinion, Using DL for QSAR is good idea but still challenging.
Because there are some problems.
1st. Kinds of input data. ( Fingerprint( almost 2D… ) or descriptors are suitable for DL ?)
2nd Data volume. ( Are there enough training data ?)
3rd Design of model. (How may layers, nodes, activation function, etc. need to build good model ?)
I ‘m still thinking….( I have no answer yet. )

Can AI play a role of MedChem ?
Can AI think about “what’s make next” ?