Look Back at 2018

I started the blog site from Aug. 2012 and changed main language from Japanese to English from 2014. After changing the language, I can find access from many foreign countries. Number of visitors are increasing year by year. In 2018, number of site visitors and page views are two times greater than last year. @_@

2018 is very nice year for me. There were many events.

First, I had opportunity to participate RDKit UGM 2018 held in Cambridge. I could enjoy not only discussion with many RDKitters and Greg who is developer of RDKit but also nice dinner and beautiful scenery of UK.
And I participated Hackathon in the UGM. It was precious time for me.
Recently there are many articles which use Deep Learning/AI for Drug Discovery. The main part of chemoinformatics engine of these articles is RDKit in most cases. And good news for RDKitter in Japan is that there are many site which describes about RDKit in Japanese. RDKit is Army Knife of Chemoinformatics.
I hope I can participate in the UGM next year. ;-)

Second, I could have many chances to present about our computational chemistry team activity at outside company. It was exciting experience for me and I could get many useful feedbacks and comments from audiences. And luckily, I got a poster award at a conference. Thankful for my team!!!
Main topics of my presentation was AI and medchem. My opinion was also posted ‘Souyaku AC 2018‘(in Japanese). The post got the highest views in this year.
Recently there are many AI drug discovery startup companies. In Japan, I often see the word ‘AI創薬’ (=AI Drug Discovery) in news paper, press releases and web page etc etc. I feel some publications overhype. Because in my opinion, most of current AI system can generate new candidate of molecules but can not design new candidate of molecules.
Finally I found a survey in JPMA site about “the application of AI in Pharmaceutical companies in Japan”.
Hmm…. It is difficult to comment in this post.
There is no doubt that AI is helpful for drug discovery research but still developing. I need to keep eyes open for the research area. And check the progress of technologies.

Third, I got Class C license of dodge ball(Ha ha ha). I played dodge ball when I was young but official rule is quite different. And the speed of ball is very fast even if the players are Elementary school students. For example, the speed of a my kid pitched ball is 60~67 km/h. He is seven years old.
The game is exiting and speedy like this….
So it is exiting but difficult to judge for me.

Etc, etc….. I could really enjoy this year. Thankful for my family, friend, team member.

Now I am studying several programming language such as ruby, haskell and quantum chemistry, mathematics and physics. I have to keep going.

I hope I can enjoy next year.

 Finally, I want to wish you and your family a beautiful and happy New Year. 


Make original sklearn classifier-2 #sklearn #chemoinfo

After posted ‘Make original sklearn classifier’, I could get comment from my follower @yamasaKit_-san and @kzfm-san. (Thanks!)

So I checked diversity of models with principal component analysis(PCA).
The example is almost same as yesterday but little bit different at last part.
Last part of my code is below. Extract feature importances from L1 layer classifiers and mono-random forest classifier. And then conduct PCA.

labels = ["rf", "et", "gbc", "xgbc", "mono_rf"]
feature_importances_list = [clf.feature_importances_ for clf in blendclf.l1_clfs_]
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
res = pca.fit_transform(feature_importances_list)

Then plot results with matplotlib. And I used adjustText library for plot labeling. adjustText is powerful package for making nice view of labeled plot.

from adjustText import adjust_text
x, y = res[:,0], res[:,1]
plt.plot(x, y, 'bo')
plt.xlim(-0.1, 0.3)
plt.ylim(-0.05, 0.1)

texts = [plt.text(x[i], y[i], '{}'.format(labels[i])) for i in range(len(labels))]

The plot indicates that two models ET, RF learned similar feature importances to mono Randomforest but XGBC and GBC learned different feature importance. So, combination of ET and RF is redundant for Ensemble learning I think.

To build good ensemble model, I need to think about combination of models and how to tune many hyper parameters.

You can check whole code from the URL below.

Make original sklearn classifier #sklearn #chemoinfo

I posted and wrote code about ‘blending’ which is one of the strategy for ensemble learning. But the code had many hard coded part so it was difficult to use in my job. In this post, I tried to make new classification class of sklearn for ensemble learning and test the code.

At first, most of part came from my old post.
To make blending classifier, the class succeed to BaseEstimator, ClassifierMixin and TransformerMixin. And defined fit, predict and predict_proba method.

Main code is below.

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import TransformerMixin
from sklearn.base import clone
import numpy as np
import six
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

class BlendingClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, l1_clfs, l2_clf, 
                 n_hold=5, test_size=0.2, verbose=0,
                 use_clones=True, random_state=794):
        self.l1_clfs = l1_clfs
        self.l2_clf = l2_clf

        self.n_hold = n_hold
        self.test_size = test_size
        self.verbose = verbose
        self.use_clones = use_clones
        self.random_state = random_state
        self.num_cls = None

    def fit(self, X, y):
        skf = StratifiedKFold(n_splits=self.n_hold, random_state=self.random_state)
        if self.use_clones:
            self.l1_clfs_ = [clone(clf) for clf in self.l1_clfs]
            self.l2_clf_ = clone(self.l2_clf)
            self.l1_clfs_ = self.l1_clfs
            self.l2_clf_ = self.l2_clf

        self.num_cls = len(set(y))
        if self.verbose > 0:
            print("Fitting {} l1_classifiers...".format(len(self.l1_clfs)))
            print("{} classes classification".format(self.num_cls))
        dataset_blend_train = np.zeros((X.shape[0],len(self.l1_clfs_), self.num_cls))

        for j, clf in enumerate(self.l1_clfs_):
            for i, (train_idx, test_idx) in enumerate(skf.split(X, y)):
                if self.verbose > 0:
                    print('{}-{}th hold, {} classifier'.format(j+1, i+1, type(clf)))
                train_i_x, train_i_y  = X[train_idx], y[train_idx]
                test_i_x, test_i_y = X[test_idx], y[test_idx]
                clf.fit(train_i_x, train_i_y)
                dataset_blend_train[test_idx, j, :] = clf.predict_proba(test_i_x)
        if self.verbose > 0:
            print('--- Blending ---')
        dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1))
        self.l2_clf_.fit(dataset_blend_train, y)
        return self

    def predict(self, X):
        l1_output = np.zeros((X.shape[0], len(self.l1_clfs_), self.num_cls))
        for i, clf in enumerate(self.l1_clfs_):
            pred_y = clf.predict_proba(X)
            l1_output[:, i, :] = pred_y
        l1_output = l1_output.reshape((X.shape[0], -1))
        return self.l2_clf_.predict(l1_output)

    def predict_proba(self, X):
        l1_output = np.zeros((X.shape[0], len(self.l1_clfs_), self.num_cls))
        for i, clf in enumerate(self.l1_clfs_):
            pred_y = clf.predict_proba(X)
            l1_output[:, i, :] = pred_y
        l1_output = l1_output.reshape((X.shape[0], -1))
        return self.l2_clf_.predict_proba(l1_output)

Now ready, let’s test the code! The code works like this …

from blending_classification import BlendingClassifier

rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
et = ExtraTreesClassifier(n_estimators=100, n_jobs=-1)
gbc = GradientBoostingClassifier(learning_rate=0.01)
xgbc = XGBClassifier(n_estimators=100, n_jobs=-1)
# To use SVC, probability option must be True
svc = SVC(probability=True, gamma='auto')
# The class is two layers blending classifier.
# layer one is set of classifier.
# layer two is final classifier which uses output of layer one.

l1_clfs = [rf, et, gbc, xgbc]
l2_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
blendclf = BlendingClassifier(l1_clfs, l2_clf, verbose=1)

blendclf.fit(train_X, train_y)
pred_y = blendclf.predict(test_X)
print(classification_report(test_y, pred_y))
              precision    recall  f1-score   support

           0       0.76      0.79      0.78       102
           1       0.69      0.61      0.65       115
           2       0.56      0.70      0.62        40

   micro avg       0.70      0.70      0.70       257
   macro avg       0.67      0.70      0.68       257
weighted avg       0.70      0.70      0.70       257
cm = confusion_matrix(test_y, pred_y)

Next, run same task with RandomForest.

mono_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1) mono_rf.fit(train_X, train_y) pred_y2 = mono_rf.predict(test_X)  print(classification_report(test_y, pred_y2))
              precision    recall  f1-score   support

           0       0.81      0.81      0.81       102
           1       0.77      0.75      0.76       115
           2       0.72      0.78      0.75        40

   micro avg       0.78      0.78      0.78       257
   macro avg       0.77      0.78      0.77       257
weighted avg       0.78      0.78      0.78       257
cm2 = confusion_matrix(test_y, pred_y2)

Hmm…… RandomForest shows better performance than blending classifier. Ensemble method seems return robust model I think but it is needed to parameter tuning. (I know it same for every machine learning method ;-) )

Any way, I wrote blending classifier class today and up load the code to my repo.
Any comments and/or suggestions are appreciated.

*All code can found following repo.

*And note book can see following URL.

Two days to 2019!

Analysis and visualize tool kit for FMO #FMO

I and my daughter got the flu last week and now we are staying in my home…
Now I read some articles and found interesting work for FMO.
URL is below.
FMO means ‘Fragment Molecular Orbital’ that is powerful method for protein-ligand interaction energy calculation. Evotec which is a drug discovery alliance and development partnership company published many attractive articles about utilization of FMO for Drug design.
Some of examples are …
FMO is useful but lack of efficient analytical tool, it is difficult to understand the results for none specialist I think.

The authors developed new analysis and visualization tool kit for FMO RbAnalysisFMO and Pymol plugin written by Ruby and Python. Both tool kit s are freely available.

This tool kit can analyze two modes, ‘All-pairs’ and ‘Selected-pairs’. All-pairs mode can analyze all pair IEIE or PIE and ‘Selected-piars’ mode can analyze IEIE or PIE with user defined amino acid residues. And also it can analyze PIEDA (pair interaction energy decomposition analysis). PIEDA decomposes total interaction energy to four components “ES/electro static”, “DS/dispersion”, “CT/charge transfer” and “EX/exchange”. It is useful to understand of details of the interaction so visualize these component is useful.

The authors showed two examples for the tool kit usage.
One is the case of BC2L-C. They conducted FMO between BC2L-C and seleno methylated fucose(SFU), and revealed new interaction that can not indicated as an interacting residue of SFU in the PDB record.

Understanding protein-ligand interaction is key for drug design so efficient analytical and visualization tools are important. The tool kit looks useful not only computational chemist but also medicinal chemist.

メドケムxAI 創薬化学者の今後は明るいのか? #souyakuAC2018

これは"創薬アドベントカレンダー2018" 20日目の記事になります。

まずは、Drug Discovery Todayから
Medicinal chemistry in drug discovery in big pharma: past, present and future

論文中のTable1”summary of the topics covered from the past to the present and then the future.”から何個か抜粋してみます。


# 合成
- ハイスキルの人材がマニュアルで実施
- 50%は派遣スタッフや委託
- 90%以上は派遣スタッフや委託

# 合成反応
- 基本的な反応セット
- 同じ反応セット+Pd反応
- 現在の反応セット+CHActivationやBioconversion

# テクノロジー
- ホットプレートで攪拌、Evap
- +マイクロ波、パラレル反応キット
- ?

# Leads
- 様々なソースから
- Role of 5によるDrug like(経口投与を意識した)な構造から
- Role of 5を超えたスペース、様々な投与経路

# Compounds design and SAR
- 論文ベース、暗黙知、経験に基づく試行錯誤
- in silicoツールを活用して効率的に進める。
- より人がやるよりコンピューターが行うようなり、データを活用した洗練された手法になる。

# What medicinal Chemists do
- ラボでの合成が主、創薬化学が少し
- 合成より、創薬化学寄りに。薬探索者のフィールドはラボからオフィスへ
- 創薬探索と真のスペシャリストは基本オフィスへ。

# Communication
- Face to face がメイン
- 電子媒体の利用が増え、会わないでも議論できるように
- SNSのようなツールでリアルタイムにどこでも議論。


 文中では”Compound design and SAR: a transition from art to a process?”というセッションで詳細が述べられております。Drug design/SARは同じデータを見ても人により異なるデザインが出てくる部分であり、想像力が求められる部分もあるかと思います。しかし、コンピューターによって効率的に行われるようになるというのはあながち遠い未来の話ではないのかもしれません。同じくArtとも捉えられてきた有機合成経路探索に関しては、すでにAIが現場に投入されていますね。

 今後、勢い衰えずComputer scienceがさらに発展し、デザインがAIによってなされるようになった時SARの案はあるものの、物作りが間に合わないといったことがありえます。
A remote-controlled adaptive medchem lab: an innovative approach to enable drug discovery in the 21st Century

Design, Synthesis, and Testing of Potent, Selective Hepsin Inhibitors via Application of an Automated Closed-Loop Optimization Platform

Abstract Image

 2つ事例を見てきましたが、冒頭の#Leadsの部分にあるように近年Beyond Ro5ということで、これまで以上に広い探索スペースから化合物を探索するようになっています。ですので化合物の物性素性もさらに多様性が増しているはずです。多様な化合物合成全てに適用可能な全自動合成マシンの登場はもう少し時間がかかりそうです、(私の予想では)。ただ、比較的汎用的な部分は自動化すぐいけるんじゃないかなと期待しています。ペプチドや核酸はもう自動合成装置がありますしね。


 病で苦しむ患者の皆様に1日も早く新薬を届ける。という目的を達成するために目指すべきはなんなんでしょう。Drug designer・Drug hunterなどと表現されることもあった創薬化学者の行うべき仕事内容は今後どうなるのでしょう。




Make interactive MMP network with Knime #Knime #chemoinformatics

I posted an example that shows making interactive scatter plot with Knime. And I would like to try MMP network with Knime. I often make network view via python package such as igraph, networkx and py2cytoscape etc. BTW, today I want to learn how to do that on knime.

Recent version of Knime is provided several JS visualization nodes. So, I used Network viewrer(Js) node. And sample data is got from my githubrepo. https://github.com/iwatobipen/mmp_example

Now ready, I made very simple MMP generation workflow with Erlwood chemoinformatics node. The details of following screen shot is … 1 load data, 2 convert smiles to mol, 3 convert mol to rdkit mol obj, 4 Conduct MMP.


And Node8 Network creator makes empty network and Node 9 object inserter binds MMP data as network. I mapped ID-L and ID-R for node paring and Set row ID as Edge ID.

Now making network is done. Then make interactive JS plot in wrapped meta node. Screen shot of expanded meta node is below.

Two openbabel nodes convert moleculeL and moleculeR to SVG image from SMILES. Then connect Network viewer node and Table View.

Finally after running the flow, I could get interactive network view and Compound pair is shown when edges are selected.

Make interactive plot with Knime #RDKit #Chemoinformatics #Knime

Dalia Goldman provided very cool presentation in RDKit UGM 2018 about Knime.

She demonstrated interactive analysis with RDKit knime node and Javascript node. I was really interested but it was difficult to build the workflow by myself at that time.

BTW, I need to learn knime for data preparation in this week. So I learned about knime and how to make interactive plot with knime.

Following sample is very simple but shows power of knime.
The example is making interactive chemical space plot with knime. All work flow is below. Version of knime is 3.7.

At frist load SDF and calculate descriptors and fingerprint with RDKit node and the split fingerprint with FingerprintExnpander. Then conduct PCA with calculated FP.
Then convert molecule to SVG with openbabel node for visualization.

Key point of this flow is wrapped metanode!
This node is constructed from two JS node ‘Scatter plot’ and ‘Card view’.

After the making metanode, I defined visual layout. The setting can call from right botton of menue bar

And I set card view option as show selected only and several scatter plot option.

Now ready. Then run the work flow! I can view selected compounds from chemical space.
Image is below.

New version of Knime is powerful tool for not only data analysis but also data visualization. ;-)