Molecular set profiling with pandas_profiling #RDKit

Molecular descriptors are good indicator for molecular profiling. Visualize and analyze these descriptors are important to have a bird’s-eye view of given molecules set.

I often use “pandas” and “seaborn” to do it. Seaborn is powerful tool to make cool visualization but difficult to obtain statistics data.

Yesterday, I found interesting tool to analyze pandas data frame named “pandas_profiling”.
It seems very easy to make analyze report. It can be installed with conda/pip.
https://github.com/pandas-profiling/pandas-profiling

Let’s install the package and use!
First, call library.

import os
import pandas
import pandas_profiling
import pandas as pd
from rdkit import Chem
from rdkit import RDConfig
from rdkit.Chem import rdBase 
from rdkit.Chem import Descriptors
from rdkit.Chem.Descriptors import _descList
from rdkit.ML.Descriptors import MoleculeDescriptors
# I used cdk2.sdf dataset as test.
datadir =  os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

Then calculate descriptors and make dataframe

data = {}
for name in desc_name:
    data[name] = []
for descs in descs_list:
    for i, desc in enumerate(descs):
        data[desc_name[i]].append(desc)
df = pd.DataFrame(data)

Let’s make report. It is very very easy!!!! ;-)

pandas_profiling.ProfileReport(df)

Then you can get analyze repot with bar chart.
Snap shots are below.

This package provides not only summary of the dataset but also details of the data. It seems very cool package isn’t it?
You can check whole code is following URL.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/pandas_profiling_test.ipynb

Advertisements

mishima.sykに参加した話

第一回目からで通算12回目、振り返れば早5年続いているmishima.sykに参加してきました。
今回も演題が多岐にわたり、かつ、盛りだくさんでとても勉強になりました。参加された他の皆様にも何か得るものがあったことを願っております。
X線、結晶構造解析の話から、ターゲットFindingとKnimeの話、Open source/企業の関わり方、AI創薬の話題、公共DBの最新の話題、webRTCでデバイスとカメラを連動させたアプリ開発の話!うなぎーハモーアナゴのえ!それ知らないっす!って話題からの実践的なPythonの話、ポプテピピック創薬w、寿司と刺身の違い!
面白いだけでなく、先端の情報、サイエンスの情報が盛り込まれていて毎回プレゼンターすげえなって思うわけですよ。mishima.sykすごい。興味ある方は是非ともご参加くださいねw
懇親会も大変美味しい料理と、面白い話を色々と聞けて大満足の1日でした。

私のプレゼンとマテリアルは下記のURLからゲットできます。今回は強化学習の最初の話とLTでcyRESTの話をさせていただきました。LTはファイルを消してしまっていて一番の見せ場で失敗するという失態を犯しましたが、、、、
https://github.com/Mishima-syk/12

強化学習に関してはDQNが世界を沸かせたのも早昔、今では様々なアルゴリズムがどんどこでていますね。論文でてすぐ実装が上がりの繰り返し。何か応用を考えようと思っているうちに世界は先に進んでいる。エキサイティングだけど追っていくのがとても大変。というか私は追えてない。
コンピュータサイエンスはアルゴリズムの進化とマシンリソースがうまく合わさってどんどん進んでいる感じがします。そこで生まれた提案、デザインを現実のものとして生み出すProduction部分もそれについていかないと、全体としての加速は起こらないでしょう。まあAIが100発100中の回答を出せれば別ですが、まだ創薬に関して考えるとそれは難しいかなって思うわけです。これから数年先、メディスナルケミストの役割、姿ってどうなってるんでしょうね。などとあれこれ考えながら週末を終えそうです。

何はともあれ参加者、発表者のみなさまどうもありがとうございました。

New modalities in Drug Discovery #diary

Here is a nice review of recent new modalities in Drug Discovery.
https://pubs.acs.org/doi/abs/10.1021/acs.jmedchem.8b00378

The article covered wide range of recent technologies.
1. Peptide based drug discovery not only synthetic but also venoms.
2. DELI.
3. New structure for drug discovery partnerships. In the section, the authors well documented about compound sharing (i.e. ELF) and risk-sharing, collection leasing partnerships and crowd sourcing. I am interested in compound sharing. Because there is a consortium for library in Japan named J-CLIC: Japan Compound Library ConsortiumJ-CLIC. The consortium joint purchase many compounds from supplier in pharmaceutical companies in Japan. It will be cost effective. I think this is one of the nice proposal in non competitive area.
Also crowdsourcing is interesting for me. It means open innovation!
4. Strategies for protein structure mimetics, stabilize alpha-helix or beta sheet. I did not know scaffold grafting technology. The technology is impressive for me.
5. In the section, the author introduced 2D combinatorial libraries and informa. This technology is used for direct RNA targeting by small molecules. Modulation of translation with small molecules is challenging I think, but this approach seems work well and well designed. Also PROTAC and miRNA, anti sense oligomer.

Figure 20 in the article is very nice summary about scope and limitation of these technologies.

There are many toolkit for drug discovery today. It is not limited only small molecules. The role of medicinal chemist is still expanding. Keep open my eyes and catch up new technology and science to develop new drug for human health.

Do rapid SAR iteration!

Now I participating with JCUP, it is exciting for me. Due to growing the computer performance such as GPU computing, in silico technology become very powerful method in drug discovery.
And also DMTA cycle is going to next stage. You know recent publication from Merck is amazing for me. They make thousands of molecules on very small scale and perform their assay in crude state.
https://www.nature.com/articles/s41586-018-0056-8
There is nice review regarding the article. So I would like to post another approach for rapid SAR.

Here is report from Cyclofluidic.
https://pubs.acs.org/doi/10.1021/acs.jmedchem.7b01698
Their unique feature is closed-loop structure activity platform, to revolutionise hit and lead optimisation. In the article they explore SAR of Hepsin, a membrane-anchored serine protease.
The compounds are build from three parts acyl/sulfonyl, amino acid and guanidino protease catalytic domain has the catalytic triad of His, Asp and Ser residues. It indicates that guanidino residue is necessarily to keeping activity.

The author explore SAR with flow chemistry. They changed synthetic flow compared to batch synthesis, used TMS protected amino acid for flow chemistry because free amino acid shows low solubility and it is problematic factor. It is good tips for flow synthesis.

Finally they obtained highly active and low toxic molecules. It seems success story of the technology. BTW, I wrote below, to keeping the activity guanidino moiety is required. And it shows bad effect for ADMET profile especially permeability.
I think low cell toxicity comes from this low permeability. Of corse this target is trans membrane and does not need to going to cell inside. But low permeability is not good feature for drug (my opinion).
I am interested in next action of the research.
I think Cyclofluidic technology is very interesting and useful for rapid SAR.
How about readers opinion. ;-)

GCAPS-CNN

Graph convolution neural network (GCN) is useful for chemoinfo because molecules can be represent as graph structure. But GCN approach in chemoinfo has difficulties that each graph has different structure compared to image data.
There are many reports about applying GCN for chemoinfo. Sometime GCN approach outperforms other method such as CNN with molecular fingerprint.

By the way, the authors point out several limitations of current GCN.
– First, basic GCN can only capture local structure information of the graph.
– Second, GCN model cannot be applied directly because they are equivariant model with respect to the node order graph.
– Third, GCNN model is their limited ability to exploit global information for the purpose of graph.

They developed novel approach Graph Capsule Convolutional Neural Networks ( GCCNN ) classification.
Original capsule net was proposed by Hilton’s group and the approach solve problem of CNN of image classification. It can manage orientational and relative spatial relationships between small set of data.
In GCCNN, Graph capsule function is defined with statistical moments and polynominal coefficients.

f(ℓ)(X,L) = σ(g(f(ℓ−1)(X,L),L)W(ℓ)) —-eq (2)
L is graph lapracian and W is weight of l th layer.

And their idea of permutation invariant features in GCAPS-CNN model is computing the covariance of f(X, L) layer.
C(f(X,L)) = 1/N(f(X,L) − µ)T(f(X,L) − µ) –eq (7)
 Merit of using the matrix is that not only each element of covariance matrix is invariant to node orders but also the matrix has rich infromation between each node’s information.
They can guaranteeing permutation invariance in GCAP-NN model by using the strategy above.
Finally they defined model and tested with some dataset, COLLAB, IMDB etc. And GCAPS-CNN outperformed other methods.
* Graph lapracian can get from adjacency matrix “A” and these matrix has unique features.

https://arxiv.org/pdf/1805.08090.pdf

There are many approaches about GCN and it is developed very rapidly. It is exciting area for me but difficult to follow the mathematics @_@.

Think about Structure Kinetics Relationship

Here is a deep analysis about SKR from Merck.
https://pubs.acs.org/doi/abs/10.1021/acs.jmedchem.8b00080

Recently it is becoming important factor for understanding ligand target binding kinetics. You know there are tools such as SPR, ITC and in silico method like a MD.

The author analyzed Kinetic data about Hsp90. They analyzed relation ship between R-group of some scaffolds and Kon with two type of compounds set called “cavity-varying” and “entrance-varying”.
The “cavity” is hydrophobic region of Hsp90 and “entrance” is hydrophilic.
It is interesting that substituents of “cavity-varying” shows strong relation ships between lipophilicity and Kon . On the other hand, substituents of “entrance-varying” shows week correlation.

Also they performed MD simulations to confirm a polar desolvation barrier. Unfortunately I am not familiar for Molecular Dynamics but it reveal the effect of desolvation step of molecular binding.

In the article the author provides lots of data. It is worth to check and learn I think.