Molecular set profiling with pandas_profiling #RDKit

Molecular descriptors are good indicator for molecular profiling. Visualize and analyze these descriptors are important to have a bird’s-eye view of given molecules set.

I often use “pandas” and “seaborn” to do it. Seaborn is powerful tool to make cool visualization but difficult to obtain statistics data.

Yesterday, I found interesting tool to analyze pandas data frame named “pandas_profiling”.
It seems very easy to make analyze report. It can be installed with conda/pip.

Let’s install the package and use!
First, call library.

import os
import pandas
import pandas_profiling
import pandas as pd
from rdkit import Chem
from rdkit import RDConfig
from rdkit.Chem import rdBase 
from rdkit.Chem import Descriptors
from rdkit.Chem.Descriptors import _descList
from rdkit.ML.Descriptors import MoleculeDescriptors
# I used cdk2.sdf dataset as test.
datadir =  os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

Then calculate descriptors and make dataframe

data = {}
for name in desc_name:
    data[name] = []
for descs in descs_list:
    for i, desc in enumerate(descs):
df = pd.DataFrame(data)

Let’s make report. It is very very easy!!!! ;-)


Then you can get analyze repot with bar chart.
Snap shots are below.

This package provides not only summary of the dataset but also details of the data. It seems very cool package isn’t it?
You can check whole code is following URL.



X線、結晶構造解析の話から、ターゲットFindingとKnimeの話、Open source/企業の関わり方、AI創薬の話題、公共DBの最新の話題、webRTCでデバイスとカメラを連動させたアプリ開発の話!うなぎーハモーアナゴのえ!それ知らないっす!って話題からの実践的なPythonの話、ポプテピピック創薬w、寿司と刺身の違い!




New modalities in Drug Discovery #diary

Here is a nice review of recent new modalities in Drug Discovery.

The article covered wide range of recent technologies.
1. Peptide based drug discovery not only synthetic but also venoms.
2. DELI.
3. New structure for drug discovery partnerships. In the section, the authors well documented about compound sharing (i.e. ELF) and risk-sharing, collection leasing partnerships and crowd sourcing. I am interested in compound sharing. Because there is a consortium for library in Japan named J-CLIC: Japan Compound Library ConsortiumJ-CLIC. The consortium joint purchase many compounds from supplier in pharmaceutical companies in Japan. It will be cost effective. I think this is one of the nice proposal in non competitive area.
Also crowdsourcing is interesting for me. It means open innovation!
4. Strategies for protein structure mimetics, stabilize alpha-helix or beta sheet. I did not know scaffold grafting technology. The technology is impressive for me.
5. In the section, the author introduced 2D combinatorial libraries and informa. This technology is used for direct RNA targeting by small molecules. Modulation of translation with small molecules is challenging I think, but this approach seems work well and well designed. Also PROTAC and miRNA, anti sense oligomer.

Figure 20 in the article is very nice summary about scope and limitation of these technologies.

There are many toolkit for drug discovery today. It is not limited only small molecules. The role of medicinal chemist is still expanding. Keep open my eyes and catch up new technology and science to develop new drug for human health.

Do rapid SAR iteration!

Now I participating with JCUP, it is exciting for me. Due to growing the computer performance such as GPU computing, in silico technology become very powerful method in drug discovery.
And also DMTA cycle is going to next stage. You know recent publication from Merck is amazing for me. They make thousands of molecules on very small scale and perform their assay in crude state.
There is nice review regarding the article. So I would like to post another approach for rapid SAR.

Here is report from Cyclofluidic.
Their unique feature is closed-loop structure activity platform, to revolutionise hit and lead optimisation. In the article they explore SAR of Hepsin, a membrane-anchored serine protease.
The compounds are build from three parts acyl/sulfonyl, amino acid and guanidino protease catalytic domain has the catalytic triad of His, Asp and Ser residues. It indicates that guanidino residue is necessarily to keeping activity.

The author explore SAR with flow chemistry. They changed synthetic flow compared to batch synthesis, used TMS protected amino acid for flow chemistry because free amino acid shows low solubility and it is problematic factor. It is good tips for flow synthesis.

Finally they obtained highly active and low toxic molecules. It seems success story of the technology. BTW, I wrote below, to keeping the activity guanidino moiety is required. And it shows bad effect for ADMET profile especially permeability.
I think low cell toxicity comes from this low permeability. Of corse this target is trans membrane and does not need to going to cell inside. But low permeability is not good feature for drug (my opinion).
I am interested in next action of the research.
I think Cyclofluidic technology is very interesting and useful for rapid SAR.
How about readers opinion. ;-)


Graph convolution neural network (GCN) is useful for chemoinfo because molecules can be represent as graph structure. But GCN approach in chemoinfo has difficulties that each graph has different structure compared to image data.
There are many reports about applying GCN for chemoinfo. Sometime GCN approach outperforms other method such as CNN with molecular fingerprint.

By the way, the authors point out several limitations of current GCN.
– First, basic GCN can only capture local structure information of the graph.
– Second, GCN model cannot be applied directly because they are equivariant model with respect to the node order graph.
– Third, GCNN model is their limited ability to exploit global information for the purpose of graph.

They developed novel approach Graph Capsule Convolutional Neural Networks ( GCCNN ) classification.
Original capsule net was proposed by Hilton’s group and the approach solve problem of CNN of image classification. It can manage orientational and relative spatial relationships between small set of data.
In GCCNN, Graph capsule function is defined with statistical moments and polynominal coefficients.

f(ℓ)(X,L) = σ(g(f(ℓ−1)(X,L),L)W(ℓ)) —-eq (2)
L is graph lapracian and W is weight of l th layer.

And their idea of permutation invariant features in GCAPS-CNN model is computing the covariance of f(X, L) layer.
C(f(X,L)) = 1/N(f(X,L) − µ)T(f(X,L) − µ) –eq (7)
 Merit of using the matrix is that not only each element of covariance matrix is invariant to node orders but also the matrix has rich infromation between each node’s information.
They can guaranteeing permutation invariance in GCAP-NN model by using the strategy above.
Finally they defined model and tested with some dataset, COLLAB, IMDB etc. And GCAPS-CNN outperformed other methods.
* Graph lapracian can get from adjacency matrix “A” and these matrix has unique features.

There are many approaches about GCN and it is developed very rapidly. It is exciting area for me but difficult to follow the mathematics @_@.

Think about Structure Kinetics Relationship

Here is a deep analysis about SKR from Merck.

Recently it is becoming important factor for understanding ligand target binding kinetics. You know there are tools such as SPR, ITC and in silico method like a MD.

The author analyzed Kinetic data about Hsp90. They analyzed relation ship between R-group of some scaffolds and Kon with two type of compounds set called “cavity-varying” and “entrance-varying”.
The “cavity” is hydrophobic region of Hsp90 and “entrance” is hydrophilic.
It is interesting that substituents of “cavity-varying” shows strong relation ships between lipophilicity and Kon . On the other hand, substituents of “entrance-varying” shows week correlation.

Also they performed MD simulations to confirm a polar desolvation barrier. Unfortunately I am not familiar for Molecular Dynamics but it reveal the effect of desolvation step of molecular binding.

In the article the author provides lots of data. It is worth to check and learn I think.