AMES classification with WL graph kernel #RDKit

I often feel it difficult for me to implement algorithm from zero-base… I need to more practice. ;-)
BTW, recently I can find many articles about application of graph theory for chemoinformatics.
I found some interesting articles and they provides useful packages in github!

Today, I tried a library named Grakel.
You can find original article from the URL below.
https://arxiv.org/abs/1806.02193

I used the package and compared the performance to traditional SVC. Open AMES dataset is used for following test.
My code is below. The Grakel package has many algorithms and easy to use for calculation of graph kernel. I calculated WL graph kernel with Adjacency matrix from RDKit and built predictive model. At the same time, I built tradicional SVC model with ECFP4(Morgan Finger print radi=2).

To compare the results, it is interesting for me that WL graph kernel worked well even if the kernel does not have details for the molecules such as charge, num of hydrogen etc.
Is it means that Graph based model is powerful? This is only one experience for the descriptor.
I would like to try any other dataset.

These model is based on feature of ligand and not include protein information. For the real world, drug discovery process is needed many informations not only ligands, but also proteins.

I would like to know possibility of graph based approach for chemoinformatics.

from grakel import GraphKernel
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
import argparse
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split


def getparser():
    parser = argparse.ArgumentParser('argparser')
    parser.add_argument('input', help='file path and name of input')
    parser.add_argument('prop', help='properties for predict')
    return parser.parse_args()

def molg_from_smi(smiles):
    mol = Chem.MolFromSmiles(smiles)
    atom_with_idx = { i:atom.GetSymbol() for i, atom in enumerate(mol.GetAtoms())}
    adj_m = Chem.GetAdjacencyMatrix(mol, useBO=True).tolist()
    return [adj_m, atom_with_idx]

def molg_from_rdkit(mol):
    atom_with_idx = { i:atom.GetSymbol() for i, atom in enumerate(mol.GetAtoms())}
    adj_m = Chem.GetAdjacencyMatrix(mol, useBO=True).tolist()
    return [adj_m, atom_with_idx]

def mol2fp(mol):
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr


if __name__=='__main__':
    args = getparser()
    mols = [mol for mol in Chem.SDMolSupplier(args.input) if mol != None]
    X = [molg_from_rdkit(mol) for mol in mols]
    Ames_dict = {'mutagen':1, 'nonmutagen':0}
    Y = [ Ames_dict[mol.GetProp('Ames test categorisation')] for mol in mols]
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

    gk = GraphKernel(kernel=[{"name": "weisfeiler_lehman", "niter": 5},{"name":"subtree_wl"}], normalize=True)
    K_train = gk.fit_transform(X_train)
    K_test = gk.transform(X_test)

    gclf = SVC(kernel='precomputed')
    gclf.fit(K_train, Y_train)
    y_pred_g = gclf.predict(K_test)

    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    rep = classification_report(Y_test, y_pred_g)
    print('WL graph kernel')
    print(confusion_matrix(Y_test, y_pred_g))
    print(rep)

    print('\n')
    print('ECFP4')

    X = [mol2fp(mol) for mol in mols]
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
    clf = SVC(C=20.)
    clf.fit(X_train, Y_train)
    y_pred = clf.predict(X_test)
    rep = classification_report(Y_test, y_pred)
    print(confusion_matrix(Y_test, y_pred))
    print(rep)

WL graph kernel
[[381 125]
[ 85 494]]
precision recall f1-score support

      0       0.82      0.75      0.78       506
      1       0.80      0.85      0.82       579

avg / total 0.81 0.81 0.81 1085

ECFP4
[[381 110]
[111 483]]
precision recall f1-score support

      0       0.77      0.78      0.78       491
      1       0.81      0.81      0.81       594

avg / total 0.80 0.80 0.80 1085

real 0m40.446s
user 1m49.922s
sys 0m3.074s




		
Advertisements

Molecular set profiling with pandas_profiling #RDKit

Molecular descriptors are good indicator for molecular profiling. Visualize and analyze these descriptors are important to have a bird’s-eye view of given molecules set.

I often use “pandas” and “seaborn” to do it. Seaborn is powerful tool to make cool visualization but difficult to obtain statistics data.

Yesterday, I found interesting tool to analyze pandas data frame named “pandas_profiling”.
It seems very easy to make analyze report. It can be installed with conda/pip.
https://github.com/pandas-profiling/pandas-profiling

Let’s install the package and use!
First, call library.

import os
import pandas
import pandas_profiling
import pandas as pd
from rdkit import Chem
from rdkit import RDConfig
from rdkit.Chem import rdBase 
from rdkit.Chem import Descriptors
from rdkit.Chem.Descriptors import _descList
from rdkit.ML.Descriptors import MoleculeDescriptors
# I used cdk2.sdf dataset as test.
datadir =  os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

Then calculate descriptors and make dataframe

data = {}
for name in desc_name:
    data[name] = []
for descs in descs_list:
    for i, desc in enumerate(descs):
        data[desc_name[i]].append(desc)
df = pd.DataFrame(data)

Let’s make report. It is very very easy!!!! ;-)

pandas_profiling.ProfileReport(df)

Then you can get analyze repot with bar chart.
Snap shots are below.

This package provides not only summary of the dataset but also details of the data. It seems very cool package isn’t it?
You can check whole code is following URL.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/pandas_profiling_test.ipynb

mishima.sykに参加した話

第一回目からで通算12回目、振り返れば早5年続いているmishima.sykに参加してきました。
今回も演題が多岐にわたり、かつ、盛りだくさんでとても勉強になりました。参加された他の皆様にも何か得るものがあったことを願っております。
X線、結晶構造解析の話から、ターゲットFindingとKnimeの話、Open source/企業の関わり方、AI創薬の話題、公共DBの最新の話題、webRTCでデバイスとカメラを連動させたアプリ開発の話!うなぎーハモーアナゴのえ!それ知らないっす!って話題からの実践的なPythonの話、ポプテピピック創薬w、寿司と刺身の違い!
面白いだけでなく、先端の情報、サイエンスの情報が盛り込まれていて毎回プレゼンターすげえなって思うわけですよ。mishima.sykすごい。興味ある方は是非ともご参加くださいねw
懇親会も大変美味しい料理と、面白い話を色々と聞けて大満足の1日でした。

私のプレゼンとマテリアルは下記のURLからゲットできます。今回は強化学習の最初の話とLTでcyRESTの話をさせていただきました。LTはファイルを消してしまっていて一番の見せ場で失敗するという失態を犯しましたが、、、、
https://github.com/Mishima-syk/12

強化学習に関してはDQNが世界を沸かせたのも早昔、今では様々なアルゴリズムがどんどこでていますね。論文でてすぐ実装が上がりの繰り返し。何か応用を考えようと思っているうちに世界は先に進んでいる。エキサイティングだけど追っていくのがとても大変。というか私は追えてない。
コンピュータサイエンスはアルゴリズムの進化とマシンリソースがうまく合わさってどんどん進んでいる感じがします。そこで生まれた提案、デザインを現実のものとして生み出すProduction部分もそれについていかないと、全体としての加速は起こらないでしょう。まあAIが100発100中の回答を出せれば別ですが、まだ創薬に関して考えるとそれは難しいかなって思うわけです。これから数年先、メディスナルケミストの役割、姿ってどうなってるんでしょうね。などとあれこれ考えながら週末を終えそうです。

何はともあれ参加者、発表者のみなさまどうもありがとうございました。

New modalities in Drug Discovery #diary

Here is a nice review of recent new modalities in Drug Discovery.
https://pubs.acs.org/doi/abs/10.1021/acs.jmedchem.8b00378

The article covered wide range of recent technologies.
1. Peptide based drug discovery not only synthetic but also venoms.
2. DELI.
3. New structure for drug discovery partnerships. In the section, the authors well documented about compound sharing (i.e. ELF) and risk-sharing, collection leasing partnerships and crowd sourcing. I am interested in compound sharing. Because there is a consortium for library in Japan named J-CLIC: Japan Compound Library ConsortiumJ-CLIC. The consortium joint purchase many compounds from supplier in pharmaceutical companies in Japan. It will be cost effective. I think this is one of the nice proposal in non competitive area.
Also crowdsourcing is interesting for me. It means open innovation!
4. Strategies for protein structure mimetics, stabilize alpha-helix or beta sheet. I did not know scaffold grafting technology. The technology is impressive for me.
5. In the section, the author introduced 2D combinatorial libraries and informa. This technology is used for direct RNA targeting by small molecules. Modulation of translation with small molecules is challenging I think, but this approach seems work well and well designed. Also PROTAC and miRNA, anti sense oligomer.

Figure 20 in the article is very nice summary about scope and limitation of these technologies.

There are many toolkit for drug discovery today. It is not limited only small molecules. The role of medicinal chemist is still expanding. Keep open my eyes and catch up new technology and science to develop new drug for human health.

mol2graph and graph2mol #rdkit #igraph

I posted about mol to graph object before.
In the blog post, I wrote example that convert RDKit mol object to igraph object. It was one way. There was no method igraph to rdkit mol object.
So I wrote very simple converter from graph to molecule.

First, import modules.

import numpy as np
import pandas as pd
import igraph
from rdkit import Chem
from rdkit.Chem.rdchem import RWMol
from rdkit.Chem import Draw
from rdkit.Chem import rdmolops
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True

Then define two way function, mol2graph and graph2mol. It is very simple.I did not sanitize process because I could not handle some compounds. RWMol method is very useful to do this work.

def mol2graph(mol):
    atoms_info = [ (atom.GetIdx(), atom.GetAtomicNum(), atom.GetSymbol()) for atom in mol.GetAtoms()]
    bonds_info = [(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType(), bond.GetBondTypeAsDouble()) for bond in mol.GetBonds()]
    graph = igraph.Graph()
    for atom_info in atoms_info:
        graph.add_vertex(atom_info[0], AtomicNum=atom_info[1], AtomicSymbole=atom_info[2])
    for bond_info in bonds_info:
        graph.add_edge(bond_info[0], bond_info[1], BondType=bond_info[2], BondTypeAsDouble=bond_info[3])
    return graph

def graph2mol(graph): 
    emol = RWMol()
    for v in graph.vs():
        emol.AddAtom(Chem.Atom(v["AtomicNum"]))
    for e in graph.es():
        emol.AddBond(e.source, e.target, e['BondType'])
    mol = emol.GetMol()
    return mol

Finally, I checked my function on jupyter notebook. And It worked well.

All code is uploaded my repo and can check from following URL. https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/mol2g_g2mol.ipynb

mol2graph and graph2mol #rdkit #igraph

I posted about mol to graph object before.
In the blog post, I wrote example that convert RDKit mol object to igraph object. It was one way. There was no method igraph to rdkit mol object.
So I wrote very simple converter from graph to molecule.

First, import modules.

import numpy as np
import pandas as pd
import igraph
from rdkit import Chem
from rdkit.Chem.rdchem import RWMol
from rdkit.Chem import Draw
from rdkit.Chem import rdmolops
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True

Then define two way function, mol2graph and graph2mol. It is very simple.I did not sanitize process because I could not handle some compounds. RWMol method is very useful to do this work.

def mol2graph(mol):
    atoms_info = [ (atom.GetIdx(), atom.GetAtomicNum(), atom.GetSymbol()) for atom in mol.GetAtoms()]
    bonds_info = [(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType(), bond.GetBondTypeAsDouble()) for bond in mol.GetBonds()]
    graph = igraph.Graph()
    for atom_info in atoms_info:
        graph.add_vertex(atom_info[0], AtomicNum=atom_info[1], AtomicSymbole=atom_info[2])
    for bond_info in bonds_info:
        graph.add_edge(bond_info[0], bond_info[1], BondType=bond_info[2], BondTypeAsDouble=bond_info[3])
    return graph

def graph2mol(graph): 
    emol = RWMol()
    for v in graph.vs():
        emol.AddAtom(Chem.Atom(v["AtomicNum"]))
    for e in graph.es():
        emol.AddBond(e.source, e.target, e['BondType'])
    mol = emol.GetMol()
    return mol

Finally, I checked my function on jupyter notebook. And It worked well.

All code is uploaded my repo and can check from following URL. https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/mol2g_g2mol.ipynb