Use mmpdblib on jupyter notebook #rdkit #mmpdb #chemoinformatics

I introduced MMPDB v3 few days ago. The package is CLI tool. So user can do MMPA by typing command line. However the tool doesn’t provide API interface.

I think, if we can use MMPDB on jupyter-notebook it’ll be more useful because output data can be used directly in coding process.

MMPDB Cli interface is implemented with Click which is python command line option parser. And click has CliRunner for testing. I found that CliRunner is nice solution to do that.

Following example is my first trial to call mmpdb on jupyter notebook. And most of SQL part is bollowed from RDKit-Blog post.

At first, connect postgresql chembl30 and retrive data.

import psycopg2
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem import PandasTools
import requests
import pandas as pd
from mmpdblib import cli
from click.testing import CliRunner
from IPython.display import HTML
%load_ext sql

data = %sql select canonical_smiles,molregno,activity_id,standard_value,standard_units from activities \
  join assays using (assay_id) \
  join compound_structures using (molregno) \
  where tid=165 and standard_type='Ki' and standard_value is not null and standard_relation='=' \
    and canonical_smiles not like '%.%';

df = data.DataFrame()

Next, calculate pKi and rename column for next data processing. And save compound data and property data separated as CSV format. It’s worth to know that properties data should be tab separated format.

import numpy as np
def get_pki(val):
    val = float(val)+0.001
    return 9.-np.log10(val)

df['pKi_hERG'] = df.standard_value.apply(get_pki)
df.rename(columns={'molregno':'ID'}, inplace=True)

cpddf = df[['canonical_smiles','ID']].copy()
cpddf.drop_duplicates().to_csv('chembl_herg.smi', sep=' ', index=False, header=False)

propdf = df[['ID', 'pKi_hERG']].copy()
propdf.drop_duplicates().to_csv('chembl_herg_prop.csv', sep='\t', index=False, columns=['ID', 'pKi_hERG'])

OK, ready to run MMPDBlib. It is easy to use CliRunner, I show examples.

# make fragmentdb
runner = CliRunner(mix_stderr=False)
args = ("fragment", "chembl_herg.smi", "-o", "chembl_herg.fragmentdb")
runner.invoke(cli.main, args=args)

# do indexing
args = ("index", "chembl_herg.fragmentdb", "-o", "chembl_herg.mmpdb")
runner.invoke(cli.main, args=args)

# check data
!mmpdb rulecat chembl_herg.mmpdb
id	from_smiles	to_smiles
4	[*:1]c1ccc(F)cc1Cl	[*:1]c1ccc(F)cc1O
1	[*:1]c1ccc(F)cc1O	[*:1]c1ccccc1O
2	[*:1]c1ccc(F)cc1Cl	[*:1]c1ccccc1O
4183	[*:1]c1ccc(F)cc1Br	[*:1]c1ccc(F)cc1Cl
4184	[*:1]c1c(Cl)cccc1Cl	[*:1]c1ccc(F)cc1Cl

# load properties
args = ("loadprops", "-p", "chembl_herg_prop.csv", "chembl_herg.mmpdb")
runner.invoke(cli.main, args=args)

It works fine. OK let’s use generate command.

# I used droperidol as an example molecule
smi = 'C1CN(CC=C1N2C3=CC=CC=C3NC2=O)CCCC(=O)C4=CC=C(C=C4)F'
droperidol = Chem.MolFromSmiles(smi)
smi = Chem.MolToSmiles(droperidol)

args = ('generate', "--smiles", smi, "--query", "c1c(F)ccc(*)c1", "--radius", "1", "chembl_herg.mmpdb")
res = runner.invoke(cli.main, args)

Result object named ‘res’ has stdout as text format. I used StringIO and convert the data as pandas dataframe ;)

from io import StringIO
sio = StringIO()
mmpdf = pd.read_csv(sio, sep='\t', skiprows=[1,2])

start	constant	from_smiles	to_smiles	r	pseudosmiles	final	heavies_diff	#pairs	pair_from_id	pair_from_smiles	pair_to_id	pair_to_smiles
0	O=C(CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1)c1ccc(F...	*C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	[*:1]c1ccc(F)cc1	[*:1]C	1	[*:1]-[C](~*)(~*)	CC(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	-6	2	423277	O=C(NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21)...	423269	CC(=O)NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21
1	O=C(CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1)c1ccc(F...	*C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	[*:1]c1ccc(F)cc1	[*:1]C(C)C	1	[*:1]-[C](~*)(~*)	CC(C)C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	-4	2	423277	O=C(NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21)...	423275	CC(C)C(=O)NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)...

Finally, convert smiles to rdkit molobjects with Chem.PandasTools. I used changeMoleculeRendering method because before using the method molecules in dataframe are not rendered.

PandasTools.AddMoleculeColumnToFrame(mmpdf, smilesCol="start", molCol="start")
PandasTools.AddMoleculeColumnToFrame(mmpdf, smilesCol="final", molCol="final")

from rdkit.Chem import PandasPatcher

Worked fine ;)

By using the approach, we can call mmpdb function from python code without subprocess module. I think it is usful for calling function from the code with is click based CLI tool.

I uploaded whole code of the post on my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Thanks for reading.

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: