Rendering molecule doesn’t directly contribute drug design but it’s really importnt for us because medicinal chemists have their own preferences for drawing moleucles such as font size, highlight color etc, etc.
It’s too difficult to parametarize their preferences for each other even if we use AI ;)
Somedays ago I knew that optuna has cool function to implement learning process with human in the loop.
After typing optuna-dashboard, web server launched and I could access localhost:8080.
By clicking the trials(List), I could see the following page. The page asked me do you like the color of atom highlighting. I could annotate the question with good, so-so and bad.
After several trials, optuna learned that I like orange color ;) It can be cornfirmed from learning curve and 2d density plot.
x axis of the density matrics is red and y axis is blue. As you can see, I added good to orange~red highligt color (r valule is near 1. and blue value is near 0.)
In this example doesn’t have any scientific meangs but it showed how to use optuna in HITL.
Machine learning is democratizing, and there are many great libraries available in Python such as scikit-lean pycaret.
Recenlty we can build ML model without writing lots of code. Today I found new package named ‘falcon‘ which can build optimized model few lines.
One of the famous package for AutoML is auto-scikitlearn, the package is also useful and powerful. But today I tried to use falcon instead of auto-scikitlearn.
Falcon can use optuna for parameter optimization. I used full automated process in the blog post.
Solubility dataset is used for the example. To use falcon, I need only import AutoML from falcon. It’s toooooooo simple isn’t it?
After importing the AutoML passing the data to the object. That’s all. AutoML tried to build regression model. Following example shows how to build regression model. But falcon also can build classfication model. To do it, task should be set to ‘tabular_classificaion’.
After the process, data is saved as onnx format ;)
I cheked model performance with training and test dataset. As you can see accuracy of test data is low compared to training data.
To build acculate model is still challenging task for many chemoinformatician. To build acculate model, huge amout of data is requred in some case but it’s not feasible :P
Balance between real and virutal data is important.
But it’s fun to search and test new ML packages. I updated my code on my gist.
The example described avobe reads dataset and do some calculation against to numerical columns. I tried to use chemoinformatics libary with knime.
Knime can read not only table data like csv, sdf but also use defined compound. To do it, ChemAxon MarvinSketch node is used. MarvinSketch is free for personal usage.
By using the node user can define query molecule with GUI based drawer.
After drawing the molecule and push apply button, the node will pass the molecule to following nodes.
To load the jupyter notebook, python script node and Conda Environment propagation node is used. Conda Envrionment Propagation node can define condaenv which would like to use in python script env. The information is passed as flowvailable.
The whole workflow is shown here. I made conda env named my_python_env which contains all packges which are used in the notebook.
*MolConverter node converts molecules to MDL format.
Let’s move to script tab in python script node.
#code for python script
import knime.scripting.io as knio
# This example script simply outputs the node's input table.
notebook_directory = "/home/iwatobipen/dev/knime_dev/notebook"
# Filename of the notebook
notebook_name = "test1.ipynb"
# Load the notebook as a Python module
import knime.scripting.jupyter as knupyter
my_notebook = knupyter.load_notebook(notebook_directory, notebook_name)
# Print its textual contents
knupyter.print_notebook(notebook_directory, notebook_name)
input_table_1 = knio.input_tables[0].to_pandas()
print(input_table_1)
output_table_1 = my_notebook.compound_generation(input_table_1)
knio.output_tables[0] = knio.Table.from_pandas(output_table_1)
To use function which is defined jupyternotebook, import “knime.scripting.jupyter as knupyter” and define notebook path and name.
The code of notebook is shown here. Following code generates molecules from user input query molecule with CReM and returns molecules as pandas dataframe.
#code from jupyternotebook
import pandas as pd
import numpy as np
from rdkit import Chem
from crem.crem import mutate_mol
dbpath = '/home/iwatobipen/dev/sandbox/rdk/replacements02_sc2.5.db'
def smi2mol(smi):
mol = Chem.MolFromSmiles(smi)
Chem.KekulizeIfPossible(mol)
return mol
def compound_generation(data_set):
mol = Chem.MolFromMolBlock(data_set['Molecule'][0])
mut_mol = mutate_mol(mol, dbpath)
output = pd.DataFrame()
molids = []
smi_list = []
for i, smi in enumerate(mut_mol):
molids.append(i)
smi_list.append(smi)
output['molid'] = molids
output['smiles'] = smi_list
output['ROMol'] = output['smiles'].apply(smi2mol)
return output
After runnning the workflow, I could get lots of mutated molecules.
After the work, I can run additional chemoinformatics tasks with knime. By using the jupyternotebook, user can check code interactively. It’s useful for check the function behaviors I think. Knime can integrate python with various ways and it’s easy to do it.
Previously I wrote blog post about ‘exmol’ which is useful package for ML related task. Examol is developed by Andrew White. I installed it almost 2 years ago and Didn’t check the update, recently I found new verson of exmol is released and new versoin have new functions.
RandomForest calssifier from scikit learn and mordred is used for original documentation but I used HistGradientBoostingClassifier and usful_rdkit_utils in my code. Because I did it from following reason.
1st, mordred depends on old version of numy, so I got an error from np.flow() method which is removed from newer version of numpy.
2nd, I used HistGradientBoosting instead of RandomForest is some molecules features are NaN. HistGB works evenif the input has NaN. At first import libraries for coding.
mport pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import rdkit, rdkit.Chem, rdkit.Chem.Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
import mordred, mordred.descriptors
from useful_rdkit_utils import RDKitDescriptors
from chembl_structure_pipeline import standardizer
import exmol
from rdkit.Chem.Draw import rdDepictor
from rdkit import Chem
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, RocCurveDisplay
from sklearn import metrics
import crem
rdDepictor.SetPreferCoordGen(True)
IPythonConsole.ipython_useSVG = True
sns.set_context("notebook")
sns.set_style(
"dark",
{
"xtick.bottom": True,
"ytick.left": True,
"xtick.color": "#666666",
"ytick.color": "#666666",
"axes.edgecolor": "#666666",
"axes.linewidth": 0.8,
"figure.dpi": 300,
},
)
color_cycle = ["#1BBC9B", "#F06060", "#F3B562", "#6e5687", "#5C4B51"]
mpl.rcParams["axes.prop_cycle"] = mpl.cycler(color=color_cycle)
Next, I read dataset with pandas and calculate compound features is useful_rdit_utils thanks Pat for sharing useful code!
data = pd.read_csv("BBBP.csv")
examplesmi = data.smiles[1232]
desc=RDKitDescriptors()
# make object that can compute descriptors
calc =RDKitDescriptors()
# make subsample from pandas df
molecules = [rdkit.Chem.MolFromSmiles(smi) for smi in data.smiles]
# the invalid molecules were None, so we'll just
# use the fact the None is False in Python
valid_mol_idx = [idx for idx, m in enumerate(molecules) if bool(m)==True]
valid_mols = [m for m in molecules if m if m != None]
descs = [calc.calc_mol(molecules[idx]) for idx in valid_mol_idx]
raw_features = pd.DataFrame(descs, columns=calc.desc_names)
valid_mol_idx = [bool(m) for m in molecules]
labels = data[valid_mol_idx].p_np
Chem.MolFromSmiles(examplesmi)
#define data normalizer
fm = raw_features.mean()
fs = raw_features.std()
def feature_convert(f):
f -= fm
f /= fs
return f
features = feature_convert(raw_features)
# we have some nans in features, likely because std was 0
features = features.values.astype(float)
features_select = np.all(np.isfinite(features), axis=0)
features = features[:, features_select]
Let’s train model with training dataset and plot ROC curve.
To use exmol, model_eval function which returns predicited values from smiles are requreid. So I did it. By using the approach exmol can use any kinds of classifier.
def model_eval(smiles_list, _=None):
molecules = [rdkit.Chem.MolFromSmiles(smi) for smi in smiles_list]
molecules = [m for m in molecules if m != None]
# input wrangling. Get some weird values from weird smiles
features = np.array([calc.calc_mol(m) for m in molecules])
#features = feature_convert(raw_features)
#features = features.astype(float)
features = features[:, features_select]
labels = clf.predict(features)
return labels
OK, let’s run sample space function with preset options. Preset option can control the diversity of molecules and use defiend molecules for sampling.
preset = ‘narrow’ ~ ‘wide’ option controls STONES number of mutations. Ideally, preset=’wide’ will generate more divers of molecules.
In the case shown above is not good example ;P, It’s difficult to compare the efffect of preset option difference.
If you want to use your own generated molecules for sampling, preset=’custom’ should be used. I used molecules which is generated with CReM (inpaired Pat’s blog post, thanks) for example.
from crem.crem import mutate_mol
examplemol = Chem.MolFromSmiles(examplesmi)
analog_smi_list = list(mutate_mol(examplemol, db_name='replacements02_sc2.5.db', max_size=10))
analog_mol_list= [Chem.MolFromSmiles(smi) for smi in analog_smi_list if 'I' not in smi] #Some molecules which has I caused error so I omitted them.
from rdkit.Chem import Draw
Draw.MolsToGridImage(analog_mol_list[:40], molsPerRow=8)
Now we can use any models and genearated molecules with exmol.
So it's interesting. I would like to make knime node for these packages.
I uploaded today's code on my gist.
The Golden Week is a collection of four national holidays within seven days. My kid graduated the elementary school and his favorite dodgeball team and he joined vollyball team. I quit coaching the dodgeball team at same time. So I have time for coding and drinking beer again ;)
I watched knime summit about new version of Knime. It’s really cool. I have interested in the node development with python. Previous version can develop own node but it required java. But now we can develop knime node with python instead of java.
I read the blog post and tried to make my own knime chemoinforomatics node today. I make node which standardize molecule with chembl_structure_pipeline. This library is really useful for normalizeing molecules.
Following section shows my log.
At first, I got template code(basic.zip) from here .
#config.yml
org.tutorial.first_extension: # {group_id}.{name} from the knime.yml
src: /home/iwatobipen/dev/knime_dev/basic/tutorial_extension # Path to folder containing the extension files
conda_env_path: /home/iwatobipen/miniconda3/envs/my_python_env # Path to the Python environment to use
debug_mode: true # Optional line, if set to true, it will always use the latest changes of execute/configure, when that method is used within the KNIME Analytics Platform
#my_conda_envy.yml
name: my_python_env
channels:
- knime
- conda-forge
dependencies:
- python=3.9
- knime-extension=4.7
- knime-python-base=4.7
- rdkit
- chembl_structure_pipeline
How to define the config.yml is well documented in the knime blogpost.
After defining the my_conda_env.yml, I made conda env with the yml-file.
$ conda env create -f mt_conda_env.yml
After making the env, I wrote code for knime node, my node get smiles strings as an input then standardize molecules from SMILES and generate molecularhash as an output.
import logging
import knime.extension as knext
from rdkit import Chem
from rdkit.Chem import rdMolHash
from functools import partial
from chembl_structure_pipeline import standardize_mol
from chembl_structure_pipeline import get_parent_mol
LOGGER = logging.getLogger(__name__)
#molhash = partial(rdMolHash,MolHash(rdMolHash.HashFunction.HeAtomTautomer))
@knext.node(name="chembl structure pipeline", node_type=knext.NodeType.MANIPULATOR, icon_path="demo.png", category="/")
@knext.input_table(name="SMILES column", description="read smiles")
@knext.output_table(name="Output Data", description="rdkit mol which is standarized with chembl structure pipeline")
class TemplateNode:
"""Short one-line description of the node.
This is sample node which is implemented with chembl structure pipeline.
input data should be SMILES.
"""
# simple code
def std_mol(self, smiles):
mol = Chem.MolFromSmiles(smiles)
if mol == None:
return None
else:
stdmol = standardize_mol(mol)
pm, _ = get_parent_mol(stdmol)
Chem.Kekulize(pm)
return pm
def get_mol_hash(sel, rdmol):
res = rdMolHash.MolHash(rdmol, rdMolHash.HashFunction.HetAtomTautomer)
return res
column_param = knext.ColumnParameter(label="label", description="description", port_index=0)
def configure(self, configure_context, input_schema_1):
#return input_schema_1.append(knext.Column(Chem.rdchem.Mol, "STD_ROMol"))
return input_schema_1.append(knext.Column(Chem.rdchem.Mol, "STD_ROMol")).append(knext.Column(knext.string(), 'MolHash'))
def execute(self, exec_context, input_1):
input_1_pandas = input_1.to_pandas()
input_1_pandas['STD_ROMol'] = input_1_pandas['column1'].apply(self.std_mol)
input_1_pandas['MolHash'] = input_1_pandas['STD_ROMol'].apply(self.get_mol_hash)
return knext.Table.from_pandas(input_1_pandas)
After writing the code, add the one line “-Dknime.python.extension.config=/home/iwatobipen/dev/knime_dev/basic/config.yml” to knime.ini which is located in knime install folder.
Then launch knime I could see my own knime node ;)
I make simple workflow with the node. chembl structure pipeline is my developed node ;)
I added some smiles from table creator node.
And run the node, I could get standardized molecules as the output.
The work flow do not only standardization of molecules but also generate molhash. So the output will be like below. Count row is count of molhash. It can see count 2 in 2-hydroxy pyridine and pyridone. Ofcourse they are tautomer.
It’s interesting for me to make new node with python. It’s useful for not only coder but also no coder I think.
I enjoyed knime summit last week. I could learn lots of useful features of new version of Knime (ver5).
Knime is useful tool for chemoinformatics but I mainly write code with python so I’m not heavy user of knime but recently I’m interested in Knime and learning to apply it in my projects.
As reader know that Knime desktop is GUI tool for low coding. But the workflow can run without GUI, from command.
Let’s try to do it ;)
At first, I make example workflow to test it. The workflow is shown below. The workflow read SDF from local directory, remove salt, add hydrogen then generate 3D geometry and optimize them with MMFF94, finaly save resutls as SDF.
Very simple ;)
I saved the workflow as ‘batchtes’.
OK next run the knime from command. The command shown here
The relative path of the workflow and knime didn’t work so I passed absolute path.
I passed -nosave option because I would like to keep the workflow as ready to run. When the option is removed, after running the workflow, I can’t run the workflow again because the workflow remember that the task is finished.
By using the approach it makes easy to do routine tasks more efficiently.
This example is simple and not so useful to make it in batch mode. But it worth to know how to run knime in bach mode. ;)
Matched molecular pairs (MMPs) are old but still useful approach for not only retrospective analysis but also forward compound design. As many chemoinformatitians know that mmpdb is the one of famous MMP CLI tool. I like tha package because it’s easy to make own MMP db. However mmpdb doesn’t have visualization tool for chemists. So we need to make some tools for visualize MMP because extract MMP information from SMILES and float values only is hard to many medicinal chemists ;)
Andrew Hoover’s group who are reseacher in Merck published nice article with code. The article can get from chemrxiv.
The tool named Matcher can provide view of MMP interactively. The view is easy to understand the effect of transformation. And the system is flexible, it can visualize MMPs wtith our own.
I tried to use it with testdata from original repo.
To use Matcher, docker-compose is required, so installed it before the task. Next get code from github repo.
$ gh repo clone Merck/matcher
$ cd matcher
The code run with docker-compose so, we don’t need to install additional packages. Dockerfiles will install all required packages automatically. Type following command to build docker-image and run it.
$ docker-compose up
After waiting few minutes, I could access matcher top page (localhost:8000). The default settings, Matcher backend run the postgresql with default setting. So port number 5432 is used for the task. If user run the additional postgresql job with default settings. It will cause conflict. So user need to change some setting I think.
In my env, I run postgresql db for chembl32 so, I changed port number ;)
Top page of matcher is shown below. Query can make two ways. 1) Specify the position where I want to change (starting molecule) 2) Specify transforamtion (starting molecule and product)
Following example is starting molecue and product type query of H to Me.
After running the query, I could get interactive view of delta-hERG change and the structure. If you can find great improvement of hERG inhibition with adding methyl withtout loss of target potency. It seems magic methyl!!!!
Next example is small part of query. But results groped by each substructure.
As figure above shows that morpholine improves hERG inhibition compred to many amine sustituents.
These analysis provides bird view of compound properties effect.
In summary Matcher is useful for chemoinformatics task. I’m making some code to make properties extractor for mmpdb and matcher!
Recently we can use LLMs(Large langues models) such as ChatGPT. And they return(make) plausible answer against questions. There are lots of chat bot systems have been used in many services but the use cases are limited. LLMs seem be really powerful tools not only general task but also life science.
In the article, the arthors asked some chemistry related questions to ChatGPT. Listed below. 1, name to SMILES / SMILES to name 2, predict LogP with given compounds 3, Getting structural information on coordination compounds 4, Predict water solubility of polymers 5, determine molecular point groups
The results said that task 1 is difficult for ChatGPT. It depends on amount of tarining data and quarity of promt. Name to strucutre and Structure to name are common task for chemoinformatics area but it’s really specific field in the world. So the amount of training data is limited.
The arthors are described their considerations deeply in this article. LLMs works very well against many problems but chemistry is still challenging area even if ChatGPT is released.
But I can’t predict near the feature of LLMs wheather do LLMs understand chemistry or not. In any case I shouldn’t stop learning not only in silico area but also wet area…..
Recently we can get huge amout of molecules and can use it for drug discovery tasks. It’s really good news however, we need computational resources more and more…
When I wrote code for chemoinformatics task with python, ‘for loop’ approach is often used. But it is straight way, it’s not so efficient. Long time ago, I wrote a blog post about jug which is python package for task based parallelization.
Today, I would like to share another approach for parallelization with ipyparallel. The original idea came from Greg’s suggestion. Thanks Greg!
Following example code is confromation generation task. I used first_5k.smi as a sample dataset.
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from chembl_structure_pipeline import standardizer
from rdkit import RDLogger
import time
RDLogger.DisableLog('rdApp.info')
suppl = Chem.SmilesMolSupplier('./first_5K.smi', smilesColumn=0, delimiter='\t')
mols = [m for m in suppl if m!=None]
#for convenience, I used first 1K molecules
mols = mols[:1000]
At first, I cleaned molecules with chembl_structure_pipeline.
def stdmol(mol):
stdmol = standardizer.standardize_mol(mol)
pmol, _ = standardizer.get_parent_mol(stdmol)
return pmol
mols = [stdmol(m) for m in mols]
Then I defined confgen function. The function generate 3D conformer and optimze it with MMFF. TIP is tpl argument. It requireed for parallelization.
def confgen(tpl, mols=mols):
from rdkit import Chem
from rdkit.Chem import AllChem
nengine, clsidx = tpl
nwrite = 0
for idx, mol in enumerate(mols):
if idx % nengine == clsidx:
try:
hmol = Chem.AddHs(mol)
res = AllChem.EmbedMolecule(hmol)
if res == 0:
AllChem.MMFFOptimizeMolecule(hmol)
nwrite += 1
except:
continue
return nwrite
Now ready to run code in parallel. I defined function for test parallel job.
import ipyparallel as ipp
def runtask(nengines=1):
start = time.time()
with ipp.Cluster(n=nengines) as rc:
view = rc.load_balanced_view()
asyncresult = view.map_async(confgen, [(nengines, idx) for idx in range(nengines)])
asyncresult.wait_interactive()
# retrieve actual results
result = asyncresult.get()
print(result)
finish = time.time()
print(finish - start)
Compared job with nengiens=1 and nengines=16, later case is faster than former. In this small case, the effect is not liner. But if lots of core is available, more speed up is expected.
In summary, ipyparallel is reall cool package for datascience.
I pushed my code on my gist. Thanks for reading. And any comments and suggestions are greatly appreciated :)
Recently, I updated version of pytorch on my env from 1.x to 2.0.
I think it’s worth to update because, original site says….
PYTORCH 2.X: FASTER, MORE PYTHONIC AND AS DYNAMIC AS EVER
Today, we announce torch.compile, a feature that pushes PyTorch performance to new heights and starts the move for parts of PyTorch from C++ back into Python. We believe that this is a substantial new direction for PyTorch – hence we call it 2.0. torch.compile is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
Underpinning torch.compile are new technologies – TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.
And also, pytorch_geometric supports pytorch2.0. Previously I worte blog post about building QSAR model with PyG. Old version of PyG didn’t support graph from molecule. So I wrote my own function to do that. But recent version of PyG has the function!
I participated my kid’s elemenary school graduation ceremony today. It was nice ceremony. …Time pasts very fast ;) In the season in Japan is very good. Beautiful cherry blossoms start to bloom. It’s good time to start anything ;-)
I would like to say thank to my kid.
BTW, recently chembl ver32 was released. As you know chembl DB has lots of biological activities of small moleculres. So it’s really useful source of multi target predictive model.
He uses onnx to convert pytorch predictive model to tensorflow model because current knime doesn’t support pytorch. I have intereted the approach and would like to run the work flow with chembl32.
To do that, I need to build model with chembl32. It was requred to modify original code because schema is changed.
After some trial, I could build the workflow. So I would like to share my experience.
At first, I made dataset. The code is below. I got data from chemblftp. The origin of the code came from Eloy’s work.
!/usr/bin/env python
# coding: utf-8
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdMolDescriptors
from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd
import numpy as np
import tables as tb
import json
from tables.atom import ObjectAtom
engine = create_engine('sqlite:///chembl_32/chembl_32_sqlite/chembl_32.db')
qtext = """
SELECT
activities.doc_id AS doc_id,
activities.standard_value AS standard_value,
molecule_hierarchy.parent_molregno AS molregno,
compound_structures.canonical_smiles AS canonical_smiles,
molecule_dictionary.chembl_id AS chembl_id,
target_dictionary.tid AS tid,
target_dictionary.chembl_id AS target_chembl_id,
protein_classification.pref_name AS pref_name,
protein_classification.short_name AS short_name,
protein_classification.PROTEIN_CLASS_DESC AS protein_class,
protein_classification.class_level AS class_level
FROM activities
JOIN assays ON activities.assay_id = assays.assay_id
JOIN target_dictionary ON assays.tid = target_dictionary.tid
JOIN target_components ON target_dictionary.tid = target_components.tid
JOIN component_class ON target_components.component_id = component_class.component_id
JOIN protein_classification ON component_class.protein_class_id = protein_classification.protein_class_id
JOIN molecule_dictionary ON activities.molregno = molecule_dictionary.molregno
JOIN molecule_hierarchy ON molecule_dictionary.molregno = molecule_hierarchy.molregno
JOIN compound_structures ON molecule_hierarchy.parent_molregno = compound_structures.molregno
WHERE activities.standard_units = 'nM' AND
activities.standard_type IN ('EC50', 'IC50', 'Ki', 'Kd', 'XC50', 'AC50', 'Potency') AND
activities.data_validity_comment IS NULL AND
activities.standard_relation IN ('=', '<') AND
activities.potential_duplicate = 0 AND assays.confidence_score >= 8 AND
target_dictionary.target_type = 'SINGLE PROTEIN'"""
with engine.begin() as conn:
res = conn.execute(text(qtext))
df = pd.DataFrame(res.fetchall())
df.columns = res.keys()
df = df.where((pd.notnull(df)), None)
cls_list=df["protein_class"].to_list()
uniq = list(set(cls_list))
df = df.sort_values(by=['standard_value', 'molregno', 'tid'], ascending=True)
df = df.drop_duplicates(subset=['molregno', 'tid'], keep='first')
df.to_csv('chembl_activity_data.csv', index=False)
def set_active(row):
active = 0
if row['standard_value'] <= 1000:
active = 1
if "ion channel" in row['protein_class']:
if row['standard_value'] <= 10000:
active = 1
if "kinase" in row['protein_class']:
if row['standard_value'] > 30:
active = 0
if "nuclear receptor" in row['protein_class']:
if row['standard_value'] > 100:
active = 0
if "membrane receptor" in row['protein_class']:
if row['standard_value'] > 100:
active = 0
return active
df['active'] = df.apply(lambda row: set_active(row), axis=1)
# get targets with at least 100 different active molecules
acts = df[df['active'] == 1].groupby(['target_chembl_id']).agg('count')
acts = acts[acts['molregno'] >= 100].reset_index()['target_chembl_id']
# get targets with at least 100 different inactive molecules
inacts = df[df['active'] == 0].groupby(['target_chembl_id']).agg('count')
inacts = inacts[inacts['molregno'] >= 100].reset_index()['target_chembl_id']
# get targets mentioned in at least two docs
docs = df.drop_duplicates(subset=['doc_id', 'target_chembl_id'])
docs = docs.groupby(['target_chembl_id']).agg('count')
docs = docs[docs['doc_id'] >= 2.0].reset_index()['target_chembl_id']
t_keep = set(acts).intersection(set(inacts)).intersection(set(docs))
# get dta for filtered targets
activities = df[df['target_chembl_id'].isin(t_keep)]
ion = pd.unique(activities[activities['protein_class'].str.contains("ion channel", na=False)]['tid']).shape[0]
kin = pd.unique(activities[activities['protein_class'].str.contains("kinase", na=False)]['tid']).shape[0]
nuc = pd.unique(activities[activities['protein_class'].str.contains("nuclear receptor", na=False)]['tid']).shape[0]
gpcr = pd.unique(activities[activities['protein_class'].str.contains("membrane receptor", na=False)]['tid']).shape[0]
print('Number of unique targets: ', len(t_keep))
print(' Ion channel: ', ion)
print(' Kinase: ', kin)
print(' Nuclear receptor: ', nuc)
print(' GPCR: ', gpcr)
print(' Others: ', len(t_keep) - ion - kin - nuc - gpcr)
# save it to a file
activities.to_csv('chembl_activity_data_filtered.csv', index=False)
def gen_dict(group):
return {tid: act for tid, act in zip(group['target_chembl_id'], group['active'])}
print('MULTI TASK DATA PREP')
group = activities.groupby('chembl_id')
temp = pd.DataFrame(group.apply(gen_dict))
mt_df = pd.DataFrame(temp[0].tolist())
mt_df['chembl_id'] = temp.index
mt_df = mt_df.where((pd.notnull(mt_df)), -1)
structs = activities[['chembl_id', 'canonical_smiles']].drop_duplicates(subset='chembl_id')
print('GET MOL')
# drop mols not sanitizing on rdkit
def molchecker(smi):
mol = Chem.MolFromSmiles(smi)
if mol == None:
return None
else:
return 1
#structs['romol'] = structs.apply(lambda row: Chem.MolFromSmiles(row['canonical_smiles']), axis=1)
structs['romol'] = structs.apply(lambda row: molchecker(row['canonical_smiles']), axis=1)
structs = structs.dropna()
del structs['romol']
# add the structures to the final df
mt_df = pd.merge(structs, mt_df, how='inner', on='chembl_id')
# save to csv
mt_df.to_csv('chembl_multi_task_data.csv', index=False)
FP_SIZE = 1024
RADIUS = 2
def calc_fp(smiles, fp_size, radius):
"""
calcs morgan fingerprints as a numpy array.
"""
mol = Chem.MolFromSmiles(smiles, sanitize=False)
mol.UpdatePropertyCache(False)
Chem.GetSSSR(mol)
fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius, nBits=fp_size)
a = np.zeros((0,), dtype=np.float32)
Chem.DataStructs.ConvertToNumpyArray(fp, a)
return a
# calc fps
print('CALC FP')
descs = [calc_fp(smi, FP_SIZE, RADIUS) for smi in mt_df['canonical_smiles'].values]
descs = np.asarray(descs, dtype=np.float32)
# put all training data in a pytables file
print('SAVE DATA')
with tb.open_file('mt_data.h5', mode='w') as t_file:
# set compression filter. It will make the file much smaller
filters = tb.Filters(complib='blosc', complevel=5)
# save chembl_ids
tatom = ObjectAtom()
cids = t_file.create_vlarray(t_file.root, 'chembl_ids', atom=tatom)
for cid in mt_df['chembl_id'].values:
cids.append(cid)
# save fps
fatom = tb.Atom.from_dtype(descs.dtype)
fps = t_file.create_carray(t_file.root, 'fps', fatom, descs.shape, filters=filters)
fps[:] = descs
del mt_df['chembl_id']
del mt_df['canonical_smiles']
# save target chembl ids
tcids = t_file.create_vlarray(t_file.root, 'target_chembl_ids', atom=tatom)
for tcid in mt_df.columns.values:
tcids.append(tcid)
# save labels
labs = t_file.create_carray(t_file.root, 'labels', fatom, mt_df.values.shape, filters=filters)
labs[:] = mt_df.values
# save task weights
# each task loss will be weighted inversely proportional to its number of data points
weights = []
for col in mt_df.columns.values:
c = mt_df[mt_df[col] >= 0.0].shape[0]
weights.append(1 / c)
weights = np.array(weights)
ws = t_file.create_carray(t_file.root, 'weights', fatom, weights.shape)
ws[:] = weights
with tb.open_file('mt_data.h5', mode='r') as t_file:
print(t_file.root.chembl_ids.shape)
print(t_file.root.target_chembl_ids.shape)
print(t_file.root.fps.shape)
print(t_file.root.labels.shape)
print(t_file.root.weights.shape)
# save targets to a json file
with open('targets.json', 'w') as f:
json.dump(t_file.root.target_chembl_ids[:], f)
After making the dataset, I build the model with pytorch. The code is below.
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torch.utils.data as D
import tables as tb
from sklearn.metrics import (matthews_corrcoef,
confusion_matrix,
f1_score,
roc_auc_score,
accuracy_score,
roc_auc_score)
# set the device to GPU if available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
MAIN_PATH = '.'
DATA_FILE = 'mt_data.h5'
MODEL_FILE = 'chembl_mt.model'
N_WORKERS = 8 # Dataloader workers, prefetch data in parallel to have it ready for the model after each batch train
BATCH_SIZE = 32 # https://twitter.com/ylecun/status/989610208497360896?lang=es
LR = 2 # Learning rate. Big value because of the way we are weighting the targets
N_EPOCHS = 2 # You should train longer!!!
class ChEMBLDataset(D.Dataset):
def __init__(self, file_path):
self.file_path = file_path
with tb.open_file(self.file_path, mode='r') as t_file:
self.length = t_file.root.fps.shape[0]
self.n_targets = t_file.root.labels.shape[1]
def __len__(self):
return self.length
def __getitem__(self, index):
with tb.open_file(self.file_path, mode='r') as t_file:
structure = t_file.root.fps[index]
labels = t_file.root.labels[index]
return structure, labels
dataset = ChEMBLDataset(f"{MAIN_PATH}/{DATA_FILE}")
validation_split = .2
random_seed= 42
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, test_indices = indices[split:], indices[:split]
train_sampler = D.sampler.SubsetRandomSampler(train_indices)
test_sampler = D.sampler.SubsetRandomSampler(test_indices)
# dataloaders can prefetch the next batch if using n workers while
# the model is tranining
train_loader = torch.utils.data.DataLoader(dataset,
batch_size=BATCH_SIZE,
num_workers=N_WORKERS,
sampler=train_sampler)
test_loader = torch.utils.data.DataLoader(dataset,
batch_size=BATCH_SIZE,
num_workers=N_WORKERS,
sampler=test_sampler)
class ChEMBLMultiTask(nn.Module):
"""
Architecture borrowed from: https://arxiv.org/abs/1502.02072
"""
def __init__(self, n_tasks):
super(ChEMBLMultiTask, self).__init__()
self.n_tasks = n_tasks
self.fc1 = nn.Linear(1024, 2000)
self.fc2 = nn.Linear(2000, 100)
self.dropout = nn.Dropout(0.25)
# add an independet output for each task int the output laer
for n_m in range(self.n_tasks):
self.add_module(f"y{n_m}o", nn.Linear(100, 1))
def forward(self, x):
h1 = self.dropout(F.relu(self.fc1(x)))
h2 = F.relu(self.fc2(h1))
out = [torch.sigmoid(getattr(self, f"y{n_m}o")(h2)) for n_m in range(self.n_tasks)]
return out
# create the model, to GPU if available
model = ChEMBLMultiTask(dataset.n_targets).to(device)
# binary cross entropy
# each task loss is weighted inversely proportional to its number of datapoints, borrowed from:
# http://www.bioinf.at/publications/2014/NIPS2014a.pdf
with tb.open_file(f"{MAIN_PATH}/{DATA_FILE}", mode='r') as t_file:
weights = torch.tensor(t_file.root.weights[:])
weights = weights.to(device)
criterion = [nn.BCELoss(weight=w) for x, w in zip(range(dataset.n_targets), weights.float())]
# stochastic gradient descend as an optimiser
optimizer = torch.optim.SGD(model.parameters(), LR)
# model is by default in train mode. Training can be resumed after .eval() but needs to be set to .train() again
model.train()
for ep in range(N_EPOCHS):
for i, (fps, labels) in enumerate(train_loader):
# move it to GPU if available
fps, labels = fps.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(fps)
# calc the loss
loss = torch.tensor(0.0).to(device)
for j, crit in enumerate(criterion):
# mask keeping labeled molecules for each task
mask = labels[:, j] >= 0.0
if len(labels[:, j][mask]) != 0:
# the loss is the sum of each task/target loss.
# there are labeled samples for this task, so we add it's loss
loss += crit(outputs[j][mask], labels[:, j][mask].view(-1, 1))
loss.backward()
optimizer.step()
if (i+1) % 500 == 0:
print(f"Epoch: [{ep+1}/{N_EPOCHS}], Step: [{i+1}/{len(train_indices)//BATCH_SIZE}], Loss: {loss.item()}")
y_trues = []
y_preds = []
y_preds_proba = []
# do not track history
with torch.no_grad():
for fps, labels in test_loader:
# move it to GPU if available
fps, labels = fps.to(device), labels.to(device)
# set model to eval, so will not use the dropout layer
model.eval()
outputs = model(fps)
for j, out in enumerate(outputs):
mask = labels[:, j] >= 0.0
mask = mask.to(device)
y_pred = torch.where(out[mask].to(device) > 0.5, torch.ones(1).to(device), torch.zeros(1).to(device)).view(1, -1)
if y_pred.shape[1] > 0:
for l in labels[:, j][mask].long().tolist():
y_trues.append(l)
for p in y_pred.view(-1, 1).tolist():
y_preds.append(int(p[0]))
for p in out[mask].view(-1, 1).tolist():
y_preds_proba.append(float(p[0]))
tn, fp, fn, tp = confusion_matrix(y_trues, y_preds).ravel()
sens = tp / (tp + fn)
spec = tn / (tn + fp)
prec = tp / (tp + fp)
f1 = f1_score(y_trues, y_preds)
acc = accuracy_score(y_trues, y_preds)
mcc = matthews_corrcoef(y_trues, y_preds)
auc = roc_auc_score(y_trues, y_preds_proba)
print(f"accuracy: {acc}, auc: {auc}, sens: {sens}, spec: {spec}, prec: {prec}, mcc: {mcc}, f1: {f1}")
print(f"Not bad for only {N_EPOCHS} epochs!")
torch.save(model.state_dict(), f"./{MODEL_FILE}")
Due to run the code on my personal environment I set epoch is 2 but it’s better to more longer epoch for production.
After the training, I converted the model to onnx format. The code is…
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torch.utils.data as D
import tables as tb
from sklearn.metrics import (matthews_corrcoef,
confusion_matrix,
f1_score,
roc_auc_score,
accuracy_score,
roc_auc_score)
from torch import onnx
# set the device to GPU if available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
MAIN_PATH = '.'
DATA_FILE = 'mt_data.h5'
MODEL_FILE = 'chembl_mt.model'
N_WORKERS = 8 # Dataloader workers, prefetch data in parallel to have it ready for the model after each batch train
BATCH_SIZE = 32 # https://twitter.com/ylecun/status/989610208497360896?lang=es
LR = 2 # Learning rate. Big value because of the way we are weighting the targets
N_EPOCHS = 2 # You should train longer!!!
class ChEMBLDataset(D.Dataset):
def __init__(self, file_path):
self.file_path = file_path
with tb.open_file(self.file_path, mode='r') as t_file:
self.length = t_file.root.fps.shape[0]
self.n_targets = t_file.root.labels.shape[1]
def __len__(self):
return self.length
def __getitem__(self, index):
with tb.open_file(self.file_path, mode='r') as t_file:
structure = t_file.root.fps[index]
labels = t_file.root.labels[index]
return structure, labels
class ChEMBLMultiTask(nn.Module):
"""
Architecture borrowed from: https://arxiv.org/abs/1502.02072
"""
def __init__(self, n_tasks):
super(ChEMBLMultiTask, self).__init__()
self.n_tasks = n_tasks
self.fc1 = nn.Linear(1024, 2000)
self.fc2 = nn.Linear(2000, 100)
self.dropout = nn.Dropout(0.25)
# add an independet output for each task int the output laer
for n_m in range(self.n_tasks):
self.add_module(f"y{n_m}o", nn.Linear(100, 1))
def forward(self, x):
h1 = self.dropout(F.relu(self.fc1(x)))
h2 = F.relu(self.fc2(h1))
out = [torch.sigmoid(getattr(self, f"y{n_m}o")(h2)) for n_m in range(self.n_tasks)]
return out
dataset = ChEMBLDataset(f"{MAIN_PATH}/{DATA_FILE}")
validation_split = .2
random_seed= 42
dataset_size = len(dataset)
model = ChEMBLMultiTask(dataset.n_targets).to(device)
path = './model_onnx.onnx'
dummy = torch.tensor([[0.5 for _ in range(1024)]],
dtype=torch.float32).to(device)
onnx.export(model, dummy, path, input_names=['input_1'],output_names=['output'])
Now, I got onnx model of chembl target prediciton. So I tried to run the work flow with the model. But it didn’t work. So I tried to tensorflow model directly on knime workflow.
At first, I converted onnx model to tensorflow model with onnx-tf.
$ onnx-tf convert -i model_onnx.onnx -o model_tf
Now ready to run the prediction ;)
I build workflow (most of part came from greg’s great work!).
To run the WF, I could get heat map of target prediction.
The environment of deeplearning for knime is below.
It have been long time since my last post ;) Now we are in Feburary… I hope every reader have nice start of this year.
I’m enjoying my day but really busy.
This is the first post of the year!
Thinking the strength of hydrogen donoer and acceptor is important for drug design but difficult to calculate. Recently I found useful package to do it. So I would like to share the pacakge.
Tha package named Jazzy!!!!!! The code is disclosed by AZ repo. It is easy to use. After installing the package, you can calculate strength of hydrogen donor and acceptor of the molecule and can vizualise it.
OK let’s write code.
Install jazzy with pip command.
$ pip install jazzy
jazzy depends on some python packages, click, kallisto, numpy and rdkit. These package will install when pip install jazzy command is called.
As you can see, jazzy provides useful method to calculate compound properties. These parameter is useful not only analyzing compound properties but also as descriptors for machine learning.
If readers have interest the package, please try to use it and check the official documentation.
It will take 3 years after changing job. Fortunately, I’m still working at early drug discovery field as chemoinformatician ;)
As many drug hunter know that bioisosteric replacement is one of the major way for designing novel molecules from known biologically active molecules.
Transformations from carboxylic acid to tetorazole, from amide to oxazole are major examples of bioisosteric replatecement.
Are these pair, “carboxylic & tetorazole” and “amide & oxazole” simlar ? And are there any metrics to compre these similarity instead of biological activities?
Espsim is one of the interesting OSS package for compare molecules with electro static potentials.
I tried to compare following tetrazole and carboxylic acid pair and amide and oxazole pair.
To clarify the similarity, I used not only neutoral state of carboxylic acid but also ionization state.
Let’s compare carboxylic acid and tetrazole with Morgan FP and ESPsim. As you can see that tetrazole and carboxylic acid shows low 2d finger print similarity but high similarity with ESPsim and Shapesim.
Next, let’s compare carboxylate and carboxylic acid metyl ester.
Interestingly ESPSIM between calboxylate and methyl ester is very low compared to their shape similarity.
How about similarity between calboxylic acid and methyl ester? Ah! there molecule shows high similarity not only shape but also electrostatic potential.
Next, let’s see amide and oxazole.
The result is as same as calboxylate case, the pair shows low 2D similarty but high shape and esp similarity.
In this experiments, ESPsim is interesting tool for predicting bioisosteric fragments. I think it worth to use espsim as a tool for drug design.
This is the last chemoinformatics blogpost in this year. Thanks for visiting and reading my blog. I hope you and your family have a great rest of the year and have a happy new year ;)
I hope all reader enjoying chemoinformatics ;) Unfortunately I’m struggling to make PPTX file in this year. So I can’t have enought time to coding. But It’s same for medicinal chemists. Because they often make presentation slide for project team, theier boss, senior etc….
They transfer SAR data from spread sheet to pptx. Sometime it seems time consuming task for me.
So I would like to write code for making automatic pptx generator with python ;)
As you know there are useful prior art in the topic. The URL is listed below.
The first one uses RDKit and the second one uses OETK.
In this post, I used RDKit and most of code is bollowed from the OE’s example code.
Here is a my example code. At first, import libraries which are required to make pptx and handling chemical structure.
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from io import BytesIO, StringIO
from pptx import Presentation
from pptx.util import Inches
# read example molecules
mols = [m for m in Chem.SDMolSupplier('./cdk2.sdf')]
for m in mols:
AllChem.Compute2DCoords(m)
Then define function to draw molecule and make pptx. I used BytesIO because by using the BytesIO I don’t need to make tempolary image file.
As many RDKitter know that rdSubstructLibrary is one of the cool tool for conductiong substructure search. Greg Landrum introduced how to use it in his great blog post.
I love the method because it works very fast for substructure searching. So I would like to make CLI tool for making substructure library database.
To do it, I used click which is useful package for making CLI tool.
After installing the package, three commands will be available. 1. make_rdssslib command makes sslib from sdf.gz 2. update_rdssslib which updates sslib with new sdf.gz 3. run_rdsss which run SSS with given smarts query.
The example is shown below.
# make ssslib from sdf.gz
$ make_rdssslib cdk2.sdf.gz cdk2.sslib.pkl
# search with ssslib from CLI
$ run_rdsss 'c1ccccc1' cdk2.sslib.pkl
After running the run_rdsss, hits.csv file will be generated.
All process can do from CLI with the code. But to handle learge sslib. I think user should run sss on interprinter. Because IO of SSLIB will take bottle neck of the code.
This code is stil underl development. Any advice or suggestion will be greatly appreciated.
I introduced MMPDB v3 few days ago. The package is CLI tool. So user can do MMPA by typing command line. However the tool doesn’t provide API interface.
I think, if we can use MMPDB on jupyter-notebook it’ll be more useful because output data can be used directly in coding process.
MMPDB Cli interface is implemented with Click which is python command line option parser. And click has CliRunner for testing. I found that CliRunner is nice solution to do that.
Following example is my first trial to call mmpdb on jupyter notebook. And most of SQL part is bollowed from RDKit-Blog post.
At first, connect postgresql chembl30 and retrive data.
import psycopg2
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem import PandasTools
import requests
import pandas as pd
from mmpdblib import cli
from click.testing import CliRunner
from IPython.display import HTML
%load_ext sql
print(rdBase.rdkitVersion)
data = %sql select canonical_smiles,molregno,activity_id,standard_value,standard_units from activities \
join assays using (assay_id) \
join compound_structures using (molregno) \
where tid=165 and standard_type='Ki' and standard_value is not null and standard_relation='=' \
and canonical_smiles not like '%.%';
df = data.DataFrame()
Next, calculate pKi and rename column for next data processing. And save compound data and property data separated as CSV format. It’s worth to know that properties data should be tab separated format.
# I used droperidol as an example molecule
smi = 'C1CN(CC=C1N2C3=CC=CC=C3NC2=O)CCCC(=O)C4=CC=C(C=C4)F'
droperidol = Chem.MolFromSmiles(smi)
smi = Chem.MolToSmiles(droperidol)
droperidol
Finally, convert smiles to rdkit molobjects with Chem.PandasTools. I used changeMoleculeRendering method because before using the method molecules in dataframe are not rendered.
By using the approach, we can call mmpdb function from python code without subprocess module. I think it is usful for calling function from the code with is click based CLI tool.
After the command avobe, I could call mmpdb command ;)
Make cdk2.smi from cdk2.sdf with rdkit.
from rdkit import Chem
mols = Chem.SDMolSupplier('cdk2.sdf')
with open('cdk2.smi') as of:
for m in mols:
molid = m.GetProp('_Name')
smi = Chem.MolToSmiles(m)
of.write(f'{smi} {molid}\n')
After making smi file, let’s make mmpdb.
$ mmpdb fragment cdk2.smi -o cdk2.fragment
$ mmpdb index cdk2.fragment -o cdk2.mmpd
60 rules are generated from cdk2.smi which contains 47 molecules.
One of the interesting feature of mmpdb v3 is generate command imprementation. The command allows to user for generating molecule with user defined substructure as query.
cdk2.mmpdb has metyl to thiazole rules ([:1]C >>> [:1]c1nccs1). So I tried to generate command with simple molecule.
I often use flask for my web app development because the package is light weight web framework and easy to write.
BTW rendering molecules as grid image is easy in jupyter notebook by using RDKit’s Draw.MolsToGridImage function. And also recently developed package named mols2grid is really cool for rendering molecules on jupyter notebook. mols2grid generaetes image data as HTML. So it means that we can embed these grid image to webpage.
So I tried to it ;)
Following code uses Flask, Flask-bootsrap, mols2grid. All packages are installed from conda-forge.
Directory structure is below.
root/app.py root/cdk2.sdf root/templates/top.html
And app.py
from flask import Flask
from flask import render_template
from flask_bootstrap import Bootstrap
from rdkit import Chem
import mols2grid
app = Flask(__name__)
Bootstrap(app)
@app.route('/')
def top():
im = mols2grid.display('./cdk2.sdf',
#tooltip=["id", "r_mmffld_Potential_Energy-OPLS_2005"]
)
return render_template('/top.html', data=im.data)
if __name__=="__main__":
app.run(debug=True)
Then access localhost:5000. I could get molecules as grid image.
The data has pagenation and check box. It can dowonload selected molecules. I think it is one of the convenient way to implement rendering molecules function in my web applications.
Last week I enjoyed RDKit UGM 2022. It was really great and exciting evenif I participated there from online. I hope I could participate RDKIT UGM 2023 locally ;)
As you know RDKit is one of the useful OSS package for chemoinformatician. It has nice community and be developed actively. I respect the community and developer’s effort!
And fortunately, we can also use FW from rdkit contrib package ;)
Now new version of rdkit(2022.09.1) isn’t available from conda-forge but I could get new version from original repo and FW package could install with pip command.
By using the package, user can conduct FW analysis really easily. Original repo shows another approach which uses FMCS to get scaffold for the dataset.
I love RDKit because it supports lots of chemoinformatics features and it is supported top level scientific community.
But for eary chemoinformacian, it’s required to learn python or C++ to write chemoinformatics code. As you know, Knime is good option for no-code chemoinformtatics. And another option for beginner is good wrapper of rdkit
Recently the package named datamol is intorduced in my TL.