Make learning process with human in the loop #optuna #memo #python #rdkit

Many chemofinromaticians have checked Greg’s great blog post which describes how to draw draw molecules in various way.
https://greglandrum.github.io/rdkit-blog/posts/2023-05-26-drawing-options-explained.html

Rendering molecule doesn’t directly contribute drug design but it’s really importnt for us because medicinal chemists have their own preferences for drawing moleucles such as font size, highlight color etc, etc.

It’s too difficult to parametarize their preferences for each other even if we use AI ;)

Somedays ago I knew that optuna has cool function to implement learning process with human in the loop.

By using the process, we can learn data from human process. So I would like to use the function as soon as possible and tried today ;)
https://github.com/optuna/optuna-dashboard/releases/tag/v0.9.0b4

Today’s code doesn’t have ML/DL process so user doesn’t need GPUs.

OK let’s write code. Main code came from original documentation.
https://optuna-dashboard.readthedocs.io/en/latest/tutorials/hitl.html

#main.py
import os
import textwrap
import time
from typing import NoReturn

import optuna
from optuna.trial import TrialState
from optuna_dashboard import ChoiceWidget
from optuna_dashboard import register_objective_form_widgets
from optuna_dashboard import save_note
from optuna_dashboard.artifact import get_artifact_path
from optuna_dashboard.artifact import upload_artifact
from optuna_dashboard.artifact.file_system import FileSystemBackend
from PIL import Image
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdDepictor
# Most of the code where drawing molecule is borrowed from Greg's great blog post https://greglandrum.github.io/rdkit-blog/posts/2023-05-26-drawing-options-explained.html

doravirine = Chem.MolFromSmiles('Cn1c(n[nH]c1=O)Cn2ccc(c(c2=O)Oc3cc(cc(c3)Cl)C#N)C(F)(F)F')

def make_png(mol, r, g, b, legend="", highlightAtoms=[]):
    d2d = Draw.MolDraw2DCairo(350, 300)
    dopts = d2d.drawOptions()
    dopts.setHighlightColour((r, g, b, .4))
    d2d.DrawMolecule(mol, legend=legend, highlightAtoms=highlightAtoms)
    d2d.FinishDrawing()
    return d2d.GetDrawingText()

def suggest_and_generate_image(study: optuna.Study, artifact_backend: FileSystemBackend) -> None:
    # 1. Ask new parameters
    trial = study.ask()
    r = trial.suggest_float("r", 0, 1)
    g = trial.suggest_float("g", 0, 1)
    b = trial.suggest_float("b", 0, 1)

    # 2. Generate image
    image_path = f"tmp/sample-{trial.number}.png"
    drawingtext = make_png(doravirine, r, g, b, legend=f"trial {trial.number}", highlightAtoms=(0,1,2,3,4,5,6))
    with open(image_path, 'wb') as f:
        f.write(drawingtext)

    # 3. Upload Artifact
    artifact_id = upload_artifact(artifact_backend, trial, image_path)
    print(artifact_id)
    artifact_path = get_artifact_path(trial, artifact_id)
    #artifact_path = os.path.join("artifact", artifact_id)
    print(artifact_path)

    # 4. Save Note
    note = textwrap.dedent(
        f"""\
    ## Trial {trial.number}

    ![generated-image]({artifact_path})
    """
    )
    save_note(trial, note)


def start_optimization(artifact_backend: FileSystemBackend) -> NoReturn:
    # 1. Create Study
    study = optuna.create_study(
        study_name="Human-in-the-loop Optimization",
        storage="sqlite:///db.sqlite3",
        sampler=optuna.samplers.TPESampler(constant_liar=True, n_startup_trials=5),
        load_if_exists=True,
    )

    # 2. Set an objective name
    study.set_metric_names(["Do you like this color?"])

    # 3. Register ChoiceWidget
    register_objective_form_widgets(
        study,
        widgets=[
            ChoiceWidget(
                choices=["Good 👍", "So-so👌", "Bad 👎"],
                values=[-1, 0, 1],
                description="Please input your score!",
            ),
        ],
    )

    # 4. Start Human-in-the-loop Optimization
    n_batch = 4
    while True:
        running_trials = study.get_trials(deepcopy=False, states=(TrialState.RUNNING,))
        if len(running_trials) >= n_batch:
            time.sleep(1)  # Avoid busy-loop
            continue
        suggest_and_generate_image(study, artifact_backend)


def main() -> NoReturn:
    tmp_path = os.path.join(os.path.dirname(__file__), "tmp")

    # 1. Create Artifact Store
    artifact_path = os.path.join(os.path.dirname(__file__), "artifact")
    artifact_backend = FileSystemBackend(base_path=artifact_path)

    if not os.path.exists(artifact_path):
        os.mkdir(artifact_path)

    if not os.path.exists(tmp_path):
        os.mkdir(tmp_path)

    # 2. Run optimize loop
    start_optimization(artifact_backend)


if __name__ == "__main__":
    main()

After writing the code, I run the script from 2 terminals.

# terminal 1
$ python main.py
# terminal 2
$ optuna-dashboard sqlite:///db.sqlite3 --artifact-dir ./artifact/

After typing optuna-dashboard, web server launched and I could access localhost:8080.

By clicking the trials(List), I could see the following page. The page asked me do you like the color of atom highlighting. I could annotate the question with good, so-so and bad.

After several trials, optuna learned that I like orange color ;) It can be cornfirmed from learning curve and 2d density plot.

x axis of the density matrics is red and y axis is blue. As you can see, I added good to orange~red highligt color (r valule is near 1. and blue value is near 0.)

In this example doesn’t have any scientific meangs but it showed how to use optuna in HITL.

I uploaded today’s code on my github repo.

https://github.com/iwatobipen/hitl_chem/tree/main

enjoy programming!

Advertisement

Optimize ML model with few line code #chemoinformatics #ML #RDKit #falcon

Machine learning is democratizing, and there are many great libraries available in Python such as scikit-lean pycaret.

Recenlty we can build ML model without writing lots of code. Today I found new package named ‘falcon‘ which can build optimized model few lines.

One of the famous package for AutoML is auto-scikitlearn, the package is also useful and powerful. But today I tried to use falcon instead of auto-scikitlearn.

Falcon can use optuna for parameter optimization. I used full automated process in the blog post.

Solubility dataset is used for the example. To use falcon, I need only import AutoML from falcon. It’s toooooooo simple isn’t it?

After importing the AutoML passing the data to the object. That’s all. AutoML tried to build regression model. Following example shows how to build regression model. But falcon also can build classfication model. To do it, task should be set to ‘tabular_classificaion’.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

After the process, data is saved as onnx format ;)

I cheked model performance with training and test dataset. As you can see accuracy of test data is low compared to training data.

To build acculate model is still challenging task for many chemoinformatician. To build acculate model, huge amout of data is requred in some case but it’s not feasible :P

Balance between real and virutal data is important.

But it’s fun to search and test new ML packages. I updated my code on my gist.

Use jupyternotebook from knime #knime #chemoinformatics #memo

Recently I posted an example to build custom knime node with python. It was a nice experience for me. And I’m still reading knime documentation.

Today I tried to use jupyter notebook from knime. There is a good example to do that with workflow and dataset.
https://docs.knime.com/latest/python_installation_guide/index.html#jupyter-notebooks

The example described avobe reads dataset and do some calculation against to numerical columns. I tried to use chemoinformatics libary with knime.

Knime can read not only table data like csv, sdf but also use defined compound. To do it, ChemAxon MarvinSketch node is used. MarvinSketch is free for personal usage.

By using the node user can define query molecule with GUI based drawer.

After drawing the molecule and push apply button, the node will pass the molecule to following nodes.

To load the jupyter notebook, python script node and Conda Environment propagation node is used. Conda Envrionment Propagation node can define condaenv which would like to use in python script env. The information is passed as flowvailable.

The whole workflow is shown here. I made conda env named my_python_env which contains all packges which are used in the notebook.

*MolConverter node converts molecules to MDL format.

Let’s move to script tab in python script node.

#code for python script
import knime.scripting.io as knio

# This example script simply outputs the node's input table.

notebook_directory = "/home/iwatobipen/dev/knime_dev/notebook"

# Filename of the notebook
notebook_name = "test1.ipynb"

# Load the notebook as a Python module
import knime.scripting.jupyter as knupyter
my_notebook = knupyter.load_notebook(notebook_directory, notebook_name)

# Print its textual contents
knupyter.print_notebook(notebook_directory, notebook_name)
input_table_1 = knio.input_tables[0].to_pandas()
print(input_table_1)
output_table_1 = my_notebook.compound_generation(input_table_1)
knio.output_tables[0] = knio.Table.from_pandas(output_table_1)

To use function which is defined jupyternotebook, import “knime.scripting.jupyter as knupyter” and define notebook path and name.

The code of notebook is shown here. Following code generates molecules from user input query molecule with CReM and returns molecules as pandas dataframe.

#code from jupyternotebook
import pandas as pd
import numpy as np
from rdkit import Chem
from crem.crem import mutate_mol
dbpath = '/home/iwatobipen/dev/sandbox/rdk/replacements02_sc2.5.db'

def smi2mol(smi):
    mol = Chem.MolFromSmiles(smi)
    Chem.KekulizeIfPossible(mol)
    return mol

def compound_generation(data_set):
    mol = Chem.MolFromMolBlock(data_set['Molecule'][0])
    mut_mol = mutate_mol(mol, dbpath)
    output = pd.DataFrame()
    molids = []
    smi_list = []
    for i, smi in enumerate(mut_mol):
        molids.append(i)
        smi_list.append(smi)
    output['molid'] = molids
    output['smiles'] = smi_list
    output['ROMol'] = output['smiles'].apply(smi2mol)
    return output

After runnning the workflow, I could get lots of mutated molecules.

After the work, I can run additional chemoinformatics tasks with knime. By using the jupyternotebook, user can check code interactively. It’s useful for check the function behaviors I think. Knime can integrate python with various ways and it’s easy to do it.

Interesting ;)

Try to use recent version exmol #exmol #rdkit #chemoinformatics

Previously I wrote blog post about ‘exmol’ which is useful package for ML related task. Examol is developed by Andrew White. I installed it almost 2 years ago and Didn’t check the update, recently I found new verson of exmol is released and new versoin have new functions.

Following code is bollowed from original documentation and Pat’s nice blog post.

RandomForest calssifier from scikit learn and mordred is used for original documentation but I used HistGradientBoostingClassifier and usful_rdkit_utils in my code. Because I did it from following reason.

1st, mordred depends on old version of numy, so I got an error from np.flow() method which is removed from newer version of numpy.

2nd, I used HistGradientBoosting instead of RandomForest is some molecules features are NaN. HistGB works evenif the input has NaN. At first import libraries for coding.

mport pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import rdkit, rdkit.Chem, rdkit.Chem.Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
import mordred, mordred.descriptors
from useful_rdkit_utils import RDKitDescriptors
from chembl_structure_pipeline import standardizer
import exmol
from rdkit.Chem.Draw import rdDepictor
from rdkit import Chem
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, RocCurveDisplay
from sklearn import metrics
import crem
rdDepictor.SetPreferCoordGen(True)

IPythonConsole.ipython_useSVG = True
sns.set_context("notebook")
sns.set_style(
    "dark",
    {
        "xtick.bottom": True,
        "ytick.left": True,
        "xtick.color": "#666666",
        "ytick.color": "#666666",
        "axes.edgecolor": "#666666",
        "axes.linewidth": 0.8,
        "figure.dpi": 300,
    },
)
color_cycle = ["#1BBC9B", "#F06060", "#F3B562", "#6e5687", "#5C4B51"]
mpl.rcParams["axes.prop_cycle"] = mpl.cycler(color=color_cycle)

Next, I read dataset with pandas and calculate compound features is useful_rdit_utils thanks Pat for sharing useful code!

data = pd.read_csv("BBBP.csv")
examplesmi = data.smiles[1232]
desc=RDKitDescriptors()
# make object that can compute descriptors
calc =RDKitDescriptors()
# make subsample from pandas df
molecules = [rdkit.Chem.MolFromSmiles(smi) for smi in data.smiles]

# the invalid molecules were None, so we'll just
# use the fact the None is False in Python
valid_mol_idx = [idx for idx, m in enumerate(molecules) if bool(m)==True]
valid_mols = [m for m in molecules if m if m != None]
    
descs = [calc.calc_mol(molecules[idx]) for idx in valid_mol_idx]
raw_features = pd.DataFrame(descs, columns=calc.desc_names)
valid_mol_idx = [bool(m) for m in molecules]
labels = data[valid_mol_idx].p_np
Chem.MolFromSmiles(examplesmi)
#define data normalizer

fm = raw_features.mean()
fs = raw_features.std()


def feature_convert(f):
    f -= fm
    f /= fs
    return f


features = feature_convert(raw_features)

# we have some nans in features, likely because std was 0
features = features.values.astype(float)
features_select = np.all(np.isfinite(features), axis=0)
features = features[:, features_select]

Let’s train model with training dataset and plot ROC curve.

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.4, shuffle=True
)
print(X_train.shape)
clf = RandomForestClassifier(max_depth=8, random_state=0)
clf = HistGradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, predicted)
roc_auc = metrics.auc(fpr, tpr)
print("AUC", roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]))
#plt.figure(figsize=(4, 3), dpi=300)

displaly = RocCurveDisplay.from_estimator(clf, X_test, y_test)
plt.plot([0, 1], [0, 1], linestyle="--")
plt.savefig("HGBC-ROC.png")

#AUC is not so bad
#(1225, 193)
#AUC 0.9132069317364468

To use exmol, model_eval function which returns predicited values from smiles are requreid. So I did it. By using the approach exmol can use any kinds of classifier.

def model_eval(smiles_list, _=None):
    molecules = [rdkit.Chem.MolFromSmiles(smi) for smi in smiles_list]
    molecules = [m for m in molecules if m != None]
    # input wrangling. Get some weird values from weird smiles
    features = np.array([calc.calc_mol(m) for m in molecules])
    #features = feature_convert(raw_features)
    #features = features.astype(float)
    
    features = features[:, features_select]
    labels = clf.predict(features)
    return labels

OK, let’s run sample space function with preset options. Preset option can control the diversity of molecules and use defiend molecules for sampling.

preset = ‘narrow’ ~ ‘wide’ option controls STONES number of mutations. Ideally, preset=’wide’ will generate more divers of molecules.

space = exmol.sample_space(examplesmi, model_eval, quiet=False, preset='wide') #'wide/medium/narrow'
exps = exmol.cf_explain(space)
fkw = {"figsize": (8, 6)}
mpl.rc("axes", titlesize=12)
exmol.plot_cf(exps, figure_kwargs=fkw, mol_size=(450, 400), nrows=1)

preset=’wide’

preset=’medium’

preset=’narrow’

In the case shown above is not good example ;P, It’s difficult to compare the efffect of preset option difference.

If you want to use your own generated molecules for sampling, preset=’custom’ should be used. I used molecules which is generated with CReM (inpaired Pat’s blog post, thanks) for example.

from crem.crem import mutate_mol
examplemol = Chem.MolFromSmiles(examplesmi)
analog_smi_list = list(mutate_mol(examplemol, db_name='replacements02_sc2.5.db', max_size=10))
analog_mol_list= [Chem.MolFromSmiles(smi) for smi in analog_smi_list if 'I' not in smi] #Some molecules which has I caused error so I omitted them.
from rdkit.Chem import Draw
Draw.MolsToGridImage(analog_mol_list[:40], molsPerRow=8)
space = exmol.sample_space(examplesmi, model_eval, quiet=False, preset='custom', data=analog_mol_list[:2000])
exps = exmol.cf_explain(space)
fkw = {"figsize": (8, 6)}
mpl.rc("axes", titlesize=12)
exmol.plot_cf(exps, figure_kwargs=fkw, mol_size=(450, 400), nrows=3, ncols=5)
Now we can use any models and genearated molecules with exmol.
So it's interesting. I would like to make knime node for these packages.
I uploaded today's code on my gist.
https://gist.github.com/iwatobipen/2a3428e0fbb78e2280be7ca8251e1517

Develop new knime node with python #chemoinformatics #knime

The Golden Week is a collection of four national holidays within seven days. My kid graduated the elementary school and his favorite dodgeball team and he joined vollyball team. I quit coaching the dodgeball team at same time. So I have time for coding and drinking beer again ;)

I watched knime summit about new version of Knime. It’s really cool. I have interested in the node development with python. Previous version can develop own node but it required java. But now we can develop knime node with python instead of java.

The function is supported from knime version=4.6. The details are described in knime official blog post.
https://www.knime.com/blog/4-steps-for-your-python-team-to-develop-knime-nodes

I read the blog post and tried to make my own knime chemoinforomatics node today. I make node which standardize molecule with chembl_structure_pipeline. This library is really useful for normalizeing molecules.

Following section shows my log.

At first, I got template code(basic.zip) from here .

The structure of the zip file is below.

(base) iwatobipen@penguin:~/dev/knime_dev/basic$ tree
.
├── config.yml
├── Example_with_Python_node.knwf
├── my_conda_env.yml
├── README.md
└── tutorial_extension
    ├── icon.png
    ├── knime.yml
    ├── LICENSE.TXT
    └── my_extension.py

I modified config.yml and my_conda_env.yml below.

#config.yml
org.tutorial.first_extension: # {group_id}.{name} from the knime.yml
  src: /home/iwatobipen/dev/knime_dev/basic/tutorial_extension # Path to folder containing the extension files
  conda_env_path: /home/iwatobipen/miniconda3/envs/my_python_env # Path to the Python environment to use
  debug_mode: true # Optional line, if set to true, it will always use the latest changes of execute/configure, when that method is used within the KNIME Analytics Platform

#my_conda_envy.yml
name: my_python_env
channels:
  - knime
  - conda-forge
dependencies:
  - python=3.9
  - knime-extension=4.7
  - knime-python-base=4.7
  - rdkit
  - chembl_structure_pipeline

How to define the config.yml is well documented in the knime blogpost.

After defining the my_conda_env.yml, I made conda env with the yml-file.

$ conda env create -f mt_conda_env.yml

After making the env, I wrote code for knime node, my node get smiles strings as an input then standardize molecules from SMILES and generate molecularhash as an output.

The code is below. Decorator is used for making input and output. Following code is defined one input port and one out put port. It is able to add additional port with @knext.input_table and @knext.output_talbe decorators(https://knime-python.readthedocs.io/en/stable/content/content.html#python-script-api).

import logging
import knime.extension as knext
from rdkit import Chem
from rdkit.Chem import rdMolHash
from functools import partial
from chembl_structure_pipeline import standardize_mol
from chembl_structure_pipeline import get_parent_mol

LOGGER = logging.getLogger(__name__)
#molhash = partial(rdMolHash,MolHash(rdMolHash.HashFunction.HeAtomTautomer))

@knext.node(name="chembl structure pipeline", node_type=knext.NodeType.MANIPULATOR, icon_path="demo.png", category="/")
@knext.input_table(name="SMILES column", description="read smiles")
@knext.output_table(name="Output Data", description="rdkit mol which is standarized with chembl structure pipeline")
class TemplateNode:
    """Short one-line description of the node.
    This is sample node which is implemented with chembl structure pipeline.
    input data should be SMILES.
    """

    # simple code
    def std_mol(self, smiles):
        mol = Chem.MolFromSmiles(smiles)
        if mol == None:
            return None
        else:
            stdmol = standardize_mol(mol)
            pm, _ = get_parent_mol(stdmol)
            Chem.Kekulize(pm)
            return pm
    
    def get_mol_hash(sel, rdmol):
        res = rdMolHash.MolHash(rdmol, rdMolHash.HashFunction.HetAtomTautomer)
        return res

    column_param = knext.ColumnParameter(label="label", description="description", port_index=0)
   

    def configure(self, configure_context, input_schema_1):   
            
        #return input_schema_1.append(knext.Column(Chem.rdchem.Mol, "STD_ROMol"))
        return input_schema_1.append(knext.Column(Chem.rdchem.Mol, "STD_ROMol")).append(knext.Column(knext.string(), 'MolHash'))

 
    def execute(self, exec_context, input_1):
        input_1_pandas = input_1.to_pandas()
        input_1_pandas['STD_ROMol'] = input_1_pandas['column1'].apply(self.std_mol)
        input_1_pandas['MolHash'] = input_1_pandas['STD_ROMol'].apply(self.get_mol_hash)
        return knext.Table.from_pandas(input_1_pandas)


After writing the code, add the one line “-Dknime.python.extension.config=/home/iwatobipen/dev/knime_dev/basic/config.yml” to knime.ini which is located in knime install folder.

Then launch knime I could see my own knime node ;)

I make simple workflow with the node. chembl structure pipeline is my developed node ;)

I added some smiles from table creator node.

And run the node, I could get standardized molecules as the output.

The work flow do not only standardization of molecules but also generate molhash. So the output will be like below. Count row is count of molhash. It can see count 2 in 2-hydroxy pyridine and pyridone. Ofcourse they are tautomer.

It’s interesting for me to make new node with python. It’s useful for not only coder but also no coder I think.

Run Knime workflow without GUI #chemoinfo #knime #RDKit

I enjoyed knime summit last week. I could learn lots of useful features of new version of Knime (ver5).

Knime is useful tool for chemoinformatics but I mainly write code with python so I’m not heavy user of knime but recently I’m interested in Knime and learning to apply it in my projects.

As reader know that Knime desktop is GUI tool for low coding. But the workflow can run without GUI, from command.

Let’s try to do it ;)

At first, I make example workflow to test it. The workflow is shown below. The workflow read SDF from local directory, remove salt, add hydrogen then generate 3D geometry and optimize them with MMFF94, finaly save resutls as SDF.

Very simple ;)

I saved the workflow as ‘batchtes’.

OK next run the knime from command. The command shown here

$ /home/iwatobipen/knime/knime -nosplash -application org.knime.product.KNIME_BATCH_APPLICATION -workflowDir="/home/iwatobipen/knime-workspace/chemoinfo/batchtest" -nosave

The relative path of the workflow and knime didn’t work so I passed absolute path.

I passed -nosave option because I would like to keep the workflow as ready to run. When the option is removed, after running the workflow, I can’t run the workflow again because the workflow remember that the task is finished.

By using the approach it makes easy to do routine tasks more efficiently.

This example is simple and not so useful to make it in batch mode. But it worth to know how to run knime in bach mode. ;)

Visualize MMP data with OSS webapp! #chemoinformatics #rdkit #mmpdb

Matched molecular pairs (MMPs) are old but still useful approach for not only retrospective analysis but also forward compound design. As many chemoinformatitians know that mmpdb is the one of famous MMP CLI tool. I like tha package because it’s easy to make own MMP db. However mmpdb doesn’t have visualization tool for chemists. So we need to make some tools for visualize MMP because extract MMP information from SMILES and float values only is hard to many medicinal chemists ;)

Andrew Hoover’s group who are reseacher in Merck published nice article with code. The article can get from chemrxiv.

https://chemrxiv.org/engage/chemrxiv/article-details/63586c15aca19850f7e53e55

The tool named Matcher can provide view of MMP interactively. The view is easy to understand the effect of transformation. And the system is flexible, it can visualize MMPs wtith our own.

I tried to use it with testdata from original repo.

To use Matcher, docker-compose is required, so installed it before the task. Next get code from github repo.

$ gh repo clone Merck/matcher
$ cd matcher

The code run with docker-compose so, we don’t need to install additional packages. Dockerfiles will install all required packages automatically. Type following command to build docker-image and run it.

$ docker-compose up

After waiting few minutes, I could access matcher top page (localhost:8000). The default settings, Matcher backend run the postgresql with default setting. So port number 5432 is used for the task. If user run the additional postgresql job with default settings. It will cause conflict. So user need to change some setting I think.

In my env, I run postgresql db for chembl32 so, I changed port number ;)

Top page of matcher is shown below. Query can make two ways.
1) Specify the position where I want to change (starting molecule)
2) Specify transforamtion (starting molecule and product)

Following example is starting molecue and product type query of H to Me.

After running the query, I could get interactive view of delta-hERG change and the structure. If you can find great improvement of hERG inhibition with adding methyl withtout loss of target potency. It seems magic methyl!!!!

Next example is small part of query. But results groped by each substructure.

As figure above shows that morpholine improves hERG inhibition compred to many amine sustituents.

These analysis provides bird view of compound properties effect.

In summary Matcher is useful for chemoinformatics task. I’m making some code to make properties extractor for mmpdb and matcher!

Have a nice weekend ;)

Can LLM understand Chemistry? #memo #journal #JCIM

Recently we can use LLMs(Large langues models) such as ChatGPT. And they return(make) plausible answer against questions. There are lots of chat bot systems have been used in many services but the use cases are limited. LLMs seem be really powerful tools not only general task but also life science.

So many people try to use LLMs to life science area including drug design. ChemGPT is the one of interesting project in these area. https://huggingface.co/ncfrey/ChemGPT-4.7M

And now @souyakuchan is holding the competition which design molecules with LLM. It seems interesitng ;)

And I read interesting article, the tile is ‘Do Large Language Models Understand Chemistry? A Conversation with ChatGPT‘. Luckly we can read the pdf from research gate ‘https://www.researchgate.net/publication/369299429_Do_Large_Language_Models_Understand_Chemistry_A_Conversation_with‘.

In the article, the arthors asked some chemistry related questions to ChatGPT. Listed below.
1, name to SMILES / SMILES to name
2, predict LogP with given compounds
3, Getting structural information on coordination compounds
4, Predict water solubility of polymers
5, determine molecular point groups

The results said that task 1 is difficult for ChatGPT. It depends on amount of tarining data and quarity of promt. Name to strucutre and Structure to name are common task for chemoinformatics area but it’s really specific field in the world. So the amount of training data is limited.

The arthors are described their considerations deeply in this article. LLMs works very well against many problems but chemistry is still challenging area even if ChatGPT is released.

But I can’t predict near the feature of LLMs wheather do LLMs understand chemistry or not. In any case I shouldn’t stop learning not only in silico area but also wet area…..

The arthors disclosed code! You can read the code from following URL.
https://github.com/andresilvapimentel/AI4Chem/tree/main

Run RDKit task in parallel #rdkit #chemoinformatics

Recently we can get huge amout of molecules and can use it for drug discovery tasks. It’s really good news however, we need computational resources more and more…

When I wrote code for chemoinformatics task with python, ‘for loop’ approach is often used. But it is straight way, it’s not so efficient. Long time ago, I wrote a blog post about jug which is python package for task based parallelization.

Today, I would like to share another approach for parallelization with ipyparallel. The original idea came from Greg’s suggestion. Thanks Greg!

Following example code is confromation generation task. I used first_5k.smi as a sample dataset.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from chembl_structure_pipeline import standardizer
from rdkit import RDLogger
import time
RDLogger.DisableLog('rdApp.info')
suppl = Chem.SmilesMolSupplier('./first_5K.smi', smilesColumn=0, delimiter='\t')
mols = [m for m in suppl if m!=None]
#for convenience, I used first 1K molecules
mols = mols[:1000]

At first, I cleaned molecules with chembl_structure_pipeline.

def stdmol(mol):
    stdmol = standardizer.standardize_mol(mol)
    pmol, _ = standardizer.get_parent_mol(stdmol)
    return pmol
mols = [stdmol(m) for m in mols]

Then I defined confgen function. The function generate 3D conformer and optimze it with MMFF. TIP is tpl argument. It requireed for parallelization.

def confgen(tpl, mols=mols):
    from rdkit import Chem
    from rdkit.Chem import AllChem     
    nengine, clsidx = tpl
    nwrite = 0
    for idx, mol in enumerate(mols):
        if idx % nengine == clsidx:
            try:
                hmol = Chem.AddHs(mol)
                res = AllChem.EmbedMolecule(hmol)
                if res == 0:
                    AllChem.MMFFOptimizeMolecule(hmol)
                    nwrite += 1
            except:
                continue
    return nwrite

Now ready to run code in parallel. I defined function for test parallel job.

import ipyparallel as ipp
def runtask(nengines=1):
    start = time.time()
    with ipp.Cluster(n=nengines) as rc:
        view = rc.load_balanced_view()
        asyncresult = view.map_async(confgen, [(nengines, idx) for idx in range(nengines)])
        asyncresult.wait_interactive()
        # retrieve actual results
        result = asyncresult.get()
        print(result)
    finish = time.time()
    print(finish - start)

OK let’s test the code!

runtask(1)
Starting 1 engines with <class 'ipyparallel.cluster.launcher.LocalEngineSetLauncher'>
  0%|          | 0/1 [00:00<?, ?engine/s]
confgen:   0%|          | 0/1 [00:00<?, ?tasks/s]
[992]
Stopping engine(s): 1680783042
engine set stopped 1680783042: {'engines': {'0': {'exit_code': 0, 'pid': 429592, 'identifier': '0'}}, 'exit_code': 0}
Stopping controller
Controller stopped: {'exit_code': 0, 'pid': 429563, 'identifier': 'ipcontroller-1680783041-w6r9-429542'}
32.62002611160278

runtask(16)
Starting 16 engines with <class 'ipyparallel.cluster.launcher.LocalEngineSetLauncher'>
  0%|          | 0/16 [00:00<?, ?engine/s]
confgen:   0%|          | 0/16 [00:00<?, ?tasks/s]
[63, 61, 63, 63, 63, 62, 63, 63, 61, 61, 61, 62, 62, 62, 62, 61]
Stopping engine(s): 1680783075
engine set stopped 1680783075: {'engines': {'0': {'exit_code': 0, 'pid': 429664, 'identifier': '0'}, '1': {'exit_code': 0, 'pid': 429666, 'identifier': '1'}, '2': {'exit_code': 0, 'pid': 429668, 'identifier': '2'}, '3': {'exit_code': 0, 'pid': 429670, 'identifier': '3'}, '4': {'exit_code': 0, 'pid': 429676, 'identifier': '4'}, '5': {'exit_code': 0, 'pid': 429682, 'identifier': '5'}, '6': {'exit_code': 0, 'pid': 429688, 'identifier': '6'}, '7': {'exit_code': 0, 'pid': 429700, 'identifier': '7'}, '8': {'exit_code': 0, 'pid': 429713, 'identifier': '8'}, '9': {'exit_code': 0, 'pid': 429728, 'identifier': '9'}, '10': {'exit_code': 0, 'pid': 429738, 'identifier': '10'}, '11': {'exit_code': 0, 'pid': 429754, 'identifier': '11'}, '12': {'exit_code': 0, 'pid': 429770, 'identifier': '12'}, '13': {'exit_code': 0, 'pid': 429786, 'identifier': '13'}, '14': {'exit_code': 0, 'pid': 429802, 'identifier': '14'}, '15': {'exit_code': 0, 'pid': 429818, 'identifier': '15'}}, 'exit_code': 0}
Stopping controller
Controller stopped: {'exit_code': 0, 'pid': 429633, 'identifier': 'ipcontroller-1680783074-y6mr-429542'}
12.983936071395874

Compared job with nengiens=1 and nengines=16, later case is faster than former. In this small case, the effect is not liner. But if lots of core is available, more speed up is expected.

In summary, ipyparallel is reall cool package for datascience.

I pushed my code on my gist. Thanks for reading. And any comments and suggestions are greatly appreciated :)

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Make QSAR model with PyG and pytorch2.0 #RDKit #Chemoinfo

Recently, I updated version of pytorch on my env from 1.x to 2.0.

I think it’s worth to update because, original site says….

PYTORCH 2.X: FASTER, MORE PYTHONIC AND AS DYNAMIC AS EVER

Today, we announce torch.compile, a feature that pushes PyTorch performance to new heights and starts the move for parts of PyTorch from C++ back into Python. We believe that this is a substantial new direction for PyTorch – hence we call it 2.0. torch.compile is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.

Underpinning torch.compile are new technologies – TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.

And also, pytorch_geometric supports pytorch2.0. Previously I worte blog post about building QSAR model with PyG. Old version of PyG didn’t support graph from molecule. So I wrote my own function to do that. But recent version of PyG has the function!

You can check the details on the official documentation.

I have interest the new version of pyg and pytorch. So I rewrote the my code with new packages.

The code is shown below.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

By using compiled model, I got many worning masseges but it worked and got almost same figure to previous post.

This is first trial of using new version of pyg and pytorch for me. I would like to dive more deeply into the packages as soon as possible.

Make multi target predictive model for Knime #RDKit #Knime #DeepLearning

I participated my kid’s elemenary school graduation ceremony today. It was nice ceremony. …Time pasts very fast ;) In the season in Japan is very good.
Beautiful cherry blossoms start to bloom. It’s good time to start anything ;-)

I would like to say thank to my kid.

BTW, recently chembl ver32 was released. As you know chembl DB has lots of biological activities of small moleculres. So it’s really useful source of multi target predictive model.

Fortunately, Greg Landrum posted nice blog that predictiction with knime. The URL is below.
https://www.knime.com/blog/interactive-bioactivity-prediction-with-multitask-neural-networks

He uses onnx to convert pytorch predictive model to tensorflow model because current knime doesn’t support pytorch. I have intereted the approach and would like to run the work flow with chembl32.

To do that, I need to build model with chembl32. It was requred to modify original code because schema is changed.

After some trial, I could build the workflow. So I would like to share my experience.

At first, I made dataset. The code is below. I got data from chemblftp. The origin of the code came from Eloy’s work.

http://chembl.blogspot.com/2019/05/multi-task-neural-network-on-chembl.html.

!/usr/bin/env python
# coding: utf-8
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdMolDescriptors
from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd
import numpy as np
import tables as tb
import json
from tables.atom import ObjectAtom

engine = create_engine('sqlite:///chembl_32/chembl_32_sqlite/chembl_32.db') 


qtext = """
SELECT
  activities.doc_id                    AS doc_id,
  activities.standard_value            AS standard_value,
  molecule_hierarchy.parent_molregno   AS molregno,
  compound_structures.canonical_smiles AS canonical_smiles,
  molecule_dictionary.chembl_id        AS chembl_id,
  target_dictionary.tid                AS tid,
  target_dictionary.chembl_id          AS target_chembl_id,
  protein_classification.pref_name     AS pref_name,
  protein_classification.short_name     AS short_name,
  protein_classification.PROTEIN_CLASS_DESC    AS protein_class,
  protein_classification.class_level     AS class_level
FROM activities
  JOIN assays ON activities.assay_id = assays.assay_id
  JOIN target_dictionary ON assays.tid = target_dictionary.tid
  JOIN target_components ON target_dictionary.tid = target_components.tid
  JOIN component_class ON target_components.component_id = component_class.component_id
  JOIN protein_classification ON component_class.protein_class_id = protein_classification.protein_class_id
  JOIN molecule_dictionary ON activities.molregno = molecule_dictionary.molregno
  JOIN molecule_hierarchy ON molecule_dictionary.molregno = molecule_hierarchy.molregno
  JOIN compound_structures ON molecule_hierarchy.parent_molregno = compound_structures.molregno
WHERE activities.standard_units = 'nM' AND
      activities.standard_type IN ('EC50', 'IC50', 'Ki', 'Kd', 'XC50', 'AC50', 'Potency') AND
      activities.data_validity_comment IS NULL AND
      activities.standard_relation IN ('=', '<') AND
      activities.potential_duplicate = 0 AND assays.confidence_score >= 8 AND
      target_dictionary.target_type = 'SINGLE PROTEIN'"""

with engine.begin() as conn:
    res = conn.execute(text(qtext))
    df = pd.DataFrame(res.fetchall())



df.columns = res.keys()
df = df.where((pd.notnull(df)), None)



cls_list=df["protein_class"].to_list()

uniq = list(set(cls_list))




df = df.sort_values(by=['standard_value', 'molregno', 'tid'], ascending=True)
df = df.drop_duplicates(subset=['molregno', 'tid'], keep='first')

df.to_csv('chembl_activity_data.csv', index=False)




def set_active(row):
    active = 0
    if row['standard_value'] <= 1000:
        active = 1
    if "ion channel" in row['protein_class']:
        if row['standard_value'] <= 10000:
            active = 1
    if "kinase" in row['protein_class']:
        if row['standard_value'] > 30:
            active = 0
    if "nuclear receptor" in row['protein_class']:
        if row['standard_value'] > 100:
            active = 0
    if "membrane receptor" in row['protein_class']:
        if row['standard_value'] > 100:
            active = 0
    return active

df['active'] = df.apply(lambda row: set_active(row), axis=1)

# get targets with at least 100 different active molecules
acts = df[df['active'] == 1].groupby(['target_chembl_id']).agg('count')
acts = acts[acts['molregno'] >= 100].reset_index()['target_chembl_id']

# get targets with at least 100 different inactive molecules
inacts = df[df['active'] == 0].groupby(['target_chembl_id']).agg('count')
inacts = inacts[inacts['molregno'] >= 100].reset_index()['target_chembl_id']

# get targets mentioned in at least two docs
docs = df.drop_duplicates(subset=['doc_id', 'target_chembl_id'])
docs = docs.groupby(['target_chembl_id']).agg('count')
docs = docs[docs['doc_id'] >= 2.0].reset_index()['target_chembl_id']



t_keep = set(acts).intersection(set(inacts)).intersection(set(docs))

# get dta for filtered targets
activities = df[df['target_chembl_id'].isin(t_keep)]


ion = pd.unique(activities[activities['protein_class'].str.contains("ion channel",  na=False)]['tid']).shape[0]
kin = pd.unique(activities[activities['protein_class'].str.contains("kinase",  na=False)]['tid']).shape[0]
nuc = pd.unique(activities[activities['protein_class'].str.contains("nuclear receptor",  na=False)]['tid']).shape[0]
gpcr = pd.unique(activities[activities['protein_class'].str.contains("membrane receptor", na=False)]['tid']).shape[0]
print('Number of unique targets: ', len(t_keep))
print('  Ion channel: ', ion)
print('  Kinase: ', kin)
print('  Nuclear receptor: ',  nuc)
print('  GPCR: ', gpcr)
print('  Others: ', len(t_keep) - ion - kin - nuc - gpcr)


# save it to a file
activities.to_csv('chembl_activity_data_filtered.csv', index=False)




def gen_dict(group):
    return {tid: act  for tid, act in zip(group['target_chembl_id'], group['active'])}

print('MULTI TASK DATA PREP')
group = activities.groupby('chembl_id')
temp = pd.DataFrame(group.apply(gen_dict))
mt_df = pd.DataFrame(temp[0].tolist())
mt_df['chembl_id'] = temp.index
mt_df = mt_df.where((pd.notnull(mt_df)), -1)




structs = activities[['chembl_id', 'canonical_smiles']].drop_duplicates(subset='chembl_id')

print('GET MOL')
# drop mols not sanitizing on rdkit
def molchecker(smi):
    mol = Chem.MolFromSmiles(smi)
    if mol == None:
        return None
    else:
        return 1

#structs['romol'] = structs.apply(lambda row: Chem.MolFromSmiles(row['canonical_smiles']), axis=1)
structs['romol'] = structs.apply(lambda row: molchecker(row['canonical_smiles']), axis=1)
structs = structs.dropna()
del structs['romol']

# add the structures to the final df
mt_df = pd.merge(structs, mt_df, how='inner', on='chembl_id')


# save to csv
mt_df.to_csv('chembl_multi_task_data.csv', index=False)




FP_SIZE = 1024 
RADIUS = 2

def calc_fp(smiles, fp_size, radius):
    """
    calcs morgan fingerprints as a numpy array.
    """
    mol = Chem.MolFromSmiles(smiles, sanitize=False)
    mol.UpdatePropertyCache(False)
    Chem.GetSSSR(mol)
    fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius, nBits=fp_size)
    a = np.zeros((0,), dtype=np.float32)
    Chem.DataStructs.ConvertToNumpyArray(fp, a)
    return a

# calc fps
print('CALC FP')
descs = [calc_fp(smi, FP_SIZE, RADIUS) for smi in mt_df['canonical_smiles'].values]
descs = np.asarray(descs, dtype=np.float32)

# put all training data in a pytables file
print('SAVE DATA')
with tb.open_file('mt_data.h5', mode='w') as t_file:

    # set compression filter. It will make the file much smaller
    filters = tb.Filters(complib='blosc', complevel=5)

    # save chembl_ids
    tatom = ObjectAtom()
    cids = t_file.create_vlarray(t_file.root, 'chembl_ids', atom=tatom)
    for cid in mt_df['chembl_id'].values:
        cids.append(cid)

    # save fps
    fatom = tb.Atom.from_dtype(descs.dtype)
    fps = t_file.create_carray(t_file.root, 'fps', fatom, descs.shape, filters=filters)
    fps[:] = descs

    del mt_df['chembl_id']
    del mt_df['canonical_smiles']

    # save target chembl ids
    tcids = t_file.create_vlarray(t_file.root, 'target_chembl_ids', atom=tatom)
    for tcid in mt_df.columns.values:
        tcids.append(tcid)

    # save labels
    labs = t_file.create_carray(t_file.root, 'labels', fatom, mt_df.values.shape, filters=filters)
    labs[:] = mt_df.values
    
    # save task weights
    # each task loss will be weighted inversely proportional to its number of data points
    weights = []
    for col in mt_df.columns.values:
        c = mt_df[mt_df[col] >= 0.0].shape[0]
        weights.append(1 / c)
    weights = np.array(weights)
    ws = t_file.create_carray(t_file.root, 'weights', fatom, weights.shape)
    ws[:] = weights




with tb.open_file('mt_data.h5', mode='r') as t_file:
    print(t_file.root.chembl_ids.shape)
    print(t_file.root.target_chembl_ids.shape)
    print(t_file.root.fps.shape)
    print(t_file.root.labels.shape)
    print(t_file.root.weights.shape)
    
    # save targets to a json file
    with open('targets.json', 'w') as f:
        json.dump(t_file.root.target_chembl_ids[:], f)

After making the dataset, I build the model with pytorch. The code is below.

import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torch.utils.data as D
import tables as tb
from sklearn.metrics import (matthews_corrcoef, 
                             confusion_matrix, 
                             f1_score, 
                             roc_auc_score,
                             accuracy_score,
                             roc_auc_score)


# set the device to GPU if available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
MAIN_PATH = '.'
DATA_FILE = 'mt_data.h5'
MODEL_FILE = 'chembl_mt.model'
N_WORKERS = 8 # Dataloader workers, prefetch data in parallel to have it ready for the model after each batch train
BATCH_SIZE = 32 # https://twitter.com/ylecun/status/989610208497360896?lang=es
LR = 2 # Learning rate. Big value because of the way we are weighting the targets
N_EPOCHS = 2 # You should train longer!!!

class ChEMBLDataset(D.Dataset):
    
    def __init__(self, file_path):
        self.file_path = file_path
        with tb.open_file(self.file_path, mode='r') as t_file:
            self.length = t_file.root.fps.shape[0]
            self.n_targets = t_file.root.labels.shape[1]
        
    def __len__(self):
        return self.length
    
    def __getitem__(self, index):
        with tb.open_file(self.file_path, mode='r') as t_file:
            structure = t_file.root.fps[index]
            labels = t_file.root.labels[index]
        return structure, labels


dataset = ChEMBLDataset(f"{MAIN_PATH}/{DATA_FILE}")
validation_split = .2
random_seed= 42

dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))

np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, test_indices = indices[split:], indices[:split]

train_sampler = D.sampler.SubsetRandomSampler(train_indices)
test_sampler = D.sampler.SubsetRandomSampler(test_indices)

# dataloaders can prefetch the next batch if using n workers while
# the model is tranining
train_loader = torch.utils.data.DataLoader(dataset,
                                           batch_size=BATCH_SIZE,
                                           num_workers=N_WORKERS,
                                           sampler=train_sampler)

test_loader = torch.utils.data.DataLoader(dataset, 
                                          batch_size=BATCH_SIZE,
                                          num_workers=N_WORKERS,
                                          sampler=test_sampler)


class ChEMBLMultiTask(nn.Module):
    """
    Architecture borrowed from: https://arxiv.org/abs/1502.02072
    """
    def __init__(self, n_tasks):
        super(ChEMBLMultiTask, self).__init__()
        self.n_tasks = n_tasks
        self.fc1 = nn.Linear(1024, 2000)
        self.fc2 = nn.Linear(2000, 100)
        self.dropout = nn.Dropout(0.25)

        # add an independet output for each task int the output laer
        for n_m in range(self.n_tasks):
            self.add_module(f"y{n_m}o", nn.Linear(100, 1))
        
    def forward(self, x):
        h1 = self.dropout(F.relu(self.fc1(x)))
        h2 = F.relu(self.fc2(h1))
        out = [torch.sigmoid(getattr(self, f"y{n_m}o")(h2)) for n_m in range(self.n_tasks)]
        return out
    
# create the model, to GPU if available
model = ChEMBLMultiTask(dataset.n_targets).to(device)

# binary cross entropy
# each task loss is weighted inversely proportional to its number of datapoints, borrowed from:
# http://www.bioinf.at/publications/2014/NIPS2014a.pdf
with tb.open_file(f"{MAIN_PATH}/{DATA_FILE}", mode='r') as t_file:
    weights = torch.tensor(t_file.root.weights[:])
    weights = weights.to(device)

criterion = [nn.BCELoss(weight=w) for x, w in zip(range(dataset.n_targets), weights.float())]

# stochastic gradient descend as an optimiser
optimizer = torch.optim.SGD(model.parameters(), LR)

# model is by default in train mode. Training can be resumed after .eval() but needs to be set to .train() again
model.train()
for ep in range(N_EPOCHS):
    for i, (fps, labels) in enumerate(train_loader):
        # move it to GPU if available
        fps, labels = fps.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(fps)
        
        # calc the loss
        loss = torch.tensor(0.0).to(device)
        for j, crit in enumerate(criterion):
            # mask keeping labeled molecules for each task
            mask = labels[:, j] >= 0.0
            if len(labels[:, j][mask]) != 0:
                # the loss is the sum of each task/target loss.
                # there are labeled samples for this task, so we add it's loss
                loss += crit(outputs[j][mask], labels[:, j][mask].view(-1, 1))

        loss.backward()
        optimizer.step()

        if (i+1) % 500 == 0:
            print(f"Epoch: [{ep+1}/{N_EPOCHS}], Step: [{i+1}/{len(train_indices)//BATCH_SIZE}], Loss: {loss.item()}")
    
y_trues = []
y_preds = []
y_preds_proba = []

# do not track history
with torch.no_grad():
    for fps, labels in test_loader:
        # move it to GPU if available
        fps, labels = fps.to(device), labels.to(device)
        # set model to eval, so will not use the dropout layer
        model.eval()
        outputs = model(fps)
        for j, out in enumerate(outputs):
            mask = labels[:, j] >= 0.0
            mask = mask.to(device)
            y_pred = torch.where(out[mask].to(device) > 0.5, torch.ones(1).to(device), torch.zeros(1).to(device)).view(1, -1)

            if y_pred.shape[1] > 0:
                for l in labels[:, j][mask].long().tolist():
                    y_trues.append(l)
                for p in y_pred.view(-1, 1).tolist():
                    y_preds.append(int(p[0]))
                for p in out[mask].view(-1, 1).tolist():
                    y_preds_proba.append(float(p[0]))

tn, fp, fn, tp = confusion_matrix(y_trues, y_preds).ravel()
sens = tp / (tp + fn)
spec = tn / (tn + fp)
prec = tp / (tp + fp)
f1 = f1_score(y_trues, y_preds)
acc = accuracy_score(y_trues, y_preds)
mcc = matthews_corrcoef(y_trues, y_preds)
auc = roc_auc_score(y_trues, y_preds_proba)

print(f"accuracy: {acc}, auc: {auc}, sens: {sens}, spec: {spec}, prec: {prec}, mcc: {mcc}, f1: {f1}")
print(f"Not bad for only {N_EPOCHS} epochs!")

torch.save(model.state_dict(), f"./{MODEL_FILE}")

Due to run the code on my personal environment I set epoch is 2 but it’s better to more longer epoch for production.

After the training, I converted the model to onnx format. The code is…

import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torch.utils.data as D
import tables as tb
from sklearn.metrics import (matthews_corrcoef, 
                             confusion_matrix, 
                             f1_score, 
                             roc_auc_score,
                             accuracy_score,
                             roc_auc_score)
from torch import onnx

# set the device to GPU if available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
MAIN_PATH = '.'
DATA_FILE = 'mt_data.h5'
MODEL_FILE = 'chembl_mt.model'
N_WORKERS = 8 # Dataloader workers, prefetch data in parallel to have it ready for the model after each batch train
BATCH_SIZE = 32 # https://twitter.com/ylecun/status/989610208497360896?lang=es
LR = 2 # Learning rate. Big value because of the way we are weighting the targets
N_EPOCHS = 2 # You should train longer!!!

class ChEMBLDataset(D.Dataset):
    
    def __init__(self, file_path):
        self.file_path = file_path
        with tb.open_file(self.file_path, mode='r') as t_file:
            self.length = t_file.root.fps.shape[0]
            self.n_targets = t_file.root.labels.shape[1]
        
    def __len__(self):
        return self.length
    
    def __getitem__(self, index):
        with tb.open_file(self.file_path, mode='r') as t_file:
            structure = t_file.root.fps[index]
            labels = t_file.root.labels[index]
        return structure, labels


class ChEMBLMultiTask(nn.Module):
    """
    Architecture borrowed from: https://arxiv.org/abs/1502.02072
    """
    def __init__(self, n_tasks):
        super(ChEMBLMultiTask, self).__init__()
        self.n_tasks = n_tasks
        self.fc1 = nn.Linear(1024, 2000)
        self.fc2 = nn.Linear(2000, 100)
        self.dropout = nn.Dropout(0.25)

        # add an independet output for each task int the output laer
        for n_m in range(self.n_tasks):
            self.add_module(f"y{n_m}o", nn.Linear(100, 1))
        
    def forward(self, x):
        h1 = self.dropout(F.relu(self.fc1(x)))
        h2 = F.relu(self.fc2(h1))
        out = [torch.sigmoid(getattr(self, f"y{n_m}o")(h2)) for n_m in range(self.n_tasks)]
        return out

dataset = ChEMBLDataset(f"{MAIN_PATH}/{DATA_FILE}")
validation_split = .2
random_seed= 42

dataset_size = len(dataset)
model = ChEMBLMultiTask(dataset.n_targets).to(device)
path = './model_onnx.onnx'
dummy = torch.tensor([[0.5 for _ in range(1024)]],
dtype=torch.float32).to(device)
onnx.export(model, dummy, path, input_names=['input_1'],output_names=['output'])

Now, I got onnx model of chembl target prediciton. So I tried to run the work flow with the model. But it didn’t work. So I tried to tensorflow model directly on knime workflow.

At first, I converted onnx model to tensorflow model with onnx-tf.

$ onnx-tf convert -i model_onnx.onnx -o model_tf

Now ready to run the prediction ;)

I build workflow (most of part came from greg’s great work!).

To run the WF, I could get heat map of target prediction.

The environment of deeplearning for knime is below.


# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
abseil-cpp                20210324.2           h9c3ff4c_0    conda-forge
absl-py                   1.4.0              pyhd8ed1ab_0    conda-forge
aiohttp                   3.8.3            py37h540881e_0    conda-forge
aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
astunparse                1.6.3              pyhd8ed1ab_0    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
asynctest                 0.13.0                     py_0    conda-forge
attrs                     22.2.0             pyh71513ae_0    conda-forge
blinker                   1.5                pyhd8ed1ab_0    conda-forge
brotlipy                  0.7.0           py37h540881e_1004    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
certifi                   2022.12.7          pyhd8ed1ab_0    conda-forge
cffi                      1.15.1           py37h43b0acd_1    conda-forge
charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
click                     8.1.3            py37h89c1867_0    conda-forge
conda                     4.12.0           py37h89c1867_0    conda-forge
conda-package-handling    2.0.2              pyh38be061_0    conda-forge
conda-package-streaming   0.7.0              pyhd8ed1ab_1    conda-forge
cryptography              38.0.2           py37h38fbfac_1    conda-forge
cudatoolkit               11.8.0              h37601d7_11    conda-forge
cudnn                     8.4.1.50             hed8a83a_0    conda-forge
frozenlist                1.3.1            py37h540881e_0    conda-forge
gast                      0.5.3              pyhd8ed1ab_0    conda-forge
giflib                    5.2.1                h0b41bf4_3    conda-forge
google-auth               2.16.2             pyh1a96a4e_0    conda-forge
google-auth-oauthlib      0.4.6              pyhd8ed1ab_0    conda-forge
google-pasta              0.2.0              pyh8c360ce_0    conda-forge
grpc-cpp                  1.43.2               h9e046d8_3    conda-forge
grpcio                    1.43.0           py37hb27c1af_0    conda-forge
h5py                      3.7.0           nompi_py37hf1ce037_101    conda-forge
hdf5                      1.12.2          nompi_h2386368_101    conda-forge
icu                       70.1                 h27087fc_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
importlib-metadata        4.11.4           py37h89c1867_0    conda-forge
jpeg                      9e                   h0b41bf4_3    conda-forge
keras                     2.8.0              pyhd8ed1ab_0    conda-forge
keras-preprocessing       1.1.2              pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.20.1               hf9c8cef_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libaec                    1.0.6                hcb278e6_1    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libcurl                   7.87.0               h6312ad2_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libpng                    1.6.39               h753d276_0    conda-forge
libprotobuf               3.19.4               h780b84a_0    conda-forge
libsolv                   0.7.23               h3eb15da_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
mamba                     0.1.2            py37h99015e2_0    conda-forge
markdown                  3.4.1              pyhd8ed1ab_0    conda-forge
markupsafe                2.1.1            py37h540881e_1    conda-forge
multidict                 6.0.2            py37h540881e_1    conda-forge
nccl                      2.14.3.1             h0800d71_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
numpy                     1.21.6           py37h976b520_0    conda-forge
oauthlib                  3.2.2              pyhd8ed1ab_0    conda-forge
onnx                      1.13.1                   pypi_0    pypi
onnx-tf                   1.10.0                   pypi_0    pypi
openssl                   1.1.1t               h0b41bf4_0    conda-forge
opt_einsum                3.3.0              pyhd8ed1ab_1    conda-forge
packaging                 23.0                     pypi_0    pypi
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
protobuf                  3.20.3                   pypi_0    pypi
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.7                      py_0    conda-forge
pycosat                   0.6.4            py37h540881e_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pyjwt                     2.6.0              pyhd8ed1ab_0    conda-forge
pyopenssl                 23.0.0             pyhd8ed1ab_0    conda-forge
pysocks                   1.7.1            py37h89c1867_5    conda-forge
python                    3.7.12          hb7a2778_100_cpython    conda-forge
python-flatbuffers        23.1.21            pyhd8ed1ab_0    conda-forge
python_abi                3.7                     3_cp37m    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
pyyaml                    6.0                      pypi_0    pypi
re2                       2022.02.01           h9c3ff4c_0    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
requests                  2.28.2             pyhd8ed1ab_0    conda-forge
requests-oauthlib         1.3.1              pyhd8ed1ab_0    conda-forge
rsa                       4.9                pyhd8ed1ab_0    conda-forge
ruamel_yaml               0.15.80         py37h540881e_1007    conda-forge
scipy                     1.7.3            py37hf2a6cf1_0    conda-forge
setuptools                67.6.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
snappy                    1.1.10               h9fff704_0    conda-forge
sqlite                    3.40.0               h4ff8645_0    conda-forge
tensorboard               2.8.0              pyhd8ed1ab_1    conda-forge
tensorboard-data-server   0.6.0            py37h38fbfac_2    conda-forge
tensorboard-plugin-wit    1.8.1              pyhd8ed1ab_0    conda-forge
tensorflow                2.8.0           cuda112py37h01c6645_0    conda-forge
tensorflow-addons         0.19.0                   pypi_0    pypi
tensorflow-base           2.8.0           cuda112py37hd7e45b3_0    conda-forge
tensorflow-estimator      2.8.0           cuda112py37h25bb9bc_0    conda-forge
tensorflow-gpu            2.8.0           cuda112py37h0bbbad9_0    conda-forge
termcolor                 2.2.0              pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
typeguard                 3.0.1                    pypi_0    pypi
typing-extensions         4.5.0                hd8ed1ab_0    conda-forge
typing_extensions         4.5.0              pyha770c72_0    conda-forge
urllib3                   1.26.15            pyhd8ed1ab_0    conda-forge
werkzeug                  2.2.3              pyhd8ed1ab_0    conda-forge
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
wrapt                     1.14.1           py37h540881e_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.7.2            py37h540881e_2    conda-forge
zipp                      3.15.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge
zstandard                 0.18.0           py37h540881e_0    conda-forge

And I uploaded today’s code on github and knimehub.

https://github.com/iwatobipen/chembl_targetprediction

https://hub.knime.com/-/spaces/-/latest/~keV0drWZt67jVOx3/

You can modify the code and knime workflow if you would like to do. Let’s enjoy chemoinformatics!

Calculate atomic property with new chemoinformatics package♪ #RDKit #Jazzy

It have been long time since my last post ;) Now we are in Feburary… I hope every reader have nice start of this year.

I’m enjoying my day but really busy.

This is the first post of the year!

Thinking the strength of hydrogen donoer and acceptor is important for drug design but difficult to calculate. Recently I found useful package to do it. So I would like to share the pacakge.

Tha package named Jazzy!!!!!! The code is disclosed by AZ repo. It is easy to use. After installing the package, you can calculate strength of hydrogen donor and acceptor of the molecule and can vizualise it.

OK let’s write code.

Install jazzy with pip command.

$ pip install jazzy

jazzy depends on some python packages, click, kallisto, numpy and rdkit. These package will install when pip install jazzy command is called.

Following code is an example of how to use jazzy.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

As you can see, jazzy provides useful method to calculate compound properties. These parameter is useful not only analyzing compound properties but also as descriptors for machine learning.

If readers have interest the package, please try to use it and check the official documentation.

https://jazzy.readthedocs.io/en/latest/cookbook.html

Thanks for reading.

Calculate similarity of popular bioisosters #RDKit #espsim #memo

It will take 3 years after changing job. Fortunately, I’m still working at early drug discovery field as chemoinformatician ;)

As many drug hunter know that bioisosteric replacement is one of the major way for designing novel molecules from known biologically active molecules.

Transformations from carboxylic acid to tetorazole, from amide to oxazole are major examples of bioisosteric replatecement.

Are these pair, “carboxylic & tetorazole” and “amide & oxazole” simlar ? And are there any metrics to compre these similarity instead of biological activities?

Espsim is one of the interesting OSS package for compare molecules with electro static potentials.

I tried to compare following tetrazole and carboxylic acid pair and amide and oxazole pair.

To clarify the similarity, I used not only neutoral state of carboxylic acid but also ionization state.

Let’s compare carboxylic acid and tetrazole with Morgan FP and ESPsim. As you can see that tetrazole and carboxylic acid shows low 2d finger print similarity but high similarity with ESPsim and Shapesim.

Next, let’s compare carboxylate and carboxylic acid metyl ester.

Interestingly ESPSIM between calboxylate and methyl ester is very low compared to their shape similarity.

How about similarity between calboxylic acid and methyl ester? Ah! there molecule shows high similarity not only shape but also electrostatic potential.

Next, let’s see amide and oxazole.

The result is as same as calboxylate case, the pair shows low 2D similarty but high shape and esp similarity.

In this experiments, ESPsim is interesting tool for predicting bioisosteric fragments. I think it worth to use espsim as a tool for drug design.

Today’s code is uploaded my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This is the last chemoinformatics blogpost in this year. Thanks for visiting and reading my blog. I hope you and your family have a great rest of the year and have a happy new year ;)

Make pptx file from Python #chemoinformatics #memo #python-pptx

I hope all reader enjoying chemoinformatics ;)
Unfortunately I’m struggling to make PPTX file in this year. So I can’t have enought time to coding. But It’s same for medicinal chemists. Because they often make presentation slide for project team, theier boss, senior etc….

They transfer SAR data from spread sheet to pptx. Sometime it seems time consuming task for me.

So I would like to write code for making automatic pptx generator with python ;)

As you know there are useful prior art in the topic. The URL is listed below.

https://github.com/dkuhn/sdf2ppt
https://docs.eyesopen.com/toolkits/cookbook/python/depiction/csv2pptx.html?highlight=pptx

The first one uses RDKit and the second one uses OETK.

In this post, I used RDKit and most of code is bollowed from the OE’s example code.

Here is a my example code.
At first, import libraries which are required to make pptx and handling chemical structure.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from io import BytesIO, StringIO
from pptx import Presentation
from pptx.util import Inches
# read example molecules
mols = [m for m in Chem.SDMolSupplier('./cdk2.sdf')]
for m in mols:
    AllChem.Compute2DCoords(m)

Then define function to draw molecule and make pptx. I used BytesIO because by using the BytesIO I don’t need to make tempolary image file.

def getmol_bio(mol, size=(200, 200), fontsize=-1):
    bio = BytesIO()
    drawer = rdMolDraw2D.MolDraw2DCairo(size[0], size[1])
    op = drawer.drawOptions()
    op.fixedFontSize = fontsize
    drawer.DrawMolecule(mol)
    drawer.FinishDrawing()
    bio.write(drawer.GetDrawingText())
    return bio

def make_pptx(mol):
    pres = Presentation()
    title_slide_layout = pres.slide_layouts[0]
    title_slide = pres.slides.add_slide(title_slide_layout)
    title = title_slide.shapes.title
    title.text = 'mol props'
    slide_layout = pres.slide_layouts[5]
    slide = pres.slides.add_slide(slide_layout)
    bio = getmol_bio(mol, size=(300,200), fontsize=18)
    slide.shapes.add_picture(bio, left=Inches(1.0), top=Inches(2.0), width=Inches(2.5))
    props = mol.GetPropsAsDict()
    rows = len(props)
    cols = 2
    table = slide.shapes.add_table(rows, cols, left=Inches(4.0), top=Inches(2.0),
                                   width=Inches(5.5), height=Inches(0.8)).table
    table.columns[0].width = Inches(2.0)
    table.columns[1].width = Inches(3.5)
    table.first_row = False
    for row, (k,v) in enumerate(props.items()):
        table.cell(row, 0).text = k[:20]+':'
        table.cell(row, 1).text = str(v)
    pres.save('mol.pptx')

The main part of code is done. Let’s make PPTX file.

make_pptx(mols[0])

After calling make_pptx, I could get pptx file shown below.

It worked fine. This is a smile example but by using the approach it will be easy to make many compound related pptx files (IMHO, IMHO).

Python-pptx package is really cool and useful tool Ithink. It’s interesting for me! Love coding!

CLI tool for making ssslib #chemoinformatics #rdkit

As many RDKitter know that rdSubstructLibrary is one of the cool tool for conductiong substructure search. Greg Landrum introduced how to use it in his great blog post.

I love the method because it works very fast for substructure searching. So I would like to make CLI tool for making substructure library database.

To do it, I used click which is useful package for making CLI tool.

This is an example to use the code.

$ gh repo clone iwatobipen/rdsss
$ cd rdsss
$ pip install -e .

After installing the package, three commands will be available.
1. make_rdssslib command makes sslib from sdf.gz
2. update_rdssslib which updates sslib with new sdf.gz
3. run_rdsss which run SSS with given smarts query.

The example is shown below.

# make ssslib from sdf.gz
$ make_rdssslib cdk2.sdf.gz cdk2.sslib.pkl

# search with ssslib from CLI
$ run_rdsss 'c1ccccc1' cdk2.sslib.pkl

After running the run_rdsss, hits.csv file will be generated.

$ cat hits.csv
Cn1cnc2c(NCc3ccccc3)nc(NCCO)nc21,ZINC01641925
CC[C@H](CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,ZINC01649340
COc1ccc(CNc2nc(N(CCO)CCO)nc3c2ncn3C(C)C)cc1,ZINC01487345
COc1ccc2c(c1)/C(=C/c1cnc[nH]1)C(=O)N2,ZINC03814467
COc1cc[nH]c1/C=C1\C(=O)Nc2ccc([N+](=O)[O-])cc21,ZINC03814470
COc1cc(-c2ccc[nH]2)c2c3c(ccc(F)c13)NC2=O,ZINC00003491
[NH3+]CCSc1cc(-c2ccc[nH]2)c2c3c(ccc(F)c13)NC2=O,ZINC03814473
NC(=O)Nc1cccc2c1C(=O)c1c-2n[nH]c1-c1cccs1,ZINC03814477

The csv file contains hit smiles and _Name props.

All process can do from CLI with the code. But to handle learge sslib. I think user should run sss on interprinter. Because IO of SSLIB will take bottle neck of the code.

This code is stil underl development. Any advice or suggestion will be greatly appreciated.

https://github.com/iwatobipen/rdsss

Use mmpdblib on jupyter notebook #rdkit #mmpdb #chemoinformatics

I introduced MMPDB v3 few days ago. The package is CLI tool. So user can do MMPA by typing command line. However the tool doesn’t provide API interface.

I think, if we can use MMPDB on jupyter-notebook it’ll be more useful because output data can be used directly in coding process.

MMPDB Cli interface is implemented with Click which is python command line option parser. And click has CliRunner for testing. I found that CliRunner is nice solution to do that.

Following example is my first trial to call mmpdb on jupyter notebook. And most of SQL part is bollowed from RDKit-Blog post.

At first, connect postgresql chembl30 and retrive data.

import psycopg2
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem import PandasTools
import requests
import pandas as pd
from mmpdblib import cli
from click.testing import CliRunner
from IPython.display import HTML
%load_ext sql
print(rdBase.rdkitVersion)

data = %sql select canonical_smiles,molregno,activity_id,standard_value,standard_units from activities \
  join assays using (assay_id) \
  join compound_structures using (molregno) \
  where tid=165 and standard_type='Ki' and standard_value is not null and standard_relation='=' \
    and canonical_smiles not like '%.%';

df = data.DataFrame()

Next, calculate pKi and rename column for next data processing. And save compound data and property data separated as CSV format. It’s worth to know that properties data should be tab separated format.

import numpy as np
def get_pki(val):
    val = float(val)+0.001
    return 9.-np.log10(val)

df['pKi_hERG'] = df.standard_value.apply(get_pki)
df.rename(columns={'molregno':'ID'}, inplace=True)

cpddf = df[['canonical_smiles','ID']].copy()
cpddf.drop_duplicates().to_csv('chembl_herg.smi', sep=' ', index=False, header=False)

propdf = df[['ID', 'pKi_hERG']].copy()
propdf.drop_duplicates().to_csv('chembl_herg_prop.csv', sep='\t', index=False, columns=['ID', 'pKi_hERG'])

OK, ready to run MMPDBlib. It is easy to use CliRunner, I show examples.

# make fragmentdb
runner = CliRunner(mix_stderr=False)
args = ("fragment", "chembl_herg.smi", "-o", "chembl_herg.fragmentdb")
runner.invoke(cli.main, args=args)

# do indexing
args = ("index", "chembl_herg.fragmentdb", "-o", "chembl_herg.mmpdb")
runner.invoke(cli.main, args=args)

# check data
!mmpdb rulecat chembl_herg.mmpdb
id	from_smiles	to_smiles
4	[*:1]c1ccc(F)cc1Cl	[*:1]c1ccc(F)cc1O
1	[*:1]c1ccc(F)cc1O	[*:1]c1ccccc1O
2	[*:1]c1ccc(F)cc1Cl	[*:1]c1ccccc1O
4183	[*:1]c1ccc(F)cc1Br	[*:1]c1ccc(F)cc1Cl
4184	[*:1]c1c(Cl)cccc1Cl	[*:1]c1ccc(F)cc1Cl
.....

# load properties
args = ("loadprops", "-p", "chembl_herg_prop.csv", "chembl_herg.mmpdb")
runner.invoke(cli.main, args=args)

It works fine. OK let’s use generate command.

# I used droperidol as an example molecule
smi = 'C1CN(CC=C1N2C3=CC=CC=C3NC2=O)CCCC(=O)C4=CC=C(C=C4)F'
droperidol = Chem.MolFromSmiles(smi)
smi = Chem.MolToSmiles(droperidol)
droperidol

args = ('generate', "--smiles", smi, "--query", "c1c(F)ccc(*)c1", "--radius", "1", "chembl_herg.mmpdb")
res = runner.invoke(cli.main, args)

Result object named ‘res’ has stdout as text format. I used StringIO and convert the data as pandas dataframe ;)

from io import StringIO
sio = StringIO()
sio.write(res.output)
sio.seek(0)
mmpdf = pd.read_csv(sio, sep='\t', skiprows=[1,2])
mmpdf.head(3)

#
start	constant	from_smiles	to_smiles	r	pseudosmiles	final	heavies_diff	#pairs	pair_from_id	pair_from_smiles	pair_to_id	pair_to_smiles
0	O=C(CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1)c1ccc(F...	*C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	[*:1]c1ccc(F)cc1	[*:1]C	1	[*:1]-[C](~*)(~*)	CC(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	-6	2	423277	O=C(NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21)...	423269	CC(=O)NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21
1	O=C(CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1)c1ccc(F...	*C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	[*:1]c1ccc(F)cc1	[*:1]C(C)C	1	[*:1]-[C](~*)(~*)	CC(C)C(=O)CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1	-4	2	423277	O=C(NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)ccc21)...	423275	CC(C)C(=O)NC1CCc2cc(CCN3CCN(c4nsc5ccccc45)CC3)...

Finally, convert smiles to rdkit molobjects with Chem.PandasTools. I used changeMoleculeRendering method because before using the method molecules in dataframe are not rendered.

PandasTools.AddMoleculeColumnToFrame(mmpdf, smilesCol="start", molCol="start")
PandasTools.AddMoleculeColumnToFrame(mmpdf, smilesCol="final", molCol="final")

from rdkit.Chem import PandasPatcher
PandasPatcher.changeMoleculeRendering(mmpdf)

Worked fine ;)

By using the approach, we can call mmpdb function from python code without subprocess module. I think it is usful for calling function from the code with is click based CLI tool.

I uploaded whole code of the post on my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Thanks for reading.

Let’s use mmpdb v3 #memo #rdkit #chemoinformatics

Matched molecular pair (MMP) is not AI based compound design method however it’s still useful and powerful approach to do compound design.

MMPDB is one of the cool package for MMP analysis. And Andrew who is developer of MMPDB preseted new version of MMPDB at RDKit UGM 2022.

link

Version 3 is still developing state but you can install it from andrew’s repo. I installed it and use mmpdb v3.

$ gh repo clone adalke/mmpdb -- -b v3-dev
$ cd mmpdb
$ pip install -e .

After the command avobe, I could call mmpdb command ;)

Make cdk2.smi from cdk2.sdf with rdkit.

from rdkit import Chem
mols = Chem.SDMolSupplier('cdk2.sdf')
with open('cdk2.smi') as of:
    for m in mols:
        molid = m.GetProp('_Name')
        smi = Chem.MolToSmiles(m)
        of.write(f'{smi} {molid}\n')

After making smi file, let’s make mmpdb.

$ mmpdb fragment cdk2.smi -o cdk2.fragment
$ mmpdb index cdk2.fragment -o cdk2.mmpd

Check rule list with rulecat command.

$ mmpdb rulecat cdk2.mmpdb
id	from_smiles	to_smiles
1	[*:1]c1[nH]cc2c1CCOC2=O	[*:1]c1cnc[nH]1
2	[*:1]c1c(F)cc(Br)cc1F	[*:1]c1nccs1
3	[*:1]Oc1nc([*:2])nc(N)c1N=O	[*:1]Oc1nc([*:2])nc2[nH]cnc12
4	[*:1]Oc1nc([*:2])nc2[nH]cnc12	[*:1]Oc1nc(N)nc([*:2])c1N=O
5	[*:1]C1CC1	[*:1]c1ccccc1
6	[*:1]C(=O)C(C)C	[*:1][C@@H]1CCCO1
7	[*:1]C(=O)C(C)C	[*:1][C@H]1CCC(=O)N1
8	[*:1]C(=O)C(C)C	[*:1]C1CCCCC1
9	[*:1]C(=O)C(C)C	[*:1][C@@H]1CC=CCC1
10	[*:1][C@@H]1CCCO1	[*:1][C@H]1CCC(=O)N1
11	[*:1]C1CCCCC1	[*:1][C@@H]1CCCO1
12	[*:1][C@@H]1CC=CCC1	[*:1][C@@H]1CCCO1
13	[*:1]C1CCCCC1	[*:1][C@H]1CCC(=O)N1
14	[*:1][C@@H]1CC=CCC1	[*:1][C@H]1CCC(=O)N1
15	[*:1]C1CCCCC1	[*:1][C@@H]1CC=CCC1
16	[*:1]c1ccc(C(=O)[O-])c([*:2])c1	[*:1]c1cccc([*:2])c1
17	[*:1]Nc1ccc(C(=O)[O-])c([*:2])c1	[*:1]Nc1cccc([*:2])c1
18	[*:1]c1nc([*:2])c(N=O)c(N)n1	[*:1]c1nc([*:2])c2nc[nH]c2n1
19	[*:1]c1nc([*:2])c2nc[nH]c2n1	[*:1]c1nc(N)nc([*:2])c1N=O
20	[*:1]C	[*:1]c1nccs1
21	[*:1]c1ccc(C(N)=O)cc1	[*:1][H]
22	[*:1]c1ccc(C(=O)[O-])c(Cl)c1	[*:1]c1cccc(Cl)c1
23	[*:1]CC1CC1	[*:1]Cc1ccccc1
24	[*:1]c1ccc(S(N)(=O)=O)cc1	[*:1]c1ccccc1
25	[*:1]c1ccccc1	[*:1][H]
26	[*:1]c1ccc(S(N)(=O)=O)cc1	[*:1][H]
27	[*:1]c1nc(N)nc(N)c1N=O	[*:1]c1nc(N)nc2[nH]cnc12
28	[*:1]CC(=O)C(C)C	[*:1]C[C@@H]1CCCO1
29	[*:1]CC(=O)C(C)C	[*:1]C[C@H]1CCC(=O)N1
30	[*:1]CC(=O)C(C)C	[*:1]CC1CCCCC1
31	[*:1]CC(=O)C(C)C	[*:1]C[C@@H]1CC=CCC1
32	[*:1]C[C@@H]1CCCO1	[*:1]C[C@H]1CCC(=O)N1
33	[*:1]CC1CCCCC1	[*:1]C[C@@H]1CCCO1
34	[*:1]C[C@@H]1CC=CCC1	[*:1]C[C@@H]1CCCO1
35	[*:1]CC1CCCCC1	[*:1]C[C@H]1CCC(=O)N1
36	[*:1]C[C@@H]1CC=CCC1	[*:1]C[C@H]1CCC(=O)N1
37	[*:1]CC1CCCCC1	[*:1]C[C@@H]1CC=CCC1
38	[*:1]c1ccc(OC)cc1	[*:1]c1cccs1
39	[*:1]N1CCC[C@H](C(N)=O)C1	[*:1][H]
40	[*:1]OC	[*:1]SCC[NH3+]
41	[*:1]S(=O)(=O)NC	[*:1]S(=O)(=O)NC(=N)N
42	[*:1]S(=O)(=O)NC	[*:1]S(=O)(=O)Nc1nccs1
43	[*:1]S(=O)(=O)NC(=N)N	[*:1]S(=O)(=O)Nc1nccs1
44	[*:1]C(=O)[O-]	[*:1][H]
45	[*:1]S(N)(=O)=O	[*:1][H]
46	[*:1]OC[C@@H](O)C[NH+](C)C	[*:1][H]
47	[*:1]NC(=O)NN(C)C	[*:1]NC(N)=O
48	[*:1]N	[*:1]Nc1ccc(C(N)=O)cc1
49	[*:1]OCC(=O)C(C)C	[*:1]OC[C@@H]1CCCO1
50	[*:1]OCC(=O)C(C)C	[*:1]OC[C@H]1CCC(=O)N1
51	[*:1]OCC(=O)C(C)C	[*:1]OCC1CCCCC1
52	[*:1]OCC(=O)C(C)C	[*:1]OC[C@@H]1CC=CCC1
53	[*:1]OC[C@@H]1CCCO1	[*:1]OC[C@H]1CCC(=O)N1
54	[*:1]OCC1CCCCC1	[*:1]OC[C@@H]1CCCO1
55	[*:1]OC[C@@H]1CC=CCC1	[*:1]OC[C@@H]1CCCO1
56	[*:1]OCC1CCCCC1	[*:1]OC[C@H]1CCC(=O)N1
57	[*:1]OC[C@@H]1CC=CCC1	[*:1]OC[C@H]1CCC(=O)N1
58	[*:1]OCC1CCCCC1	[*:1]OC[C@@H]1CC=CCC1
59	[*:1]NCC1CC1	[*:1]NCc1ccccc1
60	[*:1]N	[*:1]Nc1ccccc1

60 rules are generated from cdk2.smi which contains 47 molecules.

One of the interesting feature of mmpdb v3 is generate command imprementation. The command allows to user for generating molecule with user defined substructure as query.

cdk2.mmpdb has metyl to thiazole rules ([:1]C >>> [:1]c1nccs1). So I tried to generate command with simple molecule.

$ mmpdb generate --smiles 'Cc1ccccc1' --query '*C' cdk2.mmpdb
start	constant	from_smiles	to_smiles	r	pseudosmiles	final	heavies_diff	#pairs	pair_from_id	pair_from_smiles	pair_to_id	pair_to_smiles
GENERATE:
  EXEC: *C (22, 22, 1, 1)
Cc1ccccc1	*c1ccccc1	[*:1]C	[*:1]c1nccs1	0	[*:1](~*)	c1ccc(-c2nccs2)cc1	4	1	ZINC03814443	CNS(=O)(=O)c1ccc(N/C=C2\C(=O)Nc3ccccc32)cc1	ZINC03814447	O=C1Nc2ccccc2/C1=C/Nc1ccc(S(=O)(=O)Nc2nccs2)cc1

Generated molecule from “Cc1ccccc1” is ‘c1ccc(-c2nccs2)cc1’ and the image is shown below.

generate command can use radius option. It provides flexibility of user definition.

The new feature of mmpdb v3 is not limited generate command which is shown here. You should read Andrew’s presentation and his repo!

Thanks for reading ;)

Embed mols2grid to web page #mols2grid #RDKit #webapp

I often use flask for my web app development because the package is light weight web framework and easy to write.

BTW rendering molecules as grid image is easy in jupyter notebook by using RDKit’s Draw.MolsToGridImage function. And also recently developed package named mols2grid is really cool for rendering molecules on jupyter notebook. mols2grid generaetes image data as HTML. So it means that we can embed these grid image to webpage.

So I tried to it ;)

Following code uses Flask, Flask-bootsrap, mols2grid. All packages are installed from conda-forge.

Directory structure is below.

root/app.py
root/cdk2.sdf
root/templates/top.html

And app.py

from flask import Flask
from flask import render_template
from flask_bootstrap import Bootstrap
from rdkit import Chem
import mols2grid


app = Flask(__name__)
Bootstrap(app)

@app.route('/')
def top():
    im = mols2grid.display('./cdk2.sdf',
    #tooltip=["id", "r_mmffld_Potential_Energy-OPLS_2005"]
    )
    return render_template('/top.html', data=im.data)


if __name__=="__main__":
    app.run(debug=True)

And top.html is below.

{% extends "bootstrap/base.html" %}
{% block title %}This is an example page{% endblock %}


{% block navbar %}
  <div class="navbar navbar-inverse" role="navigation">
    <div class="container">
      <a class="navbar-brand" href="#">Navbar</a>
    </div>
</div>
{% endblock %}

{% block content %}
<div class="container">
  <div>{{data|safe}}</div>
</div>
{% endblock %}

That’s all. It is really simple isn’t it ;)

Let’s run the web app.

$ python app.py

Then access localhost:5000. I could get molecules as grid image.

The data has pagenation and check box. It can dowonload selected molecules. I think it is one of the convenient way to implement rendering molecules function in my web applications.

FW analysis make easily with rdkit contrib FW package #rdkit #chemoinformatics #memo

Last week I enjoyed RDKit UGM 2022. It was really great and exciting evenif I participated there from online. I hope I could participate RDKIT UGM 2023 locally ;)

As you know RDKit is one of the useful OSS package for chemoinformatician. It has nice community and be developed actively. I respect the community and developer’s effort!

I found new contrib package in rdkit github repo it was “FreeWilson“(FW). The logic of FW analyisis is simple but powerful approach to find the better conbination of R groups.
Original article is shown here. And PatWalter disclosed and described the code and usecase in his repo and blog post.
https://github.com/PatWalters/Free-Wilson
http://practicalcheminformatics.blogspot.com/2018/05/free-wilson-analysis.html

And fortunately, we can also use FW from rdkit contrib package ;)

Now new version of rdkit(2022.09.1) isn’t available from conda-forge but I could get new version from original repo and FW package could install with pip command.

(base)$ conda activate chemo_py309
(chemo_py309)$ gh repo clone rdkit/rdkit
(chemo_py309)$ cd rdkit/Contrib/FreeWilson
(chemo_py309)$ pip install .

After installation of FW, I run code with sample data. My code (almost same as readme) is uploaded to my gist which is show below.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

By using the package, user can conduct FW analysis really easily. Original repo shows another approach which uses FMCS to get scaffold for the dataset.

Ofcourse FW is not always propose correct combination however it is chance for thinking non additive SAR (e.g https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00525-z)

I felt the package is useful and easy to use. So I would like to apply it in my internal project ;)

Make chemoinformatics coding more easily ! #RDKit #datamol

I love RDKit because it supports lots of chemoinformatics features and it is supported top level scientific community.

But for eary chemoinformacian, it’s required to learn python or C++ to write chemoinformatics code. As you know, Knime is good option for no-code chemoinformtatics. And another option for beginner is good wrapper of rdkit

Recently the package named datamol is intorduced in my TL.

You can install datamol via conda command. And datamol is inroduced at RDKitUGM 2021!
https://www.youtube.com/watch?v=ibE10Xy_5VY

I used datamol today. And I felt that datamol is useful tool for chemoinformatics because it wrapps lots of rdkit functions and it makes easy to use.

My example was uploaded on my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

I think it is important for these kinds of package is Sustainability. So I hope datamol will be maintained long term.

If reader has interesd datamol, please install your environment and try it ;)