Try to use new LLM phi3 #memo #LLM

As name of LLM means that to use these kinds of models, we need enough GPU memory and it’s not so cost effective for personal use ;)

To overcome the limitation, there are lots of technologies are developt and still be developping. LLAMA-cpp is one of the them.

Today I would like to share new model which is named phi3 developed by microsoft.

The original article is found in arixv.
https://arxiv.org/pdf/2404.14219

From the abstract, “phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone”

It seems cool!

The model is disclosed on hugging face. So I tried to use it on my personal pc.
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct#chat-format

To use these model I made my own environment for testing LLM.

$ conda create -n phi3 python=3.10
$ conda activate phi3
$ pip install torch torchvision torchaudio jupyterlab
$ pip install git+https://github.com/huggingface/transformers

Now I could make test environment let’s test the model!

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw phi3_test.ipynb hosted with ❤ by GitHub

I asked feature of the drug discovery to model in both Japanese and English ;)

The question and answer is below.

# In Japanese
messages = [
    {"role": "system", "content": "あなたは非常に優秀な創薬科学者です"},
    {"role": "user", "content": "10年後の創薬はどのようになっているかを語ってください"}
]

# Answer
0年後の創薬科学は、多大な進歩を遂げていると考えられます。この時点での創薬の進歩は、以下のような要素によって形作られているでしょう。


1. ゲノム編集技術の発展: CRISPR-Cas9などのゲノム編集技術は、遺伝子疾患の治療に革命をもたらし、個別化医療の普及を促進しています。

2. ターゲットDrug Discovery: 高度なバイオインフォマティクスと機械学習を用いたデータ分析により、より効果的な薬剤の発見が進み、特定のターゲットに焦点を当てた治療が可能になりました。

3. 細胞生物学: 細胞生物学の進展により、薬剤の効果を細かく評価し、副作用のリスクを最小限に抑える治療法が開発されています。

4. 臨床試験の改善: 個別化医療により、より適切な治療薬を効果的に試験するためのパーソナライズされた治療試験が行われています。

5. 臨床診断の進歩: 精密医


#In English
messages = [
    {"role": "system", "content": "You are a very talented drug discovery scientist"},
    {"role": "user", "content": "Talk about what drug discovery will look like in 10 years"}
]

In the next decade, drug discovery is poised to undergo a transformative shift, driven by advancements in technology, artificial intelligence (AI), and a deeper understanding of human biology. Here are some key aspects that drug discovery is likely to evolve in the next 10 years:

1. Precision medicine: The focus will shift towards personalized medicine, where treatments are tailored to an individual's genetic makeup, lifestyle, and environment. This will involve the use of genomic data and AI to identify the most effective drugs for a specific patient, minimizing side effects and improving treatment outcomes.

2. AI and machine learning: AI and machine learning algorithms will play a crucial role in drug discovery, enabling researchers to analyze vast amounts of data, identify patterns, and predict potential drug candidates more efficiently. This will significantly reduce the time and cost required for drug development.

3. Advanced drug delivery systems: The development of innovative drug delivery systems, such as nanotechnology-based carriers, will improve the targeted delivery of drugs to specific cells or tissues, minimizing side effects and enhancing drug efficacy.

4. High-throughput screening: The use of automated high-throughput screening technologies will enable researchers to rapidly test thousands of compounds for their potential therapeutic effects, accelerating the drug discovery process.

5. Collaborative drug discovery: The next decade will see an increase in collaborative efforts between academia, industry, and government agencies, as well as the use of open-source platforms and data sharing. This will foster a more efficient and effective drug discovery process, with a greater focus on addressing unmet medical needs.

6. Regenerative medicine: Advances in regenerative medicine, such as stem cell therapy and tissue engineering, will lead to the development of novel treatments for a wide range of diseases, including degenerative conditions and injuries.

7. Immunotherapy: The use of immunotherapy, which harnesses the body's immune system to fight diseases, will continue to grow, with new therapies being developed for cancer, autoimmune diseases, and other conditions.

8. Virtual and in sil

As you can see QA in English is more better than in Japanese case.

My PC has Nvidia RTX 4060 GPU which has 8GB memory. It’s not so huge but the model works well.

It’s really interesting for me. Current AI requires massive amount of GPU resources because to train the model, huge amount of traing data. It reasonable but not cost effective. I hope the research and development of cost effective and useful AI development goes well.

Generate new molecules from fragments with Diffusion model #cheminformatics #rdkit #difflinker #memo

Designing linked molecule from fragments is one of the important task for drug desing such as FBDD, Scaffold hopping (e.g. replace core) and PROTAC molecule design.

As readers know there are lots of solutions to do it, for examoke BROOD is one of the famous commercial package for fragment replacement. I can’t use commercial package in my hobby so I like OSS which can apply for compound design ;)

Today I would like to share difflinker which can design linker with diffusion model. The original article is open access, I would like to share the URL.
https://www.nature.com/articles/s42256-024-00815-9

The author shares the code on github. So I tried to use it.
https://github.com/igashov/DiffLinker

The intesting point for me is that DiffLinker can link more than two fragments and can set pocket constrain. It means that we can design linker which fit given pocket constrains. It is cool isn’t it.

Okey, let’s try the code!

At first I made env for difflinker.

$ gh repo clone igashov/DiffLinker
$ cd DiffLinker
$ mamba create -c conda-forge -n difflinker rdkit
$ conda activate difflinker
$ mamba install -c conda-forge -c pytorch biopython imageio networkx pytorch pytorch-lightning scipy scikit-learn tqdm wandb

I don’t recommend making env with yml file which is included repo because the version of pytorch is little bit old so the env can’t work on recent GPUs.

After making the env, I got model parameters. The details are described on README.md

mkdir -p models
wget https://zenodo.org/record/7121300/files/geom_difflinker.ckpt?download=1 -O models/geom_difflinker.ckpt
wget https://zenodo.org/record/7121300/files/geom_size_gnn.ckpt?download=1 -O models/geom_size_gnn.ckpt
wget https://zenodo.org/records/10988017/files/pockets_difflinker_full_no_anchors_fc_pdb_excluded.ckpt?download=1 -O models/pockets_difflinker_full.ckpt
wget https://zenodo.org/records/10988017/files/pockets_difflinker_full_no_anchors_fc_pdb_excluded.ckpt?download=1 -O models/pockets_difflinker_full.ckpt
wget https://zenodo.org/record/7121300/files/pockets_difflinker_backbone.ckpt?download=1 -O models/pockets_difflinker_backbone.ckpt
wget https://zenodo.org/records/10988017/files/pockets_difflinker_full_fc_pdb_excluded.ckpt?download=1 -O models/pockets_difflinker_full_given_anchors.ckpt
wget https://zenodo.org/record/7121300/files/zinc_size_gnn.ckpt?download=1 -O models/zinc_size_gnn.ckpt

Now almost there ;) I tried by using COT kinase PDB for example.

I removed central core from the ligand to make input fragment file shown below.

Then I run the script with following commad.

$ python generate_with_protein.py --protein testdata/4y85_apo.pdb --fragments testdata/ligand_frag.sdf --output sample_size5 --linker_size 5 --model models/pockets_difflinker_full.ckpt

After running the code, I could get generated molecules with xyz format in sample_size5 directory. Then I converted the xyz file to mol format with openbabel.

$ cd sample_size5
$ $ obabel -m -ixyz *.xyz -omol

After that I could get mol files of generated molecules. Here is a result.

Gray molecule which located left bottom is original ligand and the others are generated ligans.

Overlayed image is shown below.

As you can see all ligands are well aligned. In summary DiffLinker is reall interesting tool for fragment linking.

I would like to generate lots of molecules with the package. Thanks for sharing such as cool code!

Try to use new version of REINVENT #cheminformatics #memo #rdkit

As many cheminformaticians know that (I expected…) REINVENT which is developed by AZ team is one of the useful and famous AI based compound generator in cheminformatics field. New version of REINVENT 4 is still active and recently it is version apped and added some useful code on the repositly.

The DL framework is moved from pytorch 1.x to 2.x and cool example notebooks are available.
https://github.com/MolecularAI/REINVENT4

I tried to use it today ;)

If you’ve instaled reinvent already, you should update it before run the new code. The way to update is described in README.md.

#Updating dependencies
#Update the lock files with pip-tools (please, do not edit the files #manually):
pip-compile --extra-index-url=https://download.pytorch.org/whl/cu121 --extra-index-url=https://pypi.anaconda.org/OpenEye/simple --resolver=backtracking pyproject.toml

I modified original code in my env so, I could not update code smoothly so I remove my old env and create new env again.

Then I installed some packages to run the new code.

pip install jupytext mols2grid seaborn

After the installation I tried to run the example code. Following code is same as original README.md so there are nothing to get new thing from the blog post ;P

The new version of reinvent has notebooks directory.

(reinvent4)$ cd REINVENT4/notebooks
#generate jupyter notebook from .py file with jupytext.
(reinvent4)$ jupytext --to ipynb -o Reinvent_TLRL.ipynb Reinvent_TLRL.py 

After running the code above, I got ipynb file. The notebook is really cool because the code uses mol2grid and tensorboad on the notebook and provides nice view.


The code has 2 step Reinforcement Learning process.
1step: learn drug like molecules from reinvent.prior model. The drug likeness is defined with QED and alert sturcture.

2step: Target (ankyrase-2 IC50 data is used) focused feature learning with chemprop QSAR model.

Here is the screen shot of 1st step. I didn’t know that tensorboard can render in notebook!

After the first step, transfer learning is conducted with tankylase dataset.

Also I could get cool view on the notebook. Here is a grid view of active molecules.

And learing curve from tensorboard.


The example is useful to know that how to use REINVET4 with prior models. The example code provides not only how to run reinvent but also the discussion of eash learning step. It worth to learn for me.

I would like to say thank to the great work. And I recommend to update REINVENT in your env!

BTW, there are lots of compound generative models. And these models can generate chemically reasonable compounds. I think it’s really amazing! But still needed to deep discussion with medicinal chemists. I would like to discuss cheminformatician who uses generative models how to use them in real drug discovery projects.

Any comments, discussions are greatly appreciated.

Edit atom indices of RDKit Mol object #memo #cheminformatics

Atom indecies are unique number of each atom. And RDKit adds the index when mol object is generated. RDKit makes mol object from SMILES, Inchi, molblock and lots of formats. To make canonical representation of molecules, the atom indices are asigned allways same roules in automatically.

The indices are asigned regardless of scaffold. So if you have two molecules which have different substituents and same scaffold will not have same indices on the scaffold. To memorize the indices atom map numbers are used sometime but I would like to know is it possible to do same thing with atom indices, it means that how to modify the atom indices of rdkit mol object.

After googling, I found useful discussion on github.
https://github.com/rdkit/rdkit/discussions/5091

rdmolops.RenumberAtoms returns new molecules with user defined order of atom indices.

Here is an example, as you can see after calling the function I could get new molecule with diffrent index order.

It’s worth to know that how to modify atom index on rdkit ;)
I tested the function, the code is shown on my gist.
TIPS: To get _smilesAtomOutputOrder from mol object, MolToSmiles should be called at first.

https://gist.github.com/iwatobipen/81a1c28e376fd5fc144c61c5101769b3

Thanks for reading!

Enumerate molecules with CXSMILES #RDKit #cheminforamtics #memo

Recent version of RDKit supports not only SMILES but also CXSMILES (chemaxon extended smiles). As name shown that CXSMILES can add lots of informations in SMILES strings but it’s little bit difficult to understand. But I found that is useful to enumerate molecules ;)

For example I would like to enumerate core and two R-groups from SMILES, there are lots of way to do that. MolZip is one of the useful way to do that however annotation is required to enumerate molecules.

Today I tried to enumerate molecules from CXSMILES with molecule enumerator. More details of Enumerator is described in Greg’s great blog post.

OK, let’s start.

I used indole as a core and Cl and methoxy group as R-Groups.

import rdkit
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdMolEnumerator
IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.drawOptions.baseFontSize = 1.0
IPythonConsole.drawOptions.padding = 0.001
coresmi = 'c1ccc2[nH]ccc2c1'
core = Chem.MolFromSmiles(coresmi)
r1 = 'CO*'
r2 = 'Cl*'
m = Chem.MolFromSmiles(f'{r1}.{r2}.{coresmi}')
m

I tried to add Cl atom to index 5 and index 13 atoms and add methoxy group to index 6 and 7 atoms.

At frist I made mol from CXSMILES with multicenter S-groups.

m2 = Chem.MolFromSmiles(f'{r1}.{r2}.{coresmi} |m:4:5.13,2:6.7|')
m2

m:4:5.13 means [A:4] and aromatic carbon of index 5 and 13 are connected and 2:6.7 means [A:2] and aromatic carbon of index 6 and 7 are connected. Each multiple connection should be separated comma.

After that I could get enumereted molecules with following command.

res = rdMolEnumerator.Enumerate(m2)
res

As you can see enumereted molecules has OMe and Cl groups where user defined postion.

In summary conbination of rdkit and CXSMILES will be powerfull tool for cheminformatics.

I updated today’s notebook on my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Current status of DMTA cycle in AZ #memo #DDT #publication #AI/ML

March is end of fiscal year in most of Japanese company. I spent lots of time for paper work in these days… ;P

As many readers know that, DMTA cycle is key of drug discovery/optimization process and lots of computational and high thoughput experimental apporaches are available in the process recenlty.

I think AZ is the front runner of these area and I respect the activity because they do not only paper publication but also sharing the code. And by sharing the code, the code is improved by good open science community.

Today I read nice open access publication, published by Gian Marco et. al. from AstraZeneca(AZ). The title is “Augmenting DMTA using predictive AImodelling at AstraZeneca“. It’s worth to read because the author described the deitails of the infrastructure of Predictive Insight Platform(PIP) which is developed by AZ for improbing DMTA cycle with ML/AI.

Now almos 250 predictive models are working on the platform, tha task is not limited property prediction but also synthetic route prediction to evaluate proposed molecules. They provides ‘Global’ and ‘Local’ models for users, these models are optimized their developed QSAR engine named ‘Qputuna’. The origin of the name is that Optuna is used in model optimization step ;). I’m happy because optuna is developed PFN and the package is used world wide!

The service is used in small molecules but they are planning to expand the service for other modalities and planning to integrate LLM for support uses. I’m looking forward to future progress.

Interesting points for me are below.
– Cloud based #scallable
– Models are deployed as REST API #easy to integrate lots of services
– K8s is used for manage containers

By using these features PIP can keep high developability/fexibility. It seems really cool. Reader can read details without cost because this article is open access. So I would like to move coding. As I wrote above, Qputuna is used in the PIP and Qputuna is shared in AZ github repo. Let’s try to use it!

The structure of Qputuna is as same as Reinvent I think, config is defined by JSON file.

To build qputuna env, poetry is required, current my development environment is based on conda, so I developed the new envrionment for the task.

# install pyenv
# https://github.com/pyenv/pyenv?tab=readme-ov-file#installation
$ curl https://pyenv.run | bash
$ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
$ echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
$ echo 'eval "$(pyenv init -)"' >> ~/.bashrc
# qputuna is developed with python 3.10.10
$ pyenv install 3.10.10

# install poetry
# https://python-poetry.org/docs/#installing-with-the-official-installer
$ curl -sSL https://install.python-poetry.org | python3 -

Then I got qputuna from original repo and build the venv.

$ gh repo clone MolecularAI/Qptuna
$ cd Qptuna
$ pyenv local 3.10.10
$ petory env use 3.10.10
$ poetry install

After waiting the installation process, I could build the qptuna env ;)

Let’s run the command line for the test. Qputuna provides 4 commands, qputuna-optimize, qptuna-build, qptuna-predict and qptuna-schemagen. Last command seems generate JSON temolate for ML task, main command will be qptuna-optimize, build and predict.

Original document shows how to run the CLI with singularity container but I tried to run it from directly with same files today.

Here is the config JSON file.
The config defined molwt is target for the prediction and ECFP and MACCS keys are used for descriptors. I added “track_to_mlflow”:true for integrating mlflow with the experiment.

# examples/optimization/regression_drd2_50.json 
{
  "task": "optimization",
  "data": {
    "input_column": "canonical",
    "response_column": "molwt",
    "training_dataset_file": "tests/data/DRD2/subset-50/train.csv",
    "test_dataset_file": "tests/data/DRD2/subset-50/test.csv"
  },
  "descriptors": [
    {
      "name": "ECFP",
      "parameters": {
        "radius": 3,
        "nBits": 2048
      }
    },
    {
      "name": "MACCS_keys",
      "parameters": {}
    }
  ],
  "settings": {
    "mode": "regression",
    "cross_validation": 3,
    "direction": "maximize",
    "n_startup_trials": 30,
    "n_trials": 15,
    "track_to_mlflow":true  <<< ADDED
  },
  "algorithms": [
    {
      "name": "SVR",
      "parameters": {
        "C": {
          "low": 1E-10,
          "high": 100.0
        },
        "gamma": {
          "low": 0.0001,
          "high": 100.0
        }
      }
    },
    {
      "name": "RandomForestRegressor",
      "parameters": {
        "max_depth": {
          "low": 2,
          "high": 32
        },
        "n_estimators": {
          "low": 10,
          "high": 250
        },
        "max_features": [
          "auto"
        ]
      }
    },
    {
      "name": "Ridge",
      "parameters": {
        "alpha": {
          "low": 0,
          "high": 2
        }
      }
    },
    {
      "name": "Lasso",
      "parameters": {
        "alpha": {
          "low": 0,
          "high": 2
        }
      }
    },
    {
      "name": "PLSRegression",
      "parameters": {
        "n_components": {
          "low": 2,
          "high": 3
        }
      }
    },
    {
      "name": "XGBRegressor",
      "parameters": {
        "max_depth": {
          "low": 2,
          "high": 32
        },
        "n_estimators": {
          "low": 3,
          "high": 100
        },
        "learning_rate": {
          "low": 0.1,
          "high": 0.1
        }
      }
    }
  ]
}

To run the optimization, I type following command.

poetry run qptuna-optimize \
        --config examples/optimization/regression_drd2_50.json \
        --best-buildconfig-outpath qptuna-target/best.json \
        --best-model-outpath qptuna-target/best.pkl \
        --merged-model-outpath qptuna-target/merged.pkl

These structure is really similar to REINVENT ;) After few seconds I could get pkl and json in the my defined folder. Also I could get mlrun folder becuase I added option for mlflow.

Then run the prediction with the model.

poetry run qptuna-predict \
        --model-file ./qptuna-target/merged.pkl \
        --input-smiles-csv-file tests/data/DRD2/subset-50/test.csv \
        --input-smiles-csv-column "canonical" \
        --output-prediction-csv-file ./qptuna-target/prediction.csv

The command accepts smiles file as an input and generates prediction.csv with predicted value and smiles. It’s easy isn’t it.

Next let’s check ML flow.

$ poetry run mlfow ui

Then access localhost:5000. I could see all experiment results. To use the mlflow it is easy track ML process.

Also the repo provides lots of notebooks for using qptuna from python. Reader who has interest the package let’s try to use! Thanks for AZ team for sharing great code and article.

https://github.com/MolecularAI/Qptuna

I would like to also know that how they build the culture of prediction(AI/ML) driven research workflow with wet researchers. It’s too difficult task compared to build system.. IMHO

Predict protein-ligand complex dynamics with python #dynamicbind #cheminformatics #memo

Computer aided drug design is one of the powerful approach for drug discovery these days. Docking study of target protein and ligands is common proceduer to evaluate whether the compound fit target protein’s pocket or not. However there is a limitation in the method. Most of the docking apporach handle protein and ligand as rigid body, so it’s difficult to consider the dynamics. Molecular dynamics can solve the limitation but it requires lots of computational resouces.

Recent progress of Protein Language Model(PLM), these model can predict protein’s 3D structure from sequences. But these predicted structure is apo form and sometime these structures are different from holo structures.
This is an exampl to use Alpha fold2 predicted structure for molecular docking from researchers in schrodinger.
https://chemrxiv.org/engage/chemrxiv/article-details/62ac0e7b04a3a9682d49ce98

They developed induced fit docking(IFD) and the method inproved perfomance of docking with AF2 proposed structure. These result showed that using directly with AF2 predcited structure is not good strategy.

To search large chemical space with AF2 predicted structure, method of considering protein dynamics rapidly is required.

Recently I found interesting article to challenge the issue with OSS. The article is published from Wei Lu et al in nature communications.

The title is ‘DynamicBind: predicting ligand-specificprotein-ligand complex structure with adeep equivariant generative model‘. The article is open access ;)

DynamicBind can perform flexible docking with following process.
1) Ligand conformation is changed only at first few steps.
2) The model simultaneously translates and rotates the portein residues.

The procedure details are described in fig1.
https://www.nature.com/articles/s41467-024-45461-2/figures/1


During the training and inference step, the ligand presetend as a graph and protein is also represented as graph of Calpha and node feature included in the graph with model embedding from ESM and side chain angles.

The importatn point of their approch is morph-like transformation that interpolates between the crystal protein structure and the structure predicted by AlphaFold, thereby reducing the transition barriers between meta-stable configurations, such as the AlphaFold-predicted conformation, and the ligandbounded holo configuration. This is a big difference compaired to current diffusion based model.

Using the approach described avobe, dynamic bind can predict dynamics from AF2 structure very rapidly. For example, movement from DFG-in to DFG-out structure transformation of Kinase in very short time. It’s really amazing because it’s requred long time to simulate this process with MD.

Fortunately the arthor shared the code on github!
https://github.com/luwei0917/DynamicBind

So I tried to test the code on my PC.

At first I made enviroment of dynamicbind and also I installed relax env to dynamicbind env.

$ gh repo clone luwei0917/DynamicBind
$ cd DynamicBind
$ mamba env create -f environment.yml
$ conda activate dynamicbind
$ conda install -c conda-forge openmm pdbfixer libstdcxx-ng
$ conda install -c conda-forge openmmforcefields openff-toolkit ambertools=22 compilers biopython ipython

Then I got trained ckpt file from zenod. https://zenodo.org/records/10137507
And unzip workdir.zip in to DynamicBind directory.

Almost there, let’s run the test code.

To run the code on my PC, I neede set environment variable of MKL_THREADING_LAYER. And run the example code show below. –parper options is requred too because with out the option no provided ckpt file will try to read and cause error.

(dynamicbind) $ export MKL_THREADING_LAYER=GNU
# the code can use SMILES or SDF as ligand input
(dynamicbind) $ cat data/test_imatinib_.csv
protein_path,ligand
data/af2_1opj_STI_A_aligned.pdb,data/1opj_STI_A.sdf

(dynamicbind) $ python run_single_protein_inference.py data/cleaned_input_proteinFile.pdb data/test_imatinib_.csv --savings_per_complex 40 --inference_steps 20 --header test --device 0 --python /home/iwatobipen/miniforge3/envs/dynamicbind/bin/python --relax_python /home/iwatobipen/miniforge3/envs/dynamicbind/bin/python --paper

The example uses af2 predcted structure and dock with imatinib.

After running the code avobe, I could make movie (pdbfiles with several frames).

(dynamicbind) $ $ python movie_generation.py results/test/index0_idx_0/ 1+2 --device 0 --python /home/iwatobipen/miniforge3/envs/dynamicbind/bin/python --relax_python /home/iwatobipen/miniforge3/envs/dynamicbind/bin/python

After running the code I could get pdb files named rank1_receptor_reverseprocess_relaxed.pdb and rank1_ligand_reverseprocess_relaxed.pdb in the results folder.

Open these files as pymol, I could see dynamics of DFG-in to DFG-out.

During the steps, the ligand fit to pocket of the protein and A loop moved.

I tested other ligand and pairs of kinase and found that these transformation of DFG-in to DFG-out is not always reproduced.

But it’s interesting and useful because calculation time is really short and run code against multiple molecules.

In summary, dynamicbind is useful for virtual screening with AF2 predcted structure even if there is a still room for improvement.

I’ll try to use it against different protein family.

Pocket awaer structure generation #DiffDec #cheminformatics

Diffusion model is the one of hot area of generative model. It’s not only computer vision but also cheminformatics. Diffusion model is interesting because it generates object from some noise.

BTW, de novo compound design with target protein structure information is really attractive but difficult approach in drug design. There are some approaches to conduct it with diffusion model such as DiffSBDD. DiffSBDD generates molecule with pockect information but sometime generates strange molecules. One reason of it is DiffSBDD generates whole molecule from diffusion process without any information of scaffold. But it doesn’t represent realystic case. Because in the drug discovery project, medicinal chemists decorates molecules beased on their own scaffold. So de novo drug desing with scaffold constrain is reasonable approach.

Recently I found and read really interesting article about ‘DiffDeck’. It can get from arixv.
https://www.biorxiv.org/content/10.1101/2023.10.08.561377v1.full.pdf

Interesting points of DiffDec is listed below.
1. DiffDec can generate molecules with scaffould constrain.
2. DiffDec can generate molecule with target pocket information.

Fortunately the authors shared their code on github ‘https://github.com/biomed-AI/DiffDec‘.

I tried to use it on my PC :)

Ok, let’s dive to code! I build environment wit environment.yaml. But my pc env did not fit cuda-10.x, so I reinstalled pytorch which has cuda 11 env. But installation process is almost same as readme.md.

$ gh repo clone biomed-AI/DiffDec
$ cd DiffDec
$ mamba env create -f  environment.yaml
$ conda activate DiffDec
# following package doesn't contain yaml but require.
# openbabel installation is important if the process is skipped you can't get novel molecules.
$ mamba install -c conda-forge openbabel pymol-open-source

After installation process is finished, I downloaded model parameters to run the code. The details are described in following file.
https://github.com/biomed-AI/DiffDec/blob/master/README.md
https://github.com/biomed-AI/DiffDec/blob/master/data/README.md
After putting model parametes in appropriate place, I run the code with PDB file of balicitinib which is jak2 and inhibitor complex.

DiffDec can decorate scaffold with pocket information so the molecule expect to fit target protein. To run the code, user should probide template and scaffold informatin which would like to keep in generation process. I defined scaffold of balicinib shown below.

And I prepared apo pdb and ligand pdb file from 6vn8.

Almost there, I run following command.

CUDA_VISIBLE_DEVICES=0 python sample_single_for_specific_context.py \
  --scaffold_smiles_file ./data/examples/bali_scaf.smi \
  --protein_file ./data/examples/jak2apo.pdb \
  --scaffold_file ./data/examples/balicitinib.sdf \
  --task_name exp \
  --data_dir ./data/examples \
  --checkpoint ./ckpt/diffdec_single.ckpt \
  --samples_dir samples_jak2 \
  --n_samples 50 \
  --device cuda:0

After running the code, I could get new molecules with my defined scaffold.

Original structure.

Generated structure examples..

All generated molecules keep scaffold!

As arthors mentiond that this program has some limitations but it really interesting for novel compound design. I would like to use and convince other in silico approach to design new molecules.

Update rdkit/shape-it #RDKit #shape-it #cheminformatics

Today I tried to build rdkit/shape-it because I could not build shape-it with current version of rdkit. So I struggled error message to fix the issue.(I’m not so good at C++ ;P)

There are two issues in the current code.
1. Version of c++ in the CMakeLists.txt is old, so I changed it from c++14 to c++17
2. Version of Catch2 is old too. Recent version of catch2 is 3.x. Old version of catch2 cause problem during the make process, so I update CMakeLists.txt and some cpp files.

After the modification, build process worked well and I could call shape-it from python.

Here is an install process.

$ gh repo clone rdkit/shape-it
$ cd shape-it
$ cmake -DCMAKE_INSTALL_PREFIX=/home/iwatobipen/src/shape-it -DBUILD_RDKIT_SUPPORT=ON -DBUILD_PYTHON_SUPPORT=ON -DRDKIT_INCLUDE_DIR=$CONDA_PREFIX/include/rdkit -DPYTHON_INSTDIR=$CONDA_PREFIX/lib/python3.10/site-packages ..
$ make j8
$ make install

After the process descrived avobe, I can call pyshape it from python.

import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import RDConfig
import py3Dmol
from pyshapeit import cpyshapeit
mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf'))]

ref = Chem.Mol(mols[0])
probe = mols[1]
temp = Chem.Mol(mols[1])
# draw molecules before alignment
p = py3Dmol.view(width=400,height=400)
p.addModel(Chem.MolToMolBlock(ref), 'sdf')
p.addModel(Chem.MolToMolBlock(probe), 'sdf')
p.setStyle({'stick':{}})
p.setBackgroundColor('0xeeeeee')
p.show()
score = cpyshapeit.AlignMol(ref, temp)
print(score)
# draw molecules after alignment
p = py3Dmol.view(width=400,height=400)
p.addModel(Chem.MolToMolBlock(ref), 'sdf')
p.addModel(Chem.MolToMolBlock(temp), 'sdf')
p.setStyle({'stick':{}})
p.setBackgroundColor('0xeeeeee')
p.show()

Pyshapeit seems work well.

And there are still some issues in rdkit/shape-it. I would like to try fix it.

https://github.com/rdkit/shape-it

Visualize feature importance with marimo #cheminformatics #RDKit #marimo

I posted new generation of notebook, marimo recently. It is cool and easy to make interactive analysis environment with python.

I’m interested in the package and am thinking how to use in chemoinformatics tasks. In QSAR tasks, chemoinformaticians are often asked the reason of prediction of the model. So XAI (explainable AI) is an attractive area in the field. rkakamilan shared really useful posts about visualize feature importance of ML models in his blog site with code.

Now I’m writing code for visualize ML weights with compound structure and most of code came from his post. There are lots of way to calculate feature importance current code support only one way but I would like to support shap value at least near the feature.

Here is my code named chem_viz (still under development….)

import base64
import functools
import rdkit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib import cm
import matplotlib_inline
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import Draw
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import SimilarityMaps
from rdkit.Chem.Draw import rdDepictor, rdMolDraw2D

def red_blue_cmap(x):
    """Red to Blue color map
    Args:
        x (float): value between -1 ~ 1, represents normalized saliency score
    Returns (tuple): tuple of 3 float values representing R, G, B.
    """
    if x > 0:
        # Red for positive value
        # x=0 -> 1, 1, 1  (white)
        # x=1 -> 1, 0, 0 (red)
        return 1.0, 1.0-x, 1.0-x
    else:
        # Blue for negative value
        x *= -1
        return 1.0-x, 1.0-x, 1.0

def is_visible(begin, end):
    if begin <= 0 or end <= 0:
        return 0
    elif begin >= 1 or end >= 1:
        return 1
    else:
        return (begin+end) * 0.5
def color_bond(bond, saliency, color_fn):
    begin = saliency[bond.GetBeginAtomIdx()]
    end = saliency[bond.GetEndAtomIdx()]
    return color_fn(is_visible(begin, end))

def label_cat(label):
    return '$+$' if bool(label!=0) else '$\cdot$'

class Mol2Img():
    def __init__(self, mol, atom_colors, bond_colors, molSize=(450, 150), kekulize=True):
        self.mol = mol
        self.atom_colors = atom_colors
        self.bond_colors = bond_colors
        self.molSize = molSize
        self.kekulize = kekulize
        self.mc = Chem.Mol(self.mol.ToBinary())
        if self.kekulize:
            try:
                Chem.Kekulize(self.mc)
            except:
                self.mc = Chem.Mol(self.mol.ToBinary())

    def mol2png(self):
        drawer = rdMolDraw2D.MolDraw2DCairo(self.molSize[0], self.molSize[1])
        self._getDrawingText(drawer)
        return drawer.GetDrawingText()
    
    def mol2svg(self):
        drawer = rdMolDraw2D.MolDraw2DSVG(self.molSize[0], self.molSize[1])
        self._getDrawingText(drawer)
        return drawer.GetDrawingText()
    
    def _getDrawingText(self, drawer):
            dops = drawer.drawOptions()
            dops.useBWAtomPalette()
            dops.padding = .2
            dops.addAtomIndices = True
            drawer.DrawMolecule(
                self.mc,
                highlightAtoms=[i for i in range(len(self.atom_colors))], 
                highlightAtomColors=self.atom_colors, 
                highlightBonds=[i for i in range(len(self.bond_colors))],
                highlightBondColors=self.bond_colors,
                highlightAtomRadii={i: .5 for i in range(len(self.atom_colors))}
                )
            drawer.FinishDrawing()
            
class XMol():
    def __init__(self, mol, weight_fn, weights=None, atoms=['C', 'N', 'O', 'S', 'F', 'Cl', 'P', 'Br'], drawingfmt='svg'):
        self.mol = mol
        self.weight_fn = weight_fn
        self.weights = weights
        self.atoms = atoms
        self.drawingfmt = drawingfmt

    def make_explainable_image(self, kekulize=True, molSize=(450, 150)):
        symbols = [f'{self.mol.GetAtomWithIdx(i).GetSymbol()}_{i}' for i in range(self.mol.GetNumAtoms())]
        #df = pd.DataFrame(columns=self.atoms)
        if self.weights is None:
            contribs = self.weight_fn(self.mol)
        else:
            contribs = self.weights
        num_atoms = self.mol.GetNumAtoms()
        arr = np.zeros((num_atoms, len(self.atoms)))
        for i in range(self.mol.GetNumAtoms()):
            _a = self.mol.GetAtomWithIdx(i).GetSymbol()
            arr[i, self.atoms.index(_a)] = contribs[i]
        df = pd.DataFrame(arr, index=symbols, columns=self.atoms)
        self.weights, self.vmax = SimilarityMaps.GetStandardizedWeights(contribs)
        self.vmin = -self.vmax
        atom_colors = {i: red_blue_cmap(e) for i, e in enumerate(self.weights)}
        # bondlist = [bond.GetIdx() for bond in mol.GetBonds()]
        bond_colors = {i: color_bond(bond, self.weights, red_blue_cmap) for i, bond in enumerate(self.mol.GetBonds())}
        viz = Mol2Img(self.mol, atom_colors, bond_colors, molSize=molSize, kekulize=kekulize)
        if self.drawingfmt == 'svg':
            matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
            self.drawingtxt = viz.mol2svg()
        elif self.drawingfmt == 'png':
            self.drawingtext = viz.mol2png()
            matplotlib_inline.backend_inline.set_matplotlib_formats('png')
        else:
            raise Exception("Please select drawingfmt form 'svg' or 'png'")
        self.fig = plt.figure(figsize=(18, 9))
        self.grid = plt.GridSpec(15, 10)
        self.ax = self.fig.add_subplot(self.grid[1:, -1])
        self.ax.barh(range(self.mol.GetNumAtoms()), np.maximum(0, df.values).sum(axis=1), color='C3')
        self.ax.barh(range(self.mol.GetNumAtoms()), np.minimum(0, df.values).sum(axis=1), color='C0')
        self.ax.set_yticks(range(self.mol.GetNumAtoms()))
        self.ax.set_ylim(-.5, self.mol.GetNumAtoms()-0.5)
        symbols= {i: f'${self.mol.GetAtomWithIdx(i).GetSymbol()}_{{{i}}}$' for i in range(self.mol.GetNumAtoms())}
        self.ax.axvline(0, color='k', linestyle='-', linewidth=.5)
        self.ax.spines['top'].set_visible(False)
        self.ax.spines['right'].set_visible(False)
        self.ax.spines['left'].set_visible(False)
        self.ax.tick_params(axis='both', which='both', left=False, labelleft=False)

        self.ax = self.fig.add_subplot(self.grid[1:, :-1], sharey=self.ax)
        self.im = self.ax.imshow(df.values, cmap='bwr', vmin=self.vmin, vmax=self.vmax, aspect='auto')
        self.ax.set_yticks(range(self.mol.GetNumAtoms()))
        self.ax.set_ylim(self.mol.GetNumAtoms() -.5, -.5)
        self.symbols= {i: f'${self.mol.GetAtomWithIdx(i).GetSymbol()}_{{{i}}}$' for i in range(self.mol.GetNumAtoms())}
        self.ax.set_yticklabels(symbols.values())
        self.ax.set_xticks(range(len(self.atoms)))


        self.ax.set_xlim(-.5, len(self.atoms) -.5)
        self.ax.set_xticklabels(self.atoms, rotation=90)
        self.ax.set_ylabel('Node')

        for (j,i),label in np.ndenumerate(df.values):
            self.ax.text(i,j, label_cat(label) ,ha='center',va='center')
        self.ax.tick_params(axis='both', which='both', bottom=True, labelbottom=True, top=False, labeltop=False)
        #ax.grid(c=None)

        self.ax = self.fig.add_subplot(self.grid[0, :-1])
        self.fig.colorbar(mappable=self.im, cax=self.ax, orientation='horizontal')
        self.ax.tick_params(axis='both', which='both', bottom=False, labelbottom=False, top=True, labeltop=True)
        

I added SVG support to original code.

Then let’s use the code from marimo! At first launch marimo notebook with marimo edit command.

import marimo as mo
import chemviz
# sorry this code does not shap now...
import functools
import os
import warnings
warnings.filterwarnings(action='ignore')

from matplotlib import cm
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import rdkit
from rdkit import Chem, RDPaths
from rdkit.Chem import AllChem,  DataStructs, Draw, rdBase, rdCoordGen, rdDepictor
from rdkit.Chem.Draw import IPythonConsole, rdMolDraw2D, SimilarityMaps
from rdkit.ML.Descriptors import MoleculeDescriptors
print(f'RDKit: {rdBase.rdkitVersion}')
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import shap
print(f'SHAP: {shap.__version__}')

def mol2fp(mol,radius=2, nBits=1024):
    bitInfo={}
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=nBits, bitInfo=bitInfo)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr, bitInfo

train_path = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
test_path = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')
train_path='/home/iwatobipen/miniforge3/pkgs/rdkit-2023.09.3-py311h4c2f14b_1/share/RDKit/Docs/Book/data/solubility.train.sdf'
test_path='/home/iwatobipen/miniforge3/pkgs/rdkit-2023.09.3-py311h4c2f14b_1/share/RDKit/Docs/Book/data/solubility.test.sdf'
train_mols = Chem.SDMolSupplier(train_path)
test_mols = Chem.SDMolSupplier(test_path)
print(len(train_mols), len(test_mols))

sol_classes = {'(A) low': 0, '(B) medium': 1, '(C) high': 2}
X_train = np.array([mol2fp(m)[0] for m in train_mols])
y_train = np.array([sol_classes[m.GetProp('SOL_classification')] for m in train_mols], dtype=np.int_) 
X_test = np.array([mol2fp(m)[0] for m in test_mols])
y_test = np.array([sol_classes[m.GetProp('SOL_classification')] for m in test_mols], dtype=np.int_)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

clf = RandomForestClassifier(random_state=20191215)
clf.fit(X_train, y_train)

def get_proba(fp, proba_fn, class_id):
    return proba_fn((fp,))[0][class_id]

def fp_partial(nBits):
    return functools.partial(SimilarityMaps.GetMorganFingerprint, nBits=nBits)

def show_pred_results(mol, model):
    y_pred = model.predict(mol2fp(mol)[0].reshape((1,-1)))
    sol_dict = {val: key for key, val in sol_classes.items()}
    print(f"True: {mol.GetProp('SOL_classification')} vs Predicted: {sol_dict[y_pred[0]]}")

def makeimg(smi):
    mol = Chem.MolFromSmiles(smi)
    weights = SimilarityMaps.GetAtomicWeightsForModel(mol, 
                                                      fp_partial(1024), 
                                                      lambda x:get_proba(x, clf.predict_proba, 2))
    xmol=chemviz.XMol(mol, weight_fn=None, weights=weights)
    xmol.make_explainable_image()
    return xmol.drawingtxt


In this case I build ML model on the notebook but model can import external source too. And it will be more flexible and readable notebook.

After preparing required data (molecule, fingerprint, helper function, ML model…) I make form for imput and show result field shown below.


form = mo.ui.text(label="input SMILES")
form

mo.Html(f""""{makeimg(form.value)}""")

Marimo provides interactive view so I can get explainable image intractively.

As result, I could make interactive visualization app for QSAR model with very few lines of code. It’s ineresting for me.

BTW, XAI for QSAR gives often difficult to understand results for chemists and they feel or say the images don’t fit current SAR so we can’t agree the results.

I think there is room for improvement in the area. And I would like to opinion about usability of XAI in QSAR from reader who have lots of experience of the field ;)

I updated today’s code on my github repo.

https://github.com/iwatobipen/chem_viz

Run conformer generation in parallel #RDKit #Cheminformatcs #lig3dlens

There are useful open source packages not only SBDD but also LBDD in these days. VSFlow is the one of useful package for LBDD and I used it in my cheminformatcs tutorial before. And recently I found another interesting package for LBDD named lig3dlens which is developed by researchers at healx, AI drug discovery company.

The features of lig3dlens are well described in their blog post.
https://healx.ai/lig3dlens-an-open-source-ligand-based-3d-virtual-screening-toolbox/

The package can generate 3d conformer and aling molecules not noly shape base but also electrostatics base with espsim.

It seems really cool and useful. The package mainly use from CLI but it has simple wrapper of conformer generation. Today I tried to use lig3dlens for conformer generation and run it in parallel.

In general, ‘multiprocessing’ is used for parallelization. But today’s case multrocessing will not work and return “multiprocessing typeerror cannot pickle ‘module’ object” error. To avoid the error, pathos should be used for multiprocessing.

From the original documentation.
pathos is a framework for heterogeneous computing. It provides a consistent high-level interface for configuring and launching parallel computations across heterogeneous resources. pathos provides configurable launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Examples of launchers that plug into pathos are: a queue-less MPI-based launcher (in pyina), a ssh-based launcher (in pathos), and a multi-process launcher (in multiprocess).

At first, I installed lig3dlens and pathos.

(base) $ gh repo clone healx/lig3dlens
(base) $ conda create -n lig3dlens python=3.10
(base) $ conda activate lig3dlens
(lig3dlens) $ cd lig3dlens
(lig3dlens) $ pip install -r requirements.txt
(lig3dlens) $ pip install .
(lig3dlens) $ pip install pathos

Then test pathos with lig3dlens, I tested case of nodes = 20, 10, 2.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdMolDraw2D, rdDepictor
from pathos import multiprocessing
import copy

sdf = Chem.SDMolSupplier("./Enamine_Hit_Locator_Library_Representative_plated_5120cmpds_20211105.sdf")
mols = [m for m in sdf]
for m in mols:
    rdDepictor.Compute2DCoords(m)
Draw.MolsToGridImage(mols[:5])
from lig3dlens import gen_conformers
from functools import partial

generate_conformers_p = partial(gen_conformers.generate_conformers,
                           num_conformers=5,
                           optimize=True)

mols1k_1 = copy.deepcopy(mols)
mols1k_2 = copy.deepcopy(mols)
mols1k_3 = copy.deepcopy(mols)
pool1 = multiprocessing.ProcessingPool(20)
pool2 = multiprocessing.ProcessingPool(10)
pool3 = multiprocessing.ProcessingPool(2)
%time res1=pool1.map(generate_conformers_p, mols1k_1)
>CPU times: user 1.13 s, sys: 111 ms, total: 1.24 s
Wall time: 1min 43s
%time res2=pool2.map(generate_conformers_p, mols1k_2)
>CPU times: user 657 ms, sys: 99.1 ms, total: 756 ms
Wall time: 2min
%time res3=pool3.map(generate_conformers_p, mols1k_3)
>CPU times: user 443 ms, sys: 92.9 ms, total: 536 ms
Wall time: 4min 55s

Compared to case of pool3 and pool1, the speed was improved but not 10 times improved. I run it from jupyter-lab so I would like to test code from script.

In summary paoth and lig3dlens are interesting and useful packages for cheminformatics. lig3dlens has lots of functions for LBDD I’m testing the code now.

New ML package for cheminformatics #cheminformatics #QSAR #ML

I introduced scikit-mol in my blog post before. The package integrates scikit-learn and rdkit. It’s easy to use because user can build QSAR model from scikit-learn’s API. I like the package. And recently I found another useful package for cheminformatics named ‘molflux‘ witch is developed by researchers in Exsicentia, famous AI Drug Discovery pharma.

molflux supports lots of useful methods for building cheminformatics related predictive models.

  • Featurization with various fingerprints and molecular descriptors.
  • Splitting data with various strategies not only mathmatical way but also cheminformatics way (i.e. scaffold split)
  • Build classification and regression models with lots of algorithms.

I have intereste the package and I tried to use it today. Following code is almost same as original documentation.

molflux provides some datasets, so user can test the package without preparing demo dataset.

from molflux.datasets import load_dataset
from molflux import datasets
from pprint import pprint
import numpy as np
dataset = load_dataset('esol',
                      trust_remote_code=True)

There are several featurization methods and load_from_dicts methods can generate multiple features from molecules. It seems useful for finding the best molecular representations.

from molflux.datasets import featurise_dataset
from molflux.features import load_from_dicts as load_representations_from_dicts
from molflux import features
featuriser = load_representations_from_dicts([
    {"name":"morgan"},
    {"name":"maccs_rdkit"}
])

To make featured dataset, use should pass the dataset and featuriser to featurise_dataset object. Simple!

featurised_dataset = featurise_dataset(dataset,
                                      column='smiles',
                                   representations=featuriser)

Then split the dataset. molflux can split data to train, validation and test. Default setting split data 80% for training and 20% for test, 0% for varidation. You can see many splitting strategies from following code ;)

from molflux.datasets import split_dataset, splitting
from molflux.splits import load_from_dict as load_split_from_dict
from molflux.splits import list_splitting_strategies
print(list_splitting_strategies())
{'core': ['group_k_fold', 'group_shuffle_split', 'k_fold', 'leave_one_group_out', 'leave_p_groups_out', 'linear_split', 'ordered_split', 'shuffle_split', 'stratified_k_fold', 'stratified_ordered_split', 'stratified_shuffle_split', 'time_series_split'], 'openeye': ['scaffold'], 'rdkit': ['scaffold_rdkit', 'tanimoto_rdkit']}
split_featurised_dataset = next(split_dataset(featurised_dataset, shuffle_strategy))

Next build predictive model and test it. After making the model object, call model.train to build predictive model.

import json
from molflux.modelzoo import load_from_dict as load_model_from_dict
from molflux.metrics import load_suite
import matplotlib.pyplot as plt
model = load_model_from_dict(
    {'name':'random_forest_regressor',
    'config':{
        'x_features':['smiles::morgan', 'smiles::maccs_rdkit'],
        'y_features':['log_solubility']
    }}
)

model.train(split_featurised_dataset['train'])
preds = model.predict(split_featurised_dataset['test'])
regression_suite = load_suite('regression')
scrores = regression_suite.compute(
    references=split_featurised_dataset['test']['log_solubility'],
    predictions=preds['random_forest_regressor::log_solubility']
)
print(json.dumps({k: round(v,2) for k,v in scrores.items()}, indent=4))

{
    "explained_variance": 0.81,
    "max_error": 3.37,
    "mean_absolute_error": 0.62,
    "mean_squared_error": 0.72,
    "root_mean_squared_error": 0.85,
    "median_absolute_error": 0.46,
    "r2": 0.81,
    "spearman::correlation": 0.91,
    "spearman::p_value": 0.0,
    "pearson::correlation": 0.9,
    "pearson::p_value": 0.0
}

regression_suite can calculate many metrics, so user can evaluate model performance only writing one line of code.

Also molflux has flexybility, add use defined model, featuriser, split strategy.

I should read the document deeply. And write code!

Today’s code is uploaded to my gist. If reader who has interest, please try to use molflux.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

New type of python notebook #marimo #cheminformatics #RDKit

Jupyter-lab, Jupyter-note book, streamlt and other packages are useful for data science beucase it can analyze and visualize data step by step. I like streamlit and dash for making simple web app.

And some days ago I found new and cool package named marimo. From the documentaion, marimo is an open-source reactive notebook for Python — reproducible, git-friendly, executable as a script, and shareable as an app.

You can read details and examples from original site. I have interest the package so I tried to use it on cheminformatics tasks.

Following example is..

1, Make smiles input form.
2, Make visualization with some molecular properties.

The rendering code is brrowed from Greg’s great blog post.

To use marimo, I installed the package with pip.

$ pip install marimo
$ marimo edit

After the installation process, ‘marimo’ command will be callable from shell. So I call marimo edit command. Then notebook is launched and I could coding on the notebook. Today’s example code is shown below.

At first I imported required packages and define functions.

import marimo as mo
from rdkit import Chem
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import Descriptors

def smitosvg(smi,molSize=(450,150),kekulize=True):
    mol = Chem.MolFromSmiles(smi)
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
        try:
            Chem.Kekulize(mc)
        except:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
        rdDepictor.Compute2DCoords(mc)
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    # It seems that the svg renderer used doesn't quite hit the spec.
    # Here are some fixes to make it work in the notebook, although I think
    # the underlying issue needs to be resolved at the generation step
    return svg.replace('svg:','')

def calc_desk(smi):
    mol = Chem.MolFromSmiles(smi)
    mw = Descriptors.MolWt(mol)
    qed = Descriptors.qed(mol)
    return f"SMILES:{smi}<br>MW:{mw:.2f}<br>qed:{qed:.2f}"


Then, make a smiles input form. To make it is really easy!

form = mo.ui.text(label="input SMILES")
form

After coding shown above, the input form is appered on the cell ;-)

Finally I defined visualiztion part which shows results as HTML. But marimo can visualize not only HTML but also markdown or any other things. It really cooooooooooool.

mo.Html(f"""{smitosvg(form.value)}<br>
{calc_desk(form.value)}""")

All of today’s coding is finished and now I can get interactive molecular renderer on my notebook with few lines of code.

In summary marimo is really cool and useful packages for data science.

Thank you for reading.

Try to use mmpdb v3.1 #RDKit #MMPDB #cheminformatics

As many RDKitters know that Andrew released new version of MMPDB! Recent version of MMPDB has lots of useful methods, one of the generate method it can generate new molecules from given smiles. The method can generate not only all possible molecules from MMPDB but also constrained molecules with options, ‘–query’ and ‘–constant’. ‘–query’ option defines substructure which would like to change, and ‘–constant’ option defines substructure which would like to keep during the transformation process.

OK, let’s make data set at first. I used ChEMBL_33 HERG data set. The SQL query is shown below.

# get_herg.sql
select canonical_smiles,molregno,activity_id,pchembl_value,molecule_dictionary.chembl_id from activities 
  join assays using (assay_id) 
  join compound_structures using (molregno) 
  join molecule_dictionary using (molregno) 
  where tid=165 and standard_type='Ki' and standard_value is not null and standard_relation='=' 
    and canonical_smiles not like '%.%';

Run psql command for getting data and prepare data for mmpdb.

# run psql command
$ psql -d chembl_33 -a -f get_herg.sql --csv -F, > ./chembl_herg.csv

# make unique SMILES list
$ awk -F ',' '!count[$1]++ {print $1" "$5}' chembl_herg.csv > chembl_herg_uni.smi 
# make unique ID properties list, hERG pChEMBL value
$ awk -F ',' '!count[$5]++ {print $5" "$4}' chembl_herg.csv > chembl_herg_uni_prop.csv

# make fragment and mmpdb with mmpdb command
$ mmpdb fragment chembl_herg_uni.smi --has-header
$ mmpdb index chembl_herg_uni.fragdb --memory --title chembl_herg --properties chembl_herg_prop.csv 

After running the code, I could get chembl_herg_uni.mmpdb ;)

Next I run mmpdb generate command from jupyter notebook. To use click from jupyter, CliRunner.invoke method is used.

E-4031 is used as an example molecule. I tried two situation, the one is fixed substructure expect sulfone amide, the second one is –query example with sulfone amide as a query. Sulfone amid is useful substructure for drug design but sometime it incleases PSA and MW, so it reduces permeability in some case. So finding replacement strategy of sulfone amide is useful I think.


I defined functino which conducts mmpdb generate with given option and return PandasDataframe with rdkit mol objects.

def gen_mol(smiles:str,
            mmpdbpath:str,
            tempdir='/tmp/',
            query:str=None,
            const:str=None) -> pd.DataFrame:
    if query != None and const != None:
        raise Exception('only one option, query or const is allowed!!')
    outputpath = tempdir+"gen.csv"
    runner = CliRunner()
    if query == None and const== None:
        res = runner.invoke(cli.main, ['generate',
                                       '--smiles', smiles,
                                       '--output', outputpath,
                                       mmpdbpath])
    elif query:
        res = runner.invoke(cli.main, ['generate',
                                       '--smiles', smiles,
                                       '--query', query,
                                       '--output', outputpath,
                                       mmpdbpath])
    elif const:
        res = runner.invoke(cli.main, ['generate',
                                       '--smiles', smiles,
                                       '--constant', const,
                                       '--output', outputpath,
                                       mmpdbpath])
    print(res.stdout)
    try:
        df = pd.read_csv(outputpath, sep='\t')
        PandasTools.AddMoleculeColumnToFrame(df, smilesCol='start', molCol='start')
        PandasTools.AddMoleculeColumnToFrame(df, smilesCol='constant', molCol='constant')
        PandasTools.AddMoleculeColumnToFrame(df, smilesCol='final', molCol='final')
        return df
    except:
        raise Exception("error occured when reading output")

Let’s run the code!

df1 = gen_mol(smiles='Cc1cccc(n1)CCN2CCC(CC2)C(=O)c3ccc(cc3)NS(=O)(=O)C',
              const='Cc1cccc(n1)CCN2CCC(CC2)C(=O)c3ccc(cc3)*',
              mmpdbpath='./chembl_herg_uni.mmpdb')
df1

As reader can see the final compound keep the constant part.

Then run function with query option.

df2 = gen_mol(smiles='Cc1cccc(n1)CCN2CCC(CC2)C(=O)c3ccc(cc3)NS(=O)(=O)C',
              query='*NS(=O)(=O)*',
              mmpdbpath='./chembl_herg_uni.mmpdb')
df2

As I expected output compounds don’t have sulfone amide part ;)
And current MMPDB has #pairs column which indicates number of paris in MMPDB. The details are described in The Playbooks of Medicinal Chemistry Design Moves. J. Chem. Inf. Model., 2021, 61 (2), pp 729-742.

In summary MMPDB is really useful and powerful tool for cheminformatics. I pushed my code on gist.
Thanks for reading.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Extract format data from chembl with Dask #cheminformatics #memo #chembl

Happy new year. I hope reader had nice year-end holidays. I started running from 1st Jan. I would like to run 1500 km in this year.

ast year, I posted about REST api usage of chembl multitask target prediction. It allows predict possible target without any ML env in user PC. But sometime user would like to build their own predictive model or fine tune the current model. So it’s worth to know the way to extract training data from ChEMBL. As many readers know that the original repo provides code to extract training data from sqlite3 based chembl db link here! To use the data extraction code, use should get chembldb from FTP site. https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/

At first, I tried to extract data with the original code extract_format_dataset.py. The code extract data with pandas, and the extraction process was killed due to OOM in my PC. It seems dataframe merge process requires huge amout of memoy. My DELL’s PC has 32GB RAM… Hmm… After googling, I found that it can be reduced memory usage when copy=False option is added at the merge process and the strategy worked.

# https://github.com/chembl/chembl_multitask_model/blob/52a68f5a40989f35d0d2a88cd677db51e085a66a/extract_format_dataset.py#L162
# mt_df = pd.merge(structs, mt_df, how="inner", on="molregno")
mt_df = pd.merge(structs, mt_df, how="inner", on="molregno", copy=False)

It worked fine but I would like to modify the code to use dask. Because I’m interest in dask and there are interesting blogpost is published from Greg and Pat.

https://rdkit.blogspot.com/2017/01/a-great-looking-option-for-doing-easy.html
https://patwalters.github.io/practicalcheminformatics/jupyter/dask/parallel/2021/03/28/dask-cheminformatics.html

It seems that Dask is one of the powerful package for cheminformatics.

After struggling few hours, I finished migration of pandas to dask ;)
Here is a my code.
I modified few lines from original code. In the original code check SMILES whether RDKit readable or not and return rdkit molobject when the SMILES is valid. But after checking process, the column is deleted so I added function that returns ‘ok’ if SMILES is valid and returs None if SMILES is invalid.

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from sqlalchemy import create_engine
from sqlalchemy.sql import text
import numpy as np
import pandas as pd
import tables as tb
from tables.atom import ObjectAtom
import json
import gc
from dask import dataframe as ddf
import dask as da

CHEMBL_VERSION = 33
FP_SIZE = 1024
RADIUS = 2
# num of active and inactive different molecules needed to include a target
ACTIVE_MOLS = 100
INACTIVE_MOLS = 100

# ----------------------------------------------------------------------------
# Get data from ChEMBL using the SQLite dump
engine = create_engine(f"sqlite:///chembl_33/chembl_33_sqlite/chembl_{CHEMBL_VERSION}.db")

qtext = """
SELECT
  activities.doc_id                    AS doc_id,
  activities.standard_value            AS standard_value,
  molecule_hierarchy.parent_molregno   AS molregno,
  compound_structures.canonical_smiles AS canonical_smiles,
  target_dictionary.tid                AS tid,
  target_dictionary.chembl_id          AS target_chembl_id,
  protein_classification.protein_class_desc     AS protein_class_desc
FROM activities
  JOIN assays ON activities.assay_id = assays.assay_id
  JOIN target_dictionary ON assays.tid = target_dictionary.tid
  JOIN target_components ON target_dictionary.tid = target_components.tid
  JOIN component_class ON target_components.component_id = component_class.component_id
  JOIN protein_classification ON component_class.protein_class_id = protein_classification.protein_class_id
  JOIN molecule_dictionary ON activities.molregno = molecule_dictionary.molregno
  JOIN molecule_hierarchy ON molecule_dictionary.molregno = molecule_hierarchy.molregno
  JOIN compound_structures ON molecule_hierarchy.parent_molregno = compound_structures.molregno
WHERE activities.standard_units = 'nM' AND
      activities.standard_type IN ('EC50', 'IC50', 'Ki', 'Kd', 'XC50', 'AC50', 'Potency') AND
      activities.data_validity_comment IS NULL AND
      activities.standard_relation IN ('=', '<') AND
      activities.potential_duplicate = 0 AND assays.confidence_score >= 8 AND
      target_dictionary.target_type = 'SINGLE PROTEIN'"""

with engine.connect() as conn:
    df = pd.read_sql(text(qtext), conn, dtype_backend="pyarrow")
    df = ddf.from_pandas(df, npartitions=10)

# Drop duplicate activities keeping the activity with lower concentration for each molecule-target pair
df = df.sort_values(by=["standard_value", "molregno", "tid"], ascending=True)
df = df.drop_duplicates(subset=["molregno", "tid"], keep="first")

# save to csv
df.to_csv(f"chembl_{CHEMBL_VERSION}_activity_data.csv", index=False)

# ----------------------------------------------------------------------------
# Set to active/inactive by threshold
#     Based on IDG protein families: https://druggablegenome.net/ProteinFam
#         Kinases: <= 30nM
#         GPCRs: <= 100nM
#         Nuclear Receptors: <= 100nM
#         Ion Channels: <= 10μM
#         Non-IDG Family Targets: <= 1μM
def set_active(row):
    active = 0
    if row["standard_value"] is not pd.NA and row["standard_value"] <= 1000:
        active = 1
    if "ion channel" in row["protein_class_desc"]:
        if row["standard_value"] is not pd.NA and row["standard_value"] <= 10000:
            active = 1
    if "enzyme  kinase  protein kinase" in row["protein_class_desc"]:
        if row["standard_value"] is not pd.NA and row["standard_value"] > 30:
            active = 0
    if "transcription factor  nuclear receptor" in row["protein_class_desc"]:
        if row["standard_value"] is not pd.NA and row["standard_value"] > 100:
            active = 0
    if "membrane receptor  7tm" in row["protein_class_desc"]:
        if row["standard_value"] is not pd.NA and row["standard_value"] > 100:
            active = 0
    return active

def dd_set_active(indf):
    return indf.apply(lambda row: set_active(row), axis=1)

df["active"] = df.map_partitions(dd_set_active, meta=('active', 'int'))

# ----------------------------------------------------------------------------
# Filter target data
#     Keep targets mentioned at least in two different docs
#     Keep targets with at least ACTIVE_MOLS active and INACTIVE_MOLS inactive molecules.

# get targets with at least ACTIVE_MOLS different active molecules
acts = df[df["active"] == 1].groupby(["target_chembl_id"]).agg("count")
acts = acts[acts["molregno"] >= ACTIVE_MOLS].reset_index()["target_chembl_id"]

# get targets with at least INACTIVE_MOLS different inactive molecules
inacts = df[df["active"] == 0].groupby(["target_chembl_id"]).agg("count")
inacts = inacts[inacts["molregno"] >= INACTIVE_MOLS].reset_index()["target_chembl_id"]

# get targets mentioned in at least two docs
docs = df.drop_duplicates(subset=["doc_id", "target_chembl_id"])
docs = docs.groupby(["target_chembl_id"]).agg("count")
docs = docs[docs["doc_id"] >= 2.0].reset_index()["target_chembl_id"]

# keep data for filtered targets
t_keep = set(acts).intersection(set(inacts)).intersection(set(docs))
activities = df[df["target_chembl_id"].isin(t_keep)]

ion = pd.unique(
    activities[activities["protein_class_desc"].str.contains("ion channel", na=False)]["tid"].compute()
).shape[0]
kin = pd.unique(
    activities[activities["protein_class_desc"].str.contains("enzyme  kinase  protein kinase", na=False)]["tid"].compute()
).shape[0]
nuc = pd.unique(
    activities[activities["protein_class_desc"].str.contains("transcription factor  nuclear receptor", na=False)]["tid"].compute()
).shape[0]
gpcr = pd.unique(
    activities[activities["protein_class_desc"].str.contains("membrane receptor  7tm", na=False)]["tid"].compute()
).shape[0]

print("Number of unique targets: ", len(t_keep))
print("  Ion channel: ", ion)
print("  Kinase: ", kin)
print("  Nuclear receptor: ", nuc)
print("  GPCR: ", gpcr)
print("  Others: ", len(t_keep) - ion - kin - nuc - gpcr)

# save it to a file
activities.to_csv(f"chembl_{CHEMBL_VERSION}_activity_data_filtered.csv", index=False)


# ----------------------------------------------------------------------------
# Prepare the label matrix for the multi-task deep neural network
#     known active = 1
#     known no-active = 0
#     unknown activity = -1, easy to filter out when calculating the loss during model training

# The matrix is extremely sparse so using sparse matrices (COO/CSR/CSC) should be considered
#     https://github.com/pytorch/pytorch/issues/20248

def gen_dict(group):
    return {tid: act for tid, act in zip(group["target_chembl_id"], group["active"])}

#group = activities.groupby("molregno")
group = activities.compute().groupby("molregno")
temp = pd.DataFrame(group.apply(gen_dict))
mt_df = pd.DataFrame(temp[0].tolist())

mt_df["molregno"] = temp.index
mt_df = mt_df.where((pd.notnull(mt_df)), -1)
structs = activities[["molregno", "canonical_smiles"]].drop_duplicates(
    subset="molregno"
)


def checkmol(smi):
    m = Chem.MolFromSmiles(smi)
    if m:
        return 'ok' 
    else:
        return m
def dd_checkmol(indf):
    return indf["canonical_smiles"].apply(checkmol)
structs["romol"] = structs.map_partitions(dd_checkmol, meta=('romol', 'object'))
#structs["romol"] = structs["canonical_smiles"].apply(checkmol)
del structs["romol"]

mt_df = ddf.from_pandas(mt_df, npartitions=10)
mt_df = ddf.merge(structs, mt_df, how="inner", on="molregno")
# save df to csv
mt_df.to_csv(f"chembl_{CHEMBL_VERSION}_multi_task_data.csv", index=False)
print("all process is finished!!!")

But most of code is as same as original one. After running the code I could get bunch of xx.part files in the xxxxx.csv folder. Dask can read these files like a CSV file.

from dask import dataframe
# I set npartitions=10 in the code, so I got 10 part files which is csv format.
ddf = dataframe.read_csv('./chembl_33_multi_task_data.csv/*.part')
ddf.head(2)
Out[6]: 
   molregno                                   canonical_smiles  CHEMBL1293237  ...  CHEMBL3713062  CHEMBL4879459  CHEMBL4523156
0       925    CCC(C)(C)C(=O)C(=O)N1CCC[C@H]1C(=O)OCCCc1cccnc1           -1.0  ...           -1.0           -1.0           -1.0
1      8090  CC(C)[C@@H](CN1CC[C@@](C)(c2cccc(O)c2)[C@@H](C...           -1.0  ...           -1.0           -1.0           -1.0

[2 rows x 682 columns]

ddf.compute().shape
Out[9]: (828677, 682)

Dask is lazy evaluation so to get shape of dataframe, compute() method should be called.

The code with dask could reduce memory usage. Dask is useful and interesting package. I would like to use it in my cheminformatrics projects!

Look Back at 2023 #diary

I’m writing this post after last run of 2023. It have been over 10 years after staring my blog site. Today I would like to look back 2023 ;-)

  1. Running
    My running pace is slow but I’m enjoying running, I started running in my lunch time after COVID pandemic.
    In this year
    Total distance: 1,373km / year
    Cumulative evaluation: 22606m up & 2259m down
    The total distance is not so long…. I would like to at least 1500km / year next year. I should long running in the weekend :-)
  2. Coding
    From github profile, there are 69 commits and 40 pull requests in this year. I got B+ from github-readme-stats. I would like to contribute open science community next year more and more.
  3. My blog-site
    I posted 34 posts(exclude this post) in this year. All most same pace compared to the last year. I’m getting busy, but I would like to keep the pace and keep learning.
  4. About my work
    I’m happy to get some opportunities for chemoinformatics hands-on training in this year. I hope the training was useful for every participants. I learned difficulties of making materials for training. I respect TeachOpenCADD activites! Because the topics are well mantained and updated. I think it’s difficult to keep motivation of these kinds of tasks.
    I didn’t have any presentaiot of my research activities in conference, I would like to try to present my activities and have discussions next year.

This will be the last post of this year. Thanks for reading.

I want to wish you and your family a safe, beautiful and happy New Year.

ChEMBL multitask prediction with python requrests #memo #chembl #cheminfo

ChEMBL provides multitask prediction model from its github repo. And shared useful blog post.
https://chembl.blogspot.com/2019/05/multi-task-neural-network-on-chembl.html

By using the code, we can get predicted target list from given molecules. And the prediction can run python, C++, JS and Knime! ChEMBL team provides not only source code but also predicted results when we search compound in ChEMBL DB. The screen shot is shown below.

I found the bottom of the predicted target list, prediction API is provided too! For example, I can get predicted results as JSON format when I posted query below.

curl 'https://www.ebi.ac.uk/chembl/target-predictions' -H 'Content-Type: application/json' --data-raw '{"smiles":"CN(Cc1cnc2nc(N)nc(N)c2n1)c1ccc(C(=O)N[C@@H](CCC(=O)O)C(=O)O)cc1"}' 

Web based API is useful because user don’t need prepare the environment on own computational enviroment. So I tried to make query with python requests, because by using python I can make chemoinformatics pipeline easily.

Here is an example code ;-)

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Tips for making query is data definition part.
F-String can’t escape curry branckets with backslash. {{someting}} means {someting}, as a normal strings.

### tips
data = f'{{"smiles":"{smi}"}}' >>> '{"smiles" : "user defined smiles"}'

The code worked well but I wondered that predicted results don’t contain COX2 which is primary target of celecoxib. I would like to check some other approved drugs with same way.

Build Postgresql rdkit cartridge and install new version of postgresql #RDKit #postgresql #cheminformatics

As readers know that rdkit supports not only python, C++ but also postgresql cartridge. We can install rdkit-postgresql via conda command without building it from source. However conda’s rdkit-postgresql supports only old version of postgresql and isn’t updated recently.

I would like to install rdkit-postgresql into new version of postgresql. Today I tried to build the cartridge and installed it into new version of postgresql. I would like to share my today’s trial in this blog post :-)

At first, I made chemdb env with mamba.

$ mamba create -n chemdb python=3.11
$ conda activate chemdb

Then install some packages.

(chemdb) $ conda install -c conda-forge -c anaconda -y cmake cairo pillow eigen pkg-config
(chemdb) $ conda install -c conda-forge -c anaconda  -y boost-cpp boost py-boost
(chemdb) $ conda install -c conda-forge matplotlib gxx_linux-64 postgresql

Now I could install required packages. Before the build cartridge I start postgresql.

(chemdb) $ initdb /home/iwatobipen/postgresdb
(chemdb) $ pg_ctl -D /home/iwatobipen/postgresdb -l /home/iwatobipen/postgresdb/logfile start

Then get rdkit code from github repo and build the cartridge!

(chemdb) $ gh repo clone rdkit/rdkit
(chemdb) $ cd rdkit
(chemdb) $ mkdir build && cd build
(chemdb) $ cmake -DPy_ENABLE_SHARED=1 \
  -DRDK_INSTALL_INTREE=ON \
  -DRDK_INSTALL_STATIC_LIBS=OFF \
  -DRDK_BUILD_CPP_TESTS=ON \
  -DRDK_BUILD_PGSQL=ON \
  -DRDK_PGSQL_STATIC=ON \
  -DPYTHON_NUMPY_INCLUDE_PATH="$(python -c 'import numpy ; print(numpy.get_include())')" \
  -DBOOST_ROOT="$CONDA_PREFIX" \
  -DPostgreSQL_INCLUDE_DIR=/home/iwatobipen/miniforge3/envs/chemdb/include \
  -DPostgreSQL_TYPE_INCLUDE_DIR=/home/iwatobipen/miniforge3/envs/chemdb/include/server \
  -DPostgreSQL_LIBRARY=/home/iwatobipen/miniforge3/envs/chemdb/lib/libpq.so \
  ..

(chemdb) $ make -j8
(chemdb) $ make install
# test the code...
(chemdb) $  RDBASE=$PWD/.. PYTHONPATH=$RDBASE LD_LIBRARY_PATH=$RDBASE/lib:$LD_LIBRARY_PATH ctest

I could build the cartridge the process decribed above.

Then install chembl 33 and add rdkit mol object and related data.

(chemdb) $ cd ~/Download
(chemdb) $ wget https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_33_postgresql.tar.gz
(chemdb) $ tar vxfz chembl_33_postgresql.tar.gz 
(chemdb) $ cd chembl_33/chembl_33_postgresql
(chemdb) $ create database chembl_33
(chemdb) $ pg_restore --no-owner -d chembl_33 chembl_33_postgresql.dmp
(chemdb) $ psql chembl_33
chembl_33=# create extension if not exists rdkit;
chembl_33=# create schema rdk;
chembl_33=# select * into rdk.mols from (select molregno,mol_from_ctab(molfile::cstring) m  from compound_structures) tmp where m is not null;
create index molidx on rdk.mols using gist(m);
alter table rdk.mols add primary key (molregno);
select molregno,torsionbv_fp(m) as torsionbv,morganbv_fp(m) as mfp2,featmorganbv_fp(m) as ffp2 into rdk.fps from rdk.mols;
create index fps_ttbv_idx on rdk.fps using gist(torsionbv);
create index fps_mfp2_idx on rdk.fps using gist(mfp2);
create index fps_ffp2_idx on rdk.fps using gist(ffp2);
alter table rdk.fps add primary key (molregno);
create or replace function get_mfp2_neighbors(smiles text)
returns table(molregno bigint, m mol, similarity double precision) as
$$
select molregno,m,tanimoto_sml(morganbv_fp(mol_from_smiles($1::cstring)),mfp2) as similarity
from rdk.fps join rdk.mols using (molregno)
where morganbv_fp(mol_from_smiles($1::cstring))%mfp2
order by morganbv_fp(mol_from_smiles($1::cstring))<%>mfp2;
$$ language sql stable ;

After the all process I could use rdkit function from new version of postgresql.

psql (16.1)
Type "help" for help.

chembl_33=# select * from rdk.mols where m@>'c1cccc2c1nncc2' limit 10;
 molregno |                                  m                                   
----------+----------------------------------------------------------------------
   301422 | CO[C@@H]1CN(C2CCN(C(=O)c3cnnc4ccccc34)CC2)CC[C@@H]1Cc1ccc(Cl)c(Cl)c1
  2682343 | N#Cc1nnc2ccccc2c1Oc1ccc2ncccc2c1
  2784139 | c1cc2cccc3c4cc5n[n+]6c7ccccc7n7cccc7c6cc5cc4c(c1)c23
  2778841 | C[C@H]1COC(=O)N1c1cc(N)c2nnc(NC(=O)[C@@H]3C[C@H]3c3cnn(C)c3)cc2c1
  1605815 | CC(=O)N1C(=O)/C(=C/c2nnc3ccccc3c2Cl)c2ccccc21
  2696533 | O=C(NCc1ccc2c(c1)OCO2)c1nn(-c2ccc(F)cc2)c2ccccc2c1=O
  1573832 | COc1cc2nnc(C(N)=O)c(Nc3ccc(C)cc3F)c2cc1C1CCN(C(C)C)CC1
  1312884 | Cc1ccccc1Cn1nc2ccccc2cc1=O
  1294859 | Cc1cc2c(=O)c(C)n[nH]c2c(C#N)c1C
   748403 | Cc1cc2c(NC(=O)Nc3ccc(F)cc3)cccc2nn1

Build from source sounds litlle bit difficult but there are lots of useful information listed below. By using the same procedure, we can use new version of rdkit cartridge on the new version of postgresql ;)

ref
https://www.rdkit.org/docs/Cartridge.html
https://github.com/rdkit/rdkit/tree/master/Code/PgSQL/rdkit
https://github.com/rdkit/rdkit/discussions/5940
https://github.com/rdkit/rdkit/blob/master/Docs/Book/Install.md

Thank you for reading!

Useful package for filtering molecules of python #RDKit #Python #memo

I wrote blog post of how to use Lilly filter from REINVENT4. In the post, I build lilly filter from source code. After posted, Hadrien introduced me a package for using lilly filter which can call from python. Thanks!

The package is provided from datamol-io’s reposidory named medchem. I had intreste the package so I installed it and tried to use it.

To use Lilly filter from the package, lilly-filter should be installed via conda.

$ mamba install -c conda-forge lilly-medchem-rules

Then install medchem package via pip.

$ pip install medchem

After installing the both package, lilly filter can apply from python. The example is uploaded my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

As you can see, it is easy to apply filter to rdkit molobjects and results can get as pandas dataframe. It makes post-process easy.

I think it worth because filtering piple line can run only python ;) Thanks for sharing useful code on github!

Use Lilly MedChem rules for scoring of REINVENT4 #cheminformatics #reinvent4 #rdkit

## install LillyMedChem Rules
$ gh repo clone IanAWatson/Lilly-Medchem-Rules
$ cd Lilly-Medchem-Rules
# to run make process, ruby should be installed at first :)
$ make
# Done
$ cd ../

## install bazel-6.2.0
$ sudo apt install apt-transport-https curl gnupg -y
curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor >bazel-archive-keyring.gpg
sudo mv bazel-archive-keyring.gpg /usr/share/keyrings
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/bazel-archive-keyring.gpg] https://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
$ sudo apt update && sudo apt install bazel-6.2.0

## install LillyMol
$ gh repo clone EliLillyCo/LillyMol
$ make

## install mordred
$ pip install mordred

Let’s use them. Move to REINVENT4 directory and check contrib folder.

REINVENT4$ tree ./contrib/
./contrib/
├── conftest.py
├── reinvent_plugins
│   └── components
│   ├── comp_mordred.py
│   ├── config
│   │   ├── mordred_descriptors.toml
│   │   ├── pharm2dfp.toml
│   │   ├── rdkit_descriptors.toml
│   │   ├── scoring.toml
│   │   └── scoring.toml.org
│   ├── Lilly
│   │   ├── comp_lilly_descriptors.py
│   │   ├── comp_lilly_pains.py
│   │   ├── comp_medchem_rules.py
│   │   ├── init.py
│   │   ├── lcm_commands.lst
│   │   ├── pains_scores.csv
│   │   └── pycache
│   │   ├── comp_lilly_descriptors.cpython-310.pyc
│   │   ├── comp_lilly_pains.cpython-310.pyc
│   │   ├── comp_medchem_rules.cpython-310.pyc
│   │   └── init.cpython-310.pyc
│   ├── NIBRSubstructureFilters
│   │   ├── assign_filters.py
│   │   ├── build_filter_catalog.py
│   │   ├── catalog.pkl
│   │   ├── comp_nibr_substructure_filter.py
│   │   ├── init.py
│   │   └── pycache
│   │   ├── assign_filters.cpython-310.pyc
│   │   ├── comp_nibr_substructure_filter.cpython-310.pyc
│   │   └── init.cpython-310.pyc
│   ├── pycache
│   │   └── comp_mordred.cpython-310.pyc
│   ├── RDKit_extra
│   │   ├── comp_pharm2d.py
│   │   ├── comp_rdkit_descriptors.py
│   │   ├── comp_unwanted_substructures.py
│   │   ├── features
│   │   │   ├── BaseFeatures_DIP2_NoMicrospecies.fdef
│   │   │   ├── BaseFeatures.fdef
│   │   │   └── MinimalFeatures.fdef
│   │   ├── init.py
│   │   └── pycache
│   │   ├── comp_pharm2d.cpython-310.pyc
│   │   ├── comp_rdkit_descriptors.cpython-310.pyc
│   │   ├── comp_unwanted_substructures.cpython-310.pyc
│   │   └── init.cpython-310.pyc
│   └── README.md
└── tests
└── reinvent_plugins
└── unit_tests
└── components
├── test_comp_lilly_descriptors.py
├── test_comp_lilly_medchem_rules.py
├── test_comp_lilly_pains.py
├── test_comp_nibr_substructure_filter.py
└── test_comp_unwanted_substructures.py

# set env
$ export PYTHONPATH=/home/iwatobipen/dev/REINVENT4/contrib
$ export LILLY_MEDCHEM_RULES_ROOT=/home/iwatobipen/src/Lilly-Medchem-Rules
$ export LILLY_MOL_ROOT=/home/iwatobipen/src/LillyMol

Then define config.toml, my example is shown below. I commented out CHEMBL_Glaxo because my rdkit env could not import the filter now.

run_type = "scoring"
output_csv = "scoring.csv"
json_out_config = "_scoring.json"

[parameters]

#smiles_file = "tests/chembl_random_100.smi"
smiles_file = "chembl.100.smi"

[scoring]

type = "geometric_mean"
parallel = false

# filters
[[scoring.component]]
[scoring.component.UnwantedSubstructures]

[[scoring.component.UnwantedSubstructures.endpoint]]
#name = "PAINS/CHEMBL_Glaxo"
name = "PAINS/ZINC"

#params.catalogs = ['ALL']
#params.catalogs = ['PAINS', "CHEMBL_Glaxo"]
params.catalogs = ['PAINS', "ZINC"]


[[scoring.component]]
[scoring.component.UnwantedSubstructures]

[[scoring.component.UnwantedSubstructures.endpoint]]
name = "PAINS"

params.catalogs = ['PAINS']


[[scoring.component]]
[scoring.component.UnwantedSubstructures]

[[scoring.component.UnwantedSubstructures.endpoint]]
#name = "CHEMBL_Glaxo"
name = "ZINC"

#params.catalogs = ["CHEMBL_Glaxo"]
params.catalogs = ["ZINC"]


[[scoring.component]]
[scoring.component.NIBRSubstructureFilters]

[[scoring.component.NIBRSubstructureFilters.endpoint]]
name = "NIBR"

params.cutoff = 11


# "normal" scoring components
[[scoring.component]]
[scoring.component.QED]

[[scoring.component.QED.endpoint]]
name = "QED"
weight = 1.2


[[scoring.component]]
[scoring.component.LillyMedchemRules]

[[scoring.component.LillyMedchemRules.endpoint]]
name = "LillyMedchem"

params.relaxed = false

transform.type = "reverse_sigmoid"
transform.high = 201
transform.low = 1.0
transform.k = 1.2


[[scoring.component]]
[scoring.component.TPSA]

[[scoring.component.TPSA.endpoint]]
name = "RDKit TPSA"

transform.type = "double_sigmoid"
transform.low = 51
transform.high = 181
transform.coef_div = 101
transform.coef_si = 151
transform.coef_se = 151


[[scoring.component]]
[scoring.component.LillyDescriptors]

[[scoring.component.LillyDescriptors.endpoint]]
name = "Novartis PSA"

params.descriptors = "w_nvrtspsa"

transform.type = "double_sigmoid"
transform.low = 51
transform.high = 181
transform.coef_div = 101
transform.coef_si = 151
transform.coef_se = 151

[[scoring.component.LillyDescriptors.endpoint]]
name = "RO6 OHNH"

params.descriptors = "w_ro6_ohnh"

transform.type = "left_step"
transform.high = 11


[[scoring.component]]
[scoring.component.LillyPAINS]

[[scoring.component.LillyPAINS.endpoint]]
name = "Alpha"

params.assay = 'Alpha'

transform.type = "sigmoid"
transform.high = 101
transform.low = 1.0
transform.k = 1.2

[[scoring.component.LillyPAINS.endpoint]]
name = "FRET_HS"

params.assay = "FRET_HS"

transform.type = "sigmoid"
transform.high = 101
transform.low = 1.0
transform.k = 1.2

[[scoring.component.LillyPAINS.endpoint]]
name = "QCEnrichment"

params.assay = "QCEnrichment"

transform.type = "sigmoid"
transform.high = 101
transform.low = 1.0
transform.k = 1.2

All setting is done. The config above is not learning but scoring. So it conduct scoring to given SMILES.

Let’s run the reinvent4 :)

$ reinvent ./contrib/reinvent_plugins/components/config/scoring.toml

Input smiles file is …

$ head -n 5 chembl.100.smi 
CC1(COC(C(=O)Nc2cccc(Cl)c2)=C(C=N)N2CCN(S(=O)(=O)NCc3nc4cc(Cl)ccc4o3)CC2)CC1
CC(=O)C(NC(=O)C(CCCCNC(C)=S)NC(=O)C(CC(=O)O)NC(=O)C(N)CO)C(O)NC(C)C(=O)O
Cc1cc(S(=O)(=O)NC(C)c2nnc3ccccn23)ccc1Br
Cc1cc(=NC(=O)c2ccc3c(-c4nc5ccccc5[nH]4)[nH]nc3c2)[nH][nH]1
O=C1CC(c2ccc(Br)cc2)Nc2c(Br)cc(Br)cc21

And results file is …

$ head -n 5 scoring.csv 
SMILES,Score,QED,LillyMedchem,RDKit TPSA,Novartis PSA,RO5 OHNH,Alpha,FRET_HS,QCEnrichment,PAINS/ZINC,PAINS,ZINC,NIBR,QED (raw),LillyMedchem (raw),RDKit TPSA (raw),Novartis PSA (raw),RO5 OHNH (raw),Alpha (raw),FRET_HS (raw),QCEnrichment (raw),PAINS/ZINC (raw),PAINS (raw),ZINC (raw),NIBR (raw)
CC1(COC(C(=O)Nc2cccc(Cl)c2)=C(C=N)N2CCN(S(=O)(=O)NCc3nc4cc(Cl)ccc4o3)CC2)CC1,0.0020991266734168117,0.16378955198231193,1.023292991233625e-09,1.0,1.0,0.0,0.09090909090909091,0.09090909090909091,0.09090909090909091,0,1,0,0,0.16378955198231193,999.0,140.85999999999999,149.24,3.0,0.0,0.0,0.0,0,1,0,0
CC(=O)C(NC(=O)C(CCCCNC(C)=S)NC(=O)C(CC(=O)O)NC(=O)C(N)CO)C(O)NC(C)C(=O)O,0.0001855987134810415,0.040164377268719686,1.023292991233625e-09,nan,nan,0.0,0.09090909090909091,0.09090909090909091,0.09090909090909091,1,1,1,0,0.040164377268719686,999.0,269.51,301.6,11.0,0.0,0.0,0.0,1,1,1,0
Cc1cc(S(=O)(=O)NC(C)c2nnc3ccccn23)ccc1Br,0.027502003498003445,0.7372736355713102,0.820495207286423,1.0,1.0,0.0,0.09090909090909091,0.09090909090909091,0.09090909090909091,1,1,1,1,0.7372736355713102,34.0,76.36,84.74,1.0,0.0,0.0,0.0,1,1,1,1
Cc1cc(=NC(=O)c2ccc3c(-c4nc5ccccc5[nH]4)[nH]nc3c2)[nH][nH]1,0.026866000966922644,0.388841533999667,0.7879267680774021,1.0,1.0,0.0,0.09090909090909091,0.09090909090909091,0.09090909090909091,1,1,1,1,0.388841533999667,43.0,118.37,118.37,4.0,0.0,0.0,0.0,1,1,1,1

As you can see Reinvent4 can use LillyFilter as scoring function. It means that compound generator can learn unwanted chemical structure from Lilly MedChem Rules.

This is an example of how to use new scoring function in Reinvent4. I would like to make my own scoring function for reinvent4 at next step.

Thanks for reading.