Quantum computing for chemistry #RDKit #psi4 #qiskit

Some years ago I read book about quantum computing written in Japanese. https://www.amazon.co.jp/%E9%87%8F%E5%AD%90%E3%82%B3%E3%83%B3%E3%83%94%E3%83%A5%E3%83%BC%E3%82%BF%E3%81%8C%E4%BA%BA%E5%B7%A5%E7%9F%A5%E8%83%BD%E3%82%92%E5%8A%A0%E9%80%9F%E3%81%99%E3%82%8B-%E8%A5%BF%E6%A3%AE-%E7%A7%80%E7%A8%94/dp/4822251896

It was very exciting for me and I had interested in quantum computer.

Quantum computer has potential to solve many difficult problems for conventional computer. Of course it is useful for chemistry. You know qiskit is one of python library for quantum computing frame work which is developed by IBM.

Fortunately, pythonista can use qiskit just only type pip install qiskit!

Qiskit has driver for several quantum chemistry calculation frame work such as PySCF, Psi4 and Gaussian etc.

Today I tried to use qiskit for chemistry problem with psi4(psikit).

Following code calculate energy of the query molecule with psi4 and optimize hamiltonian with quantum computer (used simulator in following case).

Let’s write code. Import packages for calculation and define querymaker method for qiskit driver object. I used PSI4Driver but other drivers are also available. If reader would like to use pyscf, pyquante and gaussian, please check the document.

 
import qiskit
from qiskit.chemistry import drivers
from qiskit.chemistry import FermionicOperator
from psikit import Psikit

# Defiene query maker for Psi4Driver I used psikit for mol to xyz
def querymaker(smiles, basis='sto-3g'):
    pk = Psikit()
    pk.read_from_smiles(smiles)
    xyz = pk.mol2xyz()
    setting_str = 'molecule mol {' + xyz + '}\n\n'
    setting_str += 'set {\n basis ' + basis + '\n scf_type pk\n}'
    setting_str = setting_str.replace('no_reorient', '').replace('no_com', '')
    return setting_str

Next, I passed the query to the driver and run the 1st step calculation.

 
# make query for quantum calculation

query = querymaker('O')
driver = drivers.PSI4Driver(query.split('\n'))

# run psi4 calculation
mol = driver.run()
print('finish psi4 calc')
num_particles = mol.num_alpha + mol.num_beta
num_spin_orbitals = mol.num_orbitals * 2

After the calculation, make qubit operator. The code is almost same as the original repository’s README.md.

 
# Build qubit operator
# please see eq 5 of the reference
ferOp = FermionicOperator(h1=mol.one_body_integrals, h2=mol.two_body_integrals)

map_type = 'PARITY'
qubitOp = ferOp.mapping(map_type)
qubitOp = qubitOp.two_qubit_reduced_operator(num_particles)
num_qubits = qubitOp.num_qubits
print(f'There are {num_particles} particles ')
print(f'{num_qubits} Qubits will be used for calculation')

Fermionoperator is the function which maps fermionic Hamiltonian to qubit Hamiltonian.

And the final part of this code is quantum computation.

 
# set the backend for the quantum computation
from qiskit import Aer
backend = Aer.get_backend('statevector_simulator')

from qiskit.aqua.components.optimizers import L_BFGS_B
optimizer = L_BFGS_B()

from qiskit.chemistry.aqua_extensions.components.initial_states import HartreeFock
init_state = HartreeFock(num_qubits, num_spin_orbitals, num_particles)
from qiskit.aqua.components.variational_forms import RYRZ
var_form = RYRZ(num_qubits, initial_state=init_state)

from qiskit.aqua.algorithms import VQE
algorithm = VQE(qubitOp, var_form, optimizer)
result = algorithm.run(backend)

print(result)

VQE is a variational eigenvalue solver on a quantum processor. The details are described following article.

https://arxiv.org/pdf/1304.3061.pdf

It is useful for solving large eigenvalue problems. And final output was below.

 
iwatobipen$ python qiskit_test.py 

  Memory set to   3.725 GiB by Python driver.
  Threads set to 4 by Python driver.
finish psi4 calc
There are 10 particles 
12 Qubits will be used for calculation
{'num_optimizer_evals': 1067, 'min_val': -82.1551123486962, 'opt_params': array([ 3.14159265, -3.14159265,  3.14159265, -0.47209231,  3.14159265,
        1.26538018,  3.08659821, -2.03722008,  3.14159265,  1.83786188,
       -1.14116453, -1.22806647,  3.14159265, -2.51141238,  2.76518682,
        0.65038408, -0.5401877 ,  0.94977249,  0.16367673, -2.61606226,
        2.14714991,  1.33265892,  0.14693617, -2.29699039, -0.06600933,
        1.37467628, -3.14159265, -1.58506755, -2.26983475, -0.34581384,
        1.11665802, -0.7554235 ,  1.3765809 , -0.14743813,  2.31945686,
       -0.35074729,  3.13606185, -1.74064297, -1.1052301 ,  1.83308917,
       -0.55939713,  0.58150969, -2.57860126, -2.64458621, -2.38124305,
       -3.03600205,  1.50484991, -3.07866269,  3.14159265, -0.28057194,
        3.14159265, -1.20003684,  0.2060857 ,  0.7300913 ,  0.13547982,
        0.93496467, -2.7768489 ,  2.78637461,  3.06799269,  2.39524209,
        3.14159265, -2.55084639, -2.42941145,  2.57242438,  0.25916875,
        1.85199233,  2.96304219, -0.15247656, -0.09554275,  1.10617151,
        0.83882494,  0.73484547,  0.09034047, -0.95804794, -2.97968157,
       -0.96409293, -1.70094319,  0.719477  ,  0.96564404,  0.65027412,
       -1.65178365, -2.26639198, -1.25597346, -3.01669026, -3.11323418,
        0.82251654,  1.33881735,  0.45800526,  0.32908445, -2.53067172,
        0.36776732, -2.94786506,  0.22166631, -1.00564524, -2.19364914,
       -2.35394437]), 'eval_time': 150.38587403297424, 'eval_count': 1067, 'energy': -82.1551123486962, 'eigvals': array([-82.15511235]), 'min_vector': array([ 1.60736505e-04+1.29185427e-04j, -1.48578723e-03+4.25322038e-04j,
        1.60461226e-05-4.73763479e-06j, ...,
        1.05276114e-04-1.39253104e-04j, -7.71091869e-07+1.48904495e-07j,
       -4.41171486e-06-1.34586907e-05j]), 'eigvecs': array([[ 1.60736505e-04+1.29185427e-04j, -1.48578723e-03+4.25322038e-04j,
         1.60461226e-05-4.73763479e-06j, ...,
         1.05276114e-04-1.39253104e-04j, -7.71091869e-07+1.48904495e-07j,
        -4.41171486e-06-1.34586907e-05j]])}

The result object was dictionary so energy can get by result[‘energy’].

The algorithm of quantum computing is based on physics and quantum chemistry. Now I am learning not only quantum coding but also physics, mathematics, quantum chemistry.

An article about fluorine #memo #medchem #journal

Recently there are many drugs that contain fluorine atom. Fluorine atom is often used in medicinal chemistry to improve potency, metabolic stability reduce toxicity etc. And the recent progress of chemistry enable to introduce fluorine atoms in many position of molecule. You know, there are many fluorination reagents.

Fluorine atom is familiar to medicinal chemists, of course I try to introduce fluorine atoms when I design new molecules.

Today, I read very useful article about fluorine atoms in drug discovery. URL is below.

https://pubs.acs.org/doi/10.1021/acsmedchemlett.9b00235

The author shows some example of instability of fluorine containing drugs.

Fig1 and 2 shows an example of instability of F containing compounds. Fluorine substituted alkyl has risk of F- releases and produce reactive metabolite.

The substitution and elimination mechanism of fluorine atom is based on organic chemistry. So medicinal chemist can predict it I think.

F atoms are very useful for drug design so drug designer need to think about not only potency, ADMET profile but also stability of designed compounds.

Chemistry is exciting!

GetTanimtoSim/DistMat as a lower triangular matrix and convert 2D array #RDKit #Chemoinformatics

Similarity / Dissimilarity matrix is often used for not only checking the bulk of molecular similarity but also the chemical space (such as MDS).

The cost of pair similarity matrix is O(N^2). After calculation, I can get N x N similarity / dis similarity matrix.

I found some functions to calculate similarity metrics in rdkit named GetTanimotoDistMat and GetTanimotoSimMat. Official document is below.
https://www.rdkit.org/docs/source/rdkit.DataManip.Metric.rdMetricMatrixCalc.html

These function compute the distance/simirality matrix from a list of BitVects using the Tanimoto distance metric. Returns 1D l-triangular matrix.

I used the function and after getting the results, I tried to convert 1D l-tri matrix to 2D symmetric matrix. At first I load some functions to do that.

 
%matplotlib inline
import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import seaborn as sns
from rdkit.DataManip.Metric import GetTanimotoDistMat
from rdkit.DataManip.Metric import GetTanimotoSimMat
from rdkit import rdBase
from rdkit.Chem import RDConfig
print(rdBase.rdkitVersion)
>2019.03.2

Then read cdk2.sdf as an example.

sdfdir = os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf')
mols = [m for m in Chem.SDMolSupplier(sdfdir)]

To use GetTanimotoSim(Dist)Mat, list of bit vector is required. So I made the list and calculate the matrix. Got MorganFP with common way in RDKit.

 
morganfps = [AllChem.GetMorganFingerprintAsBitVect(m,2) for m in mols]
distmat = GetTanimotoDistMat(morganfps)
simmat = GetTanimotoSimMat(morganfps)
print(distmat, len(distmat))
print(simmat, len(simmat))
print(len(mols))
> [0.53703704 0.51851852 0.43396226 ... 0.88659794 0.92079208  0.86538462] 1081
> [0.46296296 0.48148148 0.56603774 ... 0.11340206 0.07920792 0.13461538] 1081
>47

OK it seems work well. By the way, how can I convert the l-tri mat to 2d array?

Mathematically, the number of similarity combination of M molecules are calculated with following equation.

N = M (M + 1)/2 – M

The above case, I have 47 molecules, so N = 47(1+47)/2 – 47 = 1081 ;)

So, I can calculate M from N with quadratic formula.

M = [1 + sqr(1 + 4 * 2 * N)]/2

Now ready to write code! It seems redundant. If reader have more smart code, please let me know ;)

 
def tri2mat(tri_arr):
    n = len(tri_arr)
    m = int((np.sqrt(1 + 4 * 2 * n) + 1) / 2)
    arr = np.ones([m, m])
    for i in range(m):
        for j in range(i):
            arr[i][j] = tri_arr[i + j - 1]
            arr[j][i] = tri_arr[i + j - 1]
    return arr

Check the code and visualize these matrix.

 
distarr = tri2mat(distmat)
simmat = tri2mat(simmat)
sns.heatmap(simmat[:10,:10])

Molecules with index 8 and 9 showed very dissimilar for me and heat map showed the trend. And Molecules with index 0 and 1 showed high similarity.

I think it is rare case convert from tri-mat to symmetric-mat because tri-mat form is computer cost effective. Symmetric-mat form needs more memory.

The code is just my practice. I posted whole code to my gist.

QSAR toolbox for myself #RDKit #Chemoinformatics

I often work with SDF to build some QSAR models. Data preparation process is almost same. So I decided to make tool box for myself. It is not complex code, very simple.
https://github.com/iwatobipen/QSAR_TOOLBOX

Current code has two main scripts. One is finger print generator and another is model builder and optimizer which make SVM, RF and Gaussian Process Classifier and optimize hyper parameters with optuna. Optuna is developed by preferred networks.

Fingerprint maker named fpgen.py needs SDF as input. And the script generates fingerprint array and target array as npz format. It can select ECFP/FCFP as an option.

Following code is an example. I downloaded JAK3 inhibitors from ChEMBL as TSV and converted the file to SDF.

 
usage: fpgen.py [-h] [--fptype FPTYPE] [--radius RADIUS] [--nBits NBITS]
                [--molid MOLID] [--target TARGET] [--output OUTPUT]
                input

positional arguments:
  input            finename of sdf

optional arguments:
  -h, --help       show this help message and exit
  --fptype FPTYPE  ECFP, FCFP,
  --radius RADIUS  radius of ECFP, FCFP
  --nBits NBITS    number of bits
  --molid MOLID    molid prop
  --target TARGET  target name for predict
  --output OUTPUT  output path
# Example usage
python qsartools/fpgen.py example/CHEMBL25-chembl_activity-JAK3.sdf --molid Molecule --target Class

The fingerprint and class array was stored in ./data folder.

Ready to build model. Then run the next script. I referred following blog post. https://qiita.com/koshian2/items/ef9c0c74fe38739599d5
And wrote the code. This code make model and optimize parameters and save the model as pkl.

import argparse
import uuid
import pickle
import optuna
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier


def make_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument('X', type=str, help='data for X npz format')
    parser.add_argument('Y', type=str, help='data for Y npz format')
    parser.add_argument('target', type=str, help='name of target, it will be database name')
    parser.add_argument('--testsize', type=float, help='testsize of train/test split default = 0.2', default=0.2)
    parser.add_argument('--n_trials', type=int, help='number of trials default = 200', default=200)
    parser.add_argument('--outpath', type=str, help='path for output default is lgb_output', default='svc_rf_gp_output')
    return parser

def objectives(trial):
    trial_uuid = str(uuid.uuid4())
    trial.set_user_attr('uuid', trial_uuid)
    classifier_name = trial.suggest_categorical('classifier', ['SVC', 'RandomForest', 'GP'])
    if classifier_name == 'SVC':
        svc_c = trial.suggest_loguniform('svc_c', 1e-10, 1e10)
        svc_gamma = trial.suggest_loguniform('svc_gamma', 1e-10, 1e10)
        classifier_obj = SVC(C=svc_c, gamma=svc_gamma)
        trial_uuid += 'svc_'
    elif classifier_name == 'RandomForest':
        rf_max_depth = int(trial.suggest_loguniform('rf_max_depth', 2, 32))
        classifier_obj = RandomForestClassifier(max_depth=rf_max_depth, n_estimators=10)
        trial_uuid += 'rf_'
    else:
        classifier_obj = GaussianProcessClassifier()
        trial_uuid += 'gp_'
    classifier_obj.fit(train_x, train_y)
    score = cross_val_score(classifier_obj, train_x, train_y, n_jobs=4, cv=5)
    val_accuracy = 1.0-score.mean()
    trial.set_user_attr('val_acc', val_accuracy)
    #classifier_obj.fit(train_x, train_y)

    y_pred_train = classifier_obj.predict(train_x)
    y_pred_test = classifier_obj.predict(test_x)

    acc_train = accuracy_score(train_y, y_pred_train)
    acc_test =  accuracy_score(test_y, y_pred_test)
    
    error_train = 1.0 - acc_train
    error_test = 1.0 - acc_test

    trial.set_user_attr('train_error', error_train)
    trial.set_user_attr('train_acc', acc_train)
    trial.set_user_attr('test_error', error_test)
    trial.set_user_attr('test_arcc', acc_test)

    if not os.path.exists('svc_rf_gp_output'):
        os.mkdir('svc_rf_gp_output')
    with open('svc_rf_gp_output/' + f'{trial_uuid}.pkl', 'wb') as fp:
        pickle.dump(classifier_obj, fp)
    return error_test

if __name__=='__main__':
    parser = make_parser()
    args = parser.parse_args()
    study = optuna.create_study(storage=f'sqlite:///{args.target}_svc_rf.db')
    X = np.load(args.X)['arr_0']
    Y = np.load(args.Y)['arr_0']
    print(X.shape)
    print(Y.shape)
    train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=args.testsize)
    study.optimize(objectives, n_trials=args.n_trials)
    print(study.best_params)
    print(study.best_value)
    df = study.trials_dataframe()
    df.to_csv(f'{args.outpath}/optuna_svc_rf.csv')

Run the code is very easy.

 
usage: svm_rf_gp_optuna.py [-h] [--testsize TESTSIZE] [--n_trials N_TRIALS]
                           [--outpath OUTPATH]
                           X Y target

positional arguments:
  X                    data for X npz format
  Y                    data for Y npz format
  target               name of target, it will be database name

optional arguments:
  -h, --help           show this help message and exit
  --testsize TESTSIZE  testsize of train/test split default = 0.2
  --n_trials N_TRIALS  number of trials default = 200
  --outpath OUTPATH    path for output default is lgb_output

python qsartools/svm_rf_gp_optuna.py data/ECFP_2_1024_X.npz data/Class_arr.npz JAK3_ --n_trial 50

After finished the code, I could get csv file as a summary optimization process.

The screen shot is below. SVC showed the best performance against test dataset. And GP was the second.

code

There are many useful packages, documents and web resources. So I can learn lots from them.

I would like to update the toolbox with some functions near the feature.

The last piece of gem-difluorocycloalkans #medchem #memo #journal

Recently fluorination is one of an important strategy in medicinal chemistry, it sometime improves not only potency but also any ADMET profiles. Yes I like fluorination strategy ;)

However fluorination reaction is needed very harsh conditions. So I think fluorinated building blocks are very useful for medicinal chemistry to avoid such as harsh reaction steps in their synthetic routes.

Today I found interesting article published from ‘Enamine’ which is the one of big CRO in drug discovery area.

https://pubs.acs.org/doi/full/10.1021/acs.joc.9b00719

The group developed effective synthetic route of gem-difluorinated cyclobutan derivatives. In fig2 shows that only one article is found in Reaxys which describes gem-F cyclo-butane.

The author tried several route for the synthesis and key of the success was preventing enolization of cyclobutanone which is precursor of gem-F cyclobutane. (see Scheme2 in the article)

Finally they synthesized many type of gem-F cyclobutane derivatives in large scale.

In the experimental section, all reactions were conducted in gram scale!

The building blocks reported in the article have only one reaction point but it will be very useful BB for MedChem to control compound’s profiles.

Some years ago, I could have opportunity to visit Enamine. In the lab. was well organized and have high level science.

I would like to these BB if I could have my own drug discovery project near the feature.

Psikit update/Draw ESP, HOMO LUMO #RDKit #Chemoinformatics #quantumchemistry

I just updated psikit which is package for quantum-chemoinformatics ;)

It can be installed from conda / pypi :)

I added and updated new function for molecular property rendering.

Current version of psikit can draw not only frontier orbital but also ESP and dual descriptor. Dual descriptor is calculated by psi4. What is dual descriptor? From original ducment.

Calculates the dual descriptor from frontier orbitals: 𝑓2(𝐫)=𝜌LUMO(𝐫)−𝜌HOMO(𝐫)f2(r)=ρLUMO(r)−ρHOMO(r). The dual descriptor is a good measure of nucleophilicity and electrophilicity, containing information essentially equivalent to both Fukui functions combined. More details on the dual descriptor itself can be found in [Morell:2005:205], while the current implementation is described in [Martinez-Araya:2015:451]. This feature is currently only supported for closed shell systems.

It is very easy to get these images!

Let’s test the functions with acetic acid as an example.

Following code is almost borrowed form the UGM material. Thanks for sharing nice code. Import packages, read Reaction data and reaction objects at first. For convenience, I recommend to install ipymol at first.

 
import ipymol
from psikit import Psikit
pk = Psikit()
v = ipymol.viewer
v.start() # pymol will launch

Calculate energy of acetic acid.

Then call getMOview() for getting some cube files.


pk.getMOview()

Now data preparation is finished. I can get several views just call view_on_pymol with target option which I would like to draw in pymol.

 
pk.view_on_pymol(target='ESP')
pk.view_on_pymol(target='DUAL', maprange=0.001)
pk.view_on_pymol('FRONTIER')

Now I could get 3 views in pymol. Following images are results.

ESP
DUAL
LUMO
HOMO

Visualization of quantum chemistry properties are useful for medchem I think. Any comments, requests and suggestions are greatly appreciated.

code example

Today’s code is uploaded my repo and URL is below.

https://github.com/iwatobipen/playground/blob/master/new_function_psikit.ipynb

Nice review about ML for drug discovery #memo #chemoinfomatics

Here is a nice review of Machine learning(ML) in drug discovery. If reader has interested in ML and drug discovery, I recommend to read the article ;)

https://www.nature.com/articles/s41573-019-0024-5

The article describes how ML is used in drug discovery. Recently DL is used in wide are in drug discovery target identification and validation, compound screening and lead discovery, preclinical development and clinical development.

Fig2 in the article summaries the applications of ML in drug discovery. I think the figure is well summarized.
As aforementioned, this article covers wide range and it is very informative for me because I’m not familiar to bioinformatics. So I’m not sure how to use ML in bioinformatics area.

And also the article describes ML in chemofinromatics area.

In the section of small-molecule design and optimization where is the most interested part for me, and I agree the authors opinion at the part.

Applications of machine learning in drug discovery and development

”’
An unresolved challenge in the field of small-molecule design is how to best represent the chemical structure. A plethora of representations exist, from simple cir- cular fingerprints such as the extended-connectivity fingerprint (ECFP) to sophisticated symmetry functions (Fig. 3). It is still not clear which structure represen- tation works best for which small-molecule design problem. Therefore, it will be interesting to see if the rise in ML studies in the field of cheminformatics will give more guidance about the best choice for structure representation.
”’

Recent trend of the representation of molecule is graph based representation. But it is not the best way to molecular representation. It has still room for inprovement.

Recently we can see the word ‘AI drug discovery’ but it is hype I think.

If AI work so well in the drug discovery project, compound production step will be bottle neck. Lab automation, robotics is very important and attractive area for pharma I think.

I hope AI drug discovery will be good tool for drug discovery not hype.

And the science keep human health.

I appreciate the authors for writing such a nice review about ML in drug discovery area.