Current version Openforcefield supports rdkit #RDKit #Openforcefield #chemoinformatics

I posted about openforcefield(OpenFF) before. You know, old version of openff supports only OpenEyeTK but current version supports RDKit too.

It is worth to know that we can use openff with open source tool kit. I really appreciate developer’s work! It is great. Today I use the package and ipymol which can control pymol in ipython session. It means that you can control pymol from jupyter notebook! My example code is below.

import openforcefield as off
from rdkit import Chem
from rdkit.Chem import AllChem
from simtk import openmm, unit
from simtk.openmm import app
from openforcefield.topology import Topology
from openforcefield.topology import Molecule
from openforcefield.typing.engines.smirnoff import ForceField
from rdkit.Chem.Draw import IPythonConsole

Load SMIRNOF99FROSS forcefield.

ff = ForceField('test_forcefields/smirnoff99Frosst.offxml')

Then define get_energy function.

def get_energy(system, positions):
    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    context.setPositions(positions)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Make sample mol object with rdkit and convert openff mol object. It is easy, just call from_rdkit!.

rdmol = Chem.MolFromSmiles('c1c(c2sccc2)c(c3c[nH]cc3)oc1')
rdmol = Chem.AddHs(rdmol)
rdmol
ofmol = Molecule.from_rdkit(rdmol)

Then generate conformer with openff function. And get topology.

ofmol.generate_conformers(n_conformers=10)
topology = ofmol.to_topology()

Next, create openMM system with generated topology and get position of one conformer.

org_system = ff.create_openmm_system(topology)
pos = ofmol.conformers[0]

At first, calculate energy with default conformation.

get_energy(org_system, pos)
> Quantity(value=80.93719044789302, unit=kilocalorie/mole)

Next I tried to minimize energy with openMM method.

new_system = ff.create_openmm_system(topology)
new_energy = get_energy(new_system, pos)
from sys import stdout
def minimizeOpenMM(Topology, System, Positions):
    integrator = openmm.LangevinIntegrator(
                                        300.0 * unit.kelvin,
                                        1.0 / unit.picosecond,
                                        2.0 * unit.femtosecond)
                                        #.002 * unit.picoseconds)
    simulation = app.Simulation(Topology, System, integrator)
    simulation.context.setPositions(Positions)
    simulation.minimizeEnergy(tolerance=5.0E-9, maxIterations=2000)
    state =  simulation.context.getState(getPositions=True, getEnergy=True)
    positions =state.getPositions(asNumpy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    
    simulation.reporters.append(app.StateDataReporter(stdout, 1000, step=True, potentialEnergy=True, temperature=True))
    simulation.step(30000)
    print(energy)
    positions = positions / unit.angstroms
    coordlist = list()
    for atom_coords in positions:
        coordlist += [i for i in atom_coords]
    return coordlist, positions

cl, pos=minimizeOpenMM(topology, org_system, pos)

Now I could get minimized position by calling minimizeOpenMM.

Next I updated atom position with optimized atom geometries.

from rdkit.Chem import rdGeometry
from rdkit.Chem.rdchem import Conformer
AllChem.EmbedMolecule(rdmol, useExpTorsionAnglePrefs = True , useBasicKnowledge = True)

conf = rdmol.GetConformer()
w=Chem.SDWriter("test1.sdf")
w.write(rdmol)
w.close()
for i in range(rdmol.GetNumAtoms()):
    conf.SetAtomPosition(i, rdGeometry.Point3D(pos[i][0], pos[i][1],pos[i][2],))
w=Chem.SDWriter("test2.sdf")
w.write(rdmol)
w.close()

I made two SDF one is default conformer and another is minimized conformer.

OK, I use ipymol to communicate pymol and visualize in jupyter notebook.

from ipymol import viewer
viewer.start() # this method launches pymol
viewer.load("test1.sdf")
viewer.load("test2.sdf")
viewer.align("test1","test2")
from ipymol import viewer
viewer.start() # this method launches pymol
viewer.load("test1.sdf")
viewer.load("test2.sdf")
viewer.align("test1","test2")

viewer.ray()
viewer.display()
viewer.deleteAll()
viewer.load("test2.sdf")
viewer.ray()
viewer.display()
viewer.deleteAll()
viewer.load("test1.sdf")
viewer.ray()
viewer.display()

The result seems different to original article. https://www.biorxiv.org/content/early/2018/07/13/286542.full.pdf

Hmm why???

BTW, OpenFF is very attractive package for chemoinformatics I think.

My code can access below.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/openff.ipynb

Advertisements

Comparison of rdChemReactions and EditableMol #RDKit #chemoinformatics

In this year I moved from MedChem team to CompChem team. And now I need to learn SBDD. Today I struggled mol object that has 3D information.

I would like to replace hydrogen which attached aromatic carbon to some atoms. I thought it is easy if I use rdChemReactions method. But I found that it was not good approach. Because RunReactants method generates products but the method can not keep 3D information of reacted atoms.

It is very interesting and good information for me. Let see example code.

Following code is very simple example.

code example

At first, I tried to replace atom with rdChemReactions method. It worked well but the position of fluorine atom was 0, 0, 0. It indicates that the method can not keep 3D information. I think it is reasonable because some chemical reaction dramatically changes molecular conformation.

The second example used Editable mol.

This approach could keep 3D information, it means that the method do just replace atom!

Of course bond length of C-H and C-F is different but the approach is more suitable for atom scanning I think.

I always surprised that RDKit has many useful function for drug design.

Molecular encoder/decoder (VAE) with python3.x (not new topic) #rdkit #chemoinformatics

The first day of 10-day holiday is rainy. And I and my kid will go to dodge ball tournament.

Two years ago, I tried to modify the keras-molecule which is code of molecular encoder decoder. The code is written for python 2.x. So I would like to run the code on python 3.6. I stopped the modification because I often use pytorch and found molencoder for pytorch. https://github.com/cxhernandez/molencoder

Today I tried to modify keras-molecule again.
I think almost done and the code will run on python 3.x, with keras 2.x tensorflow backend.

https://github.com/iwatobipen/keras-molecules

I checked the code on google colab. Because google colab provides GPU environment. It is useful for deep learning. At first, I installed conda and several packages which are used in keras-molecule.

Following code was written on google colab.

# install conda
!wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh
!chmod +x ./Miniconda3-4.4.10-Linux-x86_64.sh
!time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
# install packages via conda command
!time conda install -y -c conda-forge rdkit
!conda install -y -c conda-forge h5py
!conda install -y -c conda-forge scikit-learn
!conda install -y -c conda-forge pytables
!conda install -y -c conda-forge progressbar
!conda install -y -c conda-forge pandas
!conda install -y -c conda-forge keras

For in my case, rdkit installation took very long time, 20min or more…. I waited patiently.

After installation, check rdkit. Installed old version. But it does not matter.

from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
> 2018.09.1

Then clone code from github repo, change current dir and download sample smiles. Keras-molecule uses git-lfs for data strage. So files which are stored is not real .h5 files. I skipped installation of git-lfs to colab.

!git clone https://github.com/iwatobipen/keras-molecules
os.chdir('keras-molecules/')
# original data is stored with git-lft. So it can't get directly with git.
# So I uploaded small sample data to my repo.
!wget https://github.com/iwatobipen/playground/raw/master/smiles_50k.h5

Now ready, preprocess is needed at first. It is easy to do it.
After finished processed.h5 file will be stored in data/ .

!python preprocess.py smiles_50k.h5 data/processed.h5

Then load the data and do training. I did it on notebook, so following code is different from original README section.

import train
data_train, data_test, charset = train.load_dataset('data/processed.h5')
model = train.MoleculeVAE()
model.create(charset,  latent_rep_size=292)
checkpointer = train.ModelCheckpoint(filepath = 'model.h5',
                                   verbose = 1,
                                   save_best_only = True)

reduce_lr = train.ReduceLROnPlateau(monitor = 'val_loss',
                                  factor = 0.2,
                                  patience = 3,
                                  min_lr = 0.0001)
model.autoencoder.fit( data_train,
                  data_train,
                  shuffle = True,
                  epochs = 20,
                  batch_size = 600,
                  callbacks = [checkpointer, reduce_lr],
                  validation_data = (data_test, data_test)
            )

It took 20-30min for training on google colab with GPU. I don’t recommend run the code on PC which does not have GPU. I will take too long time for training.

After training, I ran interpolate. Interpolate samples molecules between SORUCE and DEST from latent space. Ideally the approach can change molecule gradually because latent space is continuous space. It sounds interesting.

import interpolate
h5f = interpolate.h5py.File('data/processed.h5' ,'r')
charset = list(h5f['charset'][:])
charset = [ x.decode('utf-8') for x in charset ]
h5f.close()

LATENT_DIM = 292
WIDTH = 120
model.load(charset, 'model.h5', latent_rep_size = LATENT_DIM)

I set two molecules for SOURCE and DEST.

SOURCE = 'CC2CCN(C(=O)CC#N)CC2N(C)c3ncnc1[nH]ccc13'
DEST = 'CCS(=O)(=O)N1CC(C1)(CC#N)N2C=C(C=N2)C3=C4C=CNC4=NC=N3'
STEPS = 10
results = interpolate.interpolate(SOURCE, DEST, STEPS, charset, model, LATENT_DIM, WIDTH)
for result in results:
  print(result[2])

The result was….

Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCcccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCCCccccccccccccc Cr1ccccCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccc Cr1CcccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCccccccc Cr1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccc CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Opps! All molecules are not molecules. It just almost C/c……

The model will be improved if I set more large number of epochs and give more learning data.

Naive VAE approach is difficult to generate valid molecules. Now there are many approaches about molecule generation are reported. JT-VAE, Grammer-VAE, etc, etc…. What kind of generator do you like?

Today’s code is redundant but uploaded my github repo.

https://github.com/iwatobipen/playground/blob/master/mol_vae.ipynb

Visualize Molecular Orbital with pymol and psikit #RDKit #psi4 #psikit #pymol

I posted how to visualize MO with VMD, the data was generated from psikit.

I used VMD because psi4 provides visualize function for cubefile and I could not find the same method on pymol. Pymol is familier for me than VMD. So I would like to do same thing on pymol. And today I could do it.

It is very easy! Let’s try it. At first, geometry optimize and generate cubefile of acetic acid and tetrazole.

import sys
sys.path.append('/Users/iwatobipen/develop/chemoenv/psikit/psikit')
from psikit import Psikit
from rdkit import Chem
pk = Psikit()
# acetic acid
pk.read_from_smiles("CC(=O)[O-]")
pk.optimize()
Chem.MolToMolFile(pk.mol, 'acetic_acid.mol')
pk.getMOview()

# tetrazole
pk.read_from_smiles("CC1=NN=N[N-]1")
pk.optimize()
pk.getMOview()

The getMOview() to generate MO cube files. The function generates 4 files named Psi_a/b_n_n_A.cube. n is number of alpha electrons and number of alpha electrons + 1. It means HOMO and LUMO.

After file generation, launch pymol and load mol file and cubefiles. For acetic acid, acetic_acid.mol, Psi_a_16_16-A.cube and Psi_b_16_16-A.cube files are loaded for HOMO drawing.

Then, draw isomesh and set color from pymol command line.

isomesh HOMO_A,  Psi_a_16_16-A, -0.02
isomesh HOMO_B,  Psi_b_16_16-A, 0.02
color blue, HOMO_A
color red, HOMO_B

Now I could get following image on pymol.

And I could get HOMO of tetrazole with same manner.

Pymol can draw ligand-protein complex easily. I think it is interesting for rendering ligand MO on protein-ligand complex and think about new drug design.

Adjustment bond length and align molecule to scaffold #RDKit #chemoinformatics

To align rdkitmol object to given scaffold, GenerateDepictionMatching2DStructure is useful to do that. But, somedays ago I asked my colleague that the function could not adjust bond length when scaffold which is read from sdf and it has different bond length of RDKit’s default settings.

I found useful information in rdkit-discuss about rdMolTransforms.

rdMolTransforms.TransformConformer method can change bond length, but all bonds length are changed.

So I considered how to solve the problem. At first, I tried to use rdTransform.SetBondlength method, however the function could not apply the bond in the ring.

So I defined original function to do it. Here is my example code.

At first, I made simple scaffold and changed bond length of the scaffold.

= 0.1
rdMolTransforms.TransformConformer(scaf.GetConformer(), tm)
scaf

Now I could get pretty scaffold. ;)

Write the molecule to SD File.

writer = Chem.SDWriter('scaf.sdf')
writer.write(scaf)
writer.close()

Then I made new molecule to align.

mol = Chem.MolFromSmiles('Nc1ccc(C)cc1')

Next, load scaffold and align without bond length adjustment.

scaf = Chem.SDMolSupplier('scaf.sdf')[0]
cpmol = copy.deepcopy(mol)
AllChem.GenerateDepictionMatching2DStructure(cpmol, scaf)
cpmol

Now I could get strange image…

To fix the issue I defined following function.

fdef bondnormalize(mol, temp):
    AllChem.Compute2DCoords(mol)
    ref_bond = mol.GetBonds()[0]
    ref_length = rdMolTransforms.GetBondLength(mol.GetConformer(), ref_bond.GetBeginAtomIdx(),ref_bond.GetEndAtomIdx() )
    prob_bond = temp.GetBonds()[0]
    prob_length = rdMolTransforms.GetBondLength(temp.GetConformer(), prob_bond.GetBeginAtomIdx(),prob_bond.GetEndAtomIdx() )
    ratio = ref_length / prob_length
    tm = np.zeros((4,4), np.double)
    for i in range(3):
        tm[i,i] = ratio
    rdMolTransforms.TransformConformer(temp.GetConformer(), tm)
    AllChem.GenerateDepictionMatching2DStructure(mol, temp)
    return mol

Check the function.

m=bondnormalize(mol, scaf)

Wow it worked well.

It is rare case that reader would like to do like that. RDKit has many function for chemoinformatics. It really fun!.

Today’s code can read from following URL.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/set_same_bond_length.ipynb

Any comments and advices are highly appreciate.


Handling chemoinformatics data with pandas #RDKit #chemoinformatics

I often use Pandas for data analysis. RDKit provides useful method named PandasTools. The method can load sdf and return data as pandas dataframe. By using dataframe, It isn’t needed to do something with for loop.

I found an interesting information in rdkit issues. A package named pyjanitor.

The package wraps pandas and provides useful method not only basic data handling but also chemoinformatics.

The package can be installed with conda from conda-forge channel. I used the library. Following code is example on my jupyter notebook. At first, read libraries for the trial.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDConfig

import pandas as pd
import janitor
from janitor import chemistry

from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA
plt.style.use('ggplot')

Then load sdf data which is provided from rdkit.

path = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
df = PandasTools.LoadSDF(path)

pyjanitor has chemoinformatics function, one is morgan_fingerprint. It can calculate data and add the result in current dataframe. To do it, it is very easy ;)

fp1=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='bits')

Now I can get fingerprint information as pandas dataframe which shape is (num of compounds, num of bits).

Conduct PCA with fingerprint.

fp2=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='counts')
pca = PCA(n_components=3)
pca_res = pca.fit_transform(fp2)

janitor.chemistry has smi2mol method. It converts SMILES to mol objects in pandas dataframe.

df['SMILES'] = df.ROMol.apply(Chem.MolToSmiles)
df.head(2)
df = chemistry.smiles2mol(df, smiles_col='SMILES', mols_col='newROMol', progressbar='notebook')

And janitor has add_column method. To use the method, I can add multiple columns with chain method.

# add_column function is not native pandas method.
# df['NumAtm'] = df.ROMol.apply(Chem.Mol.GetAtoms) 
df = df.add_column('NumAtm', [Chem.Mol.GetNumAtoms(mol) for mol in df.ROMol])

It is easy to make fingerprint dataframe with janitor. And pyjanitor has other useful methods for pandas data handling.

code
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/pyjanitor_test.ipynb
pyjanitor
https://pyjanitor.readthedocs.io/index.html


New version of openforcefield supports RDKit! #RDKit #chemoinformatics #openforcefield

In this morning I found great news on my tiwtter TL that openforcefield supports RDKit! Older version of openforcefiled is required openeyeTK which is commercial license for industry. I had interested in the package but could not install because I’m not academia and our company does not OETK license now.

https://open-forcefield-toolkit.readthedocs.io/en/latest/releasehistory.html

I can’t wait to use it so I installed the package and check it immediately.
Following code is ran on google colab. If would like to run on local env, you can skip several steps.

# install conda
!wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh
!chmod ./Miniconda3-4.4.10-Linux-x86_64.sh
!time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local
# install rdkit via conda command
!time conda install -c conda-forge rdkit -y

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)

Then install openforcefield. Openforcefield is required openmm. Cuda version of colab was 10.0 so I did not need change of openmm install version. So, just type install ‘conda install openforcefield’. ;)

! conda config --add channels omnia --add channels conda-forge
! conda install -y openforcefield

Now ready, let’s call openforcefield.

from openforcefield.topology import Molecule
from openforcefield.topology import Topology
from openforcefield.utils import RDKitToolkitWrapper
from openforcefield.typing.engines.smirnoff import ForceField
from simtk import openmm
from simtk import unit

Following code was borrowed from original repo.

def get_energy(system, positions):
    """
    Return the potential energy.

    Parameters
    ----------
    system : simtk.openmm.System
        The system to check
    positions : simtk.unit.Quantity of dimension (natoms,3) with units of length
        The positions to use
    Returns
    ---------
    energy
    """

    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    context.setPositions(positions)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Load molecule with openFF method and calculate energy.

mol = Molecule.from_smiles('CCO')
mol.generate_conformers()
positions = mol.conformers[0]
#load Force Field
ff = ForceField('smirnoff99Frosst.offxml')
topology = mol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
print(orig_energy)
> 2.3399880921635527 kcal/mol

Next calculate energy from rdkit mol object.

mol = Chem.MolFromSmiles('c1ccccc1')
rdw = RDKitToolkitWrapper()
ofmol = rdw.from_rdkit(mol)
ofmol.generate_conformers()
positions = ofmol.conformers[0]
topology = ofmol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
print(orig_energy)
>7.987402593200566 kcal/mol

This is very simple example of OpenFF usage with RDKit. The package seems powerful for CADD. I would like to learn the package more and molecular dynamics.

Whole code of the post can check from my repo.

And I really thank to developer of RDKit, Openforcefield and many open source chemoinformatics tools.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/OpenFF_test.ipynb