Open data source of Quantum chemistry! #qcportal #rdkit #cheminformatics #quantum_chemisry

In RDKit UGM 2019, I had interest about QCArchive. QCArchive is MolSSI quantum chemistry archive. It provides useful data and python packages.

By using one package named qcportal, we can access huge data source of quantum chemistry. It is very useful because QC calculation is useful but it requires computational cost. QC data is useful for drug design and machine learning (i.e. building machine learning based force field etc…..).

I used the package. At first I installed qcportal via conda in my env. It isn’t good choice because I couldn’t install new version of the package. Old version of qcportal causes error. So I installed via pip. It worked fine.

Following code is almost same as original document. But I tried it for my memorandum. At first import packages and make client object. I used datasource from MolSSI.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import qcportal as ptl
client = ptl.FractalClient()

Then checked the list of torsion drive dataset. There are many dataset is available.

client.list_collections("TorsionDriveDataset")

>Fragment Stability Benchmark	None
>OpenFF Fragmenter Phenyl Benchmark	Phenyl substituent torsional barrier heights.
>OpenFF Full TorsionDrive Benchmark 1	None
>OpenFF Group1 Torsions	None
>OpenFF Primary TorsionDrive Benchmark 1	None
>OpenFF Substituted Phenyl Set 1	None
>Pfizer Discrepancy Torsion Dataset 1	None
>SMIRNOFF Coverage Torsion Set 1	None
>TorsionDrive Paper	None

ds = client.get_collection("TorsionDriveDataset", "OpenFF Fragmenter Phenyl Benchmark")
ds.df.head()

>c1c[cH:1][c:2](cc1)[C:3](=[O:4])O
>c1[cH:1][c:2](cnc1)[C:3](=[O:4])O
>[cH:1]1cncc[c:2]1[C:3](=[O:4])O
>[cH:1]1cc(nc[c:2]1[C:3](=[O:4])O)[O-]
>Cc1c[cH:1][c:2](cn1)[C:3](=[O:4])O

OK I succeeded to loading data. Let’s visualize some completed dataset. RDKit is very useful package for drawing molecules!!!!!

complete_data = ds.status(["b3lyp-d3"], collapse=False, status="COMPLETE")
Draw.MolsToGridImage([Chem.MolFromSmiles(complete_data['B3LYP-D3'].index[i]) for i in range(10)],
                    molsPerRow=5)

Finally visualize torsion energy!

ds.visualize([complete_data['B3LYP-D3'].index[i] for i in range(10)],"B3LYP-D3", units="kJ / mol")

Purple line (4th structure) has highest torsion energy at -90, 90 degree.
The molecule is 5-Hydroxynicotinic acid. Hydroxyl group is located para-positon of carboxylic group. So conjugation effect to make relative energy higher than other structures.

The package is useful for not only data source of QC but also visualization and analysis of molecules.

I uploaded today’s code on my gist.

Small molecule MD with openMM #MD #Openforcefield

I updated openforcefield from ver 0.5 to ver 0.6. ForceField of SMIRNOFF is also updated.

I tried to use new version of OpenFF.
At first, I calculated partial charge with semi empirical method ‘AM1-BCC’. Ambertools is used for the calculation, it is easy.

from openforcefield.topology import Molecule
from openforcefield.utils.toolkits import RDKitToolkitWrapper, AmberToolsToolkitWrapper
from openforcefield.topology import Topology
from openforcefield.typing.engines.smirnoff import ForceField
biar = Molecule.from_smiles('c1ccccc1-c1c(C)ccnc1')
#Gerates conformers, default number of generated conformers is 10.
biar.generate_conformers()
biar.compute_partial_charges_am1bcc()

Just finished, check the result. Nitrogen has the most negative charge and neighbor aromatic carbons has positive charges.

for i, atm in enumerate(biar.atoms):
    print(pc[i], atm)
-0.1175 e 
-0.1305 e 
-0.125 e 
-0.1305 e 
-0.1175 e 
-0.036 e 
-0.1543 e 
-0.0243 e 
-0.0648 e 
-0.2513 e 
0.3952 e 
-0.668 e 
0.4062 e 
0.136 e 
0.1335 e 
0.133 e 
0.1335 e 
0.136 e 
0.0527 e 
0.0527 e 
0.0527 e 
0.143 e 
0.0221 e 
0.0251 e 

It seems work fine. OK let’s try to MD calculation.

For convenience, I wrote simple script and config file for calculation.
Following code calculate MD with SMILES as sys.argv[1]

#small_mol_md.py
import yaml
import sys
import os
import time
import matplotlib.pyplot as plt
from openforcefield.topology import Molecule
from openforcefield.topology import Topology
from openforcefield.typing.engines.smirnoff import ForceField
from openforcefield.utils.toolkits import RDKitToolkitWrapper
from openforcefield.utils.toolkits import AmberToolsToolkitWrapper
from simtk import openmm
from simtk import unit
from rdkit import Chem

def run_md(molecule, confId=0):
    off_topology = molecule.to_topology()
    omm_topology = off_topology.to_openmm()
    system = forcefield.create_openmm_system(off_topology)

    time_step = config["time_step"] * unit.femtoseconds
    temperature = config["temperature"] * unit.kelvin
    friction = 1 / unit.picosecond
    integrator = openmm.LangevinIntegrator(temperature, friction, time_step)
    
    conf = molecule.conformers[confId]
    simulation = openmm.app.Simulation(omm_topology,
                                       system,
                                       integrator)
    simulation.context.setPositions(conf)
    if not os.path.isdir('./log'):
        os.mkdir('./log')
    pdb_reporter = openmm.app.PDBReporter('./log/trj.pdb', config["trj_freq"])
    state_data_reporter = openmm.app.StateDataReporter("./log/data.csv",
                                                       config["data_freq"],
                                                       step=True,
                                                       potentialEnergy=True,
                                                       temperature=True,
                                                       density=True)
    simulation.reporters.append(pdb_reporter)
    simulation.reporters.append(state_data_reporter)
    start = time.process_time()
    simulation.step(config["num_steps"])
    end = time.process_time()
    print(f"Elapsed time {end-start:.2f} sec")
    print("Done")

if __name__=="__main__":
    forcefield = ForceField("openff-1.0.0.offxml")
    config = yaml.load(open("mdconf.yml", "r"), yaml.Loader)
    molecule = Molecule.from_smiles(sys.argv[1])
    molecule.generate_conformers()
    run_md(molecule)

And calculation configuration is below.

#mdconfig.yml
time_step: 2
temperature: 300
friction: 1
trj_freq: 1
data_freq: 1
num_steps: 1000

Run calculation.
$ python small_mol_md.py ‘c1ccc(C)cc1-c2c(OC)nccc2’

After the calculation, I could get pdb and csv file.
Pdb file has 1000 states. And CSV file has calculated data.

blue shows energy and red shows temperature

It took ~10 sec for the molecule, it will take long time for large scale calculation.

MD calculation requires many parameters. I’m not familiar for the calculation so started to learn it. Now I installed GROMACS in my PC.

There are lots of things what I would like to learn….