Python package for Ensemble learning #Chemoinformatics #Scikit learn

Ensemble learning is a technique for machine learning. I wrote post about blending learning before. URL is below.
https://iwatobipen.wordpress.com/2018/11/11/ensemble-learning-with-scikit-learn-and-xgboost-machine-learning/
I implemented the code by myself at that time.

Ensemble learning sometime outperform than single model. So it is useful for try to use the method. Fortunately now we can use ensemble learning very easily by using a python package named ‘ML-Ens‘ Installation is very easy, only use pip command common way for pythonista I think ;)

After installing the package user can build and train ensemble learning model with few lines. I would like to introduce two example of them one is stacking method and the other is a blending method. OK let’s go to code.

At first, load dataset and make input features. I used morgan fingerprint as input data.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit import RDPaths
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import PandasTools
import numpy as np
import pandas as pd
from IPython.display import HTML
traindf = PandasTools.LoadSDF(os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf'))
testdf = PandasTools.LoadSDF(os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf'))
# Chek data
HTML(traindf.head(2).to_html())

cls2lab = {'(A) low':0, '(B) medium':1, '(C) high':2}

def fp2np(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
trainfp = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in traindf.ROMol]
testfp =  [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in testdf.ROMol]
trainX = np.array([fp2np(fp) for fp in trainfp])
testX = np.array([fp2np(fp) for fp in testfp])
trainY = np.array([cls2lab[i] for i in traindf.SOL_classification.to_list()])
testY =  np.array([cls2lab[i] for i in testdf.SOL_classification.to_list()])

Then import several package for ensemble learning. SuperLearner is class for stacking and BlendEnsemble is class for blending.

Making ensemble model is easy. Just use add method to layer addition and finally call add_meta method for adding final prediction layer.

from mlens.ensemble import SuperLearner
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, accuracy_score
from sklearn.svm import SVR, SVC

# For stacnking
ensemble = SuperLearner(scorer=accuracy_score, random_state=794, verbose=2)
ensemble.add([RandomForestClassifier(n_estimators=100, random_state=794), SVC(gamma='auto', C=1000)])
ensemble.add_meta(LogisticRegression(solver='lbfgs', multi_class='auto'))

ensemble.fit(trainX, trainY)
pred = ensemble.predict(testX)
accuracy_score(testY, pred)

# Blending
from mlens.ensemble import BlendEnsemble
ensemble2 = BlendEnsemble(scorer=accuracy_score, test_size=0.2, verbose=2)
ensemble2.add([RandomForestClassifier(n_estimators=794, random_state=794),
                 SVC(gamma='auto')])
ensemble2.add_meta(LogisticRegression(solver='lbfgs', multi_class='auto'))
ensemble2.fit(trainX, trainY)
pred_b = ensemble2.predict(testX)
accuracy_score(pred_b, testY)

Also more models can added with add method. I uploaded whole code on my gist. After calling fit, it is easy to access result data by using data method.

code example

Unfortunately the ensemble models described in the post don’t outperform single random forest model but mlens is nice tool for ensemble learning there is still room of improvement for model performance such as kind of models, hyper parameters etc.

Original document give more informations. Please go to link if reader has interest.
http://ml-ensemble.com/docs/

Quantum Chemistry data of drug bank #QCportal #Quantum_Chemistry

I’m still learning QCArchive. I posted qcportal with reaction dataset. And today I tried to retrieve of drug bank from qcportal. QCportal provides not only calculated numeric data but also 3D mol view by using py3Dmol.

OK let’s go to code. get_molecule method provides many data from qcportal web server.

import qcportal as ptl
client = ptl.FractalClient()
ds = client.get_collection("Dataset", "COMP6 DrugBank")
mols = ds.get_molecules()
mols.shape
> (13379, 1)

What kinds of data in the dataset? It is easy to do it, just call some methods.

ds.list_values().reset_index()['method'].unique()
> array(['ωB97x', 'b3lyp', 'b3lyp-d3m(bj)', 'hf', 'pbe', 'pbe-d3m(bj)',
       'svwn', 'wb97m', 'wb97m-d3(bj)'], dtype=object)
ds.list_values().reset_index()['basis'].unique()
> array(['6-31g*', 'def2-tzvp'], dtype=object)

ds.list_values()

This dataset has not only data from psi4 but also gaussian!.

I got data from method=’wB97x’

val = ds.get_values(method='ωB97x')
val.columns
> Index(['CM5 Charges', 'Hirshfeld Charges', 'Energy', 'Gradient',
       'Hirshfeld Dipole', 'Spin Density'],
      dtype='object')

I got energy from the data and visualize molecules.

energy = val['Energy']
mols['molecule'][0].show()
energy[0]
> -636107.9519541461

Py3Dmol works very well. I could get QC energy of molecule in drug bank and could render molecule as 3D object.

It is very cool!

My whole code is uploaded following URL.

Have a nice week end! ;)

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/drug_bank.ipynb

Open data source of Quantum chemistry! #qcportal #rdkit #cheminformatics #quantum_chemisry

In RDKit UGM 2019, I had interest about QCArchive. QCArchive is MolSSI quantum chemistry archive. It provides useful data and python packages.

By using one package named qcportal, we can access huge data source of quantum chemistry. It is very useful because QC calculation is useful but it requires computational cost. QC data is useful for drug design and machine learning (i.e. building machine learning based force field etc…..).

I used the package. At first I installed qcportal via conda in my env. It isn’t good choice because I couldn’t install new version of the package. Old version of qcportal causes error. So I installed via pip. It worked fine.

Following code is almost same as original document. But I tried it for my memorandum. At first import packages and make client object. I used datasource from MolSSI.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import qcportal as ptl
client = ptl.FractalClient()

Then checked the list of torsion drive dataset. There are many dataset is available.

client.list_collections("TorsionDriveDataset")

>Fragment Stability Benchmark	None
>OpenFF Fragmenter Phenyl Benchmark	Phenyl substituent torsional barrier heights.
>OpenFF Full TorsionDrive Benchmark 1	None
>OpenFF Group1 Torsions	None
>OpenFF Primary TorsionDrive Benchmark 1	None
>OpenFF Substituted Phenyl Set 1	None
>Pfizer Discrepancy Torsion Dataset 1	None
>SMIRNOFF Coverage Torsion Set 1	None
>TorsionDrive Paper	None

ds = client.get_collection("TorsionDriveDataset", "OpenFF Fragmenter Phenyl Benchmark")
ds.df.head()

>c1c[cH:1][c:2](cc1)[C:3](=[O:4])O
>c1[cH:1][c:2](cnc1)[C:3](=[O:4])O
>[cH:1]1cncc[c:2]1[C:3](=[O:4])O
>[cH:1]1cc(nc[c:2]1[C:3](=[O:4])O)[O-]
>Cc1c[cH:1][c:2](cn1)[C:3](=[O:4])O

OK I succeeded to loading data. Let’s visualize some completed dataset. RDKit is very useful package for drawing molecules!!!!!

complete_data = ds.status(["b3lyp-d3"], collapse=False, status="COMPLETE")
Draw.MolsToGridImage([Chem.MolFromSmiles(complete_data['B3LYP-D3'].index[i]) for i in range(10)],
                    molsPerRow=5)

Finally visualize torsion energy!

ds.visualize([complete_data['B3LYP-D3'].index[i] for i in range(10)],"B3LYP-D3", units="kJ / mol")

Purple line (4th structure) has highest torsion energy at -90, 90 degree.
The molecule is 5-Hydroxynicotinic acid. Hydroxyl group is located para-positon of carboxylic group. So conjugation effect to make relative energy higher than other structures.

The package is useful for not only data source of QC but also visualization and analysis of molecules.

I uploaded today’s code on my gist.

Small molecule MD with openMM #MD #Openforcefield

I updated openforcefield from ver 0.5 to ver 0.6. ForceField of SMIRNOFF is also updated.

I tried to use new version of OpenFF.
At first, I calculated partial charge with semi empirical method ‘AM1-BCC’. Ambertools is used for the calculation, it is easy.

from openforcefield.topology import Molecule
from openforcefield.utils.toolkits import RDKitToolkitWrapper, AmberToolsToolkitWrapper
from openforcefield.topology import Topology
from openforcefield.typing.engines.smirnoff import ForceField
biar = Molecule.from_smiles('c1ccccc1-c1c(C)ccnc1')
#Gerates conformers, default number of generated conformers is 10.
biar.generate_conformers()
biar.compute_partial_charges_am1bcc()

Just finished, check the result. Nitrogen has the most negative charge and neighbor aromatic carbons has positive charges.

for i, atm in enumerate(biar.atoms):
    print(pc[i], atm)
-0.1175 e 
-0.1305 e 
-0.125 e 
-0.1305 e 
-0.1175 e 
-0.036 e 
-0.1543 e 
-0.0243 e 
-0.0648 e 
-0.2513 e 
0.3952 e 
-0.668 e 
0.4062 e 
0.136 e 
0.1335 e 
0.133 e 
0.1335 e 
0.136 e 
0.0527 e 
0.0527 e 
0.0527 e 
0.143 e 
0.0221 e 
0.0251 e 

It seems work fine. OK let’s try to MD calculation.

For convenience, I wrote simple script and config file for calculation.
Following code calculate MD with SMILES as sys.argv[1]

#small_mol_md.py
import yaml
import sys
import os
import time
import matplotlib.pyplot as plt
from openforcefield.topology import Molecule
from openforcefield.topology import Topology
from openforcefield.typing.engines.smirnoff import ForceField
from openforcefield.utils.toolkits import RDKitToolkitWrapper
from openforcefield.utils.toolkits import AmberToolsToolkitWrapper
from simtk import openmm
from simtk import unit
from rdkit import Chem

def run_md(molecule, confId=0):
    off_topology = molecule.to_topology()
    omm_topology = off_topology.to_openmm()
    system = forcefield.create_openmm_system(off_topology)

    time_step = config["time_step"] * unit.femtoseconds
    temperature = config["temperature"] * unit.kelvin
    friction = 1 / unit.picosecond
    integrator = openmm.LangevinIntegrator(temperature, friction, time_step)
    
    conf = molecule.conformers[confId]
    simulation = openmm.app.Simulation(omm_topology,
                                       system,
                                       integrator)
    simulation.context.setPositions(conf)
    if not os.path.isdir('./log'):
        os.mkdir('./log')
    pdb_reporter = openmm.app.PDBReporter('./log/trj.pdb', config["trj_freq"])
    state_data_reporter = openmm.app.StateDataReporter("./log/data.csv",
                                                       config["data_freq"],
                                                       step=True,
                                                       potentialEnergy=True,
                                                       temperature=True,
                                                       density=True)
    simulation.reporters.append(pdb_reporter)
    simulation.reporters.append(state_data_reporter)
    start = time.process_time()
    simulation.step(config["num_steps"])
    end = time.process_time()
    print(f"Elapsed time {end-start:.2f} sec")
    print("Done")

if __name__=="__main__":
    forcefield = ForceField("openff-1.0.0.offxml")
    config = yaml.load(open("mdconf.yml", "r"), yaml.Loader)
    molecule = Molecule.from_smiles(sys.argv[1])
    molecule.generate_conformers()
    run_md(molecule)

And calculation configuration is below.

#mdconfig.yml
time_step: 2
temperature: 300
friction: 1
trj_freq: 1
data_freq: 1
num_steps: 1000

Run calculation.
$ python small_mol_md.py ‘c1ccc(C)cc1-c2c(OC)nccc2’

After the calculation, I could get pdb and csv file.
Pdb file has 1000 states. And CSV file has calculated data.

blue shows energy and red shows temperature

It took ~10 sec for the molecule, it will take long time for large scale calculation.

MD calculation requires many parameters. I’m not familiar for the calculation so started to learn it. Now I installed GROMACS in my PC.

There are lots of things what I would like to learn….

Calculate free solvent accessible surface area #RDKit #Chemoinformatics

Recent version of rdkit has method to calculate FreeSASA.
I never used the function so I used it. So I tried to use it.

I calculated freeSASA with very simple molecules Phenol and hydroxy pyridine.

from rdkit import Chem
from rdkit.Chem import rdFreeSASA
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
mol1 = Chem.MolFromSmiles('Oc1ccccc1')
mol2 = Chem.MolFromSmiles('Oc1ccncc1')
hmol1 = Chem.AddHs(mol1)
hmol2 = Chem.AddHs(mol2)
AllChem.EmbedMolecule(hmol1)
AllChem.EmbedMolecule(hmol2)

To calculate FreeSASA, prepare raddii is needed.

radii1 = rdFreeSASA.classifyAtoms(hmol1)
radii2 = rdFreeSASA.classifyAtoms(hmol2)

Now ready, let’s calculate FreeSASA.

rdFreeSASA.CalcSASA(hmol1, radii1)
> 137.43293375181904
rdFreeSASA.CalcSASA(hmol2, radii2)
> 128.34398350646256

At first I expected that FreeSASA of pyridine is larger than phenol but result is opposite. So I would like to know details of the reason.

After calculating FreeSASA, each atom has SASA property.

atoms1 = hmol1.GetAtoms()
atoms2 = hmol2.GetAtoms()
for i in range(len(atoms1)):
    print(atoms1[i].GetSymbol(), atoms1[i].GetProp('SASAClassName'), atoms1[i].GetProp("SASA"))
    
sum(float(a.GetProp("SASA")) for a in atoms1)


O Unclassified 10.276248749137361
C Unclassified 5.6117335908330768
C Unclassified 4.8812286399274658
C Unclassified 4.9178986731131236
C Unclassified 4.923259125887407
C Unclassified 4.8241215955112828
C Unclassified 4.8595021375180254
H Unclassified 16.645522512291386
H Unclassified 16.254710140190241
H Unclassified 15.866400115020539
H Unclassified 16.022421230036539
H Unclassified 16.089713316178983
H Unclassified 16.260173926173582
137.43293375181904

for i in range(len(atoms2)):
    print(atoms2[i].GetSymbol(), atoms2[i].GetProp('SASAClassName'), atoms2[i].GetProp("SASA"))
    
sum(float(a.GetProp("SASA")) for a in atoms2)

O Unclassified 10.443721296042458
C Unclassified 5.5711494477882848
C Unclassified 4.7609239637426501
C Unclassified 4.9640112698257193
N Unclassified 11.64593971756287
C Unclassified 4.6638358234073181
C Unclassified 5.0220733873889731
H Unclassified 16.474508728534751
H Unclassified 16.091035411384464
H Unclassified 16.409635462176684
H Unclassified 16.194368688350266
H Unclassified 16.102780310258137
128.34398350646256

The reason is that phenol has one more hydrogen atom and it occupy more surface area than aromatic nitrogen.

I think the parameter can use a property for GCN.
Today’s sample code is here.