Use ORBKIT for rendering MO #orbkit #rdkit #psikit #quantum_chemistry

As you know, there many packages for quantum chemistry not only commercial software but also free tools. And each soft has own output format. So user need to understand how to use it. But it is very time consuming step I think.

ORBIKIT is a modular python toolbox for cross-platform post processing of quantum chemical wavefunction data.

The reference URL is below.

It seems nice isn’t it? Fortunately the source code is disclosed on github.

So I tried to use orbkit with psikit. At first, I installed orbkit while referring the original site.

At first, I installed cython, numpy, scipy, h5py with conda and mayavi installed with pip (mayavi couldn’t install with conda due to package conflicts).

Then install orbkit.

$ cd $HOME
$ git clone
$ export ORBKITPATH=$HOME/orbkit
$ python build_ext --inplace clean

And also I added environment variables to my .bashrc.

Now ready, let’s use orbkit. I used psikit for making fchk file for orbkit. orbkit supports many kinds of input format. If reader has interest the package, please check the original document.

Following code shows how to use psikit and visualize MO with orbkit.

After running the code, mayavi viewer launched and could get MO view, the image is shown below.

Of course current psikit has some functions for communicate pymol for rendering MO but orbkit is also useful package for quantum chemistry.

I uploaded today’s code on my github repo.

New molecular fingerprint for chemoinformatics #map4 #RDKit #memo #chemoinformatics

Molecular fingerprint(FP) is a very important for chemoinformatics because it is used for building many predictive models not only ADMET but also biological activities.

As you know, ECFP (Morgan Fingerprint) is one of golden standard FP of chemoinformatics. Because it shows stable performance against any problems. After ECFP is reported, many new fingerprint algorithm is proposed but still ECFP is used in many case.

It means that more improvement of fingerprint is required but it’s too difficult task.

Recently new fingerprint algorithm named MAP4 is proposed from Prof. Reymond lab. URL is below.

MAP of MAP4 means “MinHashed atom-pair fingerprint up to a diameter of four bonds“.

The calculation process is below.

First, the circular substructures surrounding each non-hydrogen atom in the molecule at radii 1 to r are written as canonical, non-isomeric, and rooted SMILES string .

Second, the minimum topological distance TPj,k separating each atom pair(j, k) in the input molecule is calculated.

Third,all atom-pair shingles are written for each atom pair (j, k) and each value of r, placing the two SMILES strings in lexicographical order.

Fourth, the resulting set of atom-pair shingles is hashed to a set of integers using the unique mapping SHA-1, and its corresponding transposed vector is finally MinHashed to form the MAP4 vector.

It seems similar approach to ECFP but this approach uses minhashing techniques. It works very fast and this fingerprint outperform compared to their developed MHFP, ECPF and other fingerprints which are impremented RDKit.

In their article Fig2 showed performance of virtual screening against various targets and MAP4FP outperformed in many targets.

In the Fig8 shows Jaccard distance between set of molecules. MAP4 shows better performance against not only small molecule but also large molecule such as peptide, compared to other finger print such as atom pair FP, ECFP etc.

So I would like to use the FP on my notebook, so I tried to use it.

The author disclosed source code so I could get code from github.

git clone
cd map4

I think original repo has bug for folded fingerprint calculation so I maed PR to original repo.

And following code used modified version of original code.

I compared the FP against morganFP came from rdkit.

Today’s code was uploaded my gist and github.

MAP4Calculator provides calculate and calculate_many methods. The first one calculate MAP4FP of molecule and second one calculates MAP4FP of list of molecules.

is_folded option is false in default, so to get folded(fixed length of) finger print, user need to change the option from False to True.

The test data shown above, Moragn FP seems more sensitive to difference of compound structure. Because similarity is lower to MAP4FP. and folded and unfolded MAP4FP showed almost similar performance.

Today I showed how to calculate MAP4FP, so I would like to check the FP performance with real drug discovery project with any machine learning algorithms. :-)

I also uploaded the code to my github repo.

Benchmarking platform for generative models. #RDKit #Chemoinformatics #DeepLearning #guacamol

Yesterday I posted benchmarking platform named ‘moses’ and found it worked for test data. And then I could get comment from @Mufei Li, developer of DGL that how about to try guacamol. I checked guacamol before but didn’t try it. So I installed guacamol and used it.

From original repo, GuacaMol is an open source Python package for benchmarking of models for de novo molecular design.

This package is developed by BenevolentAI. This package can assess de novo molecular generator of following metrics.

  1. Validity
  2. Uniqueness
  3. Novelty
  4. Fr´echet ChemNet Distance
  5. KL divergence

5 is different point to moses. KL divergence metrics is based on molecular descriptors. It is not fingerprint base. So different structure but similar molecular property sets will show high score.

For test the package I installed guacamol and used it.

Fortunately, guacamol can install via pip ;).

And I tested two metrics one was KLDivBenchmark and the other was UniquenessBenchmark. As you can see most of the following code is same as test code of original repository.

I uploaded my code to my gist.

Moses and Guacamol are both useful package for benchmarking.

It is important that we should get benchmark data compared to same baseline. However in the real drug discovery project, molecular properties which is required in the projects are depends on their situation. So it is not big problem of generator performance because current models can generate reasonable structures I think.

We need to go beyond that… ;)

Benchmarking platform for generative models. #RDKit #Chemoinformatics #DeepLearning #moses

There are lots of publications about molecular generators. Each publication implements novel algorithms so we need tool for comparing these models that which is better for us.

I often use PCA, tSNE for chemical space visualization and calculate some scores such as QED, SA/SC Score and molecular properties. However I need the unified metrics. So I think Molecular Sets(MOSES) is nice tool to do it.

MOSES provides useful metrics shown below.

  1. Fragment similarity (FRAG) which is defined as the cosine distance between vectors of fragment frequencies.
  2. Scaffold similarity (SCAFF) which is defined as similar as FRAG but the metrics uses frequency of scaffolds instead of fragments.
  3. Nearest neighbor similarity (SNN) which is the average Tanimoto similarity between test molecules and generated molecules.
  4. Internal diversity (IntDiv_p) assesses the chemical diversity within the generated set of molecules. p=1 or 2
    IntDiv_p(G)=1-p\sqrt{\frac{1}{|G2|}-\sum_T(m_1, m_2)^p}
  5. Freched ChemNet Distance (FCD) which uses the penultimate layer of ChemNet and measure distance of reference molecules and generated molecules.

Fortunately source code of MOSES are freely available. I installed moses in my PC and test it.

MOSES repository provides install script, so it is easy to install moses and required packages. I modified and install it because I use torch=1.2.0 but original requires ver 1.1.0. So I commented out the line 16 of

After installed the package, I used moses from jupyter notebook. All molecules are borrowed from test script of the repository.

Following code is an example. It is easy to get metrics, just calling metrics.get_all_metrics(testmolecules, generatedmolecules). As shown below the method calculate all metrics which are implemented MOSES and show some additional properties such as ratio of valid molecule, qed, molwt etc.

import pandas as pd
from rdkit import Chem

from moses import metrics

test = ['Oc1ccccc1-c1cccc2cnccc12','COc1cccc(NC(=O)Cc2coc3ccc(OC)cc23)c1']
test_sf = ['COCc1nnc(NC(=O)COc2ccc(C(C)(C)C)cc2)s1',
gen = ['CNC', 'Oc1ccccc1-c1cccc2cnccc12',

metrics.get_all_metrics(test, gen, k=3)

>> out

{'valid': 0.8,
 'unique@3': 1.0,
 'FCD/Test': 52.58373533265676,
 'SNN/Test': 0.3152585653588176,
 'Frag/Test': 0.30000000000000004,
 'Scaf/Test': 0.5,
 'IntDiv': 0.7189187332987785,
 'IntDiv2': 0.49790709357032226,
 'Filters': 0.75,
 'logP': 4.9581881764518005,
 'SA': 0.5086898026154574,
 'QED': 0.045033731661603064,
 'NP': 0.2902816615644048,
 'weight': 14761.927533455337}

I think moses is very useful tool for checking performance of molecular generator. Thanks the author for developing such as a nice tool for in silico drug discovery!

Today’s my code uploaded gist.

Enjoy chemoinformatics! ;)

Example usage of psi4-openmm-interface #Psi4 #OpenMM #RDKit

Molecular dynamics and Quantum Chemistry are important tools for CADD. I have interested in these topics and OpenMM and Psi4 are nice tool to handing MD and QM.

Today I tried to use psi4-openmm-interface which allows passing of molecular systems between each program.

I reviewed test script and found that the package pass the molecule which optimized with openmm to psi4 as xyz format. And also the package can conduct MD calculation very simply.

I used it, following code is almost same as original repo.

At first, I install the package you can find how to install it in following URL.

Install psi4, openmm and set PYTHONPASS. It was simple. And I need to modify some code to import the package correctly.

  1. It uses PeriodicTable class of qcelemental, but the module can’t import _temp_symbol attribute from, so I added _temp_symbol line to qcelemental/
# qcelemental/ 
class PeriodicTable:
    """Nuclear and mass data about chemical elements from NIST.


    A : list of int
        Mass number, number of protons and neutrons, starting with 0 for dummies.
    Z : list of int
        Atomic number, number of protons, starting with 0 for dummies.
    E : list of str
        Element symbol from periodic table, starting with "X" for dummies. "Fe" capitalization.
    EA : list of str
        Nuclide symbol in E + A form, e.g., "Li6".
        List `EA` is a superset of `E`; that is, both "Li6" and "Li" present.
        For hydrogen, "D" and "T" also included.
    mass : list of :py:class:`decimal.Decimal`
        Atomic mass [u].
        For nuclides (e.g., "Li6"), the reported mass.
        For stable elements (e.g., "Li"), the mass of the most abundant isotope ("Li7").
        For unstable elements (e.g., "Pu"), the mass of the longest-lived isotope ("Pu244").
    name : list of str
        Element name from periodic table, starting with "Dummy". "Iron" capitalization.


    def __init__(self):

        from . import data

        # Of length number of elements
        self.Z = data.nist_2011_atomic_weights["Z"]
        self.E = data.nist_2011_atomic_weights["E"] = data.nist_2011_atomic_weights["name"]

        self._el2z = dict(zip(self.E, self.Z))
        self._z2el = collections.OrderedDict(zip(self.Z, self.E))
        self._element2el = dict(zip(, self.E))
        self._el2element = dict(zip(self.E,

        # Of length number of isotopes
        self._EE = data.nist_2011_atomic_weights["_EE"]
        self.EA = data.nist_2011_atomic_weights["EA"]
        self.A = data.nist_2011_atomic_weights["A"]
        self.mass = data.nist_2011_atomic_weights["mass"]

        self._eliso2mass = dict(zip(self.EA, self.mass))
        self._eliso2el = dict(zip(self.EA, self._EE))
        self._eliso2a = dict(zip(self.EA, self.A))
        self._el2a2mass = collections.defaultdict(dict)
        for EE, m, A in zip(self._EE, self.mass, self.A):
            self._el2a2mass[EE][A] = float(m)
       # following code was added
        self._temp_symbol =  [
    "X", "H", "HE", "LI", "BE", "B", "C", "N", "O", "F", "NE", "NA", "MG", "AL", "SI", "P", "S", "CL", "AR", "K", "CA",
    "SC", "TI", "V", "CR", "MN", "FE", "CO", "NI", "CU", "ZN", "GA", "GE", "AS", "SE", "BR", "KR", "RB", "SR", "Y",
    "ZR", "NB", "MO", "TC", "RU", "RH", "PD", "AG", "CD", "IN", "SN", "SB", "TE", "I", "XE", "CS", "BA", "LA", "CE",
    "PR", "ND", "PM", "SM", "EU", "GD", "TB", "DY", "HO", "ER", "TM", "YB", "LU", "HF", "TA", "W", "RE", "OS", "IR",
    "PT", "AU", "HG", "TL", "PB", "BI", "PO", "AT", "RN", "FR", "RA", "AC", "TH", "PA", "U", "NP", "PU", "AM", "CM",
    "BK", "CF", "ES", "FM", "MD", "NO", "LR", "RF", "DB", "SG", "BH", "HS", "MT", "DS", "RG", "UUB", "UUT", "UUQ",
    "UUP", "UUH", "UUS", "UUO"

2. Line 81 of psiomm/ should be change below.

            # from

3. gaff2.xml which is forcefield parameters is placed in openmm installed path openmm/app/data/. Because for my env, gaff2.xml file couldn’t find in the folder.

4. Got from qcdb/qcdb and placed in psi4/driver/qcdb folder.

Now ready. Let’s write code.

Import some packages at first I used psikit for molecular geometry generation.

import numpy as np

from psikit import Psikit
from psiomm import molecule
from psiomm import psi_omm as po

from import Simulation
from import ForceField
from import Modeller
from import PDBReporter

from simtk.openmm import Platform
from simtk.openmm import LangevinIntegrator

from simtk.openmm import Vec3

from simtk.unit import picosecond
from simtk.unit import kelvin
from simtk.unit import kilocalorie_per_mole
from simtk.unit import angstrom

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole

Work flow

  • make molecule from xyz string => solute

  • generate bonds

  • generate atom types

  • generate charges

  • make topology

  • define forcefield

  • generate template with FF, topology and solute

  • make modeller

  • perform simulation to get trajectory

pk = Psikit()

Psikit generate xyz strings with charge and multiplicity information but psiomm can’t read the line so I remove them and passed the string to psiomm. Then generate some molecular properties.

# remove mutiplicity, formal charge
xyzstring = '\n'.join(pk.mol2xyz().split('\n')[2:])
solute = molecule.Molecule.from_xyz_string(xyzstring)
# check generated atoms and charges
for i, at in enumerate(solute.atom_types):
    print(i, at, solute.charges[i])

0 c3 -0.16240415191041357
1 c3 0.03679384783580364
2 os -0.263056064493707
17 h4 0.07500144690962385

Following workflow is based on openmm. GAFF2 is implemented in AMBER and TIP3P is water model which is implemented in CHARMM.

topology = po.make_topology(solute)
forcefield = ForceField('gaff2.xml', 'tip3p.xml')
po.generateTemplate(forcefield, topology, solute)
coords = []
for i in range(len(
positions = coords * angstrom

Make modeller with topology and positions. I added 50 water molecules to the model. And defined the simulation system.

modeller = Modeller(topology, positions)
modeller.addSolvent(forcefield, numAdded=50)
omm_sys = forcefield.createSystem(modeller.topology)
integrator = LangevinIntegrator(300*kelvin, 1/picosecond, 0.002*picosecond)

simulation = Simulation(modeller.topology,

I used PDBReporter module to store the simulation. Simulation step set 400 and reporter object save data each 10 step. So I could get output pdb with 40 (400 /10 ) states.

simulation.reporters.append(PDBReporter('output.pdb', 10))
MD movie

It worked fine. following code didn’t use psi4 based calculation, to do it I need to modify input file because last data has not only target molecule but also 50 water molecules. I’ll check the package and psi4 more.

Today’s code can find from my gist/github repo.

Example code of DGL for chemoinformatics task #DGL #chemoinformatics #RDKit #memo

There are many publications about graph based approach for chemoinformatics area. I can’t cover all of them but still have interest these area. I think pytorch_geometric (PyG) and deep graph library (DGL) are very attractive and useful package for chemoinformaticians.

I wrote some posts about DGL and PyG. Recent DGL is more chemoinformatics friendly so I used DGL for GCN model building today.

At first I tried to use DGL from Skorch but it failed. So I use DGL only for model building. As usual, I used solubility data for the test. And used RDKit for compound structure handling. I used GCNClassifier.
To use DGL for chemoinformatics area, user should install mdtraj at first. Because mol_to_graph try to import the package but if the code can’t find mdtraj. User can’t get any error message and mol_to_graph related module can’t use.

import os
from rdkit import Chem
from rdkit import RDPaths
import numpy as np

import torch
import dgl
if torch.cuda.is_available():
    print('use GPU')
    print('use CPU')

from dgl.model_zoo.chem import GCNClassifier
from import mol_to_graph
from import mol_to_complete_graph
from import CanonicalAtomFeaturizer
from import CanonicalBondFeaturizer
from torch import nn

import torch.nn.functional as F
from import DataLoader
from import Dataset
from torch.nn import CrossEntropyLoss

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2

After data loaded, rdkit mol objects are converted to graph objects. I used canonical atom featurizer for the task. The features used in the module is the same as deepchem.

It is easy to make graph object with DGL. Just call mol_to_complete_graph. Of course you can use smiles_to_complete_graph for SMILE strings. ;)

atom_featurizer = CanonicalAtomFeaturizer()
n_feats = atom_featurizer.feat_size('h')
> 74

ncls = 3
train_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in trainmols]
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in testmols]
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

OK let’s define GCN model. GCNClassifier takes gcn_hidden_feats argument as list object. If user would like to add n GCN layers, user should pass list with n hidden layers parameters. I added 2 GCN layers with 60 and 20 hidden layers to the following model.

# define GCN NET with 4 GCN layers
gcn_net = GCNClassifier(in_feats=n_feats,
gcn_net =

Then define the collate function for original dataloader object.

def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)
train_data = list(zip(train_g, train_y))
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=collate, drop_last=True)

Next, train the model. This example is multi label classification task so I used CrossEntropyLoss for loss function.

loss_fn = CrossEntropyLoss()
optimizer = torch.optim.Adam(gcn_net.parameters(), lr=0.01)
epoch_losses = []
epoch_accuracies = []
for epoch in range(1,201):
    epoch_loss = 0
    epoch_acc = 0
    for i, (bg, labels) in enumerate(train_loader):
        labels =
        atom_feats = bg.ndata.pop('h').to(device)
        atom_feats, labels =,
        pred = gcn_net(bg, atom_feats)
        loss = loss_fn(pred, labels)
        epoch_loss += loss.detach().item()
        pred_cls = pred.argmax(-1).detach().to('cpu').numpy()
        true_label ='cpu').numpy()
        epoch_acc += sum(true_label==pred_cls) / true_label.shape[0]
    epoch_acc /= (i + 1)
    epoch_loss /= (i + 1)
    if epoch % 20 == 0:
        print(f"epoch: {epoch}, LOSS: {epoch_loss:.3f}, ACC: {epoch_acc:.3f}")

> epoch: 20, LOSS: 0.384, ACC: 0.834
> epoch: 40, LOSS: 0.327, ACC: 0.842
> epoch: 60, LOSS: 0.290, ACC: 0.874
> epoch: 80, LOSS: 0.257, ACC: 0.893
> epoch: 100, LOSS: 0.244, ACC: 0.896
> epoch: 120, LOSS: 0.249, ACC: 0.899
> epoch: 140, LOSS: 0.222, ACC: 0.903
> epoch: 160, LOSS: 0.186, ACC: 0.928
> epoch: 180, LOSS: 0.158, ACC: 0.941
> epoch: 200, LOSS: 0.178, ACC: 0.928

Plot learning process with matplot lib.

%matplotlib inline
import matplotlib.pyplot as plt'ggplot')
plt.plot([i for i in range(1, 201)], epoch_losses, c='b', alpha=0.6, label='loss')
plt.plot([i for i in range(1, 201)], epoch_accuracies, c='r', alpha=0.6, label='acc')

The learning curve indicates that the process works fine.

In summary DGL is useful package for chemoinformatics. So I hope more chemistry related example codes will be provided from original repository. ;)

Today’s code can be found in my repository and gist.

Use pytorch for QSAR model building more simply like scikit-learn #pytorch #chemoinformatics #RDKit

I often use pytorch for deep learning framework. I like pytorch because it is very flexible and many recent articles are used it for their implementation.

But to build model and train the model, I need to define training method. So it seems nice if I can train pytorch model just calling fit like scikit-learn doesn’t it?

Fortunately by using skorch, it will be enable training process to be simple.

Sktorch is a scikit-learn compatible network library that waps pytorch. Wow it’s cool! Skorch can be installed via pip command.

I installed it and then tried to build QSAR model with skorch.
At first, import packages and make dataset for training and test. For classification task, class index should be float32. float64 will cause Error.

import os
from rdkit import Chem
from rdkit import RDPaths
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')
prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2

def fp2arr(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

nbits = 1024
ncls = 3

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

train_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) for m in trainmols]
train_X = np.array([fp2arr(fp) for fp in train_fps], dtype=np.float32)

train_y = np.array([ohencoder(prop_dict[m.GetProp('SOL_classification')], ncls) for m in trainmols])
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) for m in testmols]
test_X = np.array([fp2arr(fp) for fp in test_fps], dtype=np.float32)

test_y = np.array([ohencoder(prop_dict[m.GetProp('SOL_classification')], ncls) for m in testmols])
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

Then import pytorch and skorch. I checked whether GPU is available or not.

import torch
if torch.cuda.is_available():
    print('use GPU')
    print('use CPU')

from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

Next, define simple multi-layer perceptron model. Following example is multilabel classification so I used softmax layer at the last layer.

class SimpleMLP(nn.Module):
    def __init__(self, num_units=512, nonlin=F.relu):
        super(SimpleMLP, self).__init__()
        self.dense0 = nn.Linear(nbits, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, ncls)
    def forward(self, x, **kwargs):
        x = self.nonlin(self.dense0(x))
        x = self.dropout(x)
        x = F.relu(self.dense1(x))
        x = F.softmax(self.output(x), dim=-1)
        return x

Next, make NeuralNetClassifier with defined model. It is very easy to use GPU. If you pass the device=’cuda’ option to the object, that’s all! ;-) After that let’s train the model with fit method.

net = NeuralNetClassifier(SimpleMLP,
                         ), train_y)

 epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        1.1241       0.3902        1.1102  0.1851
      ----    snip---------
    100        0.0831       0.6146        1.0327  0.0251
    (dense0): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=512, out_features=10, bias=True)
    (output): Linear(in_features=10, out_features=3, bias=True)

Ok, let check the model performance.

pred = net.predict(test_X)
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(test_y, pred)
array([[83, 18,  1],
       [35, 72,  8],
       [ 0, 11, 29]])

print(classification_report(test_y, pred))

              precision    recall  f1-score   support

           0       0.70      0.81      0.75       102
           1       0.71      0.63      0.67       115
           2       0.76      0.72      0.74        40

    accuracy                           0.72       257
   macro avg       0.73      0.72      0.72       257
weighted avg       0.72      0.72      0.71       257

It seems work fine.

Then I tried to build regression model. NeuralNetRegressor is used to do it. The model structure is almost same as above but last layer is little bit different.

from skorch import NeuralNetRegressor
class MLP_regressor(nn.Module):
    def __init__(self, num_units=512, nonlin=F.relu):
        super(MLP_regressor, self).__init__()
        self.dense0 = nn.Linear(nbits, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, 1)
    def forward(self, x, **kwargs):
        x = self.nonlin(self.dense0(x))
        x = self.dropout(x)
        x = F.relu(self.dense1(x))
        x = self.output(x)
        return x

regressor = NeuralNetRegressor(MLP_regressor,

Then make dataset for regression task.

train_y_reg = np.array([float(m.GetProp('SOL')) for m in trainmols], dtype=np.float32).reshape(-1,1)
test_y_reg = np.array([float(m.GetProp('SOL')) for m in testmols], dtype=np.float32).reshape(-1,1), train_y_reg)
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1        8.3883       17.1863  0.0306
    100        0.3729        2.7019  0.0245
    (dense0): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=512, out_features=10, bias=True)
    (output): Linear(in_features=10, out_features=1, bias=True)

Check the model.

reg_y = regressor.predict(test_X)
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(np.arange(-9,2), np.arange(-9,2), color='pink', alpha=0.7)
plt.scatter(reg_y, test_y_reg)
from sklearn.metrics import r2_score
print(f"R2 score of testset was {r2_score(test_y_reg, reg_y):.2f}")
> R2 score of testset was 0.65

Current models have room for improvement because I didn’t optimize model structure and parameters. ;)

In summary skorch is very useful package for programmer who would like to write code for model building and training quickly.

I just uploaded today’s code on my githubrepo.

And it can run on binder.