ML visualization tool of python #machine_learning #visualization

Machine learning is used for QSAR in drug discovery. After building the model, analyze its performance is important. One of the ML package for python named ‘scikit-learn’ has many function for model performance metrics. But it does not provide visualization tools. So I make some plot with matplotlib, seaborn aned etc,

But few days ago, I knew useful package for visulization for ML The name is yellowbrick.
https://www.scikit-yb.org/en/latest/index.html
It can install via conda or pip.

By using the package, user can make many plots very efficiently. Following code is my brief examples.

1st step) Import packages.



%matplotlib inline
import numpy as np
import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import RDConfig
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import DrawMorganBit
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from yellowbrick.features import FeatureImportances
from yellowbrick.classifier import PrecisionRecallCurve
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from IPython import display
le = LabelEncoder()

2nd step) Make fingerprint matrix and label index for ML. I used solubility data from rdkit package.

def fp2arr(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
train_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf'))]
test_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf'))]
cls = set([m.GetProp('SOL_classification') for m in train_mols])
n_cls = len(cls)
train_fp_bitifo = [{} for _ in range(len(train_mols))]
train_fp = np.asarray([fp2arr(AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=512, bitInfo=train_fp_bitifo[idx])) for idx, m in enumerate(train_mols)])
train_y = [m.GetProp('SOL_classification') for m in train_mols]
train_y_le = le.fit_transform(train_y)

test_fp_bitifo = [{} for _ in range(len(train_mols))]
test_fp = np.asanyarray([fp2arr(AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=512, bitInfo=test_fp_bitifo[idx])) for idx, m in enumerate(test_mols)])
test_y = [m.GetProp('SOL_classification') for m in test_mols]
test_y_le = le.fit_transform(test_y)

3rd step) OK, let’s see feature importance. Yellowbrick general flow is..

viz(model) => viz.fit(x, y) => viz.score(x,y) => viz.poof()

In the following example, bit49, 356, 295 are top 3 important features.

rf = RandomForestClassifier(n_estimators=200)
viz = FeatureImportances(rf)
viz.size = np.array([600, 6000])
viz.fit(train_fp, train_y_le)
viz.poof()

Let’s see the structure.

Draw.MolsToGridImage(test_mols[50:80], legends=test_y[50:80], molsPerRow=5)

print(test_fp_bitifo[57])
test_mols[57]
> {33: ((3, 0), (4, 0), (5, 0)), 54: ((2, 2),), 80: ((1, 0),), 114: ((2, 0),), 115: ((1, 1),), 222: ((0, 1),), 295: ((0, 0),), 392: ((3, 1), (4, 1), (5, 1)), 485: ((2, 1),)}
#see bit295 of test mol 57.
IPythonConsole.DrawMorganBit(test_mols[57], 295, bitInfo=test_fp_bitifo[57])

Bit295 means hydroxyl group. Hmm it is reasonable.

It is easy to make other metrics visualization. Coding style is almost same.

rf = RandomForestClassifier(n_estimators=200)
viz_cls = PrecisionRecallCurve(rf)
viz_cls.size = np.array([600, 400])
viz_cls.fit(train_fp, train_y_le)
viz_cls.score(test_fp, test_y_le)
viz_cls.poof()
rf = RandomForestClassifier(n_estimators=200)
viz_rocauc = ROCAUC(rf)
viz_rocauc.size = np.array([600, 400])
viz_rocauc.fit(train_fp, train_y_le)
viz_rocauc.score(test_fp, test_y_le)
viz_rocauc.poof()
rf = RandomForestClassifier(n_estimators=200)
viz_cls = ClassificationReport(rf, )
viz_cls.size = np.array([600, 400])
viz_cls.fit(train_fp, train_y_le)
viz_cls.score(test_fp, test_y_le)
viz_cls.poof()

<br>rf = RandomForestClassifier(n_estimators=200)<br>viz_ce = ClassPredictionError(rf, )<br>viz_ce.size = np.array([600, 400])<br>viz_ce.fit(train_fp, train_y_le)<br>viz_ce.score(test_fp, test_y_le)<br>viz_ce.poof()<br>

It is nice isn’t it?

Yellowbrick is very useful package for ML.

All code is uploaded to my gist.

Extract chemical information from patent data #pat-informatics #chemoinformatics

As you know, patent informatics is important for drug discovery project. And SureChembl is one of the dataset for chemical structures which are extracted from patent document by OCR. It is worth that it can freely available data source.

I surprised that recently google patents provides chemical data too.

It seems not fully cover all structure but seems cool. Let see the example, URL is below
https://patents.google.com/patent/WO2012125893A1/en?oq=WO2012125893A1.

The machine extracted information including structure is listed in ‘Concept’ table like below. I’m not sure which structures are extracted by machine. So it is not all structure.

The page source is html. I tried to extract the data with python ;).

To parse HTML, I used beautifulsoup. It’s very useful for html parsing. SMILES data is located in ‘concept=>ul=>span(itemprop=smiles)’. And additional information such as name, domain which shows where the data is extracted is provided.

Following code extract some dataset and make pandas data frame.

If the machine can extract all data in Experimental, Description and Claim, it will be powerful tool for pat-informatics.

Google provides many services which are freely available.

Visualize chemical space as a grid #chemoinformatics #rdkit

Visualize chemical space is important for medicinal chemist I think. Recently, Prof. Bajorath group published nice article. URL is below.

https://pubs.acs.org/doi/pdf/10.1021/acsomega.9b00595

The author described new approach that combines SARMatrix and Molecular Grid maps. SARMatrics is one of the method for SAR analysis like Free Wilson analysis.

I had interest their approach because they uses molecular grid maps. I often use PCA and/or t-SNE for chemical space mapping but it is not grid.

Molecular grid maps is like SOM. To make the maps, they used J-V algorithms. The details are described in following URL.
https://link.springer.com/article/10.1007/BF02278710

I would like to try the mapping method.

Fortunately python package for JV-algorithm is provided in Github! The name is lapjb. And I installed it and try to use it.
https://github.com/src-d/lapjv
My code is below.

At first, import packages and load data. Sample data came from CHEMBL.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
df = pd.read_csv('CHEMBL3888506.tsv', sep='\t', header=0)

To make grid mapping, number of sample must be N^2. The dataset has 467 molecules, so I used 400 molecule for test. It means I will embed 20 x 20 grid space.

mols = [Chem.MolFromSmiles(smi) for smi in df.Smiles]
sampleidx = np.random.choice(list(range(len(mols))), size=400, replace=False)
samplemols = [mols[i] for i in sampleidx]
sampleact = [9-np.log10(df['Standard Value'][idx]) for idx in sampleidx]
fps = [AllChem.GetMorganFingerprintAsBitVect(m,2) for m in samplemols]
def fp2arr(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp,arr)
    return arr
X = np.asarray([fp2arr(fp) for fp in fps])

Then perform PCA-t-SNE for getting chemical space and normalize the data.

size = 20
N = size*size
data = PCA(n_components=100).fit_transform(X.astype(np.float32))
embeddings = TSNE(init='pca', random_state=794, verbose=2).fit_transform(data)
embeddings -= embeddings.min(axis=0)
embeddings /= embeddings.max(axis=0)

Check the t-SNE result. Activity is used for color mapping.

plt.scatter(embeddings[:,0], embeddings[:,1], c=sampleact, cmap='hsv')

Next let’s projection chemical space to grid. Usage of lapjv is very simple. At first calculate similarity matrix with scipy cdist function. Then pass the matrix to lapjv.

grid = np.dstack(np.meshgrid(np.linspace(0,1,size), np.linspace(0,1,size))).reshape(-1,2)
from lapjv import lapjv
cost_mat = cdist(grid, embeddings, 'sqeuclidean').astype(np.float32)
cost_mat2 = cost_mat * (10000 / cost_mat.max())
row_asses, col_asses, _ = lapjv(cost_mat2)
grid_lap = grid[col_asses]

Now ready. Let’s plot grid map.

plt.scatter(grid_lap[:,0], grid_lap[:,1], c=sampleact, cmap='hsv')

It seems work well. Same color dot is located near space.

Grid plot is useful because it can avoid overlapping of each dot.

The author developed more sophisticated tool. However the source code is not disclosed. It seems very attractive for medchem ;-)

Rational drug design from computer assist is very important. But I think visualization and analysis method for medicinal chemist is equally important for drug design.

Today’s code is below.

code example

Reader who has interest in lapjv, please try it.

Process development of fluorinated-pyrrolidin analogue #memo #organic_chemistry

Here is an interesting article for efficient synthesis of fluorinated pyrrolidin synthesis from pfizer.
https://pubs.acs.org/doi/abs/10.1021/acs.oprd.9b00245

Fluorine containing building blocks are often used medicinal chemistry. So efficient synthetic route is very useful for us.

Some years ago, I synthesized similar compounds 1-fluoro-2-amino cyclic amine derivatives. I tried to use almost same synthetic scheme described in scheme1. 1) Epoxydation 2) Epoxy opening with sodium azide, 3) DAST-Fluorination, 4) Azide reduction.

The scheme is stereo selective if epoxyde is chiral. I like substrate controlled synthetic rote like this.

The scheme 1 is first generation which uses mCPBA, NaN3 and DAST it is not suitable for large scale production. Then the authors group developed very efficient synthetic route. They used modified burgess type reagent. The reaction with the reagent and 1,2-trans-diol makes cis-cyclic sulfamate intermediate. Key point of the strategy transform trans diol to sys amino alcohol as protected form. And also the its hydroxyl group protected SO2NR, it means the hydroxyl group is leaving group.

So the cyclic sulfamate reacts with TBAF and give trans-Fluoro protected amine. You can check the over view of the synthesis in the online abstract.

The merit of the scheme is that switching fluorination reagent from DAST to TBAF, switching nitrogen source from sodium azide to burgess reagent I think (of course there are more advantages and it described in the article).

Thanks the authors for publishing nice synthetic approach.

Useful package for descriptor calculation #chemoinformatics #rdkit

Descriptor calculation is an important task for chemoinfomatics. I often use rdkit to do it. And today I found very useful package for descriptor calculation which name is descriptorus. URL is below.

https://github.com/bp-kelley/descriptastorus

It is very easy to install the package. Just following command.

pip install git+https://github.com/bp-kelley/descriptastorus

After did it, I could use the package.

By using the package, following descriptors are calculated very efficiently.

* atompaircounts
* morgan3counts
* morganchiral3counts
* morganfeature3counts
* rdkit2d
* rdkit2dnormalized
* rdkitfpbits

I tried to use the package. Very simple example.

from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
gen1 = MakeGenerator((‘rdkit2d’,))
smi = ‘c1ccncc1’
data=gen1.process(smi)
data
>out
[True,
3.000000000000001,
75.86113958768547,
4.242640687119286,
3.333964941448087,
3.333964941448087,
…]

Of course it is easy to get column names.

for col in gen1.GetColumns():
print(col[0])
>out

RDKit2D_calculated
BalabanJ
BertzCT
Chi0
Chi0n
Chi0v
Chi1
Chi1n
Chi1v
Chi2n
Chi2v
Chi3n
Chi3v
Chi4n
Chi4v
EState_VSA1
EState_VSA10

The first row indicates whether the calculation is successful or not.

And the package provides normalized descriptors which are useful for machine learning. It is easy to get it.

from descriptastorus.descriptors import rdDescriptors
from descriptastorus.descriptors import rdNormalizedDescriptors
gen2 = rdDescriptors.RDKit2D()
gen3 = rdNormalizedDescriptors.RDKit2DNormalized()
data2 = gen2.process(smi)
data3 = gen3.process(smi)
for i in range(len(data2)):
print(data2[i], data3[i])
>out
True True
3.000000000000001 0.9749367759562906
75.86113958768547 0.0018713336795980954
4.242640687119286 0.0001226379059079931
3.333964941448087 0.00012708489238074725
3.333964941448087 9.889979057774486e-05
3.0 0.00031922002005261777
1.8497311128276561 0.0003685049537727884
1.8497311128276561 0.0005439421913480759
….

Descriptorus can make a DescriptaStore. I failed it because I couldn’t kyotocabinet in my env….

In summary, the package is very useful for chemoinformatician because user can get many rdkit’s descriptors only typing few lines.

It is amazing for me that we can use such an useful open source packages very easily.

Thanks for OSS developers!


Today’s gist is below.

Calculate solvent effect in Psi4 #psi4 #quantumchemistry

Recently I use not only chemoinformatics tools but also quantum chemistry tool, my favorite is Psi4. Psi4 has many options and plug-ins for quantum calculation. Most setting of calculation is vacuum, but it actually true. So considering the solvent around the molecules is important.

Can psi4 perform calculation with solvent effect?

Yes!

PCMSolver is plugin for PCM calculation. It can be installed via conda.

$ conda install -c psi4 pcmsolver

PCM means ‘polarizable continuum model’. PCM describes solute-solvent interactions only by means of electrostatics and polarization between the solute molecule and solvent. The solvent is modeled as a dielectric continuum with a given permittivity ε.

To use PCMsolver in psi4, user need to make specific input file for the calculation which should has kind of solvent, basis function, CPCM or IEFPCM, etc.

I made to input file from toulene.mol file which is retrieved from ChEMBL.

input files are below. PCMsolver can not call from python interpreter so I need to make input.dat.

# toluene in water
#! pcm

molecule mol {

0 1
C -2.0705902285428452 0.09342117730391691 0.21438842743524625
C 0.06881622580040792 1.2107631870687914 0.004377262725162135
C -0.043709196272760036 -1.211640074147377 -0.026831077837745607
C -1.3202142335884217 1.271231179761799 0.15648583022799556
C 0.714364020910022 -0.03184758851751915 -0.10370200526939809
C 2.2076605753403253 -0.0996279156897954 -0.22684462930707974
C -1.4325455046382927 -1.1469904802166477 0.1253309895373383
H -3.1452716192643337 0.14180771794289201 0.3335057008466673
H 0.6396267675023236 2.1305648085368674 -0.0295934787763336
H 0.4394292831589224 -2.1792111515476216 -0.08511724686650844
H -1.8144108252945523 2.2311166905139106 0.23329105314598383
H 2.59680405039162 0.7889905119305699 -0.7681174009541791
H 2.5134086150024317 -1.0062726983460601 -0.7912601501934461
H 2.660308829389096 -0.1337122631434785 0.78606242092797
H -2.013676759894258 -2.058593101450173 0.17802430435839692
no_reorient
no_com
units angstrom

}

set {
basis 6-31G**
scf_type pk
pcm true
pcm_scf_type total
}

pcm = {
Units = Angstrom
Medium {
SolverType = IEFPCM
Solvent = Water
}

Cavity {
RadiiSet = UFF
Type = GePol
Scaling = false
Area = 0.3
Mode = Implicit
}
}

energy_pte, wfn = energy(‘scf’, return_wfn=True)

# toluene in benzene
#! pcm

molecule mol {

0 1
C -2.0705902285428452 0.09342117730391691 0.21438842743524625
C 0.06881622580040792 1.2107631870687914 0.004377262725162135
C -0.043709196272760036 -1.211640074147377 -0.026831077837745607
C -1.3202142335884217 1.271231179761799 0.15648583022799556
C 0.714364020910022 -0.03184758851751915 -0.10370200526939809
C 2.2076605753403253 -0.0996279156897954 -0.22684462930707974
C -1.4325455046382927 -1.1469904802166477 0.1253309895373383
H -3.1452716192643337 0.14180771794289201 0.3335057008466673
H 0.6396267675023236 2.1305648085368674 -0.0295934787763336
H 0.4394292831589224 -2.1792111515476216 -0.08511724686650844
H -1.8144108252945523 2.2311166905139106 0.23329105314598383
H 2.59680405039162 0.7889905119305699 -0.7681174009541791
H 2.5134086150024317 -1.0062726983460601 -0.7912601501934461
H 2.660308829389096 -0.1337122631434785 0.78606242092797
H -2.013676759894258 -2.058593101450173 0.17802430435839692
no_reorient
no_com
units angstrom

}

set {
basis 6-31G**
scf_type pk
pcm true
pcm_scf_type total
}

pcm = {
Units = Angstrom
Medium {
SolverType = IEFPCM
Solvent = Benzene
}

Cavity {
RadiiSet = UFF
Type = GePol
Scaling = false
Area = 0.3
Mode = Implicit
}
}

energy_pte, wfn = energy(‘scf’, return_wfn=True)

PCMsolver can has many kinds of solvents listed below.

    1:'Water',
    2:'Propylene Carbonate',
    3:'Dimethylsulfoxide',
    4:'Nitromethane',
    5:'Acetonitrile',
    6:'Methanol',
    7:'Ethanol',
    8:'Acetone',
    9:'1,2-Dichloroethane',
    10:'Methylenechloride',
    11:'Tetrahydrofurane',
    12:'Aniline',
    13:'Chlorobenzene',
    14:'Chloroform',
    15:'Toluene',
    16:'1,4-Dioxane',
    17:'Benzene',
    18:'Carbon Tetrachloride',
    19:'Cyclohexane',
    20:'N-heptane',
    21:'Explicit' 

After making the input files, following step is as same as default psi4 calculation.

$ psi4 input.dat -o output.dat

After waiting few seconds, I could get two output results. Picked up line 355-363.

## toulene in water
@RHF Final Energy: -269.75544369636918

=> Energetics Energetics <=

Nuclear Repulsion Energy = 268.6733324038011688
One-Electron Energy = -895.7789138809855558
Two-Electron Energy = 357.3548941675961714
PCM Polarization Energy = -0.0019460266422257
Total Energy = -269.7526333362304172

Computation Completed

There is difference in PCM Polarization Energy between in water and in toluene, in water it is -0.0052558561183017A.U.(-3.298kcal/mol) and in tolunene, it is -0.0019460266422257A.U.(-1.22kcal/mol)
*1 a.u. =627.50 kcal/mol

I would like to test some molecules. Are there freely available experimental data set? I’ll try to find them.

Do Pharmas need more data for building better model? #memo #medchem

Is ‘AI’ & ‘BidData’ key for success? Recently many pharmaceutical companies are trying to use AI for their drug discovery project acceleration and cost reduction. However sometime it fails because of lack of training data. It is rare that lots of data is available in early drug discovery projects. It is still difficult to build accurate predictive model with small data set.

Recently attractive project is reported in nature reviews drug discovery.
https://www.nature.com/articles/d41573-019-00120-w

The project named MELLODDY(Machine Learning Ledger Orchestration for Drug Discovery) which is an an €18-million, 3-year IMI project.

According the following URL, the MELLODDY consists 17 partners. And the partner will share about the structure and activity data without IP.
https://www.janssen.com/emea/new-research-consortium-seeks-accelerate-drug-discovery-using-machine-learning-unlock-maximum

  • 10 pharmaceutical companies: Amgen, Astellas, AstraZeneca, Bayer, Boehringer Ingelheim, GSK, Janssen Pharmaceutica NV, Merck KgaA, Novartis, and Institut de Recherches Servier
  • 2 academic universities: KU Leuven, Budapesti Muszaki es Gazdasagtudomanyi Egyetem
  • 4 subject matter experts: Owkin, Substra Foundation, Loodse, Iktos
  • 1 large AI computing company: NVIDIA

The consortium uses Amazon web services for machine learning. I think it is nice approach for data and model management and it will be difficult for pharma in Japan (just my personal opinion)…..

How about the chemical space of the combined compounds data? The data is enough for overcoming the applicability domain?

I hope the project goes well.