Cut molecule to ring and linker with RDKit #RDKit #chemoinformatics #memo

Sometime chemists analyze molecule as small parts such as core, linker and substituents.

RDKit has functions for molecular decomposition RECAP, BRICS and rdMMPA. It’s useful functions but these functions can’t extract directly extract core and linker from molecule.

I had interested how to do it and tried it.

Following code, Core is defined Rings in the molecule and Linker is defined chain which connects two rings.

I defined main function named ‘getLinkerbond’ which extracts bonds that connects atom in ring and atom not in ring.

Then call Chem.FragmentOnBonds function to get fragments by cutting linker bond.

Let’s see the code. First, import some libraries. And define test molecules.

import copy
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import display

mol1 = Chem.MolFromSmiles("C1CCNCC1")
mol2 = Chem.MolFromSmiles("c1ccccc1")
mol3 = Chem.MolFromSmiles("C1CC1CC")
mol4 = Chem.MolFromSmiles("C1CC1CCC2CCOC2")
mol5 = Chem.MolFromSmiles("C1CCC2CCCCC2C1")
mol6 = Chem.MolFromSmiles("OC1CC2(C1)CCCC2")
mol7 = Chem.MolFromSmiles("CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N")
mol8 = Chem.MolFromSmiles("c1ccccc1c2ccccc2OC")
mol9 = Chem.MolFromSmiles("CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5")
mol10 = Chem.MolFromSmiles("C1CCCC1CCCOCOCOCOCOCc1ccccc1")
mols = [mol1,mol2,mol3,mol4,mol5,mol6,mol7,mol8,mol9,mol10]

Then define main function. I wrote helper function for the main function. is_in_samering function check the atoms which construct a bond are in same ring or not.

def is_in_samering(idx1, idx2, bond_rings):
    for bond_ring in bond_rings:
        if idx1 in bond_ring and idx2 in bond_ring:
            return True
    return False

Following code, I set atom prop at first to keep original atom index. Because GetScaffoldForMol function returns new molecule with new atom index. I would like to use original atom idex in this function.

def getLinkerbond(mol, useScaffold=True):
    res = []
    for atom in mol.GetAtoms():
        atom.SetIntProp("orig_idx", atom.GetIdx())
    for bond in mol.GetBonds():
        bond.SetIntProp("orig_idx", bond.GetIdx())
    
    if useScaffold:
        mol = MurckoScaffold.GetScaffoldForMol(mol)
        
    ring_info = mol.GetRingInfo()
    bond_rings = ring_info.BondRings()
    ring_bonds = set()
    for ring_bond_idxs in bond_rings:
        for idx in ring_bond_idxs:
            ring_bonds.add(idx)
    all_bonds_idx = [bond.GetIdx() for bond in mol.GetBonds()]
    none_ring_bonds = set(all_bonds_idx) - ring_bonds
    for bond_idx in none_ring_bonds:
        bgn_idx = mol.GetBondWithIdx(bond_idx).GetBeginAtomIdx()
        end_idx = mol.GetBondWithIdx(bond_idx).GetEndAtomIdx()
        if mol.GetBondWithIdx(bond_idx).GetBondTypeAsDouble() == 1.0:
            if mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 1:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
                res.append(orig_idx)
            elif not is_in_samering(bgn_idx, end_idx, bond_rings) and mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 2:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
                res.append(orig_idx)
    return res

Check the function.


for mol in mols:
    bonds = getLinkerbond(mol)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)
        display(res)
    else:
        display(mol)

The function cut murckoscaffold structure as default. So bonds not connect rings is not cut.

Next, case of useScaffold False. In this case all bonds outside of rings are cut.

for mol in mols:
    bonds = getLinkerbond(mol, False)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)
        display(res)
    else:
        display(mol)

Works fine.

It is useful for molecular decomposition at linker site I think.

Today’s whole code is uploaded my gist. Any comments and advices will be high appreciated. ;)

Draw RDKit mol/reaction object on HTML without static png image #RDKit #memo

I think this post might not useful for many people. Just for my memorandum.

When I make web app, I use draw SVG function for rdkit object rendering. But from last my post and Greg’s gist, I could learn draw png image without writing static png files. So I wonder that can I draw PNG image on HTML without png file.

Fortunately I could find the solution!

To do that, at first draw rdkit object and convert it png byte text.
Then encode base64 and decode UTF8, that’s all!

Generated text can render in HTML with <img src=’data:image/png;base64> tag.

Today’s code is below. ;)

Edit reaction image and draw it #RDKit #memo

I often use rdkit on jupyter notebook because notebook can render molecules very conveniently. However I couldn’t find way to edit font size of reaction some days ago.

Following code is an example. Changing atom font size of MolToImage is easy but reaction image can’t apply same manner.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import rdChemReactions as Reactions
from rdkit.Chem.Draw import IPythonConsole
from PIL import Image

print(Draw.DrawingOptions.atomLabelFontSize)
>12
mol = Chem.MolFromSmiles('c1ccncc1')
rxn = Reactions.ReactionFromSmarts('CC(=O)C>>CC(O)C', useSmiles=True)

Draw.MolToImage(mol)
Draw.ReactionToImage(rxn)
Image from default settings

OK, next tried to change font size.


Draw.DrawingOptions.atomLabelFontSize = 30
print(Draw.DrawingOptions.atomLabelFontSize)

Draw.MolToImage(mol)
Draw.ReactionToImage(rxn)

mol object was drawn with changed font size but reaction object wasn’t.

rxn object was drawn with small font size

I found the answer to solve the issue. It was enable by using MolDraw2DCairo and change its settings. It is not efficient way because png file was required. But I could not find any other solution now.

drawer = rdMolDraw2D.MolDraw2DCairo(800, 200)
drawer.DrawReaction(rxn)
drawer.FinishDrawing()
drawer.WriteDrawingText('test.png')
im = Image.open('test.png')
im


drawer = rdMolDraw2D.MolDraw2DCairo(800, 200)
drawer.SetFontSize(1.0)
drawer.DrawReaction(rxn)
drawer.FinishDrawing()
drawer.WriteDrawingText('test2.png')
im = Image.open('test2.png')
im

As you see, atom font size was changed. ;)

Today’s code was uploaded to my gist.

Any advice will be great fully appreciated. Have a nice weekend!!!

Update

Greg showed me how to render PNG image without drawing temporal file.
Thanks so much!

I have some idea to using the method. I’ll post it after tried it.

Partial charge visualization with RDKit #RDKit #QuantumChemistry #psikit

Last post, I showed how to visualize each fingerprint contribution for predictive model results with RDKit.

The function also can visualize atomic properties such as a partial charge of atoms.

Dr. Thomas Evangelidis shared me an example of visualization of quantum chemistry based atom property. He showed an example of conformation effect to the QM calculation.

He tried to calculate PM-6 semi empirical partial charge and draw it with rdkit. He used MOPAC as QM engine. Following figure shows partial charge of same molecule with different conformation. Red arrow shows difference of each molecule.

And also by using partial change from 100 conformers data, he showed how to visualize std and mean of the partial charge.

Mean of partial charge
std of two conformers partial charge

From above image, conjugated aromatic system shows large difference between two conformers. It seems reasonable I think.
You can check whole his code on following URL.
https://github.com/tevang/tutorials/tree/master/compare_atomic_properties

Back to the latest rdkit blog post, Greg showed an example of visualization of partial charge with Extended Huckel method which is newly implemented on RDKit!

I had interest the approach so I tried to use psikit for partial charge calculation.

Following code is almost same as original blog post but little difference, just using psikit.

Current version of psikit can calculate mulliken charge, esp, resp charge and lowdin charge very conveniently. But it took longer time for calculation compared to rdEHTools method.

code on my gist

psikit seems more sensitive to aromatic atoms. I think it is needed to select method in case by case. But rdkit’s rdEHTools works very fast so it is useful tool of light weight QM calc.

Finally thanks Dr. Thomas again for sharing nice code example!

Model interporation with new drawing code of RDKit #RDKit #Machine learning #chemoinformatics

Following code does not use new drawing code but it revised one of my old post. :)

I think, everyone who visits my blog has already read Gregs nice blog post about the drawing similarity map with new code. If you don’t read it I recommend to read it soon. URL is below. ;)
http://rdkit.blogspot.com/2020/01/similarity-maps-with-new-drawing-code.html

Now I like the drawing code because it can draw only atom property of own molecule but also compare molecular similarity and fingerprint contribution of predictive model.

Interpolation of the machine learning model is very important because it is use for next design of the new molecule.

So let’s write code with new drawing code. Following code is almost same as RDKit cook book and Greg’s blog post. Import some packages and check the version of rdkit. I used the newest version.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from functools import partial
from rdkit import Chem
from rdkit import rdBase
from rdkit import RDPaths
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import SimilarityMaps, IPythonConsole
from IPython.display import SVG
import io
from PIL import Image
import rdkit
import numpy as np
from sklearn.ensemble import RandomForestClassifier
print(rdkit.__version__)
>2019.09.2

Load solubility data which is provided from rdkit data folder for example.

trainpath = os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf')
testpath =  os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath) if m is not None]
test_mols = [m for m in Chem.SDMolSupplier(testpath) if m is not None]
val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}

Then define some helper function.

def mol2arr(mol, fpfunc):
    fp = fpfunc(mol)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
# https://rdkit.blogspot.com/2020/01/similarity-maps-with-new-drawing-code.html
def show_png(data):
    bio = io.BytesIO(data)
    img = Image.open(bio)
    return img
fpfunc = partial(SimilarityMaps.GetMorganFingerprint, nBits=1024, radius=2)

Then calculate fingerprint and get solubility class. And make randomforest classifier object and fit it. It’s very common way for sunday chemoinformatician. ;)

SOL_classification
trainX = [fpfunc(m) for m in train_mols]
trainY = [val_dict[m.GetProp('SOL_classification')] for m in train_mols]
​
testX = [fpfunc(m) for m in test_mols]
testY = [val_dict[m.GetProp('SOL_classification')] for m in test_mols]
rfc = RandomForestClassifier(random_state=794)
rfc.fit(trainX, trainY)

Next define getProba function the code is borrowed from rdkit cookbook. One difference to cookbook is I used third probability because I would like to visualize effect of the each fingerprint to the high solubility. And defined draw function with some optional arguments. ‘alpha’ is useful because it can control the blending value for the contour lines. It is key for beautiful visualization.

# https://www.rdkit.org/docs/Cookbook.html
# Following example I would like to get probability of High solubility
def getProba(fp, predctionFunction):
    return predctionFunction((fp,))[0][2]
def drawmol(mols, idx):
    d = Draw.MolDraw2DCairo(1,1)
    _, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, rfc.predict_proba),
                                                           colorMap='coolwarm',
                                                          size=(200,200),
                                                          step=0.001,
                                                          alpha=0.2)
    d.FinishDrawing()
    show_png(d.GetDrawingText())
    print(mols[idx].GetProp('SOL_classification'))

Now almost there. Let’s visualize High and Low solubility molecules with the drawing code!

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']
drawmol(high_test_mols, 0)

Hmm good, the model detects hydroxyl group is positive effect to solubility. How about second example?

drawmol(high_test_mols, 10)

OK, carbonyl group detected as positive group for solubility.

drawmol(low_test_mols, 5)

In this case it is difficult because all atoms are lipophilic. So remove any atoms are positive effect to solubility in my understanding.

The code investigates the effect of the each atom by removing one by one. And calculate difference of probability of prob molecule and atom remove molecule.

So, the magnitude of color shows relative effect of the each atom not absolute effect. This is reason why low solubility molecule shows many positive (red) colors in this example.

Any way, model interporation with 2D color mapping on the molecule is quite nice tool for QSAR analysis.

I would like to write more example codes.

Today’s code can be found on my gist and github.

https://github.com/iwatobipen/playground/blob/master/modelviz.ipynb

Any comments suggestions and requests will be greatly appreciated. ;)

Thanks,

New visualization way for SAR #RDKit #chemoinformatics

Recently I found very cool visualization code for SAR that is provided by @simonduerr. It seems cool!

You can get the code from his repository.
https://github.com/duerrsimon/substrate-scope-plot

I’ve never seen such a cool visualization. Current version of the package can fix the position of the rgroup automatically.

So I tried to use the code for my own dataset.

I got SAR data from ChEMBL, its target is MPS1. At first import packages and load data.

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
from radialscope import RadialScope as rs
from radialscope import draw_with_indeces
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
import svgutils.compose as sc
from IPython.display import SVG, HTML
import numpy as np
import pandas as pd
df = pd.read_csv('./data/mps.csv', sep=';')

Then define core and template for data extraction and visualization. Next retrieve molecules and R group information which has defined core structure.

core = Chem.MolFromSmiles('c2cn(C1CCCCC1)c3ncncc23')
templ =  Chem.MolFromSmiles('c2cn(C1CCCCC1)c3nc(C)ncc23')
matchmols = []
AlogP = []
std_val = []
for idx, mol in enumerate(mols):
    if mol.HasSubstructMatch(core):
        matchmols.append(mol)
        AlogP.append(df.AlogP[idx])
        pIC50 = np.round(9-np.log10(df['Standard Value'][idx]),2)
        std_val.append(pIC50)

rg = []
for mol in matchmols:
    sc = Chem.ReplaceCore(mol,core)
    sc = Chem.MolToSmiles(sc).replace('[1*]', '~')
    rg.append(sc)

Then, define drawing settings, I defined white_cutoff to zero because I would like to parameter with white.

settings={ 'SMILESSTRING':Chem.MolToSmiles(templ),  # use methyl groups for your rests, smiles strings can't use labels
  'use_bw_atom_theme':False,  # draw all atoms black
 'use_bold_font': True, # replace all fonts wherever possible with bold text
 'white_cutoff': 0,  #make text of labels white if greater than this number
  'scalefactor':0.7  #scale the total plot in the end, for large molecule you will need to decrease this or make the viewbox bigger in a vector software such as Inkscape
}

Next, define R group information. I used pIC50 and AlogP as SAR information. And other setting is same as original example.

Attach_atom_id is key parameter, it defines where SAR information place.

atom index 0 is dummy methyl group of the template.

radial_scope_setup_R1= {
    'rest_label':"R$_1$", # Label of the atom in the radial scope plot
    'no_wedges':len(rg), # Number of wedges in the Radial scope Plot
    'coverangle_wedges':180, # Degrees of a full circle the Rscope should cover
    'startangle':-30, #Start angle of the Rscope 
    'CMAPINNER':"Blues",  # Colormap of the inner circle. Max value has full color, options see below
    'CMAPOUTER':"RdPu", # Colormap of the outer circle. Max value has full color
    'OUTERLABEL':"pIC50", # Label of the outer circle
    'INNERLABEL':"AlogP", # Label of the inner circle
    'value_inner_circle':AlogP,
    'value_outer_circle':std_val,
    'rounding':True, # ENABLE rounding if you want something like >99 
    'rounding_boundary':10,  # cutoff for displaying > 
    'value_groups':rg, # Labels for the outer circle, you can use math using $, any ~ will be interpreted as smiles string
    'attach_atom_id': 0,
    
}

Now almost there, type following code.

scope_plot=rs(settings, radial_scope_setup_R1) # add all radial_scope dictionaries to this call
SVG('substrate_scope_replaced.svg')

It works fine!

The package can visualize multiple substituents information. However I think it should be visualize one R group information per image. Because this way is difficult to understand combination of R groups.

I don’t understand the details of the code (it’s too hacky for me) but really useful for MedChem I think.

Thanks again for sharing the nice coode.

Today’s code was uploaded my gist and github repo.

https://github.com/iwatobipen/substrate-scope-plot/blob/master/MPS1.ipynb

Clustering molecules with HDBSCAN and predict chemical space with scikit-learn #chemoinformatics #RDKit

The contents of the post is almost same as yesterday’s one that was for souyaku-advent calendar 2019.

In drug discovery area we often discuss about compound chemical space. Chemist should be explore chemical space to find new IP space etc. It is defined with descriptors, fingerprint and biological response etc… It is often high dimensional space so we would like to reduce the dimension with some mathematical approach such as PCA, TSNE and UMAP.

And also we use clustering method for grouping molecules.

There are many clustering method for example k-means, hierarchical clustering. However k-means need to define k value and hierarchical method need to define threshold for doing that. So try and error is required. If you would like to get clustering label conveniently, I recommend to try to use HDBSCAN. HDBSCAN clusters data with their density. Ref URL is below.
https://link.springer.com/chapter/10.1007%2F978-3-642-37456-2_14

Fortunately we can use HDBSCAN from python. It is easy to install via pip or conda. Please check original document to install. ;)

I tried to demo with HDBSCAN. Following data is bollowed from chaper8 in py4chemoinformatics (Thanks@fmkz___ for providing data and code!)

Let’s write code! Import packages at first. I’ll use mlinsght later.

%matplotlib inline
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
from rdkit import RDLogger
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.manifold import TSNE
from mlinsights.mlmodel import PredictableTSNE
from hdbscan import HDBSCAN

sns.set_context('poster')
sns.set_style('white')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}
RDLogger.DisableLog('rdApp.*')
seed = 794

I defined tanimoto-dis tfunction for HDBSCAN. HDBSCAN requires metric arg. for measuring distances of each data point. And there isn’t tanimoto distance as a default metric. If user would like to original metric function, pass ‘pyfunc’ for metric arugment and your own function is passed to func argument. Func should be took two numpy array and return distance.
@@@@@
My tanimoto dist function is tooooooooooooooooooooooooo slow it has room for improvement I fill improve it later. ;)
@@@@@

oxrs = [("CHEMBL3098111", "Merck" ),  ("CHEMBL3867477", "Merck" ),  ("CHEMBL2380240", "Rottapharm" ),
             ("CHEMBL3352684", "Merck" ),  ("CHEMBL3769367", "Merck" ),  ("CHEMBL3526050", "Actelion" ),
             ("CHEMBL3112474", "Actelion" ),  ("CHEMBL3739366", "Heptares" ),  ("CHEMBL3739395", "Actelion" ), 
             ("CHEMBL3351489", "Eisai" )]

fps = []
docs = []
companies = []
mol_list = []
for cid, company in oxrs:
    sdf_file = os.path.join("data", cid + ".sdf")
    mols = Chem.SDMolSupplier(sdf_file)
    for mol in mols:
        if mol is not None:
            mol_list.append(mol)
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
            arr = np.zeros((1,))
            DataStructs.ConvertToNumpyArray(fp, arr)
            docs.append(cid)
            companies.append(company)
            fps.append(arr)
fps = np.array(fps)
companies = np.array(companies)def tanimoto_dist(ar1, ar2):
    a = np.dot(ar1, ar2)
    b = ar1 + ar2 - ar1*ar2
    return 1 - a/np.sum(b)

clusterer = HDBSCAN(algorithm='best', min_samples=5, metric='pyfunc', func=tanimoto_dist)
clusterer.fit(fps)
docs = np.array(docs)

trainIDX, testIDX = train_test_split(range(len(fps)), random_state=seed)

Now ready, let’s check data without and with HDBSCAN. I made two plots one is coloured with company name and the other is coloured with HDBCAN label.

tsne = TSNE(random_state=seed)
res = tsne.fit_transform(fps)
plt.clf()
plt.figure(figsize=(12, 6))
sns.scatterplot(res[:,0], res[:,1], hue=companies, **plot_kwds)

plt.clf()
plt.figure(figsize=(12, 6))
palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                 if col >= 0 else (0.5, 0.5, 0.5) for col, sat in zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(res[:,0], res[:,1], c=cluster_colors, **plot_kwds)
coloured with company name
coloured with hdbscan label.

Coloured with Gray color point is not labelled data. The result shows HDBSCAN could do almost reasonable labelling in this case.

Plot new compound with defined chemical space

Next topic is how to plot new compound in pre-defined chemical space.
Theoretically, PCA and tSNE can’t add new data to pre-defined latent space. Re calculation is required if user would like to add new data. But we would like to map new designed compounds to current chemical space. Hmm…. how to do it?

The answer for the question is provided from mlinghts. The package has PredictableTSNE class. The class take transformer and estimator object. And it makes predictive model for transformer output (such as PCA, TSNE output) with estimator. It will be clear if you read source code. After learning, PredictableTSNE can predict latent space from data features.

Following code, I used random sampling and random forest and GP regressor for estimator.

trainFP = [fps[i] for i in trainIDX]
train_mol = [mol_list[i] for i in trainIDX]

testFP = [fps[i] for i in testIDX]
test_mol = [mol_list[i] for i in testIDX]
allFP = trainFP + testFP
tsne_ref = TSNE(random_state=seed)
res = tsne_ref.fit_transform(allFP)
plt.clf()
plt.figure(figsize=(12, 6))
sns.scatterplot(res[:,0], res[:,1], hue=['train' for i in range(len(trainFP))] + ['test' for i in range(len(testFP))])
blue is training data, orange is test data.

Use predictiveTSNE for new compound chemical space prediction. Following examples are in RF case and GP case.


rfr = RandomForestRegressor(random_state=seed)
tsne1 = TSNE(random_state=seed)
pred_tsne_rfr = PredictableTSNE(transformer=tsne1, estimator=rfr, keep_tsne_outputs=True)
pred_tsne_rfr.fit(trainFP, list(range(len(trainFP))))


pred1 = pred_tsne_rfr.transform(testFP)
plt.clf()
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_rfr.tsne_outputs_[:,0], pred_tsne_rfr.tsne_outputs_[:,1], c='blue', alpha=0.5)
plt.scatter(pred1[:,0], pred1[:,1], c='red', alpha=0.5)


gbr = GaussianProcessRegressor(random_state=seed)
tsne2 = TSNE(random_state=seed)
pred_tsne_gbr = PredictableTSNE(transformer=tsne2, estimator=gbr, keep_tsne_outputs=True)
pred_tsne_gbr.fit(trainFP, list(range(len(trainFP)))pred2 = pred_tsne_gbr.transform(testFP)
plt.clf()
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_gbr.tsne_outputs_[:,0], pred_tsne_gbr.tsne_outputs_[:,1], c='blue', alpha=0.5)
plt.scatter(pred2[:,0], pred2[:,1], c='red', alpha=0.5))
RF as an estimator
GP as an estimator

In above case, random forest seems better performance for new compound prediction.

The method of estimator has a big influence for preditiveTSNE performance. It is useful for plot new data but it has risk for miss-leading for design.

At the end, I checked cluster label and preditiveTSNE data.

allmol = train_mol + test_mol
fps2 = []
clusterer2 = HDBSCAN(algorithm='best', min_samples=5, metric='pyfunc', func=tanimoto_dist)
for mol in allmol:
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    fps2.append(arr)
fps2 = np.array(fps2)
clusterer2.fit(fps2)
plt.clf()
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_rfr.tsne_outputs_[:,0], pred_tsne_rfr.tsne_outputs_[:,1], c=clusterer2.labels_[:len(trainFP)], alpha=0.5, marker='x')
plt.scatter(pred1[:,0], pred1[:,1], c=clusterer2.labels_[len(trainFP):], alpha=0.5, marker='o')


x training data, o test data, colour : class label.

Not only training data but also test data which has same colour are plotted almost same area. It indicates that the approach works well in this case.

Finally we often discuss chemicals pace and use the term ‘chemical space’ but it is very difficult to define, use and compare it.

All code is uploaded in my gist. Any comments and/or suggestions are greatly appreciated.