New way to plot chemical space.

I often use Principal Component Analysis(PCA) to reduction dimension.
Of course, there are several method to reduction dimension, not only PCA but also MDS, kernel PCA, isomap,sammon mapping, ICA, etc.
Some days ago, I found interesting method that called t-SNE t-Distributed Stochastic Neighbor Embedding.
Key of t-SNE is that use t-Distribution instead of normal distribution.
PCA is liner method but t-SNE is none liner method.
If reader who is interested in t-SNE, please refer following url.
https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf
t-SNE works well in image recognition. BYW, there are not information about chemistry.
So I tried to visualise chemical space using t-SNE. Fortunately Scikit-learn is implemented t-SNE!
I wrote a example to do that.
Code is following….
I used gsk3b inhibitor as dataset.
in case PCA

import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.decomposition import PCA
import pickle, sys
import seaborn as sns
drugs = [ mol for mol in Chem.SDMolSupplier( "gsk3b.sdf" ) if mol != None ]

def calc_fp_arr( mols ):
    fplist = []
    for mol in mols:
        arr = np.zeros( (1,) )
        fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2 )
        DataStructs.ConvertToNumpyArray( fp, arr )
        fplist.append( arr )
    return np.asarray( fplist )

res=calc_fp_arr(drugs)
pca = PCA( n_components = 3 )
pca.fit( res )
f = open( 'pca.pkl', 'wb' )
pickle.dump( pca, f )
f.close()


x = pca.transform( res )

f = open( "pcares.txt", "w" )
act = []
for i in range( x.shape[0] ):
    line = ""
    line = Chem.MolToSmiles( drugs[i] ) + "," + drugs[i].GetProp( "STANDARD_VALUE" ) + "," + str( x[i][0] )  + "," + str( x[i][1] ) + "\n"
    f.write( line )
    act.append( float(drugs[i].GetProp( "STANDARD_VALUE" )) )
f.close()
x = pd.DataFrame( x )
x.columns = [ 'PC1', 'PC2', 'PC3' ]
x["IC50"] = act

g = sns.jointplot( "PC1", "PC2", data = x )

g.savefig( "PCA.png" )

In t-SNE

import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.manifold import TSNE
import pickle, sys
import seaborn as sns
drugs = [ mol for mol in Chem.SDMolSupplier( "gsk3b.sdf" ) if mol != None ]

def calc_fp_arr( mols ):
    fplist = []
    for mol in mols:
        arr = np.zeros( (1,) )
        fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2 )
        DataStructs.ConvertToNumpyArray( fp, arr )
        fplist.append( arr )
    return np.asarray( fplist )

res=calc_fp_arr(drugs)
model = TSNE( n_components=3, init="pca",  n_iter=2000 )
#model = TSNE( n_components=2,random_state=42 ,  n_iter=1000 )
x = model.fit_transform( res )
f = open( 'pca_tsne.pkl', 'wb' )
pickle.dump( model, f )
f.close()



f = open( "pca_tsneres.txt", "w" )
act = []
for i in range( x.shape[0] ):
    line = ""
    line = Chem.MolToSmiles( drugs[i] ) + "," + drugs[i].GetProp( "STANDARD_VALUE" ) + "," + str( x[i][0] )  + "," + str( x[i][1] ) + "\n"
    act.append(  float(drugs[i].GetProp( "STANDARD_VALUE" ))  )
    f.write( line )
f.close()
x = pd.DataFrame( x )
x.columns = [ 'A1', 'A2', 'A3'  ]

g = sns.jointplot( "A1", "A2", data = x )
g.savefig( "TSNE.png" )

Now I got following 2 images.
In this case, PCA seems better than t-SNE, but I tried in-house dataset, t-SNE showed good performance.
I want to implement GTM with python.
Which method do you use to represent chemical space ?

PCA
TSNE

Code is pushed my repo.
https://github.com/iwatobipen/chemo_info/tree/master/chemicalspace2
😉

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s