New way to plot chemical space.

I often use Principal Component Analysis(PCA) to reduction dimension.
Of course, there are several method to reduction dimension, not only PCA but also MDS, kernel PCA, isomap,sammon mapping, ICA, etc.
Some days ago, I found interesting method that called t-SNE t-Distributed Stochastic Neighbor Embedding.
Key of t-SNE is that use t-Distribution instead of normal distribution.
PCA is liner method but t-SNE is none liner method.
If reader who is interested in t-SNE, please refer following url.
https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf
t-SNE works well in image recognition. BYW, there are not information about chemistry.
So I tried to visualise chemical space using t-SNE. Fortunately Scikit-learn is implemented t-SNE!
I wrote a example to do that.
Code is following….
I used gsk3b inhibitor as dataset.
in case PCA

import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.decomposition import PCA
import pickle, sys
import seaborn as sns
drugs = [ mol for mol in Chem.SDMolSupplier( "gsk3b.sdf" ) if mol != None ]

def calc_fp_arr( mols ):
    fplist = []
    for mol in mols:
        arr = np.zeros( (1,) )
        fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2 )
        DataStructs.ConvertToNumpyArray( fp, arr )
        fplist.append( arr )
    return np.asarray( fplist )

res=calc_fp_arr(drugs)
pca = PCA( n_components = 3 )
pca.fit( res )
f = open( 'pca.pkl', 'wb' )
pickle.dump( pca, f )
f.close()


x = pca.transform( res )

f = open( "pcares.txt", "w" )
act = []
for i in range( x.shape[0] ):
    line = ""
    line = Chem.MolToSmiles( drugs[i] ) + "," + drugs[i].GetProp( "STANDARD_VALUE" ) + "," + str( x[i][0] )  + "," + str( x[i][1] ) + "\n"
    f.write( line )
    act.append( float(drugs[i].GetProp( "STANDARD_VALUE" )) )
f.close()
x = pd.DataFrame( x )
x.columns = [ 'PC1', 'PC2', 'PC3' ]
x["IC50"] = act

g = sns.jointplot( "PC1", "PC2", data = x )

g.savefig( "PCA.png" )

In t-SNE

import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.manifold import TSNE
import pickle, sys
import seaborn as sns
drugs = [ mol for mol in Chem.SDMolSupplier( "gsk3b.sdf" ) if mol != None ]

def calc_fp_arr( mols ):
    fplist = []
    for mol in mols:
        arr = np.zeros( (1,) )
        fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2 )
        DataStructs.ConvertToNumpyArray( fp, arr )
        fplist.append( arr )
    return np.asarray( fplist )

res=calc_fp_arr(drugs)
model = TSNE( n_components=3, init="pca",  n_iter=2000 )
#model = TSNE( n_components=2,random_state=42 ,  n_iter=1000 )
x = model.fit_transform( res )
f = open( 'pca_tsne.pkl', 'wb' )
pickle.dump( model, f )
f.close()



f = open( "pca_tsneres.txt", "w" )
act = []
for i in range( x.shape[0] ):
    line = ""
    line = Chem.MolToSmiles( drugs[i] ) + "," + drugs[i].GetProp( "STANDARD_VALUE" ) + "," + str( x[i][0] )  + "," + str( x[i][1] ) + "\n"
    act.append(  float(drugs[i].GetProp( "STANDARD_VALUE" ))  )
    f.write( line )
f.close()
x = pd.DataFrame( x )
x.columns = [ 'A1', 'A2', 'A3'  ]

g = sns.jointplot( "A1", "A2", data = x )
g.savefig( "TSNE.png" )

Now I got following 2 images.
In this case, PCA seems better than t-SNE, but I tried in-house dataset, t-SNE showed good performance.
I want to implement GTM with python.
Which method do you use to represent chemical space ?

PCA
TSNE

Code is pushed my repo.
https://github.com/iwatobipen/chemo_info/tree/master/chemicalspace2
😉

広告

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中