mol2vec analogy of word2vec #RDKit

Last year researcher who is in Bio MedX published following article.
And recently the article was published from ACS.
The concept of mol2vec is same as word2vec.
Word2vec converts word to vector with large data set of corpus and showed success in NLP. Mol2Vec converts molecules to vector with ECFP information.
Fortunately Mol2Vec source code is uploaded to github.
I played mol2vec by reference to original repository.
mol2vec wraps gensim that is python library for topic modeling so user can make corpus and model very easily.
At first I make corpus and model by using very small dataset.
*github repo. provides model which is made from ZINC DB. Following code is sample of how to make model in mol2vec.

from mol2vec.features import DfVec, mol2alt_sentence, mol2sentence, generate_corpus, insert_unk
from mol2vec.features import train_word2vec_model
generate_corpus("cdk2.sdf", "cdk2.corp", 1 )
#I do not need to make unk file because I used same data for analysis.
#insert_unk('cdk2.corp', 'unk_cdk2.corp')
train_word2vec_model('cdk2.corp', 'cdk2word2vec.model', min_count=0)

Cool! Easy to make model.
Next I conduct tSNE analysis and visualize data.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import word2vec
from rdkit import Chem
from rdkit.Chem import Draw
from mol2vec.features import mol2alt_sentence, MolSentence, DfVec, sentences2vec
from mol2vec.helpers import depict_identifier, plot_2D_vectors, IdentifierTable, mol_to_svg

#from supplied data (skip gram word size of 10, radi 1, UNK to replace all identifiers that appear less than 4 time)
#model = word2vec.Word2Vec.load('model_300dim.pkl')
model = word2vec.Word2Vec.load('cdk2word2vec.model')
print("****number of unique identifiers in the model: {}".format(len(model.wv.vocab.keys())))
# get feature vec of 2246728737

mols = [ mol for mol in Chem.SDMolSupplier("cdk2.sdf")]

# encode molecules as sentence(represented by MorganFP)
sentences = [ mol2alt_sentence(mol, 1) for mol in mols ]
flat_list = [ item for sublist in sentences for item in sublist ]
mol_identifiers_unique = list(set( flat_list ))
print("mols has {} identifiers and has {} unique identifiers".format(len(flat_list), len(mol_identifiers_unique) ))
# make projection of 300D vectors to 2D using PCA/TSNE combination
df_vec = pd.DataFrame()
df_vec['identifier'] = mol_identifiers_unique
df_vec.index = df_vec['identifier']

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca_model = PCA(n_components=30)
#tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='cosine')
# I use sklearn v0.19, metric = 'cosine' did not work see following URL / SOF.
tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='euclidean')

tsne_pca = tsne_model.fit_transform( pca_model.fit_transform([model.wv.word_vec(x) for x in mol_identifiers_unique]))

df_vec['PCA-tSNE-c1'] = tsne_pca.T[0]
df_vec['PCA-tSNE-c2'] = tsne_pca.T[1]
projections = df_vec.to_dict()
# Function that extracts projected values for plotting
def get_values(identifier, projections):
    return np.array((projections['PCA-tSNE-c1'][str(identifier)],projections['PCA-tSNE-c2'][str(identifier)]))

# substructure vectors are plotted as green and molecule vectors is plotted as cyan.

mol_vals = [ get_values(x, projections) for x in sentences[0]]
subplt = plot_2D_vectors( mol_vals )

plot_2D_vectors can visualize vector of each fingerprint and molecule.
I often use PCA to analyze of chemical space but results of PCA depends on dataset. But Mol2Vec make model first and apply data to model. It means Mol2vec analyze different dataset on the same basis(model).
My example code is very simple and shows a small part of Mol2Vec.
BTW, word2vec can calculate word as vec for example tokyo – japan + paris = france.
How about Mol2Vec ? mol A – mol B + mol C = mol D???? Maybe it is difficult because Finger print algorithm uses Hash function I think.
Reader who is interested in the package, please visit official repository and enjoy. ;-)

  1. Interesting, but I didn’t get the point why we need to calculate mol A – mol B + mol C = mol D

    1. Thank you for your comment. I feel the equation is interesting because if the equation works, it means that DL can extract features of each molecule and build molecule having selected features. Highly potent but toxic features − toxic features +different scaffold features =potent and non toxic molecule. I feel it is same manner of medchem’s way for designing molecules.

      1. Oh, I can’t wait to try it now :)
        Thank you for your reply!

