Last year researcher who is in Bio MedX published following article.
https://chemrxiv.org/articles/Mol2vec_Unsupervised_Machine_Learning_Approach_with_Chemical_Intuition/5513581
And recently the article was published from ACS.
The concept of mol2vec is same as word2vec.
Word2vec converts word to vector with large data set of corpus and showed success in NLP. Mol2Vec converts molecules to vector with ECFP information.
Fortunately Mol2Vec source code is uploaded to github.
I played mol2vec by reference to original repository.
mol2vec wraps gensim that is python library for topic modeling so user can make corpus and model very easily.
https://radimrehurek.com/gensim/index.html
At first I make corpus and model by using very small dataset.
*github repo. provides model which is made from ZINC DB. Following code is sample of how to make model in mol2vec.
from mol2vec.features import DfVec, mol2alt_sentence, mol2sentence, generate_corpus, insert_unk from mol2vec.features import train_word2vec_model generate_corpus("cdk2.sdf", "cdk2.corp", 1 ) #I do not need to make unk file because I used same data for analysis. #insert_unk('cdk2.corp', 'unk_cdk2.corp') train_word2vec_model('cdk2.corp', 'cdk2word2vec.model', min_count=0)
Cool! Easy to make model.
Next I conduct tSNE analysis and visualize data.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from gensim.models import word2vec from rdkit import Chem from rdkit.Chem import Draw from mol2vec.features import mol2alt_sentence, MolSentence, DfVec, sentences2vec from mol2vec.helpers import depict_identifier, plot_2D_vectors, IdentifierTable, mol_to_svg #from supplied data (skip gram word size of 10, radi 1, UNK to replace all identifiers that appear less than 4 time) #model = word2vec.Word2Vec.load('model_300dim.pkl') model = word2vec.Word2Vec.load('cdk2word2vec.model') print("****number of unique identifiers in the model: {}".format(len(model.wv.vocab.keys()))) # get feature vec of 2246728737 #print(model.wv.word_vec('2246728737')) mols = [ mol for mol in Chem.SDMolSupplier("cdk2.sdf")] # encode molecules as sentence(represented by MorganFP) sentences = [ mol2alt_sentence(mol, 1) for mol in mols ] flat_list = [ item for sublist in sentences for item in sublist ] mol_identifiers_unique = list(set( flat_list )) print("mols has {} identifiers and has {} unique identifiers".format(len(flat_list), len(mol_identifiers_unique) )) # make projection of 300D vectors to 2D using PCA/TSNE combination df_vec = pd.DataFrame() df_vec['identifier'] = mol_identifiers_unique df_vec.index = df_vec['identifier'] from sklearn.decomposition import PCA from sklearn.manifold import TSNE pca_model = PCA(n_components=30) #tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='cosine') # I use sklearn v0.19, metric = 'cosine' did not work see following URL / SOF. # https://stackoverflow.com/questions/46791191/valueerror-metric-cosine-not-valid-for-algorithm-ball-tree-when-using-sklea tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='euclidean') tsne_pca = tsne_model.fit_transform( pca_model.fit_transform([model.wv.word_vec(x) for x in mol_identifiers_unique])) df_vec['PCA-tSNE-c1'] = tsne_pca.T[0] df_vec['PCA-tSNE-c2'] = tsne_pca.T[1] print(df_vec.head(4)) projections = df_vec.to_dict() # Function that extracts projected values for plotting def get_values(identifier, projections): return np.array((projections['PCA-tSNE-c1'][str(identifier)],projections['PCA-tSNE-c2'][str(identifier)])) print(get_values(df_vec["identifier"][0],projections)) # substructure vectors are plotted as green and molecule vectors is plotted as cyan. mol_vals = [ get_values(x, projections) for x in sentences[0]] subplt = plot_2D_vectors( mol_vals ) plt.savefig('plt.png')
plot_2D_vectors can visualize vector of each fingerprint and molecule.
I often use PCA to analyze of chemical space but results of PCA depends on dataset. But Mol2Vec make model first and apply data to model. It means Mol2vec analyze different dataset on the same basis(model).
My example code is very simple and shows a small part of Mol2Vec.
BTW, word2vec can calculate word as vec for example tokyo – japan + paris = france.
How about Mol2Vec ? mol A – mol B + mol C = mol D???? Maybe it is difficult because Finger print algorithm uses Hash function I think.
Reader who is interested in the package, please visit official repository and enjoy. ;-)
http://mol2vec.readthedocs.io/en/latest/index.html?highlight=features#features-main-mol2vec-module
Interesting, but I didn’t get the point why we need to calculate mol A – mol B + mol C = mol D
Thank you for your comment. I feel the equation is interesting because if the equation works, it means that DL can extract features of each molecule and build molecule having selected features. Highly potent but toxic features − toxic features +different scaffold features =potent and non toxic molecule. I feel it is same manner of medchem’s way for designing molecules.
Oh, I can’t wait to try it now :)
Thank you for your reply!