mol2vec analogy of word2vec #RDKit

Last year researcher who is in Bio MedX published following article.
And recently the article was published from ACS.
The concept of mol2vec is same as word2vec.
Word2vec converts word to vector with large data set of corpus and showed success in NLP. Mol2Vec converts molecules to vector with ECFP information.
Fortunately Mol2Vec source code is uploaded to github.
I played mol2vec by reference to original repository.
mol2vec wraps gensim that is python library for topic modeling so user can make corpus and model very easily.
At first I make corpus and model by using very small dataset.
*github repo. provides model which is made from ZINC DB. Following code is sample of how to make model in mol2vec.

from mol2vec.features import DfVec, mol2alt_sentence, mol2sentence, generate_corpus, insert_unk
from mol2vec.features import train_word2vec_model
generate_corpus("cdk2.sdf", "cdk2.corp", 1 )
#I do not need to make unk file because I used same data for analysis.
#insert_unk('cdk2.corp', 'unk_cdk2.corp')
train_word2vec_model('cdk2.corp', 'cdk2word2vec.model', min_count=0)

Cool! Easy to make model.
Next I conduct tSNE analysis and visualize data.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import word2vec
from rdkit import Chem
from rdkit.Chem import Draw
from mol2vec.features import mol2alt_sentence, MolSentence, DfVec, sentences2vec
from mol2vec.helpers import depict_identifier, plot_2D_vectors, IdentifierTable, mol_to_svg

#from supplied data (skip gram word size of 10, radi 1, UNK to replace all identifiers that appear less than 4 time)
#model = word2vec.Word2Vec.load('model_300dim.pkl')
model = word2vec.Word2Vec.load('cdk2word2vec.model')
print("****number of unique identifiers in the model: {}".format(len(model.wv.vocab.keys())))
# get feature vec of 2246728737

mols = [ mol for mol in Chem.SDMolSupplier("cdk2.sdf")]

# encode molecules as sentence(represented by MorganFP)
sentences = [ mol2alt_sentence(mol, 1) for mol in mols ]
flat_list = [ item for sublist in sentences for item in sublist ]
mol_identifiers_unique = list(set( flat_list ))
print("mols has {} identifiers and has {} unique identifiers".format(len(flat_list), len(mol_identifiers_unique) ))
# make projection of 300D vectors to 2D using PCA/TSNE combination
df_vec = pd.DataFrame()
df_vec['identifier'] = mol_identifiers_unique
df_vec.index = df_vec['identifier']

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca_model = PCA(n_components=30)
#tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='cosine')
# I use sklearn v0.19, metric = 'cosine' did not work see following URL / SOF.
tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='euclidean')

tsne_pca = tsne_model.fit_transform( pca_model.fit_transform([model.wv.word_vec(x) for x in mol_identifiers_unique]))

df_vec['PCA-tSNE-c1'] = tsne_pca.T[0]
df_vec['PCA-tSNE-c2'] = tsne_pca.T[1]
projections = df_vec.to_dict()
# Function that extracts projected values for plotting
def get_values(identifier, projections):
    return np.array((projections['PCA-tSNE-c1'][str(identifier)],projections['PCA-tSNE-c2'][str(identifier)]))

# substructure vectors are plotted as green and molecule vectors is plotted as cyan.

mol_vals = [ get_values(x, projections) for x in sentences[0]]
subplt = plot_2D_vectors( mol_vals )

plot_2D_vectors can visualize vector of each fingerprint and molecule.
I often use PCA to analyze of chemical space but results of PCA depends on dataset. But Mol2Vec make model first and apply data to model. It means Mol2vec analyze different dataset on the same basis(model).
My example code is very simple and shows a small part of Mol2Vec.
BTW, word2vec can calculate word as vec for example tokyo – japan + paris = france.
How about Mol2Vec ? mol A – mol B + mol C = mol D???? Maybe it is difficult because Finger print algorithm uses Hash function I think.
Reader who is interested in the package, please visit official repository and enjoy. 😉


Simple way for making SMILES file #RDKit

To convert SDF to SMILES I write like a following code.

sdf = Chem.SDMolSupplier( 'some.sdf' )
with open('smiles.smi', 'w') as f:
    for mol in sdf:
        smi = Chem.MolToSmiles(mol)

In this way, to write smiles strings with properties it is needed to get properties by using GetProp(“some prop”). If I need several properties my code tend to be long.
Greg who is developer of RDKit advised me to use SmilesMolWriter. 😉

I have never used this function so I used it and found it was very useful.
Let me show the example. (copy & paste from Jupyter notebook)

import pprint
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem.rdmolfiles import SmilesWriter

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
# make writer object with a file name.
writer = SmilesWriter('cdk2smi1.smi')
#Check prop names.

#SetProps method can set properties that will be written to files with SMILES.
#The way of writing molecules can perform common way.
for mol in mols:
    writer.write( mol )

Then check the file.

!head -n 10 cdk2smi1.smi
SMILES Name Cluster
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 2
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 2

How about set all props ?

writer = SmilesWriter('cdk2smi2.smi')
for mol in mols:
    writer.write( mol )
!head -n 10 cdk2smi2.smi
SMILES Name id Cluster MODEL.SOURCE MODEL.CCRATIO r_mmffld_Potential_Energy-OPLS_2005 r_mmffld_RMS_Derivative-OPLS_2005 b_mmffld_Minimization_Converged-OPLS_2005
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 ZINC03814457 1 CORINA 3.44 0027  09.01.2008 1 -78.6454 0.000213629 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 ZINC03814459 2 CORINA 3.44 0027  09.01.2008 1 -67.4705 9.48919e-05 1
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 ZINC03814460 2 CORINA 3.44 0027  09.01.2008 1 -89.4303 5.17485e-05 1
Nc1nc(OCC2CCCCC2)c2nc[nH]c2n1 ZINC00023543 ZINC00023543 3 CORINA 3.44 0027  09.01.2008 1 -70.2463 6.35949e-05 1

Wow it works fine. This method is useful and easy to use.
Do most of the RDKiters use the function?
I’m totally out of fashion!!!

I pushed my Fast Clustering code to rdkit/Contrib repository. It was big news for me.

API for opentargets

Association of drug targets with diseases are important information for drug discovery. There are lots of databases to provide these information I think.

I like python. 😉 So, I am interested in following article.
Opentargets is a ” a data integration and visualization platform that provides evidence about the association of known and potential drug targets with diseases”.
That sounds good. The platform is provided by web app. You can use the data from
UI is very simple and sophisticated.
And also Python API is provided from pypi / github. I used the package by referring to the document.
At first I installed the package from pip command.

iwatobipen$ pip install git+git://

Next, I tried to get evidence for a disease. Retrieved data related my query and printed it.

import pprint
from opentargets import OpenTargetsClient
ot = OpenTargetsClient()
e_for_disease = ot.get_evidence_for_disease('arthritis')
res = []
for e in e_for_disease:
        res.append([ e['disease']['efo_info']['label'], 
                   e['drug']['molecule_name']] )
# get unique dataset
uniqres = []
for i in res:
    if i in uniqres:

# check the result
for i in uniqres:

['gout', 'Cyclooxygenase inhibitor', 'Cyclooxygenase', 'INDOMETHACIN']
['rheumatoid arthritis', 'T-lymphocyte activation antigen CD86 inhibitor', 'T-lymphocyte activation antigen CD86', 'ABATACEPT']
['rheumatoid arthritis', 'Dihydrofolate reductase inhibitor', 'Dihydrofolate reductase', 'METHOTREXATE']
['rheumatoid arthritis', 'T-lymphocyte activation antigen CD80 inhibitor', 'T-lymphocyte activation antigen CD80', 'ABATACEPT']
['rheumatoid arthritis', 'Dihydroorotate dehydrogenase inhibitor', 'Dihydroorotate dehydrogenase', 'LEFLUNOMIDE']
['osteoarthritis', 'Plasminogen inhibitor', 'Plasminogen', 'TRANEXAMIC ACID']
['gout', 'Xanthine dehydrogenase inhibitor', 'Xanthine dehydrogenase', 'ALLOPURINOL']
['osteoarthritis', 'Adrenergic receptor agonist', 'Adrenergic receptor', 'EPINEPHRINE']
['rheumatoid arthritis', 'Janus Kinase (JAK) inhibitor', 'Janus Kinase (JAK)', 'TOFACITINIB']
['rheumatoid arthritis', 'TNF-alpha inhibitor', 'TNF-alpha', 'ADALIMUMAB']
['chronic childhood arthritis', 'Dihydrofolate reductase inhibitor', 'Dihydrofolate reductase', 'METHOTREXATE']
['osteoarthritis, hip', 'Adrenergic receptor agonist', 'Adrenergic receptor', 'EPINEPHRINE']
['rheumatoid arthritis', 'Interleukin-6 receptor alpha subunit inhibitor', 'Interleukin-6 receptor alpha subunit', 'TOCILIZUMAB']
['osteoarthritis', 'Coagulation factor X inhibitor', 'Coagulation factor X', 'RIVAROXABAN']
['osteoarthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
['rheumatoid arthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
['chronic childhood arthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
['rheumatoid arthritis', 'FK506-binding protein 1A inhibitor', 'FK506-binding protein 1A', 'TACROLIMUS']
['osteoarthritis', 'Mu opioid receptor agonist', 'Mu opioid receptor', 'MORPHINE']
['rheumatoid arthritis', 'TNF-alpha inhibitor', 'TNF-alpha', 'INFLIXIMAB']
['chronic childhood arthritis', 'Histamine H2 receptor antagonist', 'Histamine H2 receptor', 'FAMOTIDINE']
['osteoarthritis', 'Adrenergic receptor alpha-2 agonist', 'Adrenergic receptor alpha-2', 'DEXMEDETOMIDINE']
['chronic childhood arthritis', 'Cyclooxygenase inhibitor', 'Cyclooxygenase', 'IBUPROFEN']

Next, retrieve data with limited field. The returned object can be converted pandas dataframe.

from opentargets import OpenTargetsClient
ct = OpenTargetsClient()
res = ct.get_associations_for_target('PDCD1',
resdf = res.to_dataframe()

Statistics method can handle scores. (I do not understand the part well so I need to read document)

from opentargets import OpenTargetsClient
from opentargets.statistics import HarmonicSumScorer
ot = OpenTargetsClient()
r = ot.get_associations_for_target('PDCD1')
interesting_datatypes = ['genetic_association', 'known_drug', 'somatic_mutation']

def score_with_datatype_subset(datatypes,results):
    for i in results:
        datatype_scores = i['association_score']['datatypes']
        filtered_scores = [datatype_scores[dt] for dt in datatypes]
        custom_score = HarmonicSumScorer.harmonic_sum(filtered_scores)
        if custom_score:
            yield(custom_score, i['disease']['id'], dict(zip(datatypes, filtered_scores)))
for i in score_with_datatype_subset(interesting_datatypes, r):

Whole code is pushed my repo.


Screen Shot 2017-12-31 at 22.09.25





Build QSAR model with pytorch and rdkit #RDKit

There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.
I used solubility data that is provided from rdkit and I used the dataset before.

Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.

import pprint
import argparse
import torch
import torch.optim as optim
from torch import nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
#from sklearn import preprocessing

def base_parser():
    parser = argparse.ArgumentParser("This is simple test of pytorch")
    parser.add_argument("trainset", help="sdf for train")
    parser.add_argument("testset", help="sdf for test")
    parser.add_argument("--epochs", default=150)
    return parser

parser = base_parser()
args = parser.parse_args()
traindata = [mol for mol in Chem.SDMolSupplier(args.trainset) if mol is not None]
testdata = [mol for mol in Chem.SDMolSupplier(args.testset) if mol is not None]

def molsfeaturizer(mols):
    fps = []
    for mol in mols:
        arr = np.zeros((0,))
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        DataStructs.ConvertToNumpyArray(fp, arr)
    fps = np.array(fps, dtype = np.float)
    return fps

classes = {"(A) low":0, "(B) medium":1, "(C) high":2}
#classes = {"(A) low":0, "(B) medium":1, "(C) high":1}

trainx = molsfeaturizer(traindata)
testx = molsfeaturizer(testdata)
# for pytorch, y must be long type!!
trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64)
testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)

torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.

X_train = torch.from_numpy(trainx)
X_test = torch.from_numpy(testx)
Y_train = torch.from_numpy(trainy)
Y_test = torch.from_numpy(testy)
print(X_test.size(), Y_train.size())

class QSAR_mlp(nn.Module):
    def __init__(self):
        super(QSAR_mlp, self).__init__()
        self.fc1 = nn.Linear(2048, 524)
        self.fc2 = nn.Linear(524, 10)
        self.fc3 = nn.Linear(10, 10)
        self.fc4 = nn.Linear(10,3)
    def forward(self, x):
        x = x.view(-1, 2048)
        h1 = F.relu(self.fc1(x))
        h2 = F.relu(self.fc2(h1))
        h3 = F.relu(self.fc3(h2))
        output = F.sigmoid(self.fc4(h3))
        return output

After defining the model I tried to lean and prediction.
Following code is training and prediction parts.

model = QSAR_mlp()

losses = []
optimizer = optim.Adam( model.parameters(), lr=0.005)
for epoch in range(args.epochs):
    data, target = Variable(X_train).float(), Variable(Y_train).long()
    y_pred = model(data)
    loss = F.cross_entropy(y_pred, target)
    print("Loss: {}".format([0]))

pred_y = model(Variable(X_test).float())
predicted = torch.max(pred_y, 1)[1]

for i in range(len(predicted)):
    print("pred:{}, target:{}".format([i], Y_test[i]))

print( "Accuracy: {}".format(sum(p==t for p,t in zip(, Y_test))/len(Y_test)))

Check the code.

iwatobipen$ python solubility.train.sdf solubility.test.sdf 
torch.Size([1025, 2048]) torch.Size([1025])
torch.Size([257, 2048]) torch.Size([1025])
  (fc1): Linear(in_features=2048, out_features=524)
  (fc2): Linear(in_features=524, out_features=10)
  (fc3): Linear(in_features=10, out_features=10)
  (fc4): Linear(in_features=10, out_features=3)
Loss: 1.1143544912338257
Loss: 0.6231405735015869
pred:1, target:0
pred:1, target:0
pred:0, target:0
Accuracy: 0.642023346303502

Hmm, accuracy is not so high. I believe there is still room for improvement. I am newbie of pytorch. I will try to practice pytorch next year.

This is my last code of this year. I would like to post my blog more in next year.
If readers who find mistake in my code, please let me know.

Have a happy new year !!!!

Back to my home

I and my family visited Okinawa and back to home today. That place was warm and comfortable. We enjoyed the travel. My son excited his first experience of taking airplane. 😉
We enjoyed Okinawa’s soul food. Okinawa was very warm and conforta

Goya Chample and orion beer!

Visited Churaumi Aquarium and saw Whale shark. Body weight of the shark is over 5000kg! @_@

And we could see beautiful sea
Following movie took by my dorone!
Kouri Bridge!

Sunset beach!

I think this family vacation was really successful.

New Year’s holiday.

I am waiting for my flight in Haneda airport now. 😉
I planed travel with my family during my new year’s holiday. It is first time for me to go such a travel since the birth of kids.
My kids are looking forward to getting on an airplane.
I will take photo and tweet them. I am taking my drone!
I hope they enjoy the vacation.