Replace strings in pandas dataframe / Memo

I would like to convert smiles string which has Cl or Br atoms convert to new one which has L as Cl and R as Br.
I wrote simple code for my note.

import pandas as pd
df = pd.read_table('cdk2.smi', sep=' ')
# make replace function with lumbda function
rep_func = lambda x: x.replace('Cl', 'L').replace('Br','R')
df['repsmi'] = df.SMILES.apply(rep_func)
#check
for row in df.repsmi:                                      
    ...:     if "L" in row:
    ...:         print(row)
    ...:     else: pass

Result was..

CC(C)C(CO)Nc1nc2c(ncn2C(C)C)c(Nc2cccc(L)c2)n1
CC(C)C(CO)Nc1nc2c(ncn2C(C)C)c(Nc2ccc(C(=O)[O-])c(L)c2)n1
C[NH+]1CCC(c2c(O)cc(O)c3c(=O)cc(-c4ccccc4L)oc32)C(O)C1

OK. If I use vectrize function the performance will be increased.

Advertisements

Business trip to India

I visited India this week and I came back to Japan today. When I went to Narita Air Port day before departure, the weather chill came to Japan and trains were disrupted due to a snow. The Bus from Air port to hotel was also stacked so I went to hotel by walk. It was hard to me to go to air port and hotel…

I was surprised in my flight. The window is a bit strange. This window has no curtain but has button. And when I pushed the button window color is changed!!!

Before…
before
After…..
after
I found movie in youtube.

Amazing! What’s happen ??? I searched internet and found answer.
The window system is developed by GENTEX, named “Electrochromic technology”.
This technology uses electrochromic gel that is sandwiched between thin panels. And the gel changes color by passing electronic current. Chem station also describes the technology and the site estimates the material is WO3. Because WO3 is colorless at neutral state but it turned blue when it is cationic state. Also inorganic material is durable compared to organic material.
https://www.chem-station.com/chemglossary/2016/12/electrochromism.html
Science! Science! Science! ๐Ÿ˜‰

Of course I enjoyed Indian traditional food curry, masara dosa, biriyani and tandoori chicken etc….

And I got Torkoal which is regional exclusive Pokemon in my free time. ๐Ÿ˜‰
pokemon

mol2vec analogy of word2vec #RDKit

Last year researcher who is in Bio MedX published following article.
https://chemrxiv.org/articles/Mol2vec_Unsupervised_Machine_Learning_Approach_with_Chemical_Intuition/5513581
And recently the article was published from ACS.
The concept of mol2vec is same as word2vec.
Word2vec converts word to vector with large data set of corpus and showed success in NLP. Mol2Vec converts molecules to vector with ECFP information.
Fortunately Mol2Vec source code is uploaded to github.
I played mol2vec by reference to original repository.
mol2vec wraps gensim that is python library for topic modeling so user can make corpus and model very easily.
https://radimrehurek.com/gensim/index.html
At first I make corpus and model by using very small dataset.
*github repo. provides model which is made from ZINC DB. Following code is sample of how to make model in mol2vec.

from mol2vec.features import DfVec, mol2alt_sentence, mol2sentence, generate_corpus, insert_unk
from mol2vec.features import train_word2vec_model
generate_corpus("cdk2.sdf", "cdk2.corp", 1 )
#I do not need to make unk file because I used same data for analysis.
#insert_unk('cdk2.corp', 'unk_cdk2.corp')
train_word2vec_model('cdk2.corp', 'cdk2word2vec.model', min_count=0)

Cool! Easy to make model.
Next I conduct tSNE analysis and visualize data.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import word2vec
from rdkit import Chem
from rdkit.Chem import Draw
from mol2vec.features import mol2alt_sentence, MolSentence, DfVec, sentences2vec
from mol2vec.helpers import depict_identifier, plot_2D_vectors, IdentifierTable, mol_to_svg

#from supplied data (skip gram word size of 10, radi 1, UNK to replace all identifiers that appear less than 4 time)
#model = word2vec.Word2Vec.load('model_300dim.pkl')
model = word2vec.Word2Vec.load('cdk2word2vec.model')
print("****number of unique identifiers in the model: {}".format(len(model.wv.vocab.keys())))
# get feature vec of 2246728737
#print(model.wv.word_vec('2246728737'))

mols = [ mol for mol in Chem.SDMolSupplier("cdk2.sdf")]

# encode molecules as sentence(represented by MorganFP)
sentences = [ mol2alt_sentence(mol, 1) for mol in mols ]
flat_list = [ item for sublist in sentences for item in sublist ]
mol_identifiers_unique = list(set( flat_list ))
print("mols has {} identifiers and has {} unique identifiers".format(len(flat_list), len(mol_identifiers_unique) ))
# make projection of 300D vectors to 2D using PCA/TSNE combination
df_vec = pd.DataFrame()
df_vec['identifier'] = mol_identifiers_unique
df_vec.index = df_vec['identifier']

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca_model = PCA(n_components=30)
#tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='cosine')
# I use sklearn v0.19, metric = 'cosine' did not work see following URL / SOF.
# https://stackoverflow.com/questions/46791191/valueerror-metric-cosine-not-valid-for-algorithm-ball-tree-when-using-sklea
tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='euclidean')


tsne_pca = tsne_model.fit_transform( pca_model.fit_transform([model.wv.word_vec(x) for x in mol_identifiers_unique]))

df_vec['PCA-tSNE-c1'] = tsne_pca.T[0]
df_vec['PCA-tSNE-c2'] = tsne_pca.T[1]
print(df_vec.head(4))
projections = df_vec.to_dict()
# Function that extracts projected values for plotting
def get_values(identifier, projections):
    return np.array((projections['PCA-tSNE-c1'][str(identifier)],projections['PCA-tSNE-c2'][str(identifier)]))
print(get_values(df_vec["identifier"][0],projections))

# substructure vectors are plotted as green and molecule vectors is plotted as cyan.

mol_vals = [ get_values(x, projections) for x in sentences[0]]
subplt = plot_2D_vectors( mol_vals )
plt.savefig('plt.png')

plot_2D_vectors can visualize vector of each fingerprint and molecule.
plt
I often use PCA to analyze of chemical space but results of PCA depends on dataset. But Mol2Vec make model first and apply data to model. It means Mol2vec analyze different dataset on the same basis(model).
My example code is very simple and shows a small part of Mol2Vec.
BTW, word2vec can calculate word as vec for example tokyo – japan + paris = france.
How about Mol2Vec ? mol A – mol B + mol C = mol D???? Maybe it is difficult because Finger print algorithm uses Hash function I think.
Reader who is interested in the package, please visit official repository and enjoy. ๐Ÿ˜‰
http://mol2vec.readthedocs.io/en/latest/index.html?highlight=features#features-main-mol2vec-module

Simple way for making SMILES file #RDKit

To convert SDF to SMILES I write like a following code.

..snip..
sdf = Chem.SDMolSupplier( 'some.sdf' )
with open('smiles.smi', 'w') as f:
    for mol in sdf:
        smi = Chem.MolToSmiles(mol)
        f.write("{}\n".format(smi)

In this way, to write smiles strings with properties it is needed to get properties by using GetProp(“some prop”). If I need several properties my code tend to be long.
Greg who is developer of RDKit advised me to use SmilesMolWriter. ๐Ÿ˜‰

I have never used this function so I used it and found it was very useful.
Let me show the example. (copy & paste from Jupyter notebook)

import pprint
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem.rdmolfiles import SmilesWriter
print(rdBase.rdkitVersion)
[OUT]
2017.09.2

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
# make writer object with a file name.
writer = SmilesWriter('cdk2smi1.smi')
#Check prop names.
pprint.pprint(list(mols[0].GetPropNames()))
[OUT]
['id',
 'Cluster',
 'MODEL.SOURCE',
 'MODEL.CCRATIO',
 'r_mmffld_Potential_Energy-OPLS_2005',
 'r_mmffld_RMS_Derivative-OPLS_2005',
 'b_mmffld_Minimization_Converged-OPLS_2005']

#SetProps method can set properties that will be written to files with SMILES.
writer.SetProps(['Cluster'])
#The way of writing molecules can perform common way.
for mol in mols:
    writer.write( mol )
writer.close()

Then check the file.

!head -n 10 cdk2smi1.smi
[OUT]
SMILES Name Cluster
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 2
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 2

How about set all props ?

writer = SmilesWriter('cdk2smi2.smi')
writer.SetProps(list(mols[0].GetPropNames()))
for mol in mols:
    writer.write( mol )
writer.close()
!head -n 10 cdk2smi2.smi
[OUT]
SMILES Name id Cluster MODEL.SOURCE MODEL.CCRATIO r_mmffld_Potential_Energy-OPLS_2005 r_mmffld_RMS_Derivative-OPLS_2005 b_mmffld_Minimization_Converged-OPLS_2005
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 ZINC03814457 1 CORINA 3.44 0027  09.01.2008 1 -78.6454 0.000213629 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 ZINC03814459 2 CORINA 3.44 0027  09.01.2008 1 -67.4705 9.48919e-05 1
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 ZINC03814460 2 CORINA 3.44 0027  09.01.2008 1 -89.4303 5.17485e-05 1
Nc1nc(OCC2CCCCC2)c2nc[nH]c2n1 ZINC00023543 ZINC00023543 3 CORINA 3.44 0027  09.01.2008 1 -70.2463 6.35949e-05 1

Wow it works fine. This method is useful and easy to use.
Do most of the RDKiters use the function?
I’m totally out of fashion!!!

I pushed my Fast Clustering code to rdkit/Contrib repository. It was big news for me.

API for opentargets

Association of drug targets with diseases are important information for drug discovery. There are lots of databases to provide these information I think.

I like python. ๐Ÿ˜‰ So, I am interested in following article.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210543/
Opentargets is a ” a data integration and visualization platform that provides evidence about the association of known and potential drug targets with diseases”.
That sounds good. The platform is provided by web app. You can use the data from https://www.targetvalidation.org/.
UI is very simple and sophisticated.
And also Python API is provided from pypi / github. I used the package by referring to the document.
At first I installed the package from pip command.

iwatobipen$ pip install git+git://github.com/CTTV/opentargets-py.git

Next, I tried to get evidence for a disease. Retrieved data related my query and printed it.

import pprint
from opentargets import OpenTargetsClient
ot = OpenTargetsClient()
e_for_disease = ot.get_evidence_for_disease('arthritis')
res = []
for e in e_for_disease:
    try:
        res.append([ e['disease']['efo_info']['label'], 
                   e['evidence']['target2drug']['mechanism_of_action'],
                   e['target']['target_name'],
                   e['drug']['molecule_name']] )
    except:
        pass
# get unique dataset
uniqres = []
for i in res:
    if i in uniqres:
        pass
    else:
        uniqres.append(i)

# check the result
for i in uniqres:
    print(i)
    print("**********")

###output###
['gout', 'Cyclooxygenase inhibitor', 'Cyclooxygenase', 'INDOMETHACIN']
**********
['rheumatoid arthritis', 'T-lymphocyte activation antigen CD86 inhibitor', 'T-lymphocyte activation antigen CD86', 'ABATACEPT']
**********
['rheumatoid arthritis', 'Dihydrofolate reductase inhibitor', 'Dihydrofolate reductase', 'METHOTREXATE']
**********
['rheumatoid arthritis', 'T-lymphocyte activation antigen CD80 inhibitor', 'T-lymphocyte activation antigen CD80', 'ABATACEPT']
**********
['rheumatoid arthritis', 'Dihydroorotate dehydrogenase inhibitor', 'Dihydroorotate dehydrogenase', 'LEFLUNOMIDE']
**********
['osteoarthritis', 'Plasminogen inhibitor', 'Plasminogen', 'TRANEXAMIC ACID']
**********
['gout', 'Xanthine dehydrogenase inhibitor', 'Xanthine dehydrogenase', 'ALLOPURINOL']
**********
['osteoarthritis', 'Adrenergic receptor agonist', 'Adrenergic receptor', 'EPINEPHRINE']
**********
['rheumatoid arthritis', 'Janus Kinase (JAK) inhibitor', 'Janus Kinase (JAK)', 'TOFACITINIB']
**********
['rheumatoid arthritis', 'TNF-alpha inhibitor', 'TNF-alpha', 'ADALIMUMAB']
**********
['chronic childhood arthritis', 'Dihydrofolate reductase inhibitor', 'Dihydrofolate reductase', 'METHOTREXATE']
**********
['osteoarthritis, hip', 'Adrenergic receptor agonist', 'Adrenergic receptor', 'EPINEPHRINE']
**********
['rheumatoid arthritis', 'Interleukin-6 receptor alpha subunit inhibitor', 'Interleukin-6 receptor alpha subunit', 'TOCILIZUMAB']
**********
['osteoarthritis', 'Coagulation factor X inhibitor', 'Coagulation factor X', 'RIVAROXABAN']
**********
['osteoarthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
**********
['rheumatoid arthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
**********
['chronic childhood arthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
**********
['rheumatoid arthritis', 'FK506-binding protein 1A inhibitor', 'FK506-binding protein 1A', 'TACROLIMUS']
**********
['osteoarthritis', 'Mu opioid receptor agonist', 'Mu opioid receptor', 'MORPHINE']
**********
['rheumatoid arthritis', 'TNF-alpha inhibitor', 'TNF-alpha', 'INFLIXIMAB']
**********
['chronic childhood arthritis', 'Histamine H2 receptor antagonist', 'Histamine H2 receptor', 'FAMOTIDINE']
**********
['osteoarthritis', 'Adrenergic receptor alpha-2 agonist', 'Adrenergic receptor alpha-2', 'DEXMEDETOMIDINE']
**********
['chronic childhood arthritis', 'Cyclooxygenase inhibitor', 'Cyclooxygenase', 'IBUPROFEN']
**********
..snip...

Next, retrieve data with limited field. The returned object can be converted pandas dataframe.

from opentargets import OpenTargetsClient
ct = OpenTargetsClient()
res = ct.get_associations_for_target('PDCD1',
               fields=['association_score.datasource*',
                           'association_score.overall',
                            'target.gene_info.symbol',
                           ]
   )
resdf = res.to_dataframe()

Statistics method can handle scores. (I do not understand the part well so I need to read document)

from opentargets import OpenTargetsClient
from opentargets.statistics import HarmonicSumScorer
ot = OpenTargetsClient()
r = ot.get_associations_for_target('PDCD1')
interesting_datatypes = ['genetic_association', 'known_drug', 'somatic_mutation']

def score_with_datatype_subset(datatypes,results):
    for i in results:
        datatype_scores = i['association_score']['datatypes']
        filtered_scores = [datatype_scores[dt] for dt in datatypes]
        custom_score = HarmonicSumScorer.harmonic_sum(filtered_scores)
        if custom_score:
            yield(custom_score, i['disease']['id'], dict(zip(datatypes, filtered_scores)))
for i in score_with_datatype_subset(interesting_datatypes, r):
    print(i)

Whole code is pushed my repo.
https://github.com/iwatobipen/opentargetstest/blob/master/opentarget.ipynb

ไปŠๅนดใ‚‚ใ‚‚ใ†็›ดใ็ต‚ใ‚ใ‚Šใงใ™ใญ

ใ€€ใ“ใฎ่จ˜ไบ‹ใฏไบˆ็ด„ๆŠ•็จฟใง2355ใซใƒ‘ใƒ–ใƒชใƒƒใ‚ทใƒฅใ•ใ‚Œใ‚‹ไบˆๅฎšใงใ™ใ€‚ใ†ใพใๅ‹•ใ„ใฆใ„ใ‚‹ใจใ„ใ„ใงใ™ใŒใ€‚ใ€‚ใ€‚
ใ“ใฎใƒ–ใƒญใ‚ฐใฏWordPress.comใงๆ›ธใ„ใฆใ„ใพใ™ใ€‚ใ‚ขใ‚ฏใ‚ปใ‚นใƒˆใƒฉใƒƒใ‚ญใƒณใ‚ฐใ‚’่ฆ‹ใ‚‹ใจๆ˜จๅนดใ‚ˆใ‚Šใ‚ขใ‚ฏใ‚ปใ‚นๆ•ฐใฏใŠใ‚ˆใไบŒๅ€ใใ‚‰ใ„ใซๅข—ใˆใฆใ„ใพใ™ใŒใ€ๆŠ•็จฟๆ•ฐใฏๅฎŸใฏๆ˜จๅนดใฎๆ–นใŒๅคšใ‹ใฃใŸใงใ™ใ€‚
โ†“
Screen Shot 2017-12-31 at 22.09.25
ใ€€ใ“ใ“ๆ•ฐๅนดใ€ใชใ‚“ใจใชใ่‹ฑ่ชžใฎ่จ˜ไบ‹ใ‚’ๆ›ธใ„ใฆใ„ใ‚‹ใ‚ใ‘ใงใ™ใŒใ€ๅ…จ็„ถใƒฌใƒ™ใƒซใŒไธŠใŒใฃใฆใ„ใ‚‹ๆฐ—ใŒใ—ใชใ„ไปŠๆ—ฅใ“ใฎ้ ƒใงใ™ใ€‚

ใ€€ไปŠๅนดใ‚’ๆŒฏใ‚Š่ฟ”ใ‚‹ใจใ€ไป•ไบ‹ใงใฏ่‰ฒใ€…ใจๆ–ฐใ—ใ„็ตŒ้จ“ใ‚’ใ™ใ‚‹ๆฉŸไผšใŒใ‚ใ‚Šใพใ—ใŸใ€‚ๆฏŽๅนดไธ€็ท’ใซไป•ไบ‹ใ•ใ›ใฆใ„ใŸใ ใๆ–นใŒใƒฌใƒ™ใƒซใŒ้ซ˜ใ„ๆ–นใ€…ใฐใ‹ใ‚Šใงใ€ๅˆบๆฟ€ใ‚’ๅ—ใ‘ใ‚‹ใ“ใจใŒๅคšใๅŠฉใ‹ใ‚‹ๅ้ขใ€่‡ชๅˆ†ๅ…จ็„ถ่ถณใ‚Šใชใ„ใ‚ใƒผใจๆ„Ÿใ˜ใ‚‹ใ“ใจใ‚‚ๅข—ใˆใฆใ„ใพใ™ใ€‚
ใ€€ใพใŸใ€ๆ–ฐใ—ใ„ใ“ใจใ‚’่‰ฒใ€…ใ‚„ใ‚‹ๆฉŸไผšใŒๅข—ใˆใŸไธ€ๆ–นใงใ€ๅฎŸ้จ“ๅฎคใงๅˆๆˆๅฎŸ้จ“ใ™ใ‚‹้ ปๅบฆใŒๆฟ€ๆธ›ใ—ใพใ—ใŸ(ยดใƒปฯ‰ใƒป๏ฝ€)๏ฝผ๏ฝฎ๏พŽ๏พž๏ฝฐ๏พใ€‚ใจใ„ใ†ใ‹ใปใจใ‚“ใฉใ—ใฆใชใ„ใ€ใ€ใ€ใ€‚ไปŠๅนดใฏCompChemใฎไป•ไบ‹ใ‚‚ๅ–ใ‚Š็ต„ใ‚ใ‚‹ใ‚ˆใ†ใซใชใ‚Šใ€ใกใ‚‡ใฃใจใงใ™ใŒใ€ๅฎŸๅ‹™ใจใ—ใฆใ‚ณใƒผใƒ‰ใ‚’ๆ›ธใ„ใฆใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆใซ่ฒข็ŒฎใงใใŸใฎใฏๅ€‹ไบบ็š„ใซใƒขใƒใƒ™ใƒผใ‚ทใƒงใƒณใ‚’ไฟใค่‰ฏใ„ๅˆบๆฟ€ใซใชใ‚Šใพใ—ใŸใ€‚ๅพŒใฏ็คพๅ†…ๅค–ใฎ่ชฟๆ•ดใฟใŸใ„ใชใ“ใจใ‚’็ตๆง‹ใ‚„ใฃใŸๆ„Ÿใ˜ใ€‚ใƒšใƒผใƒ‘ใƒผใƒฏใƒผใ‚ฏใฎใ‚นใ‚ญใƒซใŒใกใ‚‡ใฃใจไธŠใŒใฃใŸw

ใ€€็คพๆญดใ‚‚ใพใ‚ใพใ‚้•ทใใชใ‚ŠไธŠใ‹ใ‚‰ๆ•ฐใˆใŸๆ–นใŒๆ—ฉใใชใ‚Šใพใ—ใŸใ€‚ๆ˜”ใ€่‡ชๅˆ†ใ‚ˆใ‚ŠใšใฃใจไธŠใฎๅ…ˆ่ผฉ็คพๅ“กใฎๆ–นใŒใ€Œใ“ใ‚Œใ‹ใ‚‰ใฏๅ›ใ‚‰่‹ฅๆ‰‹ใŒ้“ใ‚’ๅˆ‡ใ‚Š้–‹ใ„ใฆใ„ใใ‚“ใ ใ€‚ๅฟœๆดใ™ใ‚‹ใ‚ˆใ€‚ใ€ใจใ‹ใ„ใ†ใฎใ‚’่žใ„ใฆใ€้‡ˆ็„ถใจใ—ใชใ„ๆฐ—ๅˆ†ใชใฃใŸใ“ใจใ‚’ใ„ใพใ ใซ่ฆšใˆใฆใ„ใพใ™ใ€‚ใ ใ‚“ใ ใ‚“ใใกใ‚‰ๅดใฎๅนดใซ่ฟ‘ใใชใฃใฆๆฅใพใ—ใŸใ€‚็งใฏ่‹ฅใ„ๆ–นใซใใ‚“ใชใ“ใจ่จ€ใ†ใคใ‚‚ใ‚Šๅ…จใใชใ„ใงใ™ใ€ใ‚‚ใกใ‚ใ‚“้‚ช้ญ”ใ™ใ‚‹ใจใ‹ใ˜ใ‚ƒใชใใฆๅ”ๅŠ›ใ‚‚ใ—ใพใ™ใ—ใ€ๅฟœๆดใ‚‚ใ—ใพใ™ใ€‚ใŒใ€ใ‚€ใ—ใ‚่‹ฅๆ‰‹ใซ่ฒ ใ‘ใชใ„ใใ‚‰ใ„่‡ชๅˆ†ใŒ้ ‘ๅผตใ‚‰ใชใ„ใจใ„ใ‹ใ‚“ใจๅฑๆฉŸๆ„Ÿใ‚’่ฆšใˆใพใ™ใ€ใ€ใ€ๅพŒใฏไปปใ›ใ‚‹ใ‚ˆใ€ๅฟœๆดใ™ใ‚‹ใ‚ˆใจใ‹ใ„ใฃใŸๆ™‚็‚นใง็ ”็ฉถ่€…ใจใ—ใฆใฎ่ฒฌไปปใ‚’ๆ”พๆฃ„ใ—ใŸใฎใงใฏใชใ„ใ‹ใจๆ€ใ†ใ‚ใ‘ใงใ€ใใ‚“ใชใ“ใจ่จ€ใ†ใใ‚‰ใ„ใชใ‚‰ใƒใ‚นใƒˆ่ญฒใ‚‹ในใใงใ—ใ‚‡ใ€‚

ใ€€ใจ่จ€ใ†ใ‚ใ‘ใงๆฅๅนดใ‚‚1ๆ—ฅ1ๆ—ฅใŒใ‚“ใฐใ‚Šใพใ™ใ€ๆฏๆŠœใใ‚‚ใ—ใคใคใ€‚
F2FใงใŠ่ฉฑใ—ใ•ใ›ใฆใ„ใŸใ ใ„ใŸๆ–นใ€ใ‚ใฃใŸใ“ใจใฏใชใ„ใ‘ใฉSNSใ‚’้€šใ˜ใฆไบคๆตใ•ใ›ใฆใ„ใŸใ ใ„ใŸๆ–นใ€็š†ๆง˜ใฎใ‚ˆใ‚Šไธ€ๅฑคใฎใ”ๆดป่บใจใ€ๆˆๅŠŸใ‚’็ฅˆใ‚Šใคใคใ€ไปŠๅนดๆœ€ๅพŒใฎ่จ˜ไบ‹ใจใ—ใŸใ„ใจๆ€ใ„ใพใ™ใ€‚

็š†ๆง˜ใซใจใฃใฆๆฅๅนดใŒใ‚ˆใ‚Š่‰ฏใ„ๅนดใซใชใ‚Šใพใ™ใ‚ˆใ†ใซ๏ผ๏ผ๏ผ๏ผ

Build QSAR model with pytorch and rdkit #RDKit

There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.
I used solubility data that is provided from rdkit and I used the dataset before.

Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.

import pprint
import argparse
import torch
import torch.optim as optim
from torch import nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
#from sklearn import preprocessing


def base_parser():
    parser = argparse.ArgumentParser("This is simple test of pytorch")
    parser.add_argument("trainset", help="sdf for train")
    parser.add_argument("testset", help="sdf for test")
    parser.add_argument("--epochs", default=150)
    return parser

parser = base_parser()
args = parser.parse_args()
traindata = [mol for mol in Chem.SDMolSupplier(args.trainset) if mol is not None]
testdata = [mol for mol in Chem.SDMolSupplier(args.testset) if mol is not None]

def molsfeaturizer(mols):
    fps = []
    for mol in mols:
        arr = np.zeros((0,))
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        DataStructs.ConvertToNumpyArray(fp, arr)
        fps.append(arr)
    fps = np.array(fps, dtype = np.float)
    return fps

classes = {"(A) low":0, "(B) medium":1, "(C) high":2}
#classes = {"(A) low":0, "(B) medium":1, "(C) high":1}

trainx = molsfeaturizer(traindata)
testx = molsfeaturizer(testdata)
# for pytorch, y must be long type!!
trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64)
testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)

torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.

X_train = torch.from_numpy(trainx)
X_test = torch.from_numpy(testx)
Y_train = torch.from_numpy(trainy)
Y_test = torch.from_numpy(testy)
print(X_train.size(),Y_train.size())
print(X_test.size(), Y_train.size())

class QSAR_mlp(nn.Module):
    def __init__(self):
        super(QSAR_mlp, self).__init__()
        self.fc1 = nn.Linear(2048, 524)
        self.fc2 = nn.Linear(524, 10)
        self.fc3 = nn.Linear(10, 10)
        self.fc4 = nn.Linear(10,3)
    def forward(self, x):
        x = x.view(-1, 2048)
        h1 = F.relu(self.fc1(x))
        h2 = F.relu(self.fc2(h1))
        h3 = F.relu(self.fc3(h2))
        output = F.sigmoid(self.fc4(h3))
        return output

After defining the model I tried to lean and prediction.
Following code is training and prediction parts.

model = QSAR_mlp()
print(model)

losses = []
optimizer = optim.Adam( model.parameters(), lr=0.005)
for epoch in range(args.epochs):
    data, target = Variable(X_train).float(), Variable(Y_train).long()
    optimizer.zero_grad()
    y_pred = model(data)
    loss = F.cross_entropy(y_pred, target)
    print("Loss: {}".format(loss.data[0]))
    loss.backward()
    optimizer.step()

pred_y = model(Variable(X_test).float())
predicted = torch.max(pred_y, 1)[1]

for i in range(len(predicted)):
    print("pred:{}, target:{}".format(predicted.data[i], Y_test[i]))

print( "Accuracy: {}".format(sum(p==t for p,t in zip(predicted.data, Y_test))/len(Y_test)))

Check the code.

iwatobipen$ python qsar_pytorch.py solubility.train.sdf solubility.test.sdf 
torch.Size([1025, 2048]) torch.Size([1025])
torch.Size([257, 2048]) torch.Size([1025])
QSAR_mlp(
  (fc1): Linear(in_features=2048, out_features=524)
  (fc2): Linear(in_features=524, out_features=10)
  (fc3): Linear(in_features=10, out_features=10)
  (fc4): Linear(in_features=10, out_features=3)
)
Loss: 1.1143544912338257
-snip-
Loss: 0.6231405735015869
pred:1, target:0
pred:1, target:0
-snip-
pred:0, target:0
Accuracy: 0.642023346303502

Hmm, accuracy is not so high. I believe there is still room for improvement. I am newbie of pytorch. I will try to practice pytorch next year.

This is my last code of this year. I would like to post my blog more in next year.
If readers who find mistake in my code, please let me know.

Have a happy new year !!!!
๐Ÿ˜‰