python in astronomy

When I was junior high school age, I read a book written by Stephen Hawking.
The book was very difficult for me so I could not understand well, but it was exciting.
Now I often read journals about organic chemistry, medicinal chemistry.
However Today I read another review letters about astronomy because some days ago I watched news about gravitational waves.

The title of the letters are “Observation of Gravitational Waves from a Binary Black Hole Merger”.
This is the first report that finding gravitational waves.
In the abstract, the author described
…… This is the first direct detection of gravitational waves and the first observation of a binary black hole merger.

The gravitational waves were detected the large detector named LIGO(The Laser Interferometer Gravitational-Wave Observatory ).
The results was very exciting and the analysis of the event (Named GW150914) can reproduce anyone using python.

The procedure was uploaded github.
Url is following.
The tutorial is using ipython notebook. ;-)

I think open data is important for science.
And I was encouraged again because I can see python anywhere.

SDWriter in rdkit

I have a fever since yesterday… ;-(
But not Flu.
I hope I’ll get well soon……

By the way, I have a problem about molecular chirality handling in rdkit.
I often use SDWriter for write molecules as SDF format. The function write molecule as V2000 format.
I wander can rdkit write molecule as v3000?
The answer is Yes.
Here is sample code.
SetForceV3000 method can change the file format.

from rdkit import Chem
mol = Chem.MolFromSmiles("N[C@@H](C)C(O)=O")
w2000 = Chem.SDWriter("2000.sdf")
w3000 = Chem.SDWriter("3000.sdf")

Check results.

In [14]: %cat 2000.sdf


  6  5  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
  2  4  1  0
  4  5  1  0
  4  6  2  0

In [15]: %cat 3000.sdf


  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 COUNTS 6 5 0 0 0
M  V30 1 N 0 0 0 0
M  V30 2 C 0 0 0 0
M  V30 3 C 0 0 0 0
M  V30 4 C 0 0 0 0
M  V30 5 O 0 0 0 0
M  V30 6 O 0 0 0 0
M  V30 1 1 1 2
M  V30 2 1 2 3
M  V30 3 1 2 4
M  V30 4 1 4 5
M  V30 5 2 4 6

Works fine. I’ll go straight to bed.

Structure generation from query molecule using rdkit

I’m thinking about making app for auto structure generator that generate compounds from query_mol and fragment list.

Query mol is the molecule that will be changed and fragment list is the fragment molecule stocker.
I tried to code simple test script.
Following code is not smart I’ll refine ASAP.
(It’s difficult to handle large dataset.)

My strategy is that reaction definition using “Xe” as bond formation point.
And analyse molecule using FragmentOnBRICSBonds function of rdkit.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem import Descriptors
import numpy as np
from rdkit.Chem.Draw import IPythonConsole

rxn = AllChem.ReactionFromSmarts('[*:1]-[Xe].[*:2]-[Xe]>>[*:1]-[*:2]')
#Tc calculator. Tempolary function. I'll planned to use Reduced Graph Fingerprint.
def calc_tanimoto(m1,m2):
    fp1 = AllChem.GetMorganFingerprintAsBitVect( m1,2 )
    fp2 = AllChem.GetMorganFingerprintAsBitVect( m2,2 )
    tc = DataStructs.TanimotoSimilarity( fp1, fp2 )
    return tc

#fragment mol using BRICS and return smiles list
def fragmenter( mol ):
    fgs = AllChem.FragmentOnBRICSBonds( mol )
    fgs_smi = Chem.MolToSmiles( fgs ).replace( "*", "Xe" ).split( "." )
    return fgs_smi

# check structure of start molecule
def check_querymol( fgs_smi ):
    res = [ smi.count("Xe") for smi in fgs_smi ]
    return res 

# generate fragment dictionary for design.
def gen_frag_dict( mol_list ):
    frag_dict = {}
    for mol in mol_list:
        fgs = fragmenter( mol )
        qmol = check_querymol( fgs )
        for i, j in enumerate( qmol ):
            if j in frag_dict.keys():
                frag_dict[ j ].add( str(fgs[i]) )
                frag_dict[ j ] = set( [str(fgs[i])] )
    keys = frag_dict.keys()
    for key in keys:
        frag_dict[ key ] = list( frag_dict[key] )
    return frag_dict

# generate molecules
def struct_gen( query_mol, frag_dict ):
    q_frgs = fragmenter( query_mol )
    # get query molecule's infromation.
    q_des = check_querymol( q_frgs )
    q_des.sort( reverse=True )
    # select starting point as random
    print( frag_dict[ q_des[0] ] )
    ps =  frag_dict[ q_des[0] ][ np.random.randint( len( frag_dict[ q_des[0] ] ) ) ] 
    ps = [ Chem.MolFromSmiles( ps ) ]
    for i in range( 1,len( q_des ) ):
        print( str(i)+" STEP" )
        #print(frag_dict[ q_des[i] ])
        ps = AllChem.EnumerateLibraryFromReaction( rxn, (ps, [Chem.MolFromSmiles(smi) for smi in frag_dict[ q_des[i] ]] ) )
        res = set()
        for p in ps:
                m = p[0]
                s = Chem.MolToSmiles(m)
                res.add( s )
        ps = [ Chem.MolFromSmiles( smi ) for smi in res ][:20]
    ps = [ mol for mol in ps if calc_tanimoto(query_mol,mol) <= 0.6 and Descriptors.MolWt(mol) <= 500 ]
    return ps

Then I tested the code.

from rdkit.Chem import Draw
mol = Chem.MolFromSmiles('CCCS(=O)(=O)Nc1ccc(F)c(c1F)C(=O)c2c[nH]c3c2cc(cn3)c4ccc(Cl)cc4')
mol2 = Chem.MolFromSmiles('Cc1ccccc1c1ccncc1')
mol3 = Chem.MolFromSmiles('CN1CCN(CC2=CC=C(C=C2)C(=O)NC2=CC(NC3=NC=CC(=N3)C3=CN=CC=C3)=C(C)C=C2)CC1')
a = gen_frag_dict([mol,mol2,mol3])
p = struct_gen(mol,a)

Now I got following image.

Test code seems working well.
Next step, I’ll implement some functions.

1) Random selection of fragments. ( Current code select molecule using indexing it’s not good for diversity.)
2) Fragment database builder instead of dictionary.
3) Suitable filter and optimiser for novel molecule generation.


Compound quality for drug discovery

I think there are some approaches to start drug discovery project.

  1. HTS, 2. Phenotypic screening, 3. FBDD, 4. Me too (Patent) etc.

We have to improve succeed rate of screening. So, compound quality is key factor.
Because good starting point is key for compound quality.
Serge et. al. published good review in drug discovery today. URL is following.

There are lots of indicators about drug likeness. For example Ro5, Ro3, QED, Golden Triangle etc. Some indicators are publicly-available. I consider about some rules in my project.

In the review, the author described about entropy and enthalpy.
I’m interested in entropy and enthalpy driven compound optimisation. These data tells chemist more details of binding affinity.

The entropic contribution relies on favorable but non-specific interaction. It causes molecular obesity.
And the enthalpic contributions relies of specific geometric complementarity between drug and target. It’s difficult to design. But we need control these interactions.

So, I think medchem take care about why the affinity is changed?
But, I have a question about entropy/enthalpy driven optimisation process.
If there are no structural data, can we understand details about interaction between drug and target, can we design molecule with suitable direction of the substituents ?

I think many compounds are needed to understand ligand-target interactions in the case of the above.

Recently medchem can access lots of indicators and in vitro data and are required rapid decision.

Who do you think about quality of compound?

Several reaction steps handling in RDKit

I want to make molecule using several reaction steps.
My situation was focused in substituted phenyl ring.
Greg who is Developer of RDKit posted great blog post before.
So, I read the post and try to make sample code.

My strategy is run reaction and get products as smiles because products of RunReactants need to sanitize( this is key step. Thanks for RDKit discuss for advice. :-) ) when products take to next step.

Sample snippet I wrote was following….

import sys
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole

#start sys.argv[1]

def aromaticsubstitution( smi ):
    smarts = "[cH&$(c(c)c):2]>>[c:2]{}".format( smi )
    rxn = AllChem.ReactionFromSmarts( smarts )
    return rxn

def runreaction( list_smiles, rxn ):
    mols = [ Chem.MolFromSmiles( smi ) for smi in list_smiles ]
    products = set()
    for mol in mols:
        ps = rxn.RunReactants( (mol,) )
        for x in ps:
            products.add( Chem.MolToSmiles( x[0], isomericSmiles=True ) )
    return products
# Define any substitution pattern user want to use.
rxnF = aromaticsubstitution( "(F)" )
rxnCl = aromaticsubstitution( "(Cl)" )
rxnOMe = aromaticsubstitution( "(OC)" )

Next run the reaction.

#1st round
monoCl = runreaction( ["N#Cc1ccccc1"], rxnCl )
#2nd round
diCl = runreaction( monoCl, rxnCl )
#3rd round
triCl = runreaction( diCl, rxnCl )
[/sourcecode ]

Check the results.

mols = []
for i in [ monoCl, diCl, triCl ]:
    for j in i:
        mols.append( Chem.MolFromSmiles(j) )
# All code run on IPython, so Molecules can visualise Grid image. 
from rdkit.Chem import Draw
Draw.MolsToGridImage( mols, molsPerRow=6 )


Works fine.
Using same protocol, I will make substituted phenyl rings MECE.

Create matched molecular series with RDKit.

New version of rdkit implemented new function about MMP named rdMMPA.
The class has FragmentMol function that returns fragments for MMP.
The function can set max number of Cut, and also can set cutting rules.
That’s means rdkit provide flexibility to chemo-informatician.
Now I’m trying to develop web app about mmpa and mmps.
So, I tested new function.
Following code was written in python3.( Dictionary object dose not have has_key method.)

from rdkit import Chem
from rdkit.Chem import rdMMPA
# I used sample files about cdk2.sdf in RDKit
mols = [ mol for mol in Chem.SDMolSupplier("cdk2.sdf") ]
# generate Fragment list using rdMMPA module.
# Condition 1) single cut, 2) get results as smiles.
fragmentlist = [ rdMMPA.FragmentMol( mol, maxCuts=1, resultsAsMols=False ) for mol in mols ]

Now I got fragmentlist.
Check it.

In [43]:fragmentlist[1]
Out[43]:(('', 'C1COC(C1)CO[*:1].Nc1nc(c2nc[nH]c2n1)[*:1]'),
 ('', 'N[*:1].c1nc2c(nc(nc2[nH]1)[*:1])OCC1CCCO1'),
 ('', 'C1COC(C1)C[*:1].Nc1nc(O[*:1])c2nc[nH]c2n1'),
 ('', 'C1COC(C1)[*:1].Nc1nc(OC[*:1])c2nc[nH]c2n1'))


Next, I made MMS(?) as python dictionary object.

fragdict = dict()
for fragments in fragmentlist:
    for fragment in fragments:
        core = fragment[1].split('.')[1]
        chain =  fragment[1].split('.')[0]
        if core in fragdict:
            fragdict[core].append( chain )
            fragdict.setdefault( core, [ chain ] )        

Results was….

# print result that has more than 3 fragments.
for k,v in fragdict.items():
    if len(v) >= 3:
        print(k, v)
        print( "="*20 )

OC[*:1] ['CC(C)C(Nc1nc(Nc2ccc(C(=O)[O-])c(Cl)c2)c2ncn(c2n1)C(C)C)[*:1]', 'CC(C)C(Nc1nc(Nc2cccc(Cl)c2)c2ncn(c2n1)C(C)C)[*:1]', 'CCC(Nc1nc(NCc2ccccc2)c2ncn(c2n1)C(C)C)[*:1]', 'COc1ccc(cc1)CNc1nc(nc2c1ncn2C(C)C)N(CCO)C[*:1]', 'Cn1cnc2c(nc(nc21)NC[*:1])NCc1ccccc1']
Nc1nc(OC[*:1])c2nc[nH]c2n1 ['C1=CCC(CC1)[*:1]', 'C1CCC(CC1)[*:1]', 'C1COC(C1)[*:1]', 'CC(C)C(=O)[*:1]']
c1ccc(cc1)C[*:1] ['CCC(CO)Nc1nc(N[*:1])c2ncn(c2n1)C(C)C', 'Cn1cnc2c(nc(nc21)NCCO)N[*:1]', 'O=C(c1ccccc1)c1cnc2n[nH]cc2c1O[*:1]', '[NH3+]C1CCC(CC1)Nc1nc(N[*:1])c2ncn(c2n1)C1CCCC1']
C[*:1] ['CC(C(=O)COc1nc(N)nc2[nH]cnc12)[*:1]', 'CC(C)C(=O)Nc1ncc(SCc2ncc(C[*:1])o2)s1', 'CC(C)C(CO)Nc1nc(Nc2ccc(C(=O)[O-])c(Cl)c2)c2ncn(c2n1)C(C)[*:1]', 'CC(C)C(CO)Nc1nc(Nc2cccc(Cl)c2)c2ncn(c2n1)C(C)[*:1]', 'CC(C)n1cnc2c(nc(nc21)N(CCO)CCO)NCc1ccc(cc1)O[*:1]', 'CC(C)n1cnc2c(nc(nc21)NC(CO)C(C)[*:1])Nc1ccc(C(=O)[O-])c(Cl)c1', 'CC(C)n1cnc2c(nc(nc21)NC(CO)C(C)[*:1])Nc1cccc(Cl)c1', 'CC(C)n1cnc2c(nc(nc21)NC(CO)C[*:1])NCc1ccccc1', 'CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(c2n1)C(C)[*:1]', 'CCc1cnc(CSc2cnc(NC(=O)C(C)[*:1])s2)o1', 'CN(C)NC(=O)Nc1cccc2-c3n[nH]c(-c4ccc(cc4)O[*:1])c3C(=O)c12', 'COc1ccc(cc1)-c1[nH]nc2-c3cccc(NC(=O)NN(C)[*:1])c3C(=O)c21', 'COc1ccc(cc1)CNc1nc(nc2c1ncn2C(C)[*:1])N(CCO)CCO']
Cl[*:1] ['CC(C)C(CO)Nc1nc(Nc2ccc(C(=O)[O-])c(c2)[*:1])c2ncn(c2n1)C(C)C', 'CC(C)C(CO)Nc1nc(Nc2cccc(c2)[*:1])c2ncn(c2n1)C(C)C', 'C[NH+]1CCC(c2c(O)cc(O)c3c(=O)cc(oc23)-c2ccccc2[*:1])C(O)C1']
c1ccc(cc1)[*:1] ['CCC(CO)Nc1nc(NC[*:1])c2ncn(c2n1)C(C)C', 'Cc1nc2ccccn2c1-c1ccnc(n1)N[*:1]', 'Cn1cnc2c(nc(nc21)NCCO)NC[*:1]', 'O=C(c1ccccc1)c1cnc2n[nH]cc2c1OC[*:1]', 'O=C(c1cnc2n[nH]cc2c1OCc1ccccc1)[*:1]', '[NH3+]C1CCC(CC1)Nc1nc(NC[*:1])c2ncn(c2n1)C1CCCC1']
c1ccc(cc1)CN[*:1] ['CCC(CO)Nc1nc(c2ncn(c2n1)C(C)C)[*:1]', 'Cn1cnc2c(nc(nc21)NCCO)[*:1]', '[NH3+]C1CCC(CC1)Nc1nc(c2ncn(c2n1)C1CCCC1)[*:1]']
F[*:1] ['CCCCOc1c(cnc2[nH]ncc12)C(=O)c1c(F)cc(Br)cc1[*:1]', 'COc1cc(-c2ccc[nH]2)c2C(=O)Nc3ccc(c1c32)[*:1]', 'Cc1ccc(c(c1)Nc1ccnc(n1)Nc1ccc(cc1)S(N)(=O)=O)[*:1]']
N[*:1] ['C1=CCC(CC1)COc1nc(nc2[nH]cnc12)[*:1]', 'CC(C)C(=O)COc1nc(nc2[nH]cnc12)[*:1]', 'NC(=O)c1ccc(cc1)Nc1nc(OCC2CCCCC2)c(N=O)c(n1)[*:1]']
O=[N+]([O-])[*:1] ['COc1cc[nH]c1/C=C1\\C(=O)Nc2ccc(c(c21)N1CCCC(C1)C(N)=O)[*:1]', 'COc1cc[nH]c1/C=C1\\C(=O)Nc2ccc(cc21)[*:1]', 'NS(=O)(=O)c1ccc(cc1)Nc1cc([nH]n1)-c1ccc(cc1)[*:1]']

It seems work fine.
This is very simple example, I’ll make mms with assay data, and assist medchem data analysis more easily.

I think MATSY is interesting and familiar for medchem.