mishima.syk #8に参加してきました。

 週末に参加してきました。もう8回目になるのですね。毎回楽しい勉強会になり元気をいただきます。幹事、参加者、プレゼンターの方には感謝の気持ちでいっぱいでございます。

 今回は直前でハプニングもあったんですが、資料が素敵だったのでpython (jupyter)の良さが共有できたのではと感じました。有償でも行けると思えるレベルじゃないだろうか。
私も毎度のことながら稚拙なプレゼンをさせていただきまして、一応コードは下記にプッシュしました。8の下が良かったのかもしれないが、、、
https://github.com/Mishima-syk
個人的に新しい取り組みとしてはプレゼンにpreziを使ってみました。そっちに時間かけすぎてしまいましたが、ツールとしては面白いと思いました。あのヌルヌル感が素敵です。
 
 さて、以下答えのない独り言となります。
 アカデミア、企業にかかわらず、研究職といえど、本当に自分が研究したいこと、主務以外にもあれこれとやらなければならいことってありますよね。(私見ですが)
 優秀、有能な人はテキパキっとその辺捌けるので、いろんな仕事が集まるってのも理解できます。あいつに任せればまあ大丈夫でしょうと、信頼があるってのはきっとそれはそれで素晴らしいことだと思う。
 一方で、適材適所、その人の能力を120%発揮してもらおうと思ったら、ちょっとそれではいけないような気もする。もしあなたが経営者だったら、どういった戦略を立てますか?と考えてみる。

勉強会後は沼津バルに参加してワイワイと美味しいものを食べました。
とても楽しい1日でした!
牡蠣美味しかった!!!
Screen Shot 2016-05-29 at 10.44.49 PM

こっちはちょっと、、、、、みりん干しは美味しかったんでその勢いで食べたんですけど、いきなり全力でパンチを食らったくらいのショックがありました。w
Screen Shot 2016-05-29 at 10.44.59 PM

明日からまた勉強しよう!

Quantum chemistry calculation with RDKit.

I enjoyed JCUP VII. All presentations were very impressive for me and I interested in psi4 that is open source tool for quantum chemistry.
http://www.psicode.org/
Fortunately, I could install psi4 using conda command. It supports only python2.x.
If user want to install psi4, just type “conda install -c psi4 psi4 “. That’s all !
Ready to run psi4. Next, I need to prepare inputfile.
Basic structure of input file of psi4 is following. (DFT energy calculation)

#! Sample UHF/6-31G** CH2 computation <==comment here

memory 250 mb <== option

molecule ch2 {
  0 1 <= 0 means 0 the molecule is neutral, 1 means the spin mautiplicity
  C position.x position.y position.z <= atom symbole, X, Y, Z
  .....

}

set basis 6-31G** <==basis function
set reference uhf <==unrestricted Hartree-Fock
energy ('scf')

molecule was represented as atom coordination of x, y, z. It is not common structure for me.
So, I wrote simple convert code that transform sdf to input file.
My code is following.(This code translate only first molecule of SDF.)

from __future__ import print_function
import sys, os, argparse
from rdkit import Chem
from rdkit.Chem import AllChem

inf = sys.argv[ 1 ]
sdf = Chem.SDMolSupplier( inf )
mol = sdf.next()
print(mol.GetNumConformers())
if mol.GetNumConformers() <= 1:
    hmol = Chem.AddHs( mol )
    AllChem.EmbedMolecule(  hmol,
                            useExpTorsionAnglePrefs=True,
                            useBasicKnowledge=True )
    #AllChem.EmbedMolecule( hmol )
else:
    hmol = Chem.AddHs( mol )

atoms = [ atom for atom in hmol.GetAtoms() ]

def atomposition2string( atom ):
    line = "{} {} {} {}"
    conf = atom.GetOwningMol().GetConformer()
    posi = conf.GetAtomPosition( atom.GetIdx() )
    line = line.format( atom.GetSymbol(), posi.x, posi.y, posi.z )
    line +="\n\n"
    return line

outf = open('in.dat', 'w')

header="#! psi4input\n\n"
molstring ="molecule inputmol "

setstring = """
set basis 6-31G**\n
set reference uhf\n
energy( "scf" )\n
"""

outf = open('in.dat', 'w')
outf.write(header)

molstring += "{\n0 1\n\n"
for atom in atoms:
    l=atomposition2string(atom)
    molstring += l
molstring += "}\n"

outf.write( molstring )
outf.write(setstring)
outf.close()

All option was hardcoding.

Next I made sample molecule using RDKIT.

from rdkit import Chem
mol = Chem.MolFromSmiles( "Cc1ccccc1" )
w = Chem.SDWriter( "tol.sdf" )
w.write( mol )
w.close()

OK ready, then convert sdf to inputfile.

iwatobipen$ python sdf2psi.py toluene.sdf
iwatobipen$ cat in.dat
#! psi4input

molecule inputmol {
0 1

C 2.12643946569 -0.148076377601 0.280172726109

C 0.737112657203 -0.0563266799793 -0.220757232163

C 0.0894292148607 1.17309385348 -0.272662802846

C -1.26388345191 1.29962103669 0.0220149512856

C -1.99460856032 0.133837305932 0.199621363415

C -1.42391701302 -1.05115760794 -0.254293865166

C -0.0465313107692 -1.17873499271 -0.410958831529

H 2.06321195027 -0.624503035816 1.29295672814

H 2.54504673868 0.854508459836 0.429328334067

H 2.763215791 -0.825994020449 -0.31296772677

H 0.680672612474 2.01686875154 -0.629481432558

H -1.64020718362 2.2772793939 0.325940161739

H -3.02108403858 0.17949008105 0.519270208312

H -2.0538300122 -1.92491235886 -0.416048311006

H 0.438933140245 -2.12499380908 -0.552134271027

}

set basis 6-31G**

set reference uhf

energy( "scf" )

It seems to work fine! Then run calculation.
Just type, psi4 input.dat output.dat -n

iwatobipen$ time psi4 in.dat out.dat -n 6

real	0m4.868s
user	0m14.404s
sys	0m0.537s

I got out.dat file.
I’m reading document of psi4.
I think psi4 is one of nice open source tool of computational chemistry.
I’ll study quantum chemistry.

iwatobipen$ cat out.dat
    -----------------------------------------------------------------------
          Psi4: An Open-Source Ab Initio Electronic Structure Package
                              Psi4 1.0rc0 Driver

                          Git: Rev {_conda_cache_origin_head} 15fc63c 

    J. M. Turney, A. C. Simmonett, R. M. Parrish, E. G. Hohenstein,
    F. A. Evangelista, J. T. Fermann, B. J. Mintz, L. A. Burns, J. J. Wilke,
    M. L. Abrams, N. J. Russ, M. L. Leininger, C. L. Janssen, E. T. Seidl,
    W. D. Allen, H. F. Schaefer, R. A. King, E. F. Valeev, C. D. Sherrill,
    and T. D. Crawford, WIREs Comput. Mol. Sci. 2, 556-565 (2012)
    (doi: 10.1002/wcms.93)

                         Additional Contributions by
    A. E. DePrince, U. Bozkaya, A. Yu. Sokolov, D. G. A. Smith, R. Di Remigio,
    R. M. Richard, J. F. Gonthier, H. R. McAlexander, M. Saitow, and
    B. P. Pritchard
    -----------------------------------------------------------------------

    Psi4 started on: Sat May 21 14:22:12 2016

    Process ID:   5371
    PSI4DATADIR: /Users/iwatobipen/.pyenv/versions/anaconda-2.4.0/share/psi4
    Memory level set to 256.000 MB

  ==> Input File <==

--------------------------------------------------------------------------
#! psi4input

molecule inputmol {
0 1

C 2.12643946569 -0.148076377601 0.280172726109

C 0.737112657203 -0.0563266799793 -0.220757232163

C 0.0894292148607 1.17309385348 -0.272662802846

C -1.26388345191 1.29962103669 0.0220149512856

C -1.99460856032 0.133837305932 0.199621363415

C -1.42391701302 -1.05115760794 -0.254293865166

C -0.0465313107692 -1.17873499271 -0.410958831529

H 2.06321195027 -0.624503035816 1.29295672814

H 2.54504673868 0.854508459836 0.429328334067

H 2.763215791 -0.825994020449 -0.31296772677

H 0.680672612474 2.01686875154 -0.629481432558

H -1.64020718362 2.2772793939 0.325940161739

H -3.02108403858 0.17949008105 0.519270208312

H -2.0538300122 -1.92491235886 -0.416048311006

H 0.438933140245 -2.12499380908 -0.552134271027

}

set basis 6-31G**

set reference uhf

energy( "scf" )

--------------------------------------------------------------------------

*** tstart() called on tkyks-MacBook-Pro.local
*** at Sat May 21 14:22:13 2016

         ---------------------------------------------------------
                                   SCF
            by Justin Turney, Rob Parrish, and Andy Simmonett
                              UHF Reference
                        6 Threads,    256 MiB Core
         ---------------------------------------------------------

  ==> Geometry <==

    Molecular point group: c1
    Full point group: C1

    Geometry (in Angstrom), charge = 0, multiplicity = 1:

       Center              X                  Y                   Z               Mass
    ------------   -----------------  -----------------  -----------------  -----------------
           C          2.338487050599    -0.168643622045     0.358601540853    12.000000000000
           C          0.949160242112    -0.076893924424    -0.142328417419    12.000000000000
           C          0.301476799769     1.152526609036    -0.194233988102    12.000000000000
           C         -1.051835867001     1.279053792246     0.100443766030    12.000000000000
           C         -1.782560975411     0.113270061488     0.278050178159    12.000000000000
           C         -1.211869428111    -1.071724852384    -0.175865050422    12.000000000000
           C          0.165516274139    -1.199302237154    -0.332530016785    12.000000000000
           H          2.275259535179    -0.645070280260     1.371385542884     1.007825032070
           H          2.757094323589     0.833941215392     0.507757148811     1.007825032070
           H          2.975263375909    -0.846561264893    -0.234538912026     1.007825032070
           H          0.892720197383     1.996301507096    -0.551052617814     1.007825032070
           H         -1.428159598711     2.256712149456     0.404368976483     1.007825032070
           H         -2.809036453671     0.158922836606     0.597699023056     1.007825032070
           H         -1.841782427291    -1.945479603304    -0.337619496262     1.007825032070
           H          0.650980725154    -2.145561053524    -0.473705456283     1.007825032070

  Running in c1 symmetry.

  Rotational constants: A =      0.17882  B =      0.08793  C =      0.06227 [cm^-1]
  Rotational constants: A =   5360.95290  B =   2636.07055  C =   1866.71219 [MHz]
  Nuclear repulsion =  271.879087177987628

  Charge       = 0
  Multiplicity = 1
  Electrons    = 50
  Nalpha       = 25
  Nbeta        = 25

  ==> Algorithm <==

  SCF Algorithm Type is DF.
  DIIS enabled.
  MOM disabled.
  Fractional occupation disabled.
  Guess Type is CORE.
  Energy threshold   = 1.00e-06
  Density threshold  = 1.00e-06
  Integral threshold = 0.00e+00

  ==> Primary Basis <==

  Basis Set: 6-31G**
    Number of shells: 66
    Number of basis function: 145
    Number of Cartesian functions: 145
    Spherical Harmonics?: false
    Max angular momentum: 2

  ==> Pre-Iterations <==

   -------------------------------------------------------
    Irrep   Nso     Nmo     Nalpha   Nbeta   Ndocc  Nsocc
   -------------------------------------------------------
     A        145     145       0       0       0       0
   -------------------------------------------------------
    Total     145     145      25      25      25       0
   -------------------------------------------------------

  ==> Integral Setup <==

  ==> DFJK: Density-Fitted J/K Matrices <==

    J tasked:                  Yes
    K tasked:                  Yes
    wK tasked:                  No
    OpenMP threads:              6
    Integrals threads:           6
    Memory (MB):               183
    Algorithm:                Core
    Integral Cache:           NONE
    Schwarz Cutoff:          1E-12
    Fitting Condition:       1E-12

   => Auxiliary Basis Set <=

  Basis Set:
    Number of shells: 240
    Number of basis function: 767
    Number of Cartesian functions: 767
    Spherical Harmonics?: false
    Max angular momentum: 3

  Minimum eigenvalue in the overlap matrix is 6.6381480018E-04.
  Using Symmetric Orthogonalization.
  SCF Guess: Core (One-Electron) Hamiltonian.

  ==> Iterations <==

                           Total Energy        Delta E     RMS |[F,P]|

   @DF-UHF iter   1:  -205.23304615372797   -2.05233e+02   6.04480e-02
   @DF-UHF iter   2:  -170.13079066988712    3.51023e+01   5.20196e-02 DIIS
   @DF-UHF iter   3:  -226.19365660788827   -5.60629e+01   4.49774e-02 DIIS
   @DF-UHF iter   4:  -254.10885985388944   -2.79152e+01   3.33189e-02 DIIS
   @DF-UHF iter   5:  -268.80837968087422   -1.46995e+01   8.19862e-03 DIIS
   @DF-UHF iter   6:  -269.58619393756834   -7.77814e-01   2.85186e-03 DIIS
   @DF-UHF iter   7:  -269.69754836796079   -1.11354e-01   8.65443e-04 DIIS
   @DF-UHF iter   8:  -269.71152278937490   -1.39744e-02   3.29262e-04 DIIS
   @DF-UHF iter   9:  -269.71354390608400   -2.02112e-03   1.57613e-04 DIIS
   @DF-UHF iter  10:  -269.71419440323365   -6.50497e-04   5.79147e-05 DIIS
   @DF-UHF iter  11:  -269.71433938905835   -1.44986e-04   2.62495e-05 DIIS
   @DF-UHF iter  12:  -269.71437022304207   -3.08340e-05   1.44521e-05 DIIS
   @DF-UHF iter  13:  -269.71438177501813   -1.15520e-05   3.78687e-06 DIIS
   @DF-UHF iter  14:  -269.71438263614840   -8.61130e-07   1.01334e-06 DIIS
   @DF-UHF iter  15:  -269.71438267844809   -4.22997e-08   3.50502e-07 DIIS

  ==> Post-Iterations <==

   @Spin Contamination Metric:  -8.881784197E-14
   @S^2 Expected:                0.000000000E+00
   @S^2 Observed:               -8.881784197E-14
   @S   Expected:                0.000000000E+00
   @S   Observed:                0.000000000E+00

    Orbital Energies (a.u.)
    -----------------------

    Alpha Occupied:                                                       

       1A    -11.232193     2A    -11.228721     3A    -11.228561
       4A    -11.228179     5A    -11.223100     6A    -11.222292
       7A    -11.220477     8A     -1.153293     9A     -1.043927
      10A     -1.003502    11A     -0.928172    12A     -0.818142
      13A     -0.787929    14A     -0.687707    15A     -0.627932
      16A     -0.626262    17A     -0.576379    18A     -0.569458
      19A     -0.561284    20A     -0.539412    21A     -0.504515
      22A     -0.483283    23A     -0.452811    24A     -0.321967
      25A     -0.304862  

    Alpha Virtual:                                                        

      26A      0.137085    27A      0.148732    28A      0.243181
      29A      0.263514    30A      0.288079    31A      0.301195
      32A      0.316476    33A      0.318434    34A      0.334419
      35A      0.343650    36A      0.371991    37A      0.417017
      38A      0.479660    39A      0.491313    40A      0.525890
      41A      0.529168    42A      0.652208    43A      0.734546
      44A      0.741592    45A      0.750655    46A      0.768670
      47A      0.783180    48A      0.790292    49A      0.822630
      50A      0.834993    51A      0.838154    52A      0.845380
      53A      0.868186    54A      0.897361    55A      0.909334
      56A      0.931892    57A      0.969084    58A      0.998175
      59A      1.033428    60A      1.070228    61A      1.077825
      62A      1.088957    63A      1.095137    64A      1.116538
      65A      1.151673    66A      1.160011    67A      1.187652
      68A      1.205487    69A      1.224702    70A      1.237064
      71A      1.249958    72A      1.297255    73A      1.339902
      74A      1.360023    75A      1.378131    76A      1.443170
      77A      1.539029    78A      1.559924    79A      1.594993
      80A      1.646101    81A      1.722888    82A      1.724744
      83A      1.837813    84A      1.849758    85A      1.955008
      86A      1.987361    87A      2.078420    88A      2.090871
      89A      2.133700    90A      2.135491    91A      2.186987
      92A      2.226221    93A      2.267664    94A      2.282939
      95A      2.298196    96A      2.324082    97A      2.340714
      98A      2.372543    99A      2.374890   100A      2.391041
     101A      2.415743   102A      2.447705   103A      2.527864
     104A      2.594864   105A      2.616772   106A      2.664942
     107A      2.694929   108A      2.716371   109A      2.731899
     110A      2.741392   111A      2.769221   112A      2.802090
     113A      2.834865   114A      2.837176   115A      2.867525
     116A      2.889260   117A      2.909152   118A      2.932169
     119A      2.967933   120A      3.019370   121A      3.037355
     122A      3.087291   123A      3.128255   124A      3.178227
     125A      3.188039   126A      3.218667   127A      3.421825
     128A      3.429738   129A      3.475633   130A      3.551688
     131A      3.624122   132A      3.641477   133A      3.708909
     134A      3.778148   135A      3.819525   136A      3.828444
     137A      3.905211   138A      4.280298   139A      4.580291
     140A      4.598241   141A      4.617029   142A      4.801975
     143A      4.885618   144A      4.963210   145A      5.228419  

    Beta Occupied:                                                        

       1A    -11.232193     2A    -11.228721     3A    -11.228561
       4A    -11.228179     5A    -11.223100     6A    -11.222292
       7A    -11.220477     8A     -1.153293     9A     -1.043927
      10A     -1.003502    11A     -0.928172    12A     -0.818142
      13A     -0.787929    14A     -0.687707    15A     -0.627932
      16A     -0.626262    17A     -0.576379    18A     -0.569458
      19A     -0.561284    20A     -0.539412    21A     -0.504515
      22A     -0.483283    23A     -0.452811    24A     -0.321967
      25A     -0.304862  

    Beta Virtual:                                                         

      26A      0.137085    27A      0.148732    28A      0.243181
      29A      0.263514    30A      0.288079    31A      0.301195
      32A      0.316476    33A      0.318434    34A      0.334419
      35A      0.343650    36A      0.371991    37A      0.417017
      38A      0.479660    39A      0.491313    40A      0.525890
      41A      0.529168    42A      0.652208    43A      0.734546
      44A      0.741592    45A      0.750655    46A      0.768670
      47A      0.783180    48A      0.790292    49A      0.822630
      50A      0.834993    51A      0.838154    52A      0.845380
      53A      0.868186    54A      0.897361    55A      0.909334
      56A      0.931892    57A      0.969084    58A      0.998175
      59A      1.033428    60A      1.070228    61A      1.077825
      62A      1.088957    63A      1.095137    64A      1.116538
      65A      1.151673    66A      1.160011    67A      1.187652
      68A      1.205487    69A      1.224702    70A      1.237064
      71A      1.249958    72A      1.297255    73A      1.339902
      74A      1.360023    75A      1.378131    76A      1.443170
      77A      1.539029    78A      1.559924    79A      1.594993
      80A      1.646101    81A      1.722888    82A      1.724744
      83A      1.837813    84A      1.849758    85A      1.955008
      86A      1.987361    87A      2.078420    88A      2.090871
      89A      2.133700    90A      2.135491    91A      2.186987
      92A      2.226221    93A      2.267664    94A      2.282939
      95A      2.298196    96A      2.324082    97A      2.340714
      98A      2.372543    99A      2.374890   100A      2.391041
     101A      2.415743   102A      2.447705   103A      2.527864
     104A      2.594864   105A      2.616772   106A      2.664942
     107A      2.694929   108A      2.716371   109A      2.731899
     110A      2.741392   111A      2.769221   112A      2.802090
     113A      2.834865   114A      2.837176   115A      2.867525
     116A      2.889260   117A      2.909152   118A      2.932169
     119A      2.967933   120A      3.019370   121A      3.037355
     122A      3.087291   123A      3.128255   124A      3.178227
     125A      3.188039   126A      3.218667   127A      3.421825
     128A      3.429738   129A      3.475633   130A      3.551688
     131A      3.624122   132A      3.641477   133A      3.708909
     134A      3.778148   135A      3.819525   136A      3.828444
     137A      3.905211   138A      4.280298   139A      4.580291
     140A      4.598241   141A      4.617029   142A      4.801975
     143A      4.885618   144A      4.963210   145A      5.228419  

    Final Occupation by Irrep:
              A
    DOCC [    25 ]
    SOCC [     0 ]

  Energy converged.

  @DF-UHF Final Energy:  -269.71438267844809

   => Energetics <=

    Nuclear Repulsion Energy =            271.8790871779876284
    One-Electron Energy =                -902.0381267071792308
    Two-Electron Energy =                 360.4446568507435131
    DFT Exchange-Correlation Energy =       0.0000000000000000
    Empirical Dispersion Energy =           0.0000000000000000
    PCM Polarization Energy =               0.0000000000000000
    EFP Energy =                            0.0000000000000000
    Total Energy =                       -269.7143826784481462

  Saving occupied orbitals to File 180.

  UHF NO Occupations:
  HONO-2 :   23  A 2.0000000
  HONO-1 :   24  A 2.0000000
  HONO-0 :   25  A 2.0000000
  LUNO+0 :   26  A 0.0000000
  LUNO+1 :   27  A 0.0000000
  LUNO+2 :   28  A 0.0000000
  LUNO+3 :   29  A 0.0000000

*** tstop() called on tkyks-MacBook-Pro.local at Sat May 21 14:22:17 2016
Module time:
	user time   =      13.83 seconds =       0.23 minutes
	system time =       0.40 seconds =       0.01 minutes
	total time  =          4 seconds =       0.07 minutes
Total time:
	user time   =      13.83 seconds =       0.23 minutes
	system time =       0.40 seconds =       0.01 minutes
	total time  =          4 seconds =       0.07 minutes

Properties will be evaluated at   0.000000,   0.000000,   0.000000 Bohr

Properties computed using the SCF density matrix

  Nuclear Dipole Moment: (a.u.)
     X:     3.2552      Y:    -0.3157      Z:     1.2040

  Electronic Dipole Moment: (a.u.)
     X:    -3.0898      Y:     0.2837      Z:    -1.0784

  Dipole Moment: (a.u.)
     X:     0.1654      Y:    -0.0320      Z:     0.1256     Total:     0.2101

  Dipole Moment: (Debye)
     X:     0.4204      Y:    -0.0814      Z:     0.3192     Total:     0.5341

*** Psi4 exiting successfully. Buy a developer a beer!

3D shape recognition using DL.

Recognition of 3D object is common task for human but hard for computer.
Interesting report was published by researchers work in Princeton University.
URL is following.
https://people.csail.mit.edu/khosla/papers/cvpr2015_wu.pdf
They developed 3D shape prediction system named ‘3D ShapeNets’.
I was interested the system because it collect image data from one side and then, the system predict next view point to minimise uncertain data.
The author described ….
For example, humans do not need to see the legs of a table to know that they are there and potentially what they might look like behind the visible surface

To do that, next best view prediction is key.
They tried to predict 3D object like furniture i.e. table, sofa, bed, etc….
And 3D-Shapenets (Based powerful CNN) showed good performance.

In this report, I could know descriptor named Light Field Descriptor(LFD), Spherical Harmonic Descriptor(SHD), and voxel ( like pixel in 2D ). ;-)

In drug discovery project, 3D shape of molecule was predicted from conformational search. So, It’s different in this case.
But 3D shape recognition is interesting and deep area.

Also I found amazing report in J Chem inf models.
http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00544?journalCode=jcisd8
The system uses kinect and source code was uploaded github.
Think molecule as 3D!

molecular classification using molecular images (failed)

In my opinion, Good Medicinal Chemists has good eye.
I think they can classify molecules efficiently. It obvious but almost of them can’t under stand molecular fingerprint. My question is that how do they design molecules, 3D, 2D or their own experiment?
If they design molecules based 2D, deep learning is quite useful.
Deep learning is powerful method to image classification in these days.
So, I tried to make molecular image classification code.
At first I coded mol to image script. The code is following.
I used hERG dataset from ChEMBLDB.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
from rdkit.Chem import Draw
import pandas as pd
import numpy as np
import cv2 as cv
import glob
import numpy as np
import pickle
from sklearn.cross_validation import train_test_split

df = pd.read_table('bioactivity-15_13-09-28.txt', header=0)
df_bind = df[ df.ASSAY_TYPE=="B" ]
df_bind = df_bind[ df_bind.STANDARD_VALUE != None ]
df_bind = df_bind[ df_bind.STANDARD_VALUE >= 0 ]

rows = df_bind.shape[ 0 ]
mols = [ ]
act = [ ]

def act2bin( val ):
    if val > 10000:
        return 0
    else:
        return 1

for i in range( rows ):
    try:
        smi = df_bind.CANONICAL_SMILES[i]
        mol = Chem.MolFromSmiles( smi )
        if mol != None:
            mols.append( mol )
            act.append( act2bin( df_bind.STANDARD_VALUE[i]) )
        else:
            pass
    except:
        pass

# save mols image dataset
for idx, mol in enumerate( mols ):
    if act[ idx ] == 1:
        Draw.MolToFile( mols[idx],"./posi/idx_{}.png".format( idx ), size=(150,150)  )
    elif act[ idx ] == 0:
        Draw.MolToFile( mols[idx], "./nega/idx_{}.png".format( idx ), size=(150,150) )

# transpose image shape from [ 300,300,3 ] to [ 3, 300, 300 ]
def transimage( image ):
    im = cv.imread( image )
    im = im.transpose( [ 2,0,1 ] )
    return im
# get image filenames
pos = glob.glob( 'posi/*.png' )
neg = glob.glob( 'nega/*.png' )

X = []
Y = []

for posf in pos:
    x = transimage( posf )
    X.append( x )
    Y.append( 1 )
for negf in neg:
    x = transimage( negf )
    X.append( x )
    Y.append( 0 )
X = np.asarray(X)
Y = np.asarray(Y)

x_train, x_test, y_train, y_test = train_test_split( X,Y, test_size=0.2, random_state=123 )

f = open( 'imagedataset.pkl', 'wb' )
pickle.dump([ ( x_train,y_train ), ( x_test, y_test ) ], f)
f.close()

Next, I coded learner and classifer.
I used CNN.

import numpy as np

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
import pickle

( x_train, y_train ), ( x_test, y_test ) = pickle.load( open('imagedataset.pkl', 'rb') )

batch_size = 100
nb_classes = 2
nb_epoch = 20
nb_filters = 32
nb_pool = 2
nb_conv = 3
im_rows , im_cols = 150, 150
im_channels = 3

x_train = x_train.astype( "float32" )
x_test = x_test.astype( "float32" )
x_train /= 255
x_test /= 255
y_train = np_utils.to_categorical( y_train, nb_classes )
y_test = np_utils.to_categorical( y_test, nb_classes )

print( x_train.shape[0], 'train samples' )
print( x_test.shape[0], 'test samples' )


model = Sequential()
model.add( Convolution2D( 10, 3, 3,
                            border_mode = 'same',
                            input_shape = ( im_channels, im_rows, im_cols ) ) )
model.add( Activation( 'relu' ) )
model.add( MaxPooling2D( pool_size=(2, 2) ) )
model.add( Convolution2D( 10, 3, 3,
                             ))
model.add( Activation( 'relu' ) )
model.add( Convolution2D( 10, 3, 3 ) )
model.add( Activation( 'relu' ) )
model.add( MaxPooling2D( pool_size=( 2, 2 ) ) )
model.add( Flatten() )
model.add( Dense( 10 ) )
model.add( Activation( 'relu' ) )
model.add( Dense(nb_classes) )
model.add( Activation('softmax') )
model.compile( loss='categorical_crossentropy',
               optimizer='adadelta',
               metrics=['accuracy'],
               )

hist = model.fit( x_train, y_train,
                  batch_size = batch_size,
                  nb_epoch = nb_epoch,
                  verbose = 1,
                  validation_data = ( x_test, y_test ))
print( model.summary() )
score = model.evaluate( x_test, y_test, verbose=0 )

Then run code.

iwatobipen$ python learn_mol_image.py 
Using Theano backend.
Using gpu device 0: GeForce GT 750M (CNMeM is disabled, cuDNN 5004)
4364 train samples
1092 test samples
Train on 4364 samples, validate on 1092 samples
Epoch 1/20
4364/4364 [==============================] - 15s - loss: 5.9538 - acc: 0.6244 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 2/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 3/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 4/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 5/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 6/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 7/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 8/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 9/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 10/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 11/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 12/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 13/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 14/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 15/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 16/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 17/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 18/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 19/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 20/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
____________________________________________________________________________________________________
Layer (type)                       Output Shape        Param #     Connected to                     
====================================================================================================
convolution2d_1 (Convolution2D)    (None, 10, 150, 150)280         convolution2d_input_1[0][0]      
____________________________________________________________________________________________________
activation_1 (Activation)          (None, 10, 150, 150)0           convolution2d_1[0][0]            
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)      (None, 10, 75, 75)  0           activation_1[0][0]               
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)    (None, 10, 73, 73)  910         maxpooling2d_1[0][0]             
____________________________________________________________________________________________________
activation_2 (Activation)          (None, 10, 73, 73)  0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)    (None, 10, 71, 71)  910         activation_2[0][0]               
____________________________________________________________________________________________________
activation_3 (Activation)          (None, 10, 71, 71)  0           convolution2d_3[0][0]            
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)      (None, 10, 35, 35)  0           activation_3[0][0]               
____________________________________________________________________________________________________
flatten_1 (Flatten)                (None, 12250)       0           maxpooling2d_2[0][0]             
____________________________________________________________________________________________________
dense_1 (Dense)                    (None, 10)          122510      flatten_1[0][0]                  
____________________________________________________________________________________________________
activation_4 (Activation)          (None, 10)          0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                    (None, 2)           22          activation_4[0][0]               
____________________________________________________________________________________________________
activation_5 (Activation)          (None, 2)           0           dense_2[0][0]                    
====================================================================================================
Total params: 124632
____________________________________________________________________________________________________
None

It’s hopeless…. ;-(
The code can’t learn from dataset.

Hmm…..Is size of image too small ?
Maybe molecular description as image is not right way manner…..
少しはいけるかなあって思ってみたんですが安直だったかな。

Interesting Web app Named ‘ChemTreeMap’ using RDKit.

I like web app because user does not need client soft to use it.
I often use cytoscape to visualise molecular network. Network view is very informative.
Yesterday, I found cool web app that using RDKit.
URL is following.
http://ajing.github.io/ChemTreeMap/
The app is an open source application for visualizing molecular networks.
If user can use docker, Installation is very easy. ;-)
I’m mac user, so I start boot2docker at first.
And run following command.

iwatobipen$ boot2docker start
iwatobipen$ docker pull ajing/chemtreemap

Wait several minutes……
Then run the server.

iwatobipen$ docker run -t -i -p 8000:8000 ajing/chemtreemap /bin/bash
root@6969be03007e:/# cd examples/
root@6969be03007e:/examples# python examples.py

Ready!
Access to server from mac.
http://’docker’s ip’:8000/dist/#/aff

To get docker’s ip just input following command  ‘boot2docker ip’.

I got following interactive view.
chemtreemap_sc

Worked fine. The app seems light weight and colourful.
And the library using RDKit for chemical structure handling, it’s meaning the app has  flexibility for development.
I wish the app had structure search function.

Docker is very useful to share nice technology for everyone.
I’m looking forward to major release of native docker app for MAC.