Use wolfram from python.

Wolfram alpha is a computational knowledge engine developed by Wolfram Research. It is an online service that answers factual queries directly by computing the answer from externally sourced “curated data”.
You know, the service can be used from following URL.
https://www.wolframalpha.com/
By the way, if there is API to use wolfram alpha from another applications, it sounds nice doesn’t it.
Yah, wolfram python API was developed and uploaded pypi!
So, to use the API I installed wolframalpha by using pip command.

iwatobioen$ pip install walframalpha

Let’s use the library.
To use API, I created wolfram web site account and got API-Key. Then start jupyter notebook.

import wolframalpha
client = wolframalpha.Client( "yourAPIkey!!"  )
# get weather forecast
res = client.query( "weather in Newyork tomorrow" )
# print response
for pod in res.pods:
    for sub in pod.subpods:
        print( sub["plaintext"] )
>>>output
weather forecast | New York City, United States
  | tomorrow
between 6 °C and 8 °C
few clouds (very early morning  |  early morning onward)  |  partly cloudy (very early morning to early morning)
between 6 °C and 13 °C
few clouds (early morning  |  late morning to early afternoon)  |  clear (early morning to late morning  |  early afternoon onward)
between 3 °C and 10 °C
clear (late afternoon to night)  |  cloudy (late night onward)
None
| clear: 51.7% (3.8 days)   |  overcast: 0% (0 minutes)
| rain: 19.6% (1.4 days)
None
None
low: -11 °C
Feb 1993 | average high:  | 7 °C
average low:  | -2 °C | high: 19 °C
Feb 2011
(daily ranges, not corrected for changes in local weather station environment)
res = client.query("np complete problems")
for pod in res.pods:
    for sub in pod.subpods:
        print( sub["plaintext"] )
>>>output
NP complete problems
bin packing problem  |  chromatic number problem  |  clique problem  |  crossing number problem  |  directed Hamiltonian cycle problem  |  domatic number problem  |  dominating set problem  |  feedback arc set problem  |  feedback vertex set problem  |  Hamiltonian cycle problem  |  Hamiltonian path problem  |  hitting set problem  |  independent set problem  |  knapsack problem  |  max cut problem  |  minimum cover problem  |  partition problem  |  satisfiability problem  |  set packing problem  |  set splitting problem  |  ...   (total:  27)
Is there a partition of a finite set of items U into disjoint sets U_1, U_2, ..., U_k such that the sum of the sizes of the items in each U_i is B or less.
Given a graph G = (V, E) and a positive integer K<=|V|, is G K-colorable?
Given a graph G = (V, E) and a positive integer K<=|V|, does G contain a clique of size K or more?
Does a graph have a crossing number of K or less?
Does a graph contain a directed Hamiltonian cycle?
Given a graph G = (V, E) and a positive integer K<=|V|, is the domatic number of G at least K?
Given a graph G = (V, E) and a positive integer K<=|V|, is there a dominating set of size K or less for G?
Given a directed graph G = (V, A) and a positive integer K<=|A|, is there a subset A'(subset equal)A with |A'|<=K such that A' contains at least one arc from every directed cycle in G?
Given a directed graph G = (V, A) and a positive integer K<=|V|, is there a subset V'(subset equal)V with |V'|<=K such that V' contains at least one vertex from every directed cycle in G?
Does a graph contain a Hamiltonian cycle?
Does there exist a function f:V->{1, 2, ..., K} such that f(u)!=f(v) whenever {u, v} element E?
Does G contain a subset V'(subset equal)V with |V'|>=K such that every two vertices in V' are joined by an edge in E?
Can a graph be embedded in the plane with K or fewer pairs edges crossing one another.
Can V be partitioned into k>=K disjoint sets V_1, V_2, ..., V_k such that each V_i is a dominating set for G?
Is there a subset V'(subset equal)V with |V'|<=K such that for all u element V - V',  there is a v element V' for which {u, v} element E?
Does G contain a subset V'(subset equal)V such that |V'|>=K and such that no two vertices in V' are joined by an edge in E?
Does C contain a subset C'(subset equal)C with |C'| with |C'|<=K such that every element of S belongs to at least one member of C'?
Does G contain a subset V(subset equal)V_1 and a subset E(subset equal)E_1 such that |V| = |V_2|, |E| = |E_2|, and there exists a one-to-one function f:V_2->V satisfying {u, v} element E_2 iff {f(u), f(v)} element E?
Does M contain a subset M'(subset equal)M such that |M'| = q and no two elements of M' agree in any coordinate?
Is there a subset V'(subset equal)V with |V'|<=K such that for each edge {u, v} element E,  at least one of u and v belongs to V'?
| formulation date | formulators | status
bin packing problem |  |  | proved NP-complete
chromatic number problem |  |  | proved NP-complete
clique problem |  |  | proved NP-complete
crossing number problem |  |  | proved NP-complete
directed Hamiltonian cycle problem |  |  | proved NP-complete
domatic number problem | 1975  (42 years ago) | Ernest Cockayne  |  Stephen Hedetniemi | proved NP-complete
dominating set problem |  |  | proved NP-complete
feedback arc set problem |  |  | proved NP-complete
feedback vertex set problem |  |  | proved NP-complete
Hamiltonian cycle problem |  |  | proved NP-complete
Hamiltonian path problem |  |  | proved NP-complete
hitting set problem |  |  | proved NP-complete
independent set problem |  |  | proved NP-complete
knapsack problem |  |  | proved NP-complete
max cut problem |  |  | proved NP-complete
minimum cover problem |  |  | proved NP-complete
partition problem |  |  | proved NP-complete
satisfiability problem |  |  | proved NP-complete
set packing problem |  |  | proved NP-complete
set splitting problem |  |  | proved NP-complete
subgraph isomorphism problem |  |  | proved NP-complete
subset product problem |  |  | proved NP-complete
subset sum problem |  |  | proved NP-complete
three-dimensional matching problem |  |  | proved NP-complete
3-satisfiability problem |  |  | proved NP-complete
traveling salesman problem |  |  | proved NP-complete
vertex cover problem |  |  | proved NP-complete
 | proof date | provers
chromatic number problem | 1972 (45 years ago) | Richard Karp
clique problem | 1972 (45 years ago) | Richard Karp
crossing number problem | 1983 (34 years ago) | Michael Garey  |  David S. Johnson
directed Hamiltonian cycle problem | 1972 (45 years ago) | Richard Karp
domatic number problem | 1976 (1 year later)  (41 years ago) | Michael Garey  |  David S. Johnson  |  Robert Tarjan
feedback arc set problem | 1972 (45 years ago) | Richard Karp
feedback vertex set problem | 1972 (45 years ago) | Richard Karp
Hamiltonian cycle problem | 1972 (45 years ago) | Richard Karp
hitting set problem | 1972 (45 years ago) | Richard Karp
knapsack problem | 1972 (45 years ago) | Richard Karp
max cut problem | 1972 (45 years ago) | Richard Karp
minimum cover problem | 1972 (45 years ago) | Richard Karp
partition problem | 1972 (45 years ago) | Richard Karp
satisfiability problem | 1971 (46 years ago) | Stephen Cook
set packing problem | 1972 (45 years ago) | Richard Karp
set splitting problem | 1973 (44 years ago) | László Lovász
subgraph isomorphism problem | 1971 (46 years ago) | Stephen Cook
subset product problem | 1972 (45 years ago) | Richard Karp
subset sum problem | 1972 (45 years ago) | Richard Karp
three-dimensional matching problem | 1972 (45 years ago) | Richard Karp
3-satisfiability problem | 1971 (46 years ago) | Stephen Cook
vertex cover problem | 1972 (45 years ago) | Richard Karp
NP complete problems  |  NP problems  |  mathematical problems  |  solved mathematics problems

It worked. Wolfram alpha is high quality service so web site very useful. The site can use image file as query!!!
And the python library is cool too I think.

Traveling Salesperson Problem

Some days ago, I wrote blog about knapsack problem.
Next I tried to solve traveling salesperson problem(TSP).

The problem is : “Given a list of cites and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city ?”. It is an NP-hard problem. -from wiki-
Following code uses genetic algorithm to solve it.
Each cities positions(X,Y) are represented as complex. And define some function for calculation.
All tours are permutation of possible combination. It will be huge number if I use large number of cities.

import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cmx
import itertools, operator
import time
import random
import numpy, math
random.seed( 789 )
def exact_TSP( cities ):
    return shortest( alltours( cities ) )
def shortest( tours ):
    return min( tours, key = total_distance )
alltours = itertools.permutations
def total_distance( tour ):
    return sum( distance( tour[i], tour[i-1] ) for i in range( len(tour) ) )
City = complex
def generate_cities( n ):
    return set( City( random.randrange(10,890),
                                 random.randrange( 10,590) ) for i in range( n) )

Next I defined parameters for GA.
numpy.random.permutation returns random order permutation of given length.
Initial individual has random order for traveling.
Evaluate function calculate total distance of travel and GA will minimize the distance.

from deap import algorithms, base, creator, tools
from deap.tools import initIterate
num_cities  = 30
cities = generate_cities( num_cities )
toolbox = base.Toolbox()
creator.create( "FitnessMin", base.Fitness, weights=( -1.0, ) )
creator.create( "Individual", list, fitness = creator.FitnessMin )
toolbox.register( "indices", numpy.random.permutation, len( cities ) )
toolbox.register( "individual", initIterate, creator.Individual, toolbox.indices )
toolbox.register( "population", tools.initRepeat, list, toolbox.individual )
toolbox.register( "mate", tools.cxOrdered )
toolbox.register( "mutate", tools.mutShuffleIndexes, indpb=0.05 )
def create_tour( individual ):
    return [ list(cities) [e] for e in individual ]
def evaluation( individual ):
    return ( total_distance( create_tour( individual ) ), )

Next step. Set evaluator and selector, run code.

toolbox.register("evaluate", evaluation)
toolbox.register("select", tools.selTournament, tournsize=3)
%%time
res, log = algorithms.eaSimple( pop, toolbox,
                                                        cxpb = 0.8,
                                                        mutpb=0.2,
                                                        ngen=400, verbose=False)
>>> CPU times: user 5.38 s, sys: 9.52 ms, total: 5.39 s
>>>Wall time: 5.39 s

best_individual = tools.selBest(res, k=1)[0]
print('Fitness of the best individual: ', evaluation(best_individual)[0])
>>>Fitness of the best individual:  3558.4312378302448

The route seems cross. But better route than my choice.
Next, I will try to multi knapsack problem.
fig

Knapshack problem

Drug discovery project is multi objective optimization process. It’s difficult to solve you know.
There are many approaches to solve the problem and one is genetic algorithm.
Recently, deap is active library of GA for python.
I used deap for knapsack problem.
You know….
the knapsack problem or rucksack problem is a problem in combinatorial optimization: Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible. It derives its name from the problem faced by someone who is constrained by a fixed-size knapsack and must fill it with the most valuable items.
From wikipedia.

Code is following. ( Almost same as default example code. )
Deap has default function for mutation and cross over. But following code use set for individual, so define cxSet / cxMut.

import random
import numpy as np

from deap import algorithms
from deap import base, creator, tools

IND_INIT_SIZE = 5
MAX_ITEM = 50
MAX_WEIGHT = 50

random.seed( 123 )
items = {}
# itembox ( weight, price )
for i in range( NBR_ITEMS ):
    items[ i ] = ( random.randint(1,10), random.uniform(0,100) )

creator.create( "Fitness", base.Fitness, weights=(-1.0,1.0) )
creator.create("Individual", set, fitness = creator.Fitness )

toolbox = base.Toolbox()

toolbox.register( "attr_item", random.randrange, NBR_ITEMS )
toolbox.register( "individual", tools.initRepeat, creator.Individual,toolbox.attr_item, IND_INIT_SIZE )
toolbox.register( "population", tools.initRepeat, list, toolbox.individual )

def evalKnapsack( individual ):
    weight = 0.0
    value = 0.0
    for item in individual:
        weight += items[ item ][0]
        value += items[ item ][1]
    if len( individual ) > MAX_ITEM or weight > MAX_WEIGHT:
        return 10000, 0
    return weight, value

def cxSet( ind1, ind2 ):
    temp = set( ind1)
    ind1 &= ind2
    ind2 ^= temp
    return ind1, ind2

def mutSet( individual ):
    if random.random() < 0.5:
        if len(individual)>0:
            individual.remove( random.choice( sorted(tuple(individual)) ) )
    else:
        individual.add( random.randrange(NBR_ITEMS))
    return individual,

toolbox.register("evaluate", evalKnapsack)
toolbox.register( "mate", cxSet )
toolbox.register( "mutate", mutSet )
toolbox.register( 'select', tools.selNSGA2 )

def main():
    random.seed(123)
    NGEN = 50
    MU = 50
    LAMBDA = 100
    CXPB = 0.7
    MUTPB = 0.2
    
    pop = toolbox.population(n=MU)
    hof = tools.ParetoFront()
    stats = tools.Statistics(lambda ind: ind.fitness.values)
    stats.register("avg", np.mean, axis=0)
    stats.register("std", np.std, axis=0)
    stats.register("min", np.min, axis=0)
    stats.register("max", np.max, axis=0)
    
    algorithms.eaMuPlusLambda(pop, toolbox, MU, LAMBDA, CXPB, MUTPB, NGEN, stats,
                              halloffame=hof)
    
    return pop, stats, hof
main()
gen	nevals	avg                          	std                        	min                        	max                          
0  	50    	[  21.92        199.47169802]	[  6.24768757  63.6981168 ]	[ 10.          63.12779453]	[  36.          326.38666121]
..........
50 	88    	[  17.58        447.32874335]	[  11.69630711  180.42873947]	[ 0.  0.]                  	[  42.         706.3663599]  
Out[102]:
([{1, 2, 3, 4, 7, 8, 10, 13, 14, 15, 18},
  {1, 2, 3, 4, 7, 8, 10, 13, 14, 15, 18},
  ........
  {2, 8, 10, 13, 14, 15}],

Deap has many function. I try to use Deap more and more.

Call rdkit from Ruby

Recently I use web app made with ruby.
The app is very useful for me, but there is not chemoinformatics library in ruby.
So, I want to find way to use rdkit from ruby.
I found some web page that described how to run shell script from ruby. The simplest way is using back quotation.
It is means call “python xxx.py” from ruby.
I tested it.
At first, I wrote sample script named test.py. Just load SDF and print molecules as smiles.

 from rdkit import Chem
 sdf = Chem.SDMolSupplier( 'cdk2.sdf' )
  
 for mol in sdf:
     smi = Chem.MolToSmiles( mol )
     print( smi )
                               

Then, I wrote ruby script.
It just call test.py. That’s all. 😉

callpy.rb

         
res = `python test.py`
  print res

Run callpy.rb.

         
iwatobipen$ ruby callpy.rb 
...........
Cn1cnc2c(NCc3ccccc3)nc(NCCO)nc21
CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1
COc1ccc(CNc2nc(N(CCO)CCO)nc3c2ncn3C(C)C)cc1
Nc1nc(N)c(N=O)c(OCC2CCCCC2)n1
COc1ccc2c(c1)C(=Cc1cnc[nH]1)C(=O)N2
COc1cc[nH]c1C=C1C(=O)Nc2ccc([N+](=O)[O-])cc21
CCc1cnc(CSc2cnc(NC(=O)C(C)C)s2)o1
CCCCOc1c(C(=O)c2nccs2)cnc2[nH]ncc12
COc1cc(-c2ccc[nH]2)c2c3c(ccc(F)c13)NC2=O
[NH3+]CCSc1cc(-c2ccc[nH]2)c2c3c(ccc(F)c13)NC2=O
NC(=O)Nc1cccc2c1C(=O)c1c-2n[nH]c1-c1cccs1
COc1ccc2c(c1)C(=Cc1[nH]cc3c1CCOC3=O)C(=O)N2
CNS(=O)(=O)c1ccc2c(c1)C(=Cc1[nH]cc3c1CCNC3=O)C(=O)N2
CC(=O)Nc1cccc2c1C(=O)c1c-2n[nH]c1-c1ccncc1
....

It worked. 😉

There are some ways to call shell script from ruby. I will try another method soon.

encode and decode SMILES strings.

Today was very cold. ;-( But, my kids play at a park long time! @_@
I want to go snow board with my family…
BTW, recently I am interested in recurrent neural network. There are a lot of report about deep learning using smiles strings as input.
To use smiles strings as input, it needs to convert smiles to matrix.
I tried to implement smiles encoder and decoder.
Encoder converts smiles strings as matrix of one hot vector.
Decoder converts matrix to smiles.

from rdkit import Chem

SMILES_CHARS = [' ',
                  '#', '%', '(', ')', '+', '-', '.', '/',
                  '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
                  '=', '@',
                  'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P',
                  'R', 'S', 'T', 'V', 'X', 'Z',
                  '[', '\\', ']',
                  'a', 'b', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's',
                  't', 'u']
smi2index = dict( (c,i) for i,c in enumerate( SMILES_CHARS ) )
index2smi = dict( (i,c) for i,c in enumerate( SMILES_CHARS ) )
def smiles_encoder( smiles, maxlen=120 ):
    smiles = Chem.MolToSmiles(Chem.MolFromSmiles( smiles ))
    X = np.zeros( ( maxlen, len( SMILES_CHARS ) ) )
    for i, c in enumerate( smiles ):
        X[i, smi2index[c] ] = 1
    return X

def smiles_decoder( X ):
    smi = ''
    X = X.argmax( axis=-1 )
    for i in X:
        smi += index2smi[ i ]
    return smi

Test the code.

mat=smiles_encoder('CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N')
mat.shape
Out:
(120, 56)
print( mat )
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

Next, decode mat.

dec=smiles_decoder(mat)
print(dec)
>>>CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12          

This is first step of my study of deep learning.
ref
https://arxiv.org/abs/1610.02415
https://github.com/maxhodak/keras-molecules/blob/master/molecules/vectorizer.py

Convert function in RDKit

It become large data file when large amount of molecules are saved as SDF format. So I often convert to SMILES from SDF.
I use MolToSmiles function to do that. But, new version of RDKit has convenient method to convert file format.
Here is sample snippet.

from rdkit import Chem
from rdkit.Chem.ChemUtils import SDFToCSV
f = open( 'out.csv', 'w' )
suppl = Chem.SDMolSupplier( 'cdk2.sdf' )
# convert sdf to smiles
SDFToCSV.Convert( suppl, f )
f.close()

Now I got out.csv file. Check the file.

iwatobipen$ head -n 10 out.csv
SMILES,id,Cluster,MODEL.SOURCE,MODEL.CCRATIO,r_mmffld_Potential_Energy-OPLS_2005,r_mmffld_RMS_Derivative-OPLS_2005,b_mmffld_Minimization_Converged-OPLS_2005
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12,ZINC03814457,1,CORINA 3.44 0027  09.01.2008,1,-78.6454,0.000213629,1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1,ZINC03814459,2,CORINA 3.44 0027  09.01.2008,1,-67.4705,9.48919e-05,1
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1,ZINC03814460,2,CORINA 3.44 0027  09.01.2008,1,-89.4303,5.17485e-05,1
Nc1nc(OCC2CCCCC2)c2nc[nH]c2n1,ZINC00023543,3,CORINA 3.44 0027  09.01.2008,1,-70.2463,6.35949e-05,1
Nc1nc(OCC2CC=CCC2)c2nc[nH]c2n1,ZINC03814458,3,CORINA 3.44 0027  09.01.2008,1,-72.9091,6.51479e-05,1
Cn1cnc2c(NCc3ccccc3)nc(NCCO)nc21,ZINC01641925,3,CORINA 3.44 0027  09.01.2008,1,-42.2404,0.000120409,1
CCC(CO)Nc1nc(NCc2ccccc2)c2ncn(C(C)C)c2n1,ZINC01649340,3,CORINA 3.44 0027  09.01.2008,1,-33.4734,7.14544e-05,1
COc1ccc(CNc2nc(N(CCO)CCO)nc3c2ncn3C(C)C)cc1,ZINC01487345,3,CORINA 3.44 0027  09.01.2008,1,-23.1357,8.18592e-05,1
Nc1nc(N)c(N=O)c(OCC2CCCCC2)n1,ZINC03814479,4,CORINA 3.44 0027  09.01.2008,1,-112.542,8.83166e-05,1

🎵🎵

Generate fragment fingerprint using RDKit

I often use MorganFP( ECFP or FCFP like ) for machine learning.
But for the chemist, I think it’s difficult to understand. On the other hand, fragment based fingerprint is easy to understand.
There are lots of report about fragment based fingerprint for QSAR. Fortunately, RDKit already implemented the method to generate FragmentFingerprint. 😉
It seems fun! I wrote a snippet to generate fragmentfingerprint.
I used a python library called Click for option parser. “Click” is very easy to understand and the library can improve readability of the code.
My code is following.
This code need one argument and one option. The argument is input.sdf and option is output filename. Default name of outputfile is calc_frag_fp.npz.
npz extension is needed because of I used savezmethod,

import click
import os
import numpy as np
from rdkit.Chem import RDConfig
from rdkit import Chem
from rdkit.Chem import FragmentCatalog
from rdkit.Chem import DataStructs

fName = os.path.join( RDConfig.RDDataDir, 'FunctionalGroups.txt' )
fparams = FragmentCatalog.FragCatParams( 1, 6, fName )
fcat = FragmentCatalog.FragCatalog( fparams )
fcgen = FragmentCatalog.FragCatGenerator()
fpgen = FragmentCatalog.FragFPGenerator()

@click.command()
@click.argument( 'infile' )
@click.option( '--output','-o', default = 'calc_frag_fp.npz' )
def getfragmentfp( infile, output ):
    fragfps = []
    sdf = Chem.SDMolSupplier( infile )
    counter = 0
    for mol in sdf:
        if mol == None:
            continue
        counter += 1
        nAdded = fcgen.AddFragsFromMol( mol, fcat )
    print( "{} mols read".format( counter ) )
    for mol in sdf:
        if mol == None:
            continue
        arr = np.zeros((1,))
        fp = fpgen.GetFPForMol( mol, fcat )
        DataStructs.ConvertToNumpyArray( fp, arr )
        fragfps.append( arr )
    fragfps = np.asarray( fragfps )
    np.savez( output, x = fragfps )
    print( fragfps.shape )
    print( 'done!' )

if __name__ == '__main__':
    getfragmentfp()

To use the code, command is very simple. Just type..

# not set option value.
iwatobipen$ python fragfpgen.py cdk2.sdf

Then some massages got from console and output.npz file is generated.

47 mols read
(47, 6118)
done!

The npz file has ndarray data of fragmentfingerprint of the SDF. The array can use input for Machine learning.
I pushed the code to my repository,

https://github.com/iwatobipen/fragfp