Traveling Salesperson Problem

Some days ago, I wrote blog about knapsack problem.
Next I tried to solve traveling salesperson problem(TSP).

The problem is : “Given a list of cites and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city ?”. It is an NP-hard problem. -from wiki-
Following code uses genetic algorithm to solve it.
Each cities positions(X,Y) are represented as complex. And define some function for calculation.
All tours are permutation of possible combination. It will be huge number if I use large number of cities.

import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cmx
import itertools, operator
import time
import random
import numpy, math
random.seed( 789 )
def exact_TSP( cities ):
    return shortest( alltours( cities ) )
def shortest( tours ):
    return min( tours, key = total_distance )
alltours = itertools.permutations
def total_distance( tour ):
    return sum( distance( tour[i], tour[i-1] ) for i in range( len(tour) ) )
City = complex
def generate_cities( n ):
    return set( City( random.randrange(10,890),
                                 random.randrange( 10,590) ) for i in range( n) )

Next I defined parameters for GA.
numpy.random.permutation returns random order permutation of given length.
Initial individual has random order for traveling.
Evaluate function calculate total distance of travel and GA will minimize the distance.

from deap import algorithms, base, creator, tools
from deap.tools import initIterate
num_cities  = 30
cities = generate_cities( num_cities )
toolbox = base.Toolbox()
creator.create( "FitnessMin", base.Fitness, weights=( -1.0, ) )
creator.create( "Individual", list, fitness = creator.FitnessMin )
toolbox.register( "indices", numpy.random.permutation, len( cities ) )
toolbox.register( "individual", initIterate, creator.Individual, toolbox.indices )
toolbox.register( "population", tools.initRepeat, list, toolbox.individual )
toolbox.register( "mate", tools.cxOrdered )
toolbox.register( "mutate", tools.mutShuffleIndexes, indpb=0.05 )
def create_tour( individual ):
    return [ list(cities) [e] for e in individual ]
def evaluation( individual ):
    return ( total_distance( create_tour( individual ) ), )

Next step. Set evaluator and selector, run code.

toolbox.register("evaluate", evaluation)
toolbox.register("select", tools.selTournament, tournsize=3)
%%time
res, log = algorithms.eaSimple( pop, toolbox,
                                                        cxpb = 0.8,
                                                        mutpb=0.2,
                                                        ngen=400, verbose=False)
>>> CPU times: user 5.38 s, sys: 9.52 ms, total: 5.39 s
>>>Wall time: 5.39 s

best_individual = tools.selBest(res, k=1)[0]
print('Fitness of the best individual: ', evaluation(best_individual)[0])
>>>Fitness of the best individual:  3558.4312378302448

The route seems cross. But better route than my choice.
Next, I will try to multi knapsack problem.
fig

Advertisements

Knapshack problem

Drug discovery project is multi objective optimization process. It’s difficult to solve you know.
There are many approaches to solve the problem and one is genetic algorithm.
Recently, deap is active library of GA for python.
I used deap for knapsack problem.
You know….
the knapsack problem or rucksack problem is a problem in combinatorial optimization: Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible. It derives its name from the problem faced by someone who is constrained by a fixed-size knapsack and must fill it with the most valuable items.
From wikipedia.

Code is following. ( Almost same as default example code. )
Deap has default function for mutation and cross over. But following code use set for individual, so define cxSet / cxMut.

import random
import numpy as np

from deap import algorithms
from deap import base, creator, tools

IND_INIT_SIZE = 5
MAX_ITEM = 50
MAX_WEIGHT = 50

random.seed( 123 )
items = {}
# itembox ( weight, price )
for i in range( NBR_ITEMS ):
    items[ i ] = ( random.randint(1,10), random.uniform(0,100) )

creator.create( "Fitness", base.Fitness, weights=(-1.0,1.0) )
creator.create("Individual", set, fitness = creator.Fitness )

toolbox = base.Toolbox()

toolbox.register( "attr_item", random.randrange, NBR_ITEMS )
toolbox.register( "individual", tools.initRepeat, creator.Individual,toolbox.attr_item, IND_INIT_SIZE )
toolbox.register( "population", tools.initRepeat, list, toolbox.individual )

def evalKnapsack( individual ):
    weight = 0.0
    value = 0.0
    for item in individual:
        weight += items[ item ][0]
        value += items[ item ][1]
    if len( individual ) > MAX_ITEM or weight > MAX_WEIGHT:
        return 10000, 0
    return weight, value

def cxSet( ind1, ind2 ):
    temp = set( ind1)
    ind1 &= ind2
    ind2 ^= temp
    return ind1, ind2

def mutSet( individual ):
    if random.random() < 0.5:
        if len(individual)>0:
            individual.remove( random.choice( sorted(tuple(individual)) ) )
    else:
        individual.add( random.randrange(NBR_ITEMS))
    return individual,

toolbox.register("evaluate", evalKnapsack)
toolbox.register( "mate", cxSet )
toolbox.register( "mutate", mutSet )
toolbox.register( 'select', tools.selNSGA2 )

def main():
    random.seed(123)
    NGEN = 50
    MU = 50
    LAMBDA = 100
    CXPB = 0.7
    MUTPB = 0.2
    
    pop = toolbox.population(n=MU)
    hof = tools.ParetoFront()
    stats = tools.Statistics(lambda ind: ind.fitness.values)
    stats.register("avg", np.mean, axis=0)
    stats.register("std", np.std, axis=0)
    stats.register("min", np.min, axis=0)
    stats.register("max", np.max, axis=0)
    
    algorithms.eaMuPlusLambda(pop, toolbox, MU, LAMBDA, CXPB, MUTPB, NGEN, stats,
                              halloffame=hof)
    
    return pop, stats, hof
main()
gen	nevals	avg                          	std                        	min                        	max                          
0  	50    	[  21.92        199.47169802]	[  6.24768757  63.6981168 ]	[ 10.          63.12779453]	[  36.          326.38666121]
..........
50 	88    	[  17.58        447.32874335]	[  11.69630711  180.42873947]	[ 0.  0.]                  	[  42.         706.3663599]  
Out[102]:
([{1, 2, 3, 4, 7, 8, 10, 13, 14, 15, 18},
  {1, 2, 3, 4, 7, 8, 10, 13, 14, 15, 18},
  ........
  {2, 8, 10, 13, 14, 15}],

Deap has many function. I try to use Deap more and more.

select diverse molecules from dataset.

To select subset from compound library, I think about molecular diversity.
So, I often use clustering .
There are many clustering method in chemo-informatics area.
But, today I used another approach. Use GA.
My strategy is that, evaluate diversity of molecules is sum of dissimilarity.
Let’s start.
For convenience, I used rdkit first_prop_200.sdf as dataset.

import scipy.misc as scm
import numpy as np
import random
from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from deap import creator, base
from deap import algorithms
from deap import tools

mols = [mol for mol in Chem.SDMolSupplier('first_200.props.sdf')]
n_mols = len( mols )

# define diversity as sum of dissimilarity
def eval_diversity(individual):
    combination = scm.comb( len(individual),2 )
    dis_sim_mat=[]
    fps =[ AllChem.GetMorganFingerprintAsBitVect(mols[i], 2) for i in individual ]
    for idx in range(1,len(fps)):
        # return distance returnDistance=1
        dis_sim = DataStructs.BulkCosineSimilarity(fps[idx], fps[:idx], returnDistance=1)
        dis_sim_mat.extend(dis_sim)
    return sum(dis_sim_mat)/combination,

Points.
1. I want to maximise dissimilarity
2. each chromosome has ten molecules.
3. population is 20.

creator.create( 'Fitness', base.Fitness, weights=(1.0,) )
creator.create( 'Individual', list, fitness = creator.Fitness )
toolbox = base.Toolbox()
#generate random int from range(200). It means select molecules in random.
toolbox.register('attr_item', random.randint, 0,n_mols)
# select 10 molecules from library and maximize diversity.
toolbox.register('individual', tools.initRepeat, creator.Individual, toolbox.attr_item, 10)
toolbox.register('population', tools.initRepeat, list, toolbox.individual)

toolbox.register('evaluate', eval_diversity)
toolbox.register('mate', tools.cxTwoPoint)
toolbox.register('mutate', tools.mutShuffleIndexes, indpb = 0.05)
toolbox.register('select', tools.selTournament, tournsize=3)

Then write main function.

def main():
    random.seed(64)
    #set cross over and mutation probability and generation.
    CXPB, MUTPB, NGEN= 0.5, 0.2, 40
    # use all molecures.
    pop = toolbox.population(n=20)
    print('start evaluation !!!')
    fitnesses = list( map(toolbox.evaluate, pop))
    for ind, fit in zip(pop, fitnesses):
        ind.fitness.values = fit
    print('evaled %i inds' %len(pop))
    for g in range(NGEN):
        print('--GENERATION %i--' % g)
        # select the next generation individuals
        offspring = toolbox.select( pop, len(pop) )
        # clone the selected individuals
        offspring = list( map(toolbox.clone, offspring) )
        # Apply crossover and mutaion on the offspring
        for  child1, child2 in zip( offspring[::2], offspring[1::2]):
            if random.random() < CXPB:
                toolbox.mate( child1, child2 )
                del child1.fitness.values
                del child2.fitness.values
        for mutant in offspring:
            if random.random() < MUTPB:
                toolbox.mutate(mutant)
                del mutant.fitness.values
        # Evaluate the individuals with an invalid fitness
        invalid_ind = [ ind for ind in offspring if not ind.fitness.valid ]
        fitnesses = map(toolbox.evaluate, invalid_ind)
        for ind, fit in zip(invalid_ind, fitnesses):
            ind.fitness.values = fit
        print(' evaled %i individuals '%len(invalid_ind))
        pop[:] = offspring
        fits = [ind.fitness.values[0] for ind in pop]
        length = len(pop)
        mean = sum(fits) / length
        sum2 = sum(x*x for x in fits)
        std = abs(sum2/length-mean**2)**0.5
        print('min %s' % min(fits))
        print ('max %s' % max(fits))
        print ('avg %s' % mean)
        print('std %s' % std)
    print('--end of evolution--')
    best_ind = tools.selBest(pop, 1)[0]
    print('bset individual is %s, %s' % (best_ind, best_ind.fitness.values))

Ready.
Then run script.

main()
<out put from here>
start evaluation !!!
evaled 20 inds
--GENERATION 0--
 evaled 18 individuals 
min 0.765494194159
max 0.850630899091
avg 0.818827150217
std 0.0248685727781
........
 evaled 10 individuals 
min 0.912074571976
max 0.912074571976
avg 0.912074571976
std 2.58095682795e-08
--end of evolution--
bset individual is [41, 130, 1, 113, 5, 198, 94, 130, 93, 151], (0.91207457197601005,)
<end>

Now I got answer.
Check molecules.

res = [mols[i] for i in [41, 130, 1, 113, 5, 198, 94, 130, 93, 151]]
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
img = Draw.MolsToGridImage( res,molsPerRow=5,subImgSize=(200,200))
img

Ah …
Screen Shot 2015-02-08 at 10.48.01 PM

Now I got best dissimilarity score was 0.91207457197601005.
Compare with random fashion.

sample = random.sample(range(200),10)
sample
>[37, 160, 157, 30, 163, 38, 112, 118, 135, 85]
eval_diversity(sample)
>(0.82878571084365449,)

😉
Select molecules using GA is useful because, I could make evaluate function in myself.
Today I used only maximise dissimilarity, but I can also use maximise drug likeness or activity for target, or any more…
So I’ll study more deeply about deap.