Population builder for TSP GA

https://www.replica-uhren.org

Genetic algorithms are an amazing intersection of biology and computer science. Ive wanted to write about it for a long time and finally, that time is here! So please enjoy the travelling coffee drinker A coded walk-through of the solution I used to find the quickest route to every Starbucks in London!

The full code for this project is available in a notebook on my GitHub page. Link at the end of the article.

IntroductionProblem Statement

The travelling salesman problem (TSP) has been explored in mathematics for a long time. In its simplest form (the version solved in this project) the TSP was most probably first stated in the 18th Century and the full definition is given below:

What is the shortest route a traveling salesman can take to visit n cities and return back to his home city, only going through each city once? Biron, 2006

So in a more modern adaptation of the problem:

What is the shortest route a coffee fanatic can take to visit 165 Starbucks cafs and return back to his home caf, only sampling each caf once? Lewis, 2022

Exploration

The data used comes from Kaggle with a link at the end of this article, and luckily is all licensed as Open Source! The data shows that there are 975 total Starbucks Cafs within the GB country code. With most of those being in London (165):

Now how can we reach each one of these cafs whilst travelling the least possible distance? At this scale, it would take a long time to do the manual calculation for each possible solution (165 factorial possible solutions!). So instead we can optimise and use a Genetic Algorithm.

The Genetic Algorithm is brilliantly explained by user Vijini Mallawaarachchi in a link at the end of the article. However, what we are seeking to do is apply the processes of natural selection and evolution to a set of possible solutions, more on this in the methodology.

However, a generic genetic algorithm is not tailored to meet the constraints of a TSP, so is not always an appropriate solution. This article shows that genetic algorithms can be tailored solutions to restrictive problems like the TSP. But first, it is important to look at why the TSP is so restrictive and where a genetic algorithm could go wrong.

The Constraints

A solution to the TSP has to have:

The traveller begin and end their journey at the same cafThe traveller not revisiting any intermediate caf so each one in the solution must be uniqueThe traveller visiting all 165 + 1 (for returning) = 166 cafs

An out-of-the-box implementation of a genetic algorithm will truly mimic biological processes and is not fit for use on restrictive problems. This means that mutations are random (so could cause duplication breaking point 2 above), crossovers between solutions are also random (so could break point 1) and deletions are also possible mutations where a gene could be removed entirely (breaking point 3). So how can they be altered and developed for the TSP?

Methodology

In building the genetic algorithm existing packages like NumPy and PyGAD can be leveraged. PyGAD is a package dedicated to Genetic Algorithms written in Python. It is really useful and has many use cases most heavily weighted toward training neural networks. This means that it has not been fully adapted to solving the TSP. However, it does have very intuitive customisation options and parameters as you build a genetic algorithm instance. Which are explored below to make it more bespoke for the restrictive TSP.

Building the Genetic Algorithm

A genetic algorithm is made up of a set number of parts:

An initial populationA fitness functionA crossover operationA mutation operation

These constituents in the case of this solution to fix the genetic algorithm for the restrictive TSP will be:

A pool of solutions programmatically generated with a random order of stores starting and ending at a given storeA function to determine the distance between each store in a solutionA function to imitate breeding in a natural process, where parts of two very fit solutions are passed on to the children using the partial match crossover operator which upholds restrictionsA function to randomly change an element each solution to imitate mutations in natural processes an inversion which again upholds restrictions

There is also parent selection however this is taken care of by PyGAD. Now its important to make sure our biology is up to scratch here too and we can start to replace some of the lengthier sentences above with shorter more precise ones:

gene a store to visit in a given solution

chromosome a set of stores strung together to make a solution

population the pool of chromosomes

Using PyGADs documentation as a source the methodology when applying the package is given in the helpful flow chart below:

Implementation

Now for the code. There is a lot of code for this project so it wont all be included here. Especially the boilerplate genetic algorithm code and operations. The most important takeaways are the application of a custom population builder, fitness function, crossover operator and mutation operator. As well as solution verification.

First things first the population needs to be generated. The simplest method for getting random values is making use of random in Python. So from the data, a list of stores is generated. A hash (dictionary) containing each store and some lon/lat information for where the store is, and some random selections courtesy of the random package in Python can be used to build the chromosomes:

Notably, the first and final genes are set to 0 indicating the travellers return.

Now for the fitness function. To determine which chromosomes will make it to the next generation and breed, a fitness value is needed. The fittest ones will make it through. The fitness of a solution in this context is how long or short the distance that the traveller has to go before reaching the end of the line. So a solution that bounces the coffee drinker from North to South London and back again wont be very fit. The work here makes use of GeoPy and measures distance as the crow flies:

This is where things get really interesting. Out-of-the-box genetic algorithms in PyGad come with several crossover operators. However, none of these is suitable for the TSP! All of the crossover operations would result in invalid solutions. So there are two options either carry on as normal and generate incorrect values but apply a mutation that rematches the chromosome to validate the solution. Or develop a custom crossover method that does not result in invalid solutions. This project uses option two and applies the Partially Matched Crossover (PMX) method.

The reason behind crossover is to combine very good solutions in an effort to make two even better child solutions. PMX does this by:

Define a run of genes in parent 1Pass these genes on to the childFind the first gene in parent 2 that is not already in the childIf this gene the position of this gene is already filled in the child then match to parent 1 to see what gene it isIterate until a position is found for every gene that the child is missing from parent 2Repeat the process with parent 2 as parent 1

The diagram below comes from a paper based on the PMX crossover, in biology:

So the process is defining a run of consecutive genes to take into a child from parent one in the exact same position. Then fill in the rest of the childs chromosome with the remaining genes in the safe positions from parent two. To create the second child the parents just swap and the run of consecutive genes is taken from parent two and parent one is used to fill in the gaps. The code implementation is split into two functions below:

The final impactful function to design is the mutation function. For the TSP again there are no suitable out-of-the-box mutation functions. So the options are to build one or to build a custom mutation that re-validates a solution. The first option is more favourable because it is more in line with the concepts of genetic algorithms and involves less meddling. The mutation used is an inversion, in which a sequence of genes of random length is defined and then reversed so [, 1, 2, 4, ] would become [, 4, 2, 1, ]. The code implementation is more simple than the crossover:

Now the genetic algorithm ga_instance can be initialised and run below:

The key things to note are the implementation of the custom functions. As well as a few other parameters namely:

num_generations: the number of iterations to run the initialisation, selection, breed and mutation processsol_per_pop: the number of totals solutions per generationnum_parents: the number of parents that will breed based on fitness rankings (this value is kept fairly high here to ensure diversity in the solution pool)keep_parents: keep the two fittest parents at each generation (elitism) so that they can still be considered fit solutions until they are outranked and selected against laterResults

Analysing the results of the genetic algorithm is where using the pre-built PyGAD package is very useful. The results are super accessible with just a few simple function calls:

To show the results in a more simple format initially a version of the algorithm is trained on just 10 cafs. And the best solution found is given below:

Now at first this doesnt look like anything too clever, but zooming into the middle section its clear how well the solution has been optimised. As there can be no repeated sections the algorithm has found paths on the