Beginner's Regatta

Here we look at an example of solving the Traveling Salesman Problem using a genetic algorithm.

Traveling Salesman Problem Using Genetic Algorithms

This, from this site, demonstrates solving the Traveling Salesman Problem using a genetic algorithm. It implements the solution in C# but provides a good explanation of the method and the issues involved so we can use the article as guidance for writing our own version.

Problem Statement

From the article:

I have developed a solution to the Traveling Salesman Problem (TSP) using a Genetic Algorithm (GA). In the Traveling Salesman Problem, the goal is to find the shortest distance between N different cities. The path that the salesman takes is called a tour.
Testing every possibility for an N city tour would be N! math additions. A 30 city tour would have to measure the total distance of be 2.65 X 1032 different tours. Assuming a trillion additions per second, this would take 252,333,390,232,297 years. Adding one more city would cause the time to increase by a factor of 31. Obviously, this is an impossible solution.
A genetic algorithm can be used to find a solution is much less time. Although it might not find the best solution, it can find a near perfect solution for a 100 city tour in less than a minute. There are a couple of basic steps to solving the traveling salesman problem using a GA.

Basic Algorithm

The article outlines these steps:

First, create a group of many random tours in what is called a population. This algorithm uses a greedy initial population that gives preference to linking cities that are close to each other.
Second, pick 2 of the better (shorter) tours parents in the population and combine them to make 2 new child tours. Hopefully, these children tour will be better than either parent.
A small percentage of the time, the child tours are mutated. This is done to prevent all tours in the population from looking identical.
The new child tours are inserted into the population replacing two of the longer tours. The size of the population remains the same.
New children tours are repeatedly created until the desired goal is reached.

As the name implies, Genetic Algorithms mimic nature and evolution using the principles of Survival of the Fittest.

Difficulties with Crossover

We are next introduced to an issues unique to TSP with genetic algorithms: the problem with the crossover step, the fourth on above where a part of one tour is inserted into another tour. This raises the problem that since each path has exactly one instance of each node, steps must be taken to deal with duplicates and missing elements when inserting an arbitrary piece of one path into another.

This author deals with the problem by using a doubly-linked-list with a link to both the prior and subsequent node in a path. How exactly this helps is not explained but is presumably available in the code.

Parameters

In this article, there are six parameters that control how the GA works. These are

Population size - how many tours to look at at once.
Neighborhood size - number of tours to choose from the population at each iteration.
Mutation rate - a percentage determining how likely it is for a mutation to occur.
Number of nearby cities - the number of cities in a grouping of cities that are considered close to each other.
Nearby city odds - the chance that a given city will choose the next step in the path from one of the nearby cities rather than from the population in general.
Random seed - allow for setting the starting point of the random number generation; supposed to be handy for debugging.
City list - list of cities and locations

Preliminary J Work Implementing TSP Using GA

We don't have a complete solution here but show some preliminary work in J to work toward an implementation of this method. We did a more complete job on this for a later meeting.

Here’s some code:

NB.* geneticTSP.ijs: look at using genetic algos to solve the
NB. Traveling Salesman Problem.

disBtwPts=: 13 : '%:+/*:x-y'
disBtwPts=: [: %: [: +/ [: *: -

NB.* mat2col2Rxy: 2-col mat -> R assignments
mat2col2Rxy=: 3 : 0
   assert. 2={:$y   NB. 2-columns
   assert. 2=#$y    NB. 2-dimensional
   str=. 'xx<- ',1|.')c(',}:;',',~&.>j2n &.> 0{"1 y
   str,'; yy<- ',1|.')c(',}:;',',~&.>j2n &.> 1{"1 y
)

J2Rassign2rowMat=: 3 : 0
   smoutput ('xx<- c('),')',~}.;',',&.>j2n &.> 0{"1 y
   smoutput ('yy<- c('),')',~}.;',',&.>j2n &.> 1{"1 y
)

NB.* circuit: repeat 1st point at end -> complete circuit.
circuit=: ],{.

eg=: 0 : 0
   |:locs=. 6 2?@$0
0.375026   0.821357 0.843802 0.151142 0.188003 0.994509
0.448847 0.00569082 0.136437 0.294528 0.603056 0.184087
)

egShowPath=: 0 : 0
   pd 'reset'
   pd 'title TSP Example: Shortest Path'
   pd <"1 |:locs [ pd 'type point;pensize 2'
   pd <"1 |:circuit locs{~ord [ pd 'type line;itemcolor green'
   dis=. +/2 disBtwPts/\circuit locs{~ord
   pd 'text 100 100 Total Distance=',":dis
   pd 'show'
)

Here’s some exploration of how we might use this:

   load 'plot'
   pd 'reset'
   |:circuit locs=. 6 2?@$0
0.375026   0.821357 0.843802 0.151142 0.188003 0.994509 0.375026
0.448847 0.00569082 0.136437 0.294528 0.603056 0.184087 0.448847
   $locs
6 2
   !#locs
720
   alldiss=. +/&>2 disBtwPts/\&.> circuit &.> <"2 locs{~(i.!#locs) A. i.#locs
   alldiss i. <./alldiss
84
   ]ord=. 84 A. i.#locs
0 4 3 1 2 5
   ]dis=. +/2 disBtwPts/\circuit locs{~ord
2.24734

   pd 'reset'
   pd <"1 |:locs [ pd 'type point;pensize 2'
   pd 'title TSP Example: Shortest Path'
   pd <"1 |:circuit locs{~ord [ pd 'type line;itemcolor green'
   pd 'text 100 100 Total Distance=', ":dis
   pd 'show'

Expanding on this:

   |:locs=. 10 2?@$0
0.432281 0.981256 0.865989 0.175182 0.422217 0.188374 0.323206 0.348423 0.876925 0.421128
0.497782  0.18366 0.432289 0.617645 0.646875  0.10098 0.537415 0.345876 0.870853 0.217829
   ord=. i.#locs
   pd 'reset'
   pd 'title TSP Example: Initial Path'
   pd <"1 |:locs [ pd 'type point;pensize 2'
   pd <"1 |:circuit locs{~ord [ pd 'type line;itemcolor green'
   dis=. +/2 disBtwPts/\circuit locs{~ord
   pd 'text 100 100 Total Distance=',":dis
   pd 'show'

   !#locs                             NB. Possible number of paths
3.6288e6
   $paths=. (10?!#locs) A. i.#locs    NB. Some random permutations.
10 10
   +/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~paths
5.04213 4.41731 4.68363 5.36805 5.00227 5.45471 4.61821 5.26628 4.29059 5.18575
   /:+/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~paths
8 1 6 2 4 0 9 7 3
   pd 'reset'
   pd 'title TSP Example: First Best (Random) Guess'
   pd <"1 |:locs [ pd 'type point;pensize 2'
   pd <"1 |:circuit locs{~8{paths [ pd 'type line;itemcolor green'
   dis=. +/2 disBtwPts/\circuit locs{~8{paths
   pd 'text 100 100 Total Distance=',":dis
   pd 'show'

   $best=. paths{~(<.0.5+-:#paths){./:+/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~paths
5 10
   +/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~best
4.29059 4.41731 4.61821 4.68363 5.00227
   best
8 4 6 9 0 1 2 5 7 3
5 0 9 2 8 1 6 4 7 3
3 6 0 1 9 5 8 4 2 7
4 5 6 0 1 7 3 9 2 8
3 9 4 6 1 8 7 5 2 0
   A."1 best
3093843 1851015 1290908 1633018 1429439
   best i."(1 0) 0
4 1 2 3 9
   (best i."(1 0) 0)|."(0 1) best
0 1 2 5 7 3 8 4 6 9
0 9 2 8 1 6 4 7 3 5
0 1 9 5 8 4 2 7 3 6
0 1 7 3 9 2 8 4 5 6
0 3 9 4 6 1 8 7 5 2
   A."1 (best i."(1 0) 0)|."(0 1) best
1812 332002 38092 26538 117743
   best=. (best i."(1 0) 0)|."(0 1) best  NB. Start at same point for each path.
   ]ixs=. A."1 best
1812 332002 38092 26538 117743
   i:5
_5 _4 _3 _2 _1 0 1 2 3 4 5
   $ixs +/ i: 5
5 11
   _1 A. i.3
2 1 0
   newixs=. ~.,ixs +/ i: 5
   3{.newixs
1807 1808 1809
   (3{.newixs) A. i.#locs
0 1 2 5 7 3 6 4 9 8
0 1 2 5 7 3 6 8 4 9
0 1 2 5 7 3 6 8 9 4
   diss=. +/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~newixs A. i.#locs
   (<./,>./) diss
4.03417 5.54674
   $paths=. locs{~newixs A. i.#locs
55 10 2
   $paths=. <"2 locs{~newixs A. i.#locs
55
   $best=. paths{~(<.0.5+-:#paths){./:+/&>2 disBtwPts/\&.>circuit &.> paths
28
   diss{~(#best){./:diss
4.03417 4.13188 4.18143 4.20547 4.23095 4.29059 4.29463 4.29489 4.30649 4.35064 4.36072 4.39746 4.40792 4.40901 4.41017 4.41731 4.43094 4.44933 4.45013 4.46523 4.49365 4.51075 4.52487 4.56752 4.57172 4.58078 4.58707 4.61821
   ]newixs=. newixs{~5{.\:diss
26537 26536 26533 26535 26534
   newixs A. i.#locs
0 1 7 3 9 2 6 8 5 4
0 1 7 3 9 2 6 8 4 5
0 1 7 3 9 2 6 4 8 5
0 1 7 3 9 2 6 5 8 4
0 1 7 3 9 2 6 5 4 8
   bp=. {.newixs
   pd <"1 |:circuit locs{~bp [ pd 'type line;itemcolor blue'
   pd 'reset'
   pd 'title TSP Example: Next Best (Permutation Neighbors) Guess'
   pd <"1 |:locs [ pd 'type point;pensize 2'
   pd <"1 |:circuit locs{~bp [ pd 'type line;itemcolor green'
   dis=. +/2 disBtwPts/\circuit locs{~bp
   pd 'text 100 100 Total Distance=',":dis

   bp=. 4{newixs
   pd <"1 |:circuit locs{~bp [ pd 'type line;itemcolor red;pensize 3'
   dis=. +/2 disBtwPts/\circuit locs{~bp
   pd 'textcolor red;text 100 200 Total Distance=',":dis
   pd 'show'
   pd <"1 |:circuit locs{~bp [ pd 'type line;itemcolor red;pensize 5'
   pd 'textcolor red;text 100 200 Total Distance=',":dis
   pd 'show'

Odds'n'Sods

We looked at an article on "Corpora and Vector Spaces" and implementing this in Python.

Corpora and Vector Spaces

We read about Genesim, a Python library for analyzing bodies ("corpora") of plain text by building vector spaces of the words in a given corpus. A vector space represents a word as a high-dimensional vector of numbers.

The complete article is included in the Materials section below.

About the Entry

An explanation of the winner of the 2015 "Underhanded C Contest": we looked at the rather simple bit of C code for comparing pairs of gamma-ray spectra as well as at a more general explanation of the results of that contest in which there was discussion of "NaN poisoning attacks". These attacks rely on the undefined behavior of "NaN" (Not a Number) which has representation in many programming languages, including J.

The complete articles are included in the Materials section below.

Got to be a Better Way

Here we looked at some code in Javascript for doing financial analysis. This perhaps falls under the heading of things for which a particular language is not well-suited.

Similarly, we examined some Python code for calculating means, medians, and quantile creation.

We compared the J code for doing this with the Python code.

J Code for Mean and Median

   mean=: (+/ % #) :wmean
   median=: -:@(+/)@((<. , >.)@midpt {  /:~)
   midpt=: -:@<:@#
   wmean=: +/@[ %~ +/@:*

Python Code for Mean and Median (with graphing)

Here's some Python code, with wrapping for plotting the resulting quantiles, for calculating these values.

# percentile breaker-upper
import pandas as pd
from pandas import read_csv, DataFrame
import matplotlib.pyplot as plt
plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
path_to_file = input("path:")
#path_to_file = 'test_data.csv'
# remove trailing space if exists, as happens when getting the file path 
# by dragging an icon from the desktop on a Mac.
path_to_file = path_to_file.strip()
# read the file into a DataFrame
dframe = read_csv(path_to_file)
list_headers = list(dframe.columns.values)
len_data = len(dframe.index)
print('length of data: ',len_data)
# print a list of the headers, with an index number to select
for x in range(0,len(list_headers)):
    print('column #',x,' ',list_headers[x])
col_sort = int(input('enter column to sort by: '))
col_meanmed = int(input('enter mean/median column: '))
tiles = int(input("enter number of quantiles to use: "))
# sort the data
dframe = dframe.sort_values([list_headers[col_sort]],ascending=True)
# handle uneven number of rows by assigning remainder to first and last quantiles.
# if remainder is odd, the first quantile gets the extra one.
int_tile = int(len_data/tiles)
modrem = len_data % tiles
offset = int((modrem/2) + (modrem % 2))
stuff_begin = int_tile + offset
stuff_end = len_data - int(modrem/2) - int_tile
# this is the computed quantile mean and median data
tile_data =[]
# the first and last quantiles are handled outside the loop, since they
# are possibly of differing row lengths to the inside quantiles.
new_slice=slice(0,stuff_begin)
temp_mean = dframe.iloc[new_slice,col_meanmed]
print('\n\nquantile data:\n\nmean/median 0 :\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean))
tile_data.append([0,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)])
for x in range(1,tiles-1):
    new_slice=slice((int_tile * x)+ offset, (int_tile * (x+1)) + offset)
    temp_mean = dframe.iloc[new_slice,col_meanmed]
    print('mean/median',x,':\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean))
    tile_data.append([x,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)])
new_slice = slice(stuff_end,len_data)
temp_mean = dframe.iloc[new_slice,col_meanmed]
print('mean/median',x+1,' :\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean))
tile_data.append([x+1,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)])
# matplotlib seems to want tuples instead of other data types.
# adding 1 to each element in the quantile row, so that quantiles don't start
# at zero.
row_t = tuple([row[0]+1 for row in tile_data])
row_mn = tuple([row[1] for row in tile_data])
row_md = tuple([row[2] for row in tile_data])
print("""
making bar chart....
""")
# graphing stuff. Copied and pasted from samples, with much head-scratching involved.
n_groups = len(row_t)
title_graph = str(row_t[len(row_t)-1]) + ' quantiles, sorted by "' + str(list_headers[col_sort]) + '", data: "' + str(list_headers[col_meanmed]) + '"'
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.35
opacity = 1
rects1 = plt.bar(index, row_mn, bar_width,
                 alpha=opacity,
                 color='b',
                 label='Mean')
rects2 = plt.bar(index + bar_width, row_md, bar_width,
                 alpha=opacity,
                 color='r',
                 label='Median')
plt.xlabel('quantiles')
plt.ylabel('value')
plt.title(title_graph)
plt.xticks(index + bar_width, row_t)
plt.legend()
plt.show()

Miscellaneous

We looked at an example of a crossword using regular expressions as clues.

Here is a partially solved one to illustrate how this works.

Materials

NYCJUG/2016-02-09

Contents