NYCJUG/2018-04-10

From J Wiki
Jump to navigation Jump to search

Meeting Agenda for NYCJUG 20180410

1. Beginner's regatta: see "Example of Building a Function to Calculate End-of-month Dates".

2. Show-and-tell: see "Notes on 'Mastermind' Game".

3. Advanced topics: see "Using Self-Organizing Maps to solve the Traveling Salesman Problem".

4. Learning and teaching J: see "Partial correlation" - can we do what R does easily? See "A Programmable Programming Language".

Example of Building a Function to Calculate End-of-month Dates

End-of-month dates are a useful subset of all dates but the rules of a calendar are somewhat irregular with different months having different numbers of days. We could generate dates based on the number of days per month and a leap-year adjustment but here's a different way using some of the standard date functions.

   load 'dt'  NB. Load Date/Time verbs
   toJulian=: 3 : 0
NB.* toJulian: convert YYYYMMDD date to Julian day number.
   dd=. |: 0 100 100 #: y
   mm=. (12*xx=. mm<:2)+mm=. 1{dd
   yy=. (0{dd)-xx
   nc=. <.0.01*yy
   jd=. (<.365.25*yy)+(<.30.6001*1+mm)+(2{dd)+1720995+2-nc-<. 0.25*nc
NB.EG toJulian 19590524 19921216
)

   toGregorian=: 3 : 0
NB.* toGregorian: convert Julian day numbers to dates in form YYYYMMDD
NB. (>15 Oct 1582).  Adapted from "Numerical Recipes in C" by Press,
NB. Teukolsky, et al.
   igreg=. 2299161        NB. Gregorian calendar conversion day (1582/10/15).
   xx=. igreg<:ja=. <. ,y
   jalpha=. <.((<.(xx#ja)-1867216)-0.25)%36524.25
   ja=. ((-.xx) (#^:_1) (-.xx)#ja)+xx (#^:_1) (xx#ja)+1+jalpha-<.0.25*jalpha
   jb=. ja+1524
   jc=. <.6680+((jb-2439870)-122.1)%365.25
   jd=. <.(365*jc)+0.25*jc
   je=. <.(jb-jd)%30.6001
   id=. <.(jb-jd)-<.30.6001*je
   mm=. mm-12*12<mm=. <.je-1
   iyyy=. iyyy-0>iyyy=. (<.jc-4715)-mm>2
   gd=. (10000*iyyy)+(100*mm)+id
)

Now we can play with these routines to determine end-of-month dates "empirically".

Experiment with Date Routines

   toJulian 19991231
2451544
   qts''
2016 10 17 12 50 4.982
   toJulian 20161017
2457679
   toJulian 20161017 19991231
2457679 2451544
   -/toJulian 20161017 19991231   NB. Number of days between these two dates.
6135

Build the vector of all day numbers between two dates:

   days=. (toJulian 19991231)+ i. >: -/toJulian 20161017 19991231

Figure out how to extract the month from a date in (numeric) YYYYMMDD form:

   toGregorian 0{days
19991231
   <.100%~toGregorian 0{days       NB. Reduce YYYYMMDD to YYYYMM
199912
   100|<.100%~toGregorian 0{days   NB. Extract month numbers MM
12
   $mos=. 100|<.100%~toGregorian days
6136

Build Boolean indicating where the month changes and check that result looks correct:

   whdiff=. 2~:/\mos                     NB. For list of month numbers, where does the month change?
   I. 100{. whdiff                       NB. Offsets for month change
0 31 60 91
   (toGregorian days){~I. 100{. whdiff   NB. Dates from offsets in YYYYMMDD form
19991231 20000131 20000229 20000331

Extract these end-of-month dates:

   #eomdts=. (toGregorian days){~I. whdiff
202
   _4{.eomdts
20160630 20160731 20160831 20160930

Figure how to convert numeric YYYYMMDD dates into literal form “YYYY-MM-DD”:

   1 0 0 0 1 0 1 0<;.1 ":20160930            NB. Partition date number as literal.
+----+--+--+
|2016|09|30|
+----+--+--+
   '-',~&.>1 0 0 0 1 0 1 0<;.1 ":20160930    NB. Put a hyphen at end of each partitioned item.
+-----+---+---+
|2016-|09-|30-|
+-----+---+---+
   ;'-',~&.>1 0 0 0 1 0 1 0<;.1 ":20160930   NB. Raze -> extra hyphen at end
2016-09-30-
   }:;'-',~&.>1 0 0 0 1 0 1 0<;.1 ":20160930 NB. Drop last character (= extra hyphen)
2016-09-30

Save the end-of-month dates into a file so we can examine them:

   svdir=. '\amisc\Clarifi\Data\'
   13 : '}:;''-'',~&.>1 0 0 0 1 0 1 0<;.1 ":y' NB. Tacit date conversion
[: }: [: ; '-' ,~&.> 1 0 0 0 1 0 1 0 <;.1 ":
   3{.eomCharDts=. ([: }: [: ; '-' ,~&.> 1 0 0 0 1 0 1 0 <;.1 ":)&.>eomdts
+----------+----------+----------+
|1999-12-31|2000-01-31|2000-02-29|
+----------+----------+----------+
   (;LF,~&.>eomCharDts) fwrite svdir,'EOMDates.txt'   NB. LF-delimited character strings to file
2222

Check that this looks correct:

   fread (svdir,'EOMDates.txt');0,54
1999-12-31
2000-01-31
2000-02-29
2000-03-31
2000-04-30

Now can go back over the lines above and extract only the relevant ones to assemble the complete sequence:

   load 'dt'
   days=. (toJulian 19991231)+ i. >: -/toJulian 20161017 19991231
   mos=. 100|<.100%~toGregorian days
   whdiff=. 2~:/\mos
   eomdts=. (toGregorian days){~I. whdiff
   svdir=. '\amisc\Clarifi\Data\'
   eomCharDts=. ([: }: [: ; '-' ,~&.> 1 0 0 0 1 0 1 0 <;.1 ":)&.>eomdts
   svdir fileVar_WS_ 'eomCharDts'
   load 'dsv'
   (;LF,~&.>eomCharDts) fwrite svdir,'EOMDates.txt'

Game of Mastermind

Dan showed us some work he's done implementing a solver for the game of Mastermind. In this game, the "tester" arranges 4 pegs, with a choice of 8 colors per peg, in a location not visible to the player. The player sets up 4 pegs as a guess at the solution and the tester gives a score to this guess that indicates how many pegs are the correct color but not in the correct position and how many are both the right color and in the correct position. The player uses this score to make a new guess, which gets scored, and so on until the player deduces the hidden arrangement.

Here are Dan's notes on how he worked out a J solution to help a player solve a puzzle in as few guesses as possible.

Notes on "Mastermind" Game

Mastermind

4 pegs, choice of 8 colors for each: 8^4 combinations(4096)
score black and white pegs
black for each digit in correct position
white for digits included, but not in the correct position

Numeralogic

four places for digits 0-9, no duplicates

   */10 9 8 7 combinations (5040)

score in 2 digits (number of correct digits, number of correct digits in correct position)

   correct_position =: +/ guess = solution

(if no duplicates present)

   correct_digits =: +/ +/ guess =/ solution

Since I am used to the Numeralogic scoring, I am using its definition for whitescore.

   blackscore=: 4 : '+/ x = y'

   1 2 3 4 blackscore 2 3 1 1
0
   1 2 3 4 blackscore 1 2 3 1 
3
   1 2 3 4 blackscore 5 1 2 3 
0

NB. if using Numeralogic rules (no duplicates in guess or key)

   whitescore3=: 4 : '+/ +/ x =/ y'

   1 2 3 4 whitescore3 5 1 2 3 
3
   1 2 3 4 = / 5 1 2 3 
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0

NB. these are incorrect (duplicates do not score correctly)
   1 2 3 4 whitescore3 2 3 1 1
4
   1 2 3 4 =/ 2 3 1 1
0 0 1 1
1 0 0 0
0 1 0 0
0 0 0 0

   1 2 2 3 whitescore3 1 1 2 3
5
   1 2 2 3 =/ 1 1 2 3
1 1 0 0
0 0 1 0
0 0 1 0
0 0 0 1

NB. to score with duplicates need to step through digits of key, and remove matches from guess.
---------------------------------
   key
1 2 2 3 

   guess 
1 1 2 3
check if 1 is within 1 1 2 3
yes, add 1 to whitescore (now 1)
check if 2 is within 1 2 3 (remove first matched 1)
yes, add 1 to whitescore (now 2)
check if 2 is within 1 3
no, move on
check if 3 is within 1 3
yes, add 1 to whitescore (now 3)

the above with recursion:

whitescore1=: 4 : 0    NB. x = key   y = guess    use recursion
if. 0 = # x  do. r=. 0
NB. match all of y to first element of x.  if digit match(s), nb will remove first instance from y  
else. nb =. y ([: -. ([: i. [: # [) e. ] {.~ 1 <. [: # ]) I. b =. y = {. x
r =. (+./ b) + (}. x) whitescore1 nb # y   NB. add 1 if there were any matches
end.
)

   1 2 2 3 whitescore1 1 1 2 3
3
   1 2 2 3 whitescore1~ 1 1 2 3
3
Can it be done without looping or recursion?

NB. first idea
NB. take and insert (+./) followed by plus insert (+/)

   +/ +./ 1 2 3 4 =/ 2 3 1 1
4
   +/ +./ 2 3 1 1 =/ 1 2 3 4 
3

NB. transpose binary matrix is the same as swapping x and y
   +/ +./ |: 1 2 3 4 =/ 2 3 1 1  
3
NB. order matters.
NB. i thought i could take lower value from the 2.

NB. this fails regardless of order

   +/ +./ 1 2 2 3 =/ 1 1 2 3
4
   +/ +./ |: 1 2 2 3 =/ 1 1 2 3
4

NB. make a list of all possible digits (no duplicates)
NB. compare guess to list and compare key to same list
NB. lower value is correct answer

   ]x1=. +/ 1 2 2 3 =/ i. 10
0 1 2 1 0 0 0 0 0 0
   ]y1=. +/ 1 1 2 3 =/ i. 10
0 2 1 1 0 0 0 0 0 0
   x1 <. y1
0 1 1 1 0 0 0 0 0 0
   +/ x1 <. y1
3

NB. don't need i. 10, nub of key and guess will suffice  

+/ 1 2 2 3 =/ ~. 1 2 2 3 , 1 1 2 3

   whitescore2=: 4 : 0    NB. x = key   y = guess
cmb=. ~. x,y
x1=. +/ x =/ cmb
y1=. +/ y =/ cmb
+/ x1 <. y1
)

Using Self-organizing Maps to Solve the Traveling Salesman Problem

These are excerpts from an article from article showing how a Python implementation performs.

The Traveling Salesman Problem is a well-known challenge in Computer Science: it consists on finding the shortest route possible that traverses all cities in a given map only once. ... the difficulty to solve it increases rapidly with the number of cities, and we do not know in fact a general solution that solves the problem. For that reason, we currently consider that any method able to find a sub-optimal solution is generally good enough (we cannot verify if the solution returned is the optimal one most of the times).

To solve it, we can try to apply a modification of the Self-Organizing Map (SOM) technique. Let us take a look at what this technique consists, and then apply it to the TSP once we understand it better.

The original paper released by Teuvo Kohonen in 1998...[explains] that a self-organizing map is [a]...grid of nodes, [using principles of] a neural network. ...the nodes in the map are spatially organized to be closer the more similar they are with each other. ... To obtain this structure, [we apply] a regression operation to ... update the nodes, ... from the model (e) .... The expression used for the regression is: SOMregressionEquation.jpg This implies that the position of the node n is updated adding the distance from it to the given element, multiplied by the neighborhood factor of the winner neuron, We. The winner of an element is the more similar node in the map to it, usually measured by the closer node using the Euclidean distance.... ...the neighborhood is defined as a convolution-like kernel for the map around the winner. Doing this, we are able to update the winner and the neurons nearby closer to the element, obtaining a soft and proportional result. The function is usually defined as a Gaussian distribution....

Modifying the Technique

To use the network to solve the TSP, the main concept to understand is how to modify the neighborhood function. If instead of a grid we declare a circular array of neurons, each node will only be conscious of the neurons in front of and behind it. That is, the inner similarity will work just in one dimension. Making this slight modification, the self-organizing map will behave as an elastic ring, getting closer to the cities but trying to minimize the perimeter of it thanks to the neighborhood function.

Although this modification is the main idea behind the technique, it will not work as is: the algorithm will hardly converge any of the times. To ensure the convergence of it, we can include a learning rate, α, to control the exploration and exploitation of the algorithm. To obtain high exploration first, and high exploitation after that in the execution, we must include a decay in both the neighborhood function and the learning rate. Decaying the learning rate will ensure less aggressive displacement of the neurons around the model, and decaying the neighborhood will result in a more moderate exploitation of the local minima of each part of the model. Then, our regression can be expressed as:

SOMregressionEquation2.jpg

where α is the learning rate at a given time, and g is the Gaussian function centered in a winner and with a neighborhood dispersion of h. The decay function consists on simply multiplying the two given discounts, γ, for the learning rate and the neighborhood distance.

SOMregressionEquation-decayFnc.jpg

This expression is indeed quite similar to that of Q-Learning, and the convergence is search in a similar fashion to this technique. Decaying the parameters can be useful in unsupervised learning tasks like the aforementioned ones. It is also similar to the functioning of the Learning Vector Quantization technique, also developed by Teuvo Kohonen.

Finally, to obtain the route from the SOM, it is only necessary to associate a city with its winner neuron, traverse the ring starting from any point and sort the cities by order of appearance of their winner neuron in the ring. If several cities map to the same neuron, it is because the order of traversing such cities have not been contemplated by the SOM (due to lack of relevance for the final distance or because of not enough precision). In that case, any possible order can be considered for such cities.

Implementing and testing the SOM

For the task, an implementation of the previously explained technique is provided in Python 3. It is able to parse and load any 2D instance problem modelled as a TSPLIB file and run the regression to obtain the shortest route. This format is chosen because for the testing and evaluation of the solution the problems in the Traveling Salesman Problem instances offered by the University of Waterloo, which also provides the optimal value of the route of such instances and will allow us to check the quality of our solutions.

On a lower level, the numpy package was used for the computations, which enables vectorization of the computations and higher performance in the execution, as well as more expressive and concise code. pandas is used for loading the =.tsp= files to memory easily, and matplotlib is used to plot the graphical representation. These dependencies are all included in the Anaconda distribution of Python, or can be easily installed using pip.

... implementation and testing details omitted ...


The code is available in GitHub and licensed under MIT....

Partial Correlation

[From Wikipedia]

Not to be confused with Coefficient of partial determination.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding whether or to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression.

For example, if we have economic data on the consumption, income, and wealth of various individuals and we wish to see if there is a relationship between consumption and income, failing to control for wealth when computing a correlation coefficient between consumption and income would give a misleading result, since income might be numerically related to wealth which in turn might be numerically related to consumption; a measured correlation between consumption and income might actually be contaminated by these other correlations. The use of a partial correlation avoids this problem.

Like the correlation coefficient, the partial correlation coefficient takes on a value in the range from –1 to 1. The value –1 conveys a perfect negative correlation controlling for some variables (that is, an exact linear relationship in which higher values of one variable are associated with lower values of the other); the value 1 conveys a perfect positive linear relationship, and the value 0 conveys that there is no linear relationship. The partial correlation coincides with the conditional correlation if the random variables are jointly distributed as the multivariate normal, other elliptical, multivariate hypergeometric, multivariate negative hypergeometric, multinomial or Dirichlet distribution, but in general otherwise.

Example

Suppose we have the following data on three variables, X, Y, and Z.  
These data have the feature that whenever Z = 0, X equals exactly twice Y, 
and whenever Z = 1, X is exactly 5 times Y. Thus, contingent on the value 
of Z, there is an exact relationship between X and Y; but the relationship 
cannot be said to be exact without reference to the value of Z.
X Y Z
2 1 0
4 2 0
15 3 1
20 4 1
In fact, if we compute the correlation coefficient 
between variables X and Y, the result is approximately
0.969, while if we compute the partial correlation between X and Y, 
using the formula given below, we find a partial correlation of
0.919. The computations were done using R with the code shown here.
> X = c(2,4,15,20)
> Y = c(1,2,3,4)
> Z = c(0,0,1,1)
> mm1 = lm(X~Z)
> res1 = mm1$residuals
> mm2 = lm(Y~Z)
> res2 = mm2$residuals
> cor(res1,res2)
[1] 0.919145
> cor(X,Y)
[1] 0.9695016

Formal Definition

Formally, the partial correlation between X and Y given a set of n controlling variables Z = {Z1, Z2, ..., Zn}, written ρXY·Z, is the correlation between the residuals eX and eY resulting from the linear regression of X with Z and of Y with Z, respectively. The first-order partial correlation (i.e. when n=1) is the difference between a correlation and the product of the removable correlations divided by the product of the coefficients of alienation of the removable correlations. The coefficient of alienation, and its relation with joint variance through correlation are available in Guilford (1973, pp. 344–345).

Replicating the R Example in J

The R code above gives a few statistics:
> cor(res1,res2)
[1] 0.919145
> cor(X,Y)
[1] 0.9695016
> summary(mm1)
Call:
lm(formula = X ~ Z)
Residuals:
  1    2    3    4 
-1.0  1.0 -2.5  2.5 
Coefficients:
           Estimate Std. Error t value Pr(>|t|)  
(Intercept)    3.000      1.904   1.576   0.2558  
Z             14.500      2.693   5.385   0.0328 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
Residual standard error: 2.693 on 2 degrees of freedom
Multiple R-squared: 0.9355,	Adjusted R-squared: 0.9032 
F-statistic:    29 on 1 and 2 DF,  p-value: 0.03280
Here's J code to calculate the first few statistics shown in R:
   x=. 2 4 15 20
   y=. >:i.4
   z=. 0 0 1 1

corrCoeff=: 3 : 0
NB.* corrCoeff: coefficients of linear correlation between 2 columns of y
   avgdif=. y-"(1 1)mean y
   (+/*/"1 avgdif)%%:*/+/*:avgdif
)
   corrCoeff x,.y
0.969502
   x%.1,.z   NB. Regress x on z w/intercept
3 14.5
   (1,.z)+/ . *x%.1,.z  NB. Fit points
3 3 17.5 17.5
   NB. Calculate x and y residuals
   res1=. x-(1,.z)+/ . *x%.1,.z
   res2=. y-(1,.z)+/ . *y%.1,.z
   corrCoeff res1,.res2
0.919145
   res1  NB. Residuals as shown in R
_1 1 _2.5 2.5