NYCJUG/2011-04-12/RedOrBlackGameSimulation

From J Wiki
Jump to navigation Jump to search

The Red or Black Game

The “red or black” game is played with a shuffled deck of 52 cards, half red, half black. The object of the game is to predict the color of a card before it is turned over. It would seem that the naïve expectation is that one would average 26 correct guesses over a full deck. However, a simple strategy raises this expectation to about 30 correct guesses.

An article about this, “Interesting Probability Calculations Using Cards To Win The Red Or Black Game” by Yutaka Nishiyama (in the International Journal of Pure and Applied Mathematics, Volume 66 No. 1 2011, 81-90), calculates that a simple strategy yields an expectation of 30.007 correct guesses. The calculation looks reasonable. However, a simple simulation in J seems to give a slightly different average of about 30.0407. Following is an explanation of how the paper lays out the problem and following that is an example of a simulation of the problem in J.

The paper considers the general case of any sequence from a deck of n red and n black cards as corresponding to a path on a grid that is n x n. As shown here, it is evident that, for instance, as more red cards are turned up, black becomes a better guess; as more black cards have been seen, red becomes more likely. However, on the diagonal of the grid, representing points at which an equal number of red and black cards have come up, either one is as likely as the other. This is illustrated here by a section of a grid representing the first eleven draws in an arbitrary sequence. The horizontal axis represents draws of red cards, the vertical axis represents draws of black cards. The squares of the grid are colored to indicate which color is more likely to come up at each point in a given sequence, so the upper triangle, in which more red cards have been drawn, is colored black and the reverse is true for the lower triangle.

GuessesOnGridWithEllipses.png

The paper makes the point that each time the path hits the diagonal, the expected number of points is 0.5 but, for points not on the diagonal, the expected value of a guess is higher than this depending on how far off the diagonal, i.e. how many more cards of one color than the other have been drawn. The author extends this argument thusly:

PredictedScoreFromNishiPaper.png

The remainder of the paper details the calculations for determining the general formula and, applying it to the (26, 26) case, gives the value of 30.007 for the expected value using the strategy outlined here. The derivation of the formula, involving generating functions and convolutions as well as the use of Stirling’s Formula to calculate the value of large factorials, was a bit beyond my own modest mathematical abilities so I resorted to simulation of the problem. However, running this simulation for quite a few trials (several billion), seems to point to an expected mean value of about 30.0407, which is a small but significant difference from the value given in the paper. So, which is more correct? To figure this out, we’ll first look at the J code used to arrive at this estimate by simulation.

J Simulation of the “Red or Black” Game

A J routine to run a simulation of this game is as follows:

scoreRB=: 3 : 0"0
NB. Simulate "Red or Black" game once per argument item (value ignored).
   guess=. -*+/\0,}:seq=. (] {~ [: ?~ #)26 26#_1 1
   (-:0+/ . =guess)+seq +/ . = guess
NB.EG scores=. scoreRB 1e6$0  NB. Scores for one million simulations.
)

The routine works by first setting the variable “seq” as a random sequence of red and black cards, represented by ones and negative ones (26 26#_1 1), then scrambling these into a random order. The expression ?~ # evaluates to 52?52 when applied to the vector of 52 ones and negative ones and this permutation of the first 52 integers selects ({~) from the initial, ordered set in a random order.

The variable “guess” is calculated based on this random sequence representing the red and black values by cumulatively summing (+/\) them. This sum gives a running total of whether the deck is over-weighted by red or black cards at each point in the sequence based on the signum (*). At each point in the sequence, if the total is positive, there are more ones at that point; if it’s negative, there are more negative ones; and if it’s zero, the ones and negative ones are balanced. Our guess is simply the opposite (-) of which of the two predominates at each point.

This “running total” is a simpler way of looking at the problem than explicitly tracking each path but seems as adequate an approach as the one given in the paper referenced earlier, at least for purposes of simulation.

The score for a simulation is calculated by comparing the sequence to the guesses and assigning one point for each match (seq +/ . = guess). This amount is incremented by half the number of zeroes in the guesses (-:0+/ . =guess) on the assumption that each guess at a point where the deck is evenly divided between red and black is worth half a point based on the equal likelihoods at those points.

The only other part of the code not yet mentioned is 0,}: in the first line which offsets the guesses by one from the sequence and represents our initial guess as a zero, reflecting the equal balance of red and black in the initial state. The last element is dropped (}:) to equalize the length of “guess” to that of “seq” and properly align them with each other. We could have accomplished the same thing by assigning the guesses using a rotation of the cumulative sum since we know the last element of this sum is always zero as there are equal numbers of ones and negative ones:

guess=. -*_1|.+/\seq   NB. Alternative calculation

Running the Code

Now that we have the simulation figured out, what results do we get? Running the code for one million simulations takes only about 20 seconds and gives an average score of about 30.04 as shown here.

   6!:2 'sc=. scoreRB 1e6$0'
22.136179
   (<./,>./,mean,stddev) sc
26 46 30.039792 2.3721194

We justify rounding to four digits on the basis that precision from a stochastic process like this scales as the square root of the number of iterations, so 10^6 iterations justify something like three digits and experience with these sort of simulations justifies rounding to one more digit than this. Since this last digit is somewhat uncertain, at this point we’re unsure if this estimate truly differs from the value of 30.007 given in the paper mentioned earlier. So, we need to run more simulations to achieve better precision. However, for each additional digit, we need to run about 100 times as many simulations. This means that we probably want to run about 10^10 simulations to be sure of the five digits or so we need to see if our simulation estimate agrees with the value given in the paper. We can estimate the time for this by scaling the time for one million simulations from our earlier run:

   0 60 60#:22.136179*1e4  NB. Convert seconds to hours, minutes, seconds.
61 29 21.79

So, we can get an answer in about 60 hours. However, we can get the answer more quickly by running multiple instances of the simulation in parallel. The following explores an example of how we might do this.

Running Multiple, Parallel Simulations

Parallel instances of simulations can be run on separate machines and on different cores of a single machine; the results, written to distinctly-named files, can be combined once the various runs have completed.

The parallel processes are spun off a template file customized for each instance. Here’s an initial attempt at such a template. There are two parameters to be customized for each new instance: "{IterationsBatchSize}" and "{outputFile}" are replaced by 1) two integers for the number of iterations and the number of simulations per iteration, and 2) the name of an output file in which to save the results.

NB.* runRorBSim.template: template to run simulations of the Red or Black game
NB. multiple times.

scoreRB=: 3 : 0"0
NB. Simulate "Red or Black" game once per argument item (value ignored).
   guess=. -*+/\0,}:seq=. (] {~ [: ?~ #)26 26#_1 1
   (-:0+/ . =guess)+seq +/ . = guess
NB.EG scores=. scoreRB 1e6$0  NB. Scores for one million simulations.
)

stddev=: %:@var=: (ssdev=: +/@:*:@dev=: -"_1 _ mean=: (+/%#)) % <:@#
usus=: (<./,>./,mean,stddev)

NB.* int2csv: convert integers (full-precision) to comma-separated vector.
int2csv=: (LF,~ [: }. [: ; ',' ,&.> ] ":&.>~ [: <. [: >:10^.])
NB.* num2csv: convert numbers (floating-point) to comma-separated vector.
num2csv=: (LF ,~[: }. [: ; ',' ,&.> ":&.>)

NB. Replace {IterationsBatchSize} w/2 ints: # iters, # simulations/iteration.
3 : 0 ] {IterationsBatchSize}
   'ni nspi'=. y [ sttime=. 6!:1  [ rand0=. ?2147483647
   (int2csv ni,nspi,rand0) fwrite '{outputFile}'
   while. _1<ni=. <:ni do.
       (num2csv usus scoreRB nspi$0) fappend '{outputFile}' end.
NB. Register elapsed clock time in last row.
   (num2csv sttime-~6!:1) fappend '{outputFile}'
)
2!:55

This last, anonymous function runs the simulations. The first line sets values for the number of iterations ni and the number of simulations per iteration nspi, as well as the starting time sttime of the process; the final variable rand0 notes the value of a random number for reasons which will be clear shortly. The results of each batch of simulations are summarized – we retain the minimum, the maximum, the mean, and the standard deviation for each iteration.

This template is used in the following way to generate two parallel processes to run on a dual-core machine; these lines are entered in a J session to spin off two simultaneous sets of one million simulations at a time for ten times each, giving us 20 million simulations in all.

   JEXE=: '"C:\Program Files\J701\bin\jconsole.exe" '
   tmplt=. fread 'runRorBSim0.template'
   (tmplt rplc '{IterationsBatchSize}';'10 1e6';'{outputFile}';'RB0.csv') fwrite xflnm=. 'runR&B0.ijs'
   fork JEXE,xflnm
   (tmplt rplc '{IterationsBatchSize}';'10 1e6';'{outputFile}';'RB1.csv') fwrite xflnm=. 'runR&B1.ijs'
   fork JEXE,xflnm

This creates and runs two scripts. However, examination of the output reveals a problem with this method. After the simulations complete, we get the results from the files (after loading the CSV functions):

   fls=. ([:,"2[:".&>readcsv)&.>'RB0.csv';'RB1.csv'

We can also check how many seconds each task took:

   _1{"1 {:&>fls
200.938 206.282 

However, when we examine the parameters and the results of random number generation, we find that the random number is the same for both instances.

   ([: (LF-.~ int2csv)"1 }:"1) {.&>fls
10,1000000,2094548590
10,1000000,2094548590

This means that our parallel simulations are duplicates of each other, as we can verify by comparing the core results:

   -: / (}.@:}:)&.>fls
1

For this reason, our parallel runs do not complement each other – they achieve no greater precision. We need to ensure that each set of simulations starts with a different random seed. Here’s what our main template function, in the file runRorBSim1.template, looks like for one attempt at this. We introduce a new parameter to specify the random seed each time.

3 : 0 ] {IterationsBatchSize}
   'ni nspi'=. y [ sttime=. 6!:1  [ 9!:1]{randseed} [ max=. 2147483647 
   (int2csv ni,nspi,2?@$max) fwrite '{outputFile}'
   (int2csv 4?@$max) fappend '{outputFile}'
   while. _1<ni=. <:ni do.
       (num2csv usus scoreRB nspi$0) fappend '{outputFile}' end.
NB. Register elapsed clock time in last row.
   (num2csv (3?@$max),sttime-~6!:1) fappend '{outputFile}'
)

We run two parallel instances of this modified version in this way for our new template runRorBSim1.template.

   tmplt=. (fread 'runRorBSim1.template') rplc '{IterationsBatchSize}';'10 1e6'
   pcs=. tmplt rplc '{outputFile}';'RB0.csv'
   (pcs rplc '{randseed}';":?2147483647) fwrite xflnm=. 'runR&B0.ijs'
   fork JEXE,xflnm
   pcs=. tmplt rplc '{outputFile}';'RB1.csv'
   (pcs rplc '{randseed}';":?2147483647) fwrite xflnm=. 'runR&B1.ijs'
   fork JEXE,xflnm

The results from this are encouragingly different (we examine several random numbers to ensure that setting the seed this way has the desired effect of starting entirely new random sequences):

   fls=. ([:".&>readcsv)&.>'RB0.csv';'RB1.csv'
   (LF-.~ int2csv)"1] 2{.&>fls
10,1000000,1111077906,181488023
286527472,1823991428,1503387545,1222810738

10,1000000,1134926174,591816778
394038841,1445999514,281044172,1120398835

Resolving the Discrepancy

Now that we have a good way to run multiple, parallel sets of simulations, we can get the results of ten billion runs relatively quickly. What do these look like? For one set of 10,000 batches of a million simulations per batch, we get a mean score of 30.0407. This looks significantly different from the published value of 30.007. Can we verify that one is a better estimate than the other? Fortunately, upon re-reading the paper to prepare this essay, I noticed that the result there was obtained using an approximation for a large factorial and that a more precise formula is also given as we see in this excerpt.

ApplyingStirlingApproximation.png

Here, the notation “C(2n,n)” can be rendered in J as (! +:) and calculated directly without resorting to Stirling’s approximation. So, the first equation shown above can be written

   RBScore=: 3 : 'y+-:<:(4^y)%(! +:)y'

This gives us the value for S(26) of 30.040665 which agrees well with our ten billion simulation average of 30.0407.