Doc/Articles/Play161

From J Wiki
Jump to navigation Jump to search

At Play With J ... Table of Contents ... Previous Chapter ... Next Chapter

21. New Big Deal

. By Eugene McDonnell, with a major contribution from Roger Hui. First published in Vector, 16, 1, (July 1999), 113-119.

Chris Burke was displeased, to begin with. He had tried J's deal with a small left argument and a very large right argument, and this took a perceptibly long time. He tried it with a larger argument and was told he had run out of space. He tried it with a still larger argument, and was told he had exceeded deal's limits. He wrote to Roger Hui about these distressing circumstances, showing him several examples:

  (time=:6!:2) '1 ? 1e7'  NB. this takes too long
0.93

   1 ? 1e8
|out of memory
| 1 ?100000000

   1 ? 1e9
|limit error
| 1 ?1000000000

Roger's first thought was how to get around this limitation. He said that if the left argument is much smaller than the right Chris would be better off just doing ?x$y ; because deal allocates a boolean vector of length y to compute unique answers in the result of x?y. Thus a very large y would give the results Chris noted. He forwarded Chris's message to me, copying Chris, with the message "Perhaps Eugene can comment."

I checked the literature on my shelf but couldn't find any worthwhile "selection without replacement" algorithms. I then remembered the very early days of APL\360, before deal was made a primitive -- some time in 1966, perhaps. I had written a defined function to perform deal, using an algorithm that was quite fast. I remembered it vaguely, and thought that with I might be able to do better with J. I wrote one, tried it out on a dozen or so cases, then communicated it to Chris and Roger:

   deal =: dyad define
   NB. experimental deal
   NB. for small x and very large y
   count =: 0
   t =. i.0
   k =. 1.1 NB. adjust as you see fit
   NB. maybe make it a function of y
   u =. >. k * x
   whilst. (# t) < x do.
   t =. ~. ? u # y
   count =: count + 1
   end.
   x {. t
   )

As you can see, I was uncertain about the factor 1.1. I had put a counter in to see how often more than one execution of the whilst section was needed. In the few dozen cases I tried, there were none. Happily, the timings showed a large improvement over current deal for Chris's cases: the cases that were slow were much faster, and the range of the right argument was significantly extended.

Here are the timings I experienced:

   time '1 ? 1e7'
4.07
   100 time '1 deal 1e7'
0
   1000 time '1 deal 1e7'
0.00044
   1000 time '1 deal 1e8'
0.00044
   1000 time '1 deal 1e9'
0.00044
   1000 time '100 deal 1e7'
0.00082
   count
1
   1000 time '100 deal 1e8'
0.00076
   count
1
   1000 time '100 deal 1e9'
0.00083
   count
1
   ts=: 6!:2 , 7!:2@] NB. time and space
   100 ts '1 deal 1e7'
0 2240
   100 ts '1 deal 1e8'
0 2240
   100 ts '1 deal 1e9'
0.0005 2240
   100 ts '100 deal 1e7'
0.0005 4288
   100 ts '100 deal 1e8'
0.0005 4288
   100 ts '100 deal 1e9'
0.0005 4288

Roger thought this was neat. He implemented my algorithm, invoked when x<0.01*y. He rewrote my algorithm, simplifying it:

   deal =: dyad define
   u =. >. 1.1 * x
   while. x># t=. ~. ? u # y do. end.
   x {. t
   )

His C implementation looked like this:

static A bigdeal(m,n)I m,n;{A t,x,y;
RZ(x=sc((I)floor(1.11*m)));
RZ(y=sc(n));
do{RZ(t=nub(roll(reshape(x,y))));}while(m>AN(t));
R vec(INT,m,AV(t));
} /* E.E. McDonnell circa 1966, small m and large n */

But he worried that this would run into the birthday problem, which gets its name from its most celebrated instance, that the odds are in your favor if you bet that in a group of 23 or more people at least two of them will have the same birthday. The greater the number of people, of course, the higher the probability that this will occur. With more than 365 people, the probability is certain, or 1, since the pigeon hole principle dictates that if you have x pigeonholes and y pigeons, with x less than y, at least one of the pigeonholes will have more than one pigeon in it. What Roger wanted to know was, what is the value of x as a function of y where the probability of a duplicate appearing was 0.5. If he had this information, he could make the multiplier more accurate. Perhaps 1.11 wasn't good enough.

I wrote the following:

   Hui_test =: dyad define
   tests =: 0
   successes =: 0
   whilst. 1000 > tests =. >: tests do.
   successes =: successes + *./~:?x$y
   end.
   <. 100 * 0.001 * successes
   )

I ran this test for y in all of the hundreds, thousands, ten thousands, hundred thousands, millions, and 10,000,000 and 20,000,000. To make it easier to digest I'll only show the results for y=:10^2 3 4 5 6 7. The other results are consistent with these.

       y    x
     100   12
    1000   37
   10000  116
  100000  370
 1000000 1180
10000000 3740
 I thought this looked roughly like a square root relationship so tried a few maneuvers:
   %:y
10 31.6228 100 316.228 1000 3162.28
   %:1.4*y
11.8322 37.4166 118.322 374.166 1183.22 3741.66
 This was quite a good fit, and shows:
   y = (x^2) % 1.4
   x = %: 1.4 * y

to a close approximation. I communicated these results to Roger. He studied this and was able to apply some more theory to it, and wrote back to me: In choosing m items from a universe of n items, the probability of all the m items distinct is (*/n-i.m) % (n^m). The numerator is the number of ways of choosing m distinct items; the denominator is the number of ways of choosing m items. Thus:

   (*/n-i.m) % (n^m)                      (a)
   (*/n-i.m) % (*/m$n)
   */ (n-i.m) % (m$n)
   */ (n-i.m) % n

   f=: [: */ ] %~ i.@[ -~ ]

   22 23 f"0 ] 365 NB. the classic birthday problem
0.524305 0.492703

The first m, for which m f n is less than 0.5, is:

   1 + (0.5 > */\ n %~ n - i.n) i. 1

   f1=: >: @ (i.&1) @ (0.5&>) @ (*/\) @ ([ %~ [ - i.)
   f1 365
23
   n=: , (,10^1 2 3 4 5)*/1 2 4 8
   m=: f1"0 n
   n,.m ,. %: 1.4*n
    10    5 3.74166
    20    6  5.2915
    40    8 7.48331
    80   11  10.583
   100   13 11.8322
   200   17 16.7332
   400   24 23.6643
   800   34 33.4664
  1000   38 37.4166
  2000   53  52.915
  4000   75 74.8331
  8000  106  105.83
 10000  119 118.322
 20000  167 167.332
 40000  236 236.643
 80000  334 334.664
100000  373 374.166
200000  527  529.15
400000  745 748.331
800000 1054  1058.3

For extremely large values of n, Roger saw that the function f1 is impractical to compute, and he concluded that a method based on root finding on the original function f, using my approximation of m=: %: 1.4*n as an initial guess, would be more suitable.

1e7 3724
1e8 11775
1e9 37234
2e9 52656

The end result was that Roger found that the rule I had originally suggested of switching to the "roll" algorithm for m?n when m<0.01*n does run into the birthday problem, and he replaced it with the more accurate rule I had found following his suggestion. There the matter rests, with one small postscript. As I studied Roger's mathematics, particularly the phrase (*/n-i.m) in line (a) above, I recalled J's "falling factorial" function. The J Dictionary defines this as follows:

The fit conjunction applies to ^ to yield a stope defined as follows: x^!.k n is */x + k*i. n. In particular, ^!._1 is the falling factorial function.

Let me elaborate on this. Think of the caret (^) as being defined in the first instance as denoting a function of three arguments: a base, a count, and a step. Then caret (base, count, step) is the product over base + step * integers count. For example,

  'base count step' =. 3 5 7
   i. count
0 1 2 3 4
   step * i. count
0 7 14 21 28
   base + step * i. count
3 10 17 24 31
   */ base + step * i. count
379440

   caret =: monad define
   NB. general caret function
   NB. y is a list of three values:
   NB.  base
   NB.  count
   NB.  step
   'base count step' =. y
   */ base + step * i. count
   )
   caret(3 5 7)
379440

This generality hides the fact that there are really just three variations of significant interest, steps having values _1, 0, and 1. These three provide falling factorials, steady factorials (powers) and rising factorials. Each of these has to do with a product over count number of values, beginning with base , and continuing with values increasing by step . See what results come with a base of 5 and a count of 3, with each of the three significant step sizes:

   caret 5 3 _1
60
   5 * 4 * 3
60
   caret 5 3 0
125
   5 * 5 * 5
125
   5 ^ 3
125
   caret 5 3 1
210
   5 * 6 * 7
210

The case with step zero is the default case of caret, and is the power function. We can use the falling and steady factorial cases to write a more compact version of Hui's function f:

   probdupes =: dyad define
   NB. Probability of no duplicates when drawing with
   NB. replacement from among the first count integers
   base =. x
   count =. y
   (base ^!._1 count) % (base ^!.0 count)
   )

NB. simplified version of probdupe
   pd =: ^!._1 % ^

NB. version exploiting family resemblances
   pdx =: [: %/ ^!._1 0
   365 probdupes 23
0.492703
   365 pd 23
0.492703
  365 pdx 23
0.492703

These functions fail when the values of the falling factorial get very large, causing its value to exceed the maximum real value, and thus to be represented by machine infinity. When this happens, the steady factorial (power) value will already be machine infinity, and the quotient of two infinities is Not a Number, or NaN. To get around this problem, use extended integers as arguments, and extended precision inverse on the result. When the result is smaller than the smallest nonzero number, a result of zero is forced:

   365 pd 200                NB. result is NaN error, from _ % _
|NaN error: pd
|   365     pd 200

   x:^:_1 [ 365x pd 200x     NB. result of zero is forced
0

Vector Vol.16 No.1