NYCJUG/2006-10-10

From J Wiki
Jump to navigation Jump to search

Beginner's regatta

Parametric Plots

There is a new plot feature for specifying a range and the name of a function to apply over that range. Here are some example of using it.

Expression Plot
       'pensize 3' plot _1p1 1p1;'sin' 

Here we supply the the end points of _π to π and the sin function, defined as sin=: 1&o.. We use the fat pensize so the plot shows up better when presented in a small format.

PlotSin.jpg

Here we plot both sin and cos (defined as cos=: 2&o.) from negative two pi to positive pi.

       'pensize 3' plot _2p1 1p1;'sin`cos' 

The backtick between the two functions gives us a vector of functions, each applied to the argument implied by the point range.

PlotSinCos.jpg

This parametric plot of sin on the x axis and cos on the y gives us regularly-spaced points on a circle.

       egparm=: 3 : '<"1 (sin,:cos)y'
       'point;pensize 5' plot _1p1 1p1;'egparm'
    |domain error: plotf_userverb__origlocale
    |   'yvals zvals'=.    plotf_userverb__origlocale pfmody^:_1 yscrn=.steps y
       'point;pensize 5' plot egparm i:4j100
ParametricPlotSinCos.jpg

Show-and-tell

Pascal's rectangle (see pascalRectangles.txt and pascal.ijs). Also, see "Tiny Precision Differences.doc".

Pascal's Rectangle

You may have heard of Pascal's triangle but what is Pascal's rectangle?

The Problem: Walking Blocks

Recently we set out to calculate the number of possible paths one might have on a walk that traverses a rectangle so many blocks wide by so many blocks long. The only rule is that there is no backtracking. We eventually came up with this nice graphical representation of the problem.

   5 pascalRect 5
1 1  1  1  1
1 2  3  4  5
1 3  6 10 15
1 4 10 20 35
1 5 15 35 70

This rectangle gives the number of all possible paths one could take walking down 5 blocks and over 5 blocks. The number at each point in the grid is how many different paths intersect there. So, we see that the first row and column all have only a single path along them, whereas interior points have more. Looking at just the upper left corner, we can see how it works.

1 1  1  
1 2  3  
1 3  6

There is one path at the starting point and one of two rectangular directions we can move from that one. However the "2" represents the two different ways we could reach that point, either by moving across and down or down and across.

At this time, my regular daily walk was from 14th and 1st to 32nd and Park Avenue, 5 blocks over and 18 blocks up, which would be represented like this:

         
   5 pascalRect 18
1 1  1  1  1   1   1   1   1   1    1    1    1    1    1    1    1    1
1 2  3  4  5   6   7   8   9  10   11   12   13   14   15   16   17   18
1 3  6 10 15  21  28  36  45  55   66   78   91  105  120  136  153  171
1 4 10 20 35  56  84 120 165 220  286  364  455  560  680  816  969 1140
1 5 15 35 70 126 210 330 495 715 1001 1365 1820 2380 3060 3876 4845 5985

We can derive this analytically:

   0!18                     NB. The upper left corner
1
   0 1 2 3 4 5!18           NB. This gives part of an anti-diagonal from the right.
1 18 153 816 3060 8568
   0 1 2 3 4!17 18 19 20 21 NB. This gives the first column
1 18 171 1140 5985

This was based on my daughter Maia's idea - for a 5x18 walk, we have 5+18 or 23 blocks to cover, minus the endpoints, for each path, so there are this many paths from one corner to the far corner:

   (5-1)!(18-1)+(5-1)
5985

The Code Version 1 - Slow and Wasteful

We depend on this utility to generate the next row of Pascal's triangle given the current one:

pascalNextRow=: ({.,(2: (+/)\]),{:)

As we can see, it saves the first and last items from the input but inserts the cumulative sum of adjacent pairs between them.

It gives these results:

    pascalNextRow ,1
1 1
      pascalNextRow pascalNextRow ,1
1 2 1
      pascalNextRow pascalNextRow pascalNextRow ,1
1 3 3 1

Generalizing with the power conjunction "^:":

     pascalNextRow ^:3 ] ,1

1 3 3 1

      pascalNextRow ^:10 ] ,1

1 10 45 120 210 252 210 120 45 10 1

     pascalNextRow ^:(i.10) ] ,1

1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 1 3 3 1 0 0 0 0 0 0 1 4 6 4 1 0 0 0 0 0 1 5 10 10 5 1 0 0 0 0 1 6 15 20 15 6 1 0 0 0 1 7 21 35 35 21 7 1 0 0 1 8 28 56 70 56 28 8 1 0 1 9 36 84 126 126 84 36 9 1

This gives us a tool to build as many rows of the triangle as we need which we can then shift into place in the rectangle.

NB.* pascalRect1: make matrix with Pascal's triangle arranged on minor diags.
pascalRect1=: 4 : 0
   pt=. ,:lastrow=. ,1
   max=. 2*x+y             NB. Need twice the rows of triangle to fill mat.
   while. 0<max=. <:max do.
       pt=. pt,lastrow=. pascalNextRow lastrow
   end.
   rectpt=. y{."1 x{.(2$>.-:$pt){.(i.1{$pt)|."(0 1) |:pt
NB.* 1 1 1 1 1 2 3 4 1 3 6 10 -: 3 pascalRect 4
NB.* 1 1 1 1 1 2 3 4 1 3 6 10 1 4 10 20 1 5 15 35 -: 5 pascalRect 4
)

Here's what it does:

   5 pascalRect1 18
1 1  1  1  1   1   1   1   1   1    1    1    1    1    1    1    1    1
1 2  3  4  5   6   7   8   9  10   11   12   13   14   15   16   17   18
1 3  6 10 15  21  28  36  45  55   66   78   91  105  120  136  153  171
1 4 10 20 35  56  84 120 165 220  286  364  455  560  680  816  969 1140
1 5 15 35 70 126 210 330 495 715 1001 1365 1820 2380 3060 3876 4845 5985
   

The Code Version 2 & Up - Better

This version is a little shorter and simpler but it still has an explicit loop.

NB.* pascalRect4: make matrix with Pascal's triangle arranged on minor diags.
pascalRect4=: 3 : 0
   'rows cols'=. x: y
   pr=. (1x) 0}y$0x
   for_ii. }.i.rows do.
       pr=. (+/\(<:ii){pr) ii}pr
   end.
)

This direct method based on an idea from Maia McCormick.

NB.* pascalRect2: make matrix with Pascal's triangle arranged on minor diags.
pascalRect2=: 3 : 0
   'rows cols'=. x: y
   cols{."1(i.rows)|."(0 1)(i.rows)!/i.rows+cols
)

Based on simplifying pascalRect4 above, we get to another version with no explicit loop:

NB.* pascalRect5: make matrix with Pascal's triangle arranged on minor diags.
pascalRect5=: 3 : 0
   'rows cols'=. y
   (+/\^:(i.rows))cols$1x
)

Finally, just a substitution into the above to eliminate two temporary local variables.

pascalRect=: 3 : '+/\^:(i.{.y) ] 1x#~{:y'

Since all of these take the same two parameters on the right, we can compare them simply:

   *./({.-:&>}:)(pascalRect;pascalRect2;pascalRect4;pascalRect5) 5 18
1

Tiny Precision Differences

I stumbled across this trying to help my daughter with her homework of finding a zero for the following function. I attempted to cheat by graphing it but was puzzled by what I got until I realized that, applied to negatives, this function returns complex numbers and the plot package treats these as 2-D point pairs.

I actually solved it for one root by inspection.

   ff=: 13 : '((3*y^%2)-(5*y^%4))-2'    NB. Here we use explicit numbers for the fractional powers
   ff2=: 13 : '((3*%:y)-5*%:%:y)-2'     NB. Here we use (repeated) square roots %: instead

These should be the same but there's a tiny precision difference

   (ff-:ff2)i:16j1000    NB. Not the same over this domain
0
   (ff,ff2) _16          NB. Negative arguments give complex results
_9.0710678j4.9289322 _9.0710678j4.9289322

   >./"1 | 9 11 o./ (ff-ff2)i:16j1000  NB. So compare real and imaginary parts
1.7763568e_15 2.6645353e_15            NB.  separately->tiny differences...
   plot (ff-ff2) i: 16j1000            
NB. Gives an interesting pattern.

I don't know what causes this but it's probably some periodic rounding/precision issue at the least significant part of the pair of floating point numbers representing the very small differences between two things that ought to be the same.

Small Real vs Imaginary Differences in Small Floating Point Numbers.jpg

Advanced topics

an idea for a project: the Netflix challenge (see netflixchallengeRules.txt, NetflixChallengeREADME.txt, and toc.dir).

The Netflix Challenge

The video distribution company Netflix has an algorithm for making suggestions about movies based on a user's rating of other movies.

The Dataset

They put together an anonymized dataset described like this:

The movie rating files contain over 100 million ratings from 480 thousand randomly-chosen, anonymous Netflix customers over 17 thousand movie titles. The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received during this period. The ratings are on a scale from 1 to 5 (integral) stars....

The date of each rating and the title and year of release for each movie id are also provided.

There is a training set of records with one file for each of 17,770 movies. Each file contains a movie ID and a customer rating, one per line, in this format: Customer ID, rating, date.

Futhermore,

  • - MovieIDs range from 1 to 17770 sequentially.
  • - CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
  • - Ratings are on a five star (integral) scale from 1 to 5.
  • - Dates have the format YYYY-MM-DD.

The Rules

This is an outline of the rules but without the page or so of legalese following it.

We’re quite curious, really. To the tune of one million dollars.

Netflix is all about connecting people to the movies they love. To help customers find those movies, we’ve developed our world-class movie recommendation system: CinematchSM. Its job is to predict whether someone will enjoy a movie based on how much they liked or disliked other movies. We use those predictions to make personal movie recommendations based on each customer’s unique tastes. And while Cinematch is doing pretty well, it can always be made better.

Now there are a lot of interesting alternative approaches to how Cinematch works that we haven’t tried. Some are described in the literature, some aren’t. We’re curious whether any of these can beat Cinematch by making better predictions. Because, frankly, if there is a much better approach it could make a big difference to our customers and our business.

So, we thought we’d make a contest out of finding the answer. It’s "easy" really. We provide you with a lot of anonymous rating data, and a prediction accuracy bar that is 10% better than what Cinematch can do on the same training data set. (Accuracy is a measurement of how closely predicted ratings of movies match subsequent actual ratings.) If you develop a system that we judge most beats that bar on the qualifying test set we provide, you get serious money and the bragging rights. But (and you knew there would be a catch, right?) only if you share your method with us and describe to the world how you did it and why it works.

Serious money demands a serious bar. We suspect the 10% improvement is pretty tough, but we also think there is a good chance it can be achieved. It may take months; it might take years. So to keep things interesting, in addition to the Grand Prize, we’re also offering a $50,000 Progress Prize each year the contest runs. It goes to the team whose system we judge shows the most improvement over the previous year’s best accuracy bar on the same qualifying test set. No improvement, no prize. And like the Grand Prize, to win you’ll need to share your method with us and describe it for the world.

There is no cost to enter, no purchase required, and you need not be a Netflix subscriber. So if you know (or want to learn) something about machine learning and recommendation systems, give it a shot. We could make it really worth your while. Terms and Conditions in a Nutshell

  • Contest begins October 2, 2006 and continues through at least October 2, 2011.
  • Contest is open to anyone, anywhere (except certain countries listed below).
  • You have to register to enter.
  • Once you register and agree to these Rules, you’ll have access to the Contest training data and qualifying test sets.
  • To qualify for the $1,000,000 Grand Prize, the accuracy of your submitted predictions on the qualifying set must be at least 10% better than the accuracy Cinematch can achieve on the same training data set at the start of the Contest.
  • To qualify for a year’s $50,000 Progress Prize the accuracy of any of your submitted predictions that year must be less than or equal to the accuracy value established by the judges the preceding year.
  • To win and take home either prize, your qualifying submissions must have the largest accuracy improvement verified by the Contest judges, you must share your method with (and non-exclusively license it to) Netflix, and you must describe to the world how you did it and why it works.

Learning and teaching J

Here we look at a way of applying rank to apply a function between arrays in an unusual way.

Iterated Rank

From: Jose Mario Quintana <JoseMarioQuintana@2bestsystems.com>	
Reply-To: General forum <general@jsoftware.com>
To: General forum <general@jsoftware.com>
Date: Sep 25, 2006 8:57 AM
Subject: RE: Cell (was Re: [Jgeneral] I expect a table of orderedpairs forthis)
   > On Behalf Of Roger Hui
   > When I talk about the rows of a matrix I am
   > talking about the 1-cells of a matrix.  Not a verb
   > in sight.  Likewise for items.

After some thought and modifying the question:

What is an instance of a coding usage of talking about n-cells of an array without referencing a verb?

  ir=. ("1 0) ("1 1)  NB. An instance of talking about n-cells
                      NB. without a verb in sight

  (i.2 3) *ir (i.2 4) NB. An instance of eventual usage
 0  0  0
 0  1  2
 0  2  4
 0  3  6

12 16 20
15 20 25
18 24 30
21 28 35

This might seem purely academic but apparently at least one person finds this kind of "iterated rank" constructions useful: http://www.jsoftware.com/pipermail/general/2004-September/018139.html

   > This may or may not answer your question depending
   > on what "implicitly reference" or "practical use"
   > means.
   >
   > From: Jose Mario Quintana <JoseMarioQuintana@2BestSystems.Com>

Trying This

We have the dissect tool available now but it was not around at the time of this discussion. Here's what we get from it:

IteratedRankDemoDissected.jpg

We get more detail if we don't hide the rank conjunction behind a name:

IteratedRankExplicitDissected.jpg

Some Experimentation with Iterated Rank

   i. 2 3
0 1 2
3 4 5
   i. 2 4
0 1 2 3
4 5 6 7

So, (i.2 3) *ir (i.2 4) multiplies each scalar in i. 2 3 by each row in i. 2 4.

Not quite:

   (i. 2 3)*"(0 1) i. 2 4
 0  0  0  0
 0  1  2  3
 0  2  4  6

12 15 18 21
16 20 24 28
20 25 30 35
   
   ir=. ("1 0) ("1 1)  NB. An instance of talking about n-cells
   (i.2 3) *ir (i.2 4) NB. An instance of eventual usage
 0  0  0
 0  1  2
 0  2  4
 0  3  6

12 16 20
15 20 25
18 24 30
21 28 35
   0 1 2 */ 0 1 2 3
0 0 0 0
0 1 2 3
0 2 4 6
   |: 0 1 2 */ 0 1 2 3
0 0 0
0 1 2
0 2 4
0 3 6
   |: 3 4 5 */ 4 5 6 7
12 16 20
15 20 25
18 24 30
21 28 35

Some equivalent expressions:


   |:(|:i. 2 3)*"1/|:i.2 4
 0  0  0
 0  1  2
 0  2  4
 0  3  6

12 16 20
15 20 25
18 24 30
21 28 35

Get rid of the plethora of explicit transposes above by using "each" with transpose:

   (i. 2 3)*"1/&.|:i.2 4
 0  0  0
 0  1  2
 0  2  4
 0  3  6

12 16 20
15 20 25
18 24 30
21 28 35

So, this is applying multiplication orthogonally between the column vectors.

End Quotes

   Convictions are more dangerous enemies of truth than lies.
     - Friedrich Nietzsche

   A man who tells lies, like me, merely hides the truth. 
   But a man who tells half-lies has forgotten where he put it.
     - Mr. Dryden, in Lawrence of Arabia