NYCJUG/2007-04-10

NYCJUG meeting; explaining dimensionality, frames, cells, rank; 64-bit J; Netflix Bayesian approach; installation standards; Linux; re-framing efficiency.

NYC JUG Meeting of 20070410

In this meeting, we talked about how we think about dimensionality (along with frames, cells, axes, and rank), problems (and eventual success) installing J on a 64-bit machine, what J needs to do to adhere to common installation standards, the Dr. Dobbs states' anagram puzzle and reframing the efficiency discussion.

Meeting Agenda for NYC JUG 20070410

1. Beginner's regatta: what is a good way to clarify succession of array
dimensionality?  What is behind the confusion about frames?  See
"frameConfusion.txt", "arrayShapeDimensionality.txt" and
"Array description-excerpt from JfCP.doc".  Also, see use of "frame" in
section 4 below.


2. Show-and-tell: 64-bit J installation and use.

Netflix update.


3. Advanced topics: why does J not adhere to installation standards?


4. Learning and teaching J: reframing the efficiency discussion; see
"reframingEfficiency.txt" and "DrDobbsStatesAnagramPuzzle.txt".


5. Publicizing J: Dr. Dobbs author guidelines "DrDobbsAuthorGuidelines.doc":
http://www.ddj.com/authors.htm#2.

The 10th ICFP Programming Contest?  See "ICFP Programming Contest
announcement.doc".


            +.--------------------+.

Arrogance is consolation for those
who are doomed to stay small.
  - Arabic proverb

Proceedings

We'll look at the topics more or less in agenda order. We only barely touched on our new section 5 "Publicizing J".

Beginner's regatta: Explaining Dimensionality

There has been a lot of discussion recently on the J Forum that indicates the difficulty people have understanding how J looks at multi-dimensional arrays. Since this is one of the language's great strengths, and may be something most of us have not considered for a long time, it might pay to re-visit how we think about this.

When I first learned these concepts in APL, I thought of zero-dimensional arrays as corresponding to a geometric point; one-dimensional arrays are like line segments; two-dimensional arrays are plane segments, and so on to higher-dimensional arrays. For some reason though, in the recent discussions, the conceptual problems seem to be with the smallest dimensionality (scalars), not the large ones as we might expect. However, I think this confusion can be resolved by examining the ramifications of the following insight.

One great insight of APL, and J, is that

  The shape of any array is a vector of non-negative integers.

This simple notion allows us to work with even high-dimensional arrays quite easily because we don't have to grapple with how to visualize them explicitly: we can look at only the vector of numbers representing the array's dimensions. I'll shortly give an example of how this is helpful for higher-dimensional arrays.

Sticking with this simple insight - the shape is a vector of numbers - makes sense "all the way down":

  A three-dimensional array's shape consists of three numbers.
  A two-dimensional array's shape consists of two numbers.
  A one-dimensional array's shape consists of one number.
  A zero-dimensional array's shape consists of zero numbers.

The other sticking point, besides why scalars have the empty vector as their shape, seems to be frames versus cells. Again, remembering only that the shape is a vector, we can explain this dichotomy thusly: for a given shape (vector of numbers), draw an imaginary (arbitrary) line between any two numbers, separating the vector into two parts: the left portion is the frame, the right one is the cells. For example, the shape of "i. 2 3 4 5" can be divided like this

   2 3 | 4 5

giving us a frame of shape "2 3" and cells of shape "4 5". The default framing treats the leading axis specially, so this array, by default, would be

  2 | 3 4 5

that is, two items of shape "3 4 5".

That's all there is to it.

Everything else is "sound and fury, signifying nothing."

At our next meeting, I had come up with the following diagram to attempt to summarize the above:

Easy 4-D Arrays

I've recently had occasion to work with four-dimensional arrays. One of the things I've been doing is reducing a 4-D array on each of its dimensions in turn. I never try to look at the whole array as it's awkward and difficult to display neatly. Instead, I look only at its shape.

So, if my array has shape "20 30 80 25", I can sum the 20 3-D arrays by +/array. To move the second coordinate "30" to the front, I use dyadic transpose

   $1 0 2 3|:array
30 20 80 25

again, looking only at the shape.

So I can sum the 30 3-D arrays by +/1 0 2 3|:array.

Similarly to move the third coordinate "80" to the front and sum, simply do this +/2 0 1 3|:array.

By considering only the easily-displayable shape, I'm always certain I'm working on the array the way I want to.

Learning and Teaching J

It's about time we in the Array-Language Community took charge of the discussion about "programming efficiency" and re-framed it in a way that makes more sense. Not so coincidentally, this also strengthens the case for interpreted, interactive environments.

Reframing the Efficiency Discussion

Recently, there was thread on the J Programming Forum about solving a problem posed in the Dr. Dobbs blog: http://www.ddj.com/dept/architect/198701685?cid=RSSfeed_DDJ_All. Randy MacDonald had pointed out this problem: http://www.jsoftware.com/pipermail/chat/2007-April/000467.html. The problem is to figure out which two US states have names that are the anagrams of two other US states. The article in question has four parts in which the author develops and refines a solution using C++.

From the perspective of truly high-level langages, a four-part article requires shameless padding: this is a very simple problem. The padding comes in because of the clumsiness and notational overhead of a language like C++. Also, much of the refinement is aimed at speeding up the original solution (which was unacceptably slow).

I posted this comment in response to the original request for ideas on solving the problem:

Reply Devon McCormick to Programming  5:28 pm (7 hours ago)
subject	 [Jprogramming] index error on load (but works in console)

I just solved this and posted a come-on for my usual rant about these
sorts of problems: what was the total time to reach a solution, coding plus
execution time?

As is common in these sort of articles, there are multiple iterations to
solving the problem but the main concern at each iteration is CPU time.
This emphasis frequently puts J at an apparent disadvantage.  We should
not let the dialogue continue to be framed in terms of CPU time to the
neglect of total time to solution.

--
Devon McCormick, CFA

Randy MacDonald agreed with me:

Reply ramacd <ramacd@nbnet.nb.ca> to Programming 5:57 pm (7 hours ago)

Hi Devon;

Yes, the CPU time metric is still aggravatingly present. IME, APL have had
nothing to be ashamed about since 1990, J essentially for its entire
existence.

I also reject the "written once, executed many times" rationale.  Software
that is worth writing gets improved in minutes, with perhaps a few dozen
executions before it gets those improvements.

BTW, I roughly recall 5" for coding, and 0.9" for execution, but that's just
me.

------------------------------------------------------------------------
|\/| Randy A MacDonald   | APL: If you can say it, it's done.. (ram)
...

Once Randy was brave enough to give an estimate of his total time-to-solution, I followed suit:

Reply Devon McCormick to Programming 6:16 pm (6 hours ago)

It took me about 21 minutes of coding and the CPU time was negligible,
roughly 0.01 seconds.  Of course, most of the coding time was spent
assembling the data and wandering up a blind alley but that's all
"billable hours".

In our NYCJUG meeting, two others had also solved this problem and kindly offered their estimates of how long it took them. Dan estimated that his initial idea took him about 30 seconds but that the code ran for about three hours because he considered all possible anagrams. John's reported time was very close to mine. Dan also opined that the ratio of 30 seconds of his time to three hours of computer time was fine with him; I agree with this idea. Why does the bottleneck to productivity of problem-solving - the coder's time - continue to be ignored whereas the cheap resource - CPU time - continue to be the primary consideration? Maybe it's because CPU time is simple to measure and doesn't this measurement doesn't threaten one's ego as much as admitting how long it took to figure out a problem?

This is a direction we should push all discussions of this type whenever we see them: what is the total time-to-solution? For most problems, this is much more relevant than any other metric. It's ludicrous that these discussions are still framed in terms appropriate to the mainframe, batch-processing era of 40 years ago and we should point this out relentlessly.

Scan of Meeting Notes