NYCJUG/2010-03-09

parallel processing, problems, introductory material, multi-core, multi-thread, multicore, multithread

Location::ThomasNet, 5 Penn Plaza, NYC (Partial) video recording of the meeting here.

Agenda for NYCJUG of 20100309

1. Beginner's regatta: what would entice beginners and how do we figure
out what they want and need?  See "BeginnerHarness.pdf".


2. Show-and-tell: de-speckling an image - could be a third-level of complexity
for parallelism: see "ShadesOfGray.pdf".

A simple example of parallelization: see "SimpleExampleOfParallelization.pdf".


3. Advanced topics: the complexities of multi-core multi-threading: see
"CommonMulticoreProgrammingProblems.pdf", "FollyOfDo-it-yourselfMultithreading.pdf", "MulticoreMatrixMultiplyEG.pdf", and "ParallelArchitectureInJ.pdf".


4. Learning, teaching and promoting J: there is no such thing as standard C:
see "AFewBillionLinesLater-CodeVerification.pdf"; an example of exposition:
see "ExamplesFromPakinReferenceManual.pdf".

A large set of problems: see "ProblemSetsUVa.pdf".

Beginner's regatta

We considered the following request from Jim Korn:

I'd like to have a general purpose shell to create desktop widgets: a procedure
that describes how to create an executable through a series of steps with an <insert your code here>
sort of thing.  A simple widget might be an application like an alarm clock (your timer);
an intermediate one, something the old windows cardfile.exe; and advanced, a media widget
that allows both cursor and keyboard shortcut controls for mute, volume, play, pause, etc.
and maybe goes the extra level of dealing with Flash, etc. (I've never fully understood the
layers of these things: operating system, application, etc.).   As a beginner if I had a
generic shell I might be tempted to learn to program more.  I know many (all?) of the pieces
are available for this.  If what I'm talking about already exists, maybe a beginner's segment
on it would be of interest.

Jim has offered to put together a more detailed specification on this idea for anyone interested in attempting to implement it. You can contact Jim by posting to the J Forum.

At the meeting, though we all agreed that something to entice beginners is a good thing, this particular project strongly emphasizes the GUI aspect of programming which is not really a core strength of J. Writing a GUI toolkit would be a non-trivial project in its own right. Perhaps if someone has a suggestion for hooking J code into an existing widget framework we could better accommodate this suggestion.

Dan had a suggestion about elucidating the connections between J and other APLs (array-programming languages). This is a topic that has been touched on in the Forum traffic but is perhaps of limited interest to a general population.

We also discussed what sort of J might interest students studying beginning statistics. We agreed with Dan's idea that most people would like anything that helped them with their homework. We'll consult with someone teaching such a class to find out what these kind of students might want and what they find good or bad about J.

Advanced Topics: Parallel Programming Taking Advantage of Multi-Core Architectures

We switched the order of the agenda to look at this third item before the second one as it led to a more natural progression of ideas.

We looked at an essay on some difficulties of multi-core programming - see File:CommonMulticoreProgrammingProblems.pdf. This focused on some issues we've talked about before in our discussion of parallel programming, including the paramount importance of minimizing data movement and the related problems of maintaining consistency of data split between processes and handling contentions between multiple processes working on the same data.

Some of the suggestions in this essay were remote from a J-programmer's consideration, particularly the suggestion to re-arrange data structures for tighter byte-packing to reduce the amount of data movement between cores. This is the sort of bit-twiddling we are happy to leave behind when we work in J.

However, the following suggestion for working with data in cache probably has general applicability though we don't have explicit control of that (and it's probably a mistake to customize code too much for the particular machine on which one happens to be working at the time).

The interesting thing I noticed about the code examples in this essay - variants of the Sieve of Eratosthenes - and the code in some of the other essays - such as File:FollyOfDo-it-yourselfMultithreading.pdf and File:MulticoreMatrixMultiplyEG.pdf - was how much work had to be done on simple, straight-forward code to take advantage of multi-core architecture. These latter two essays also start with simple code examples, the Fibonacci series and matrix-multiply, respectively, and progressively show how to analyze and modify them to use a particular product ("Cilk") to multi-thread and use multiple cores.

The Fibonacci code starts out looking something like this:

1	#include <stdio.h>
2	#include <stdlib.h>  	  	
3	int fib(int n)
4 	  	{ if (n < 2) return n;
5 	  	  else {
6                   int x = fib(n-1);
7	  	    int y = fib(n-2); 	  	
8 	  	    return x + y;
9 	  	  }
10 	  	}
11	int main(int argc, char *argv[])
12 	  	{ int n = atoi(argv[1]);
13 	  	  int result = fib(n);
14 	  	  printf("Fibonacci of %d is %d.\n", n, result);
15 	  	  return 0;
16 	  	}

I won't show the do-it-yourself multi-threading code given as an example of why you shouldn't do this yourself - it weighs in at over 40 lines, but the good example, using "Cilk" ends up looking like this after multi-threading is added:

1	#include <stdio.h>
2 	#include <stdlib.h>
3	#include <cilk.h>
4	int fib(int n)
5		{ if (n < 2) return n;
6 	  	  else {
7 	  	    int x = cilk_spawn fib(n-1);
8 	  	    int y = fib(n-2);
9 	  	    cilk_sync;
10 	  	    return x + y;
11 	  	  }
12		}
13	int cilk_main(int argc, char *argv[])
14 	  	{ int n = atoi(argv[1]);
15 	  	  int result = fib(n);
16 	  	  printf("Fibonacci of %d is %d.\n", n, result);
17 	  	  return 0;
18 	  	}

which doesn't seem too bad. However, the next example shows a lot more bloat when multi-threading is added.

This matrix-multiply example was illuminating in a couple of respects. One is that the author's initial attempt to multi-thread the algorithm resulted in worse performance. The other is the hint that it might be possible to take advantage of these techniques at the level of the J-interpreter (using the tools being promoted).

This latter idea speaks to the siren call of parallelism that we've all heard: automatic, painless parallelism without having to change our own code. However, I'm afraid the reality is usually the opposite - as demonstrated by the examples in these essays - that taking advantage of this requires lots of work. It looks like it often requires pretty substantial revision of existing code.

Here's how the code starts out:

void matrix_multiply_1(matrix_t A, matrix_t B, matrix_t C) {
    for (int i = 0; i < A.rows; ++i) {
       for (int j = 0; j < B.cols; ++j) {
        for (int k = 0; k < A.cols; ++k)
            C[i][j] += A[i][k] * B[k][j];
                                }
                            }
                        }

and here's what it ends up like even with the advantage of using a product designed to help implement multi-threading:

void matrix_multiply_5(matrix_t A, matrix_t B, matrix_t C,
        int i0, int i1, int j0, int j1, int k0, int k1) {
    int di = i1 - i0;
    int dj = j1 - j0;
    int dk = k1 - k0;
    if (di >= dj && di >= dk && di >= THRESHOLD) {
     int mi = i0 + di / 2;
     cilk_spawn matrix_multiply_5(A, B, C, i0, mi, j0, j1, k0, k1);
     matrix_multiply_5(A, B, C, mi, i1, j0, j1, k0, k1);
     cilk_sync;
} else if (dj >= dk && dj >= THRESHOLD) {
  int mj = j0 + dj / 2;
  cilk_spawn matrix_multiply_5(A, B, C, i0, i1, j0, mj, k0, k1);
  matrix_multiply_5(A, B, C, i0, i1, mj, j1, k0, k1);
  cilk_sync;
} else if (dk >= THRESHOLD) {
  int mk = k0 + dk / 2;
// N.B. It's not safe to use a spawn here. Fun exercise: try putting
// it in and then running Cilkscreen to detect the resulting race.
   matrix_multiply_5(A, B, C, i0, i1, j0, j1, k0, mk);
   matrix_multiply_5(A, B, C, i0, i1, j0, j1, mk, k1);
} else {
// The problem is now small enough that we can just do things serially.
for (int i = i0; i < i1; ++i) {
  for (int j = j0; j < j1; ++j) {
   for (int k = k0; k < k1; ++k)
    C[i][j] += A[i][k] * B[k][j];
     }
   }
  }
}

To be fair, the algorithm was also changed a bit - can you find the core of the algorithm buried in the extraneous code?

Also, not to be shown here but it's at the end of the document File:FollyOfDo-it-yourselfMultithreading.pdf, is a matrix version of the Fibonacci series which weighs in at 225 lines; there's also a Java version here at over 100 lines. Compare this to the J version

f7=: 3 : 0
 mp=. +/ .*
 {.{: mp/ mp~^:(I.|.#:y) 2 2$0 1 1 1x
)

found on the wiki.

Parallel-Programming Projects on the J Wiki

However, it is possible to take advantage of multiple cores using J and there are a few resources on our own wiki to lend a hand with this. One that came up recently is Dave Mitchell's contribution to invoke a Windows API to force a particular core to be used.

However, I personally find the argument for doing this unconvincing, at least for the general case. The argument given for why you want to do this is that one of your cores may have cached data from earlier processing so you want to be sure to use the same core for subsequent work to avoid data movement. This may happen on occasion but it sounds like more of an exception than a common, general case. However, if you think you can use this to your advantage, it's there for your use.

There's a more broadly-based page, started by Don Guinn, to help J developers create multiple instances of J in order to take advantage of multiple cores or multiple machines. This seems like a good framework for sharing experiences getting this plausible scenario to work.

My own contribution to this is a page outlining some specific, realistic problems on which we can test parallelization attempts. I'm in agreement with Tracy Harms when he opined - over our conference link - that the key to accomplishing parallelization is to focus on a particular problem. On this page, I'm attempting to present a series of concrete examples which can be solved simply and used to benchmark different approaches to accomplishing parallelization.

To summarize what we have so far on this page:

Case 0. The "Embarassingly Parallel" example - generate data series that can be independently used to produce results that can be summarized separately. In this case, we might envisage passing data as small as short J programs to multiple compute engines so that large amounts of data can be generated and manipulated but only relatively small result sets will be passed back to the central dispatcher.

Case 1. The "Simple Contention" example - we sort a large data set, broken into pieces according to one sort column, by another column and assemble these pieces grouped by the new sort order. The input datasets can be read independently but the outputs have potential collision problems which have to be somehow resolved. This problem mandates more data traffic as well.

There is a third example in progress which is the subject of our next section.

Show-and-Tell

An algorithm recently outlined on the J Forum by Skip Cave seems to be a good candidate for the next level of parallel complexity - see the document File:ShadesOfGray.pdf for an explanation of the algorithm. Basically, the goal is to "de-speckle" an image by applying some simple rules to the pixels in a fixed-size window applied across the entire image. The complexity added by this problem is that an image might naturally be broken into non-overlapping tiles in order to work in parallel but the operation must work across the borders of these tiles. This raises difficulties both with data movement and co-ordination in multiple dimensions.

Also, like the example problems already proposed, the size of the dataset can be scaled up to be arbitrarily large. This allows us to benchmark against increasingly large problems to get measurements of a solution's scalability (its "big O" function).

Example of a Simple Parallelization

One of the longest-running tasks I regularly invoke is a simple J program to rotate all the photos in a directory by a quarter-turn to get most of them oriented properly. I use the "images" add-on to read and write the photo images.

Here we see that doing this for 219 photos takes about 620 seconds.

   info,<6!:2 'info=: getFrom1Drive ''C:\amisc\'''
+----------------------------------------+-------+---------+----------+
|"c:\amisc\pix\Photos\2010Q1\20100306-07"|1 0 219|830703363|0.51579433|
+----------------------------------------+-------+---------+----------+
   NB. Number of photos.......................^^^
   6!:2 'regflipphotos ''"''-.~>0{info'
620.17066

Looking at a newer batch of photos, we see that we have more of them (331):

   info,<6!:2 'info=: getFrom1Drive ''C:\amisc\'''
+----------------------------------------+---------------------------+-----------...
|"c:\amisc\pix\Photos\2010Q1\20100307-08"|0.99698795 0.0030120482 331|1265904085|...
+----------------------------------------+---------------------------+-----------...

Simple extrapolation shows we would expect rotating this new group to take over 900 seconds:

   620.17066*331%219
937.33556

However, by breaking them into two directories and running a separate J session against each, we can fully use the dual cores on this machine to reduce the time by about half if we imagine the two examples shown here running simultaneously.

   6!:2 'regflipphotos ''c:\amisc\pix\Photos\2010Q1\20100307-08'''
461.62112
   6!:2 'regflipphotos ''c:\amisc\pix\Photos\2010Q1\20100307-08_1'''
463.9133

As a "proof-of-concept", I manually divided the images between two directories and ran two processes simultaneously in two separate J sessions.

This screen-shot illustrates how the processing is distributed across the two cores - you can also see the bottom part of each of the two J sessions, one (blue background) running under the J session manager and the other (black background) running in jconsole-mode (using "jee.exe") under the emacs editor. The timings are visible, near the bottom, in each session.

The top two line-graphs show the CPU utilization for each of the two cores; the longer line graph under these shows page-file use (for a period twice as long). The sawtooth page-file pattern indicates when each individual image files are read from disk and "flipped".

Learning, Teaching and Promoting J

We discussed issues with the difficulty of learning languages in general, and an example of some good explanation of APL.

Finding Defects in C Code

We talked about File:AFewBillionLinesLater-CodeVerification.pdf from the Communications of the ACM (Vol. 53 No. 2, pp. 66-75) about a bug-finding tool for C from a company called Coverity. Some of the more salient points from the article were the authors' discovery that there is no such thing as a standard C implementation - the tool had to be customized to handle idiosyncrasies of the major compilers - and that even experience coders have fundamental mis-conceptions about how C (or C++) is supposed to work.

It was also illuminating to read about the code-verifiers experience with real-life code bases. One early difficulty they encountered was that use of different "make" tools by potential clients. This high-lighted a problem with figuring out exactly what code was being included in a given build. There were also difficulties with early versions of the verifier being command-line based in an increasingly GUI world.

Another unexpected finding was that improving the tool to find more errors sometimes provoked a bad reaction from their clients. They had been working against one version of the tool and plotting the number of errors found after each new build, giving them an encouraging, downward-sloping trend. However, a newer, better version of the tool would wreck their nice trend because the number of bugs found would suddenly shoot up.

A Good Example of Explanation

We looked at a reference manual I remember finding very useful when I was learning APL: Sandra Pakin's APL Reference File:ExamplesFromPakinReferenceManual.pdf. What I like about her style is that she combines the abstract explanation of each APL function with several examples of how it might commonly be used. The J dictionary examples do this to some extent but it's always helpful to have working examples.