Vocabulary/ArrayProcessing

From J Wiki
Jump to navigation Jump to search

Back to: Vocabulary

Array Processing

How do you process an array?

You can view this question in two ways:

  • How do you (currently) process an array?
  • How might you ideally process an array, if you could throw away the crutches of crippled thinking?

Data comes in arrays.

Well no, it doesn't always. But when it doesn't (as it doesn't with SQL or COM) deep down inside you know it really should.

So data processing is array processing. At least it is up there in programmers' heaven.

But just try to collect your data into one big array in preparation for a really bold approach, and you find that SQL breaks your fingers for you upfront, right from the word go. Every finger bar the first, which leaves you only able to pick through records one-by-one.


How do you currently do it?

Sit comfortably – this is going to be messy...

Other people's languages have arrays. They have functions that work on an array, call it A. Function f say, which calls function g. Some import the whole array A upfront into the namespace of the function f and then expect you to chip away at it with a for-loop.

Sometimes you're painfully aware of having two copies of A.

Sometimes the OPL is smart enough not to pass a snapshot of A to f, but only a pointer to where it starts in memory.

Some folk think, when passing arrays between functions, that's as smart as it's possible to be. But, to find your way around the universe at the end of your pointer, you need a pointer-based template which you push around by dead-reckoning.

That's like sailing in uncharted waters with a chart that needs unrolling on the surface of the sea itself.

Ask yourself this:

  • Are the numbers in the array A bytes, integers or floating point?
  • Signed or unsigned?
  • What precision?

Your template needs to know. Else your dead-reckoning will leave you dead in the water.

You cross your fingers that the guy who built the A-table used the same .h file as you did.

When you port your code to a different platform, something as simple as the definition of an integer can make everything come unstuck. In 10 years the int in that .h file can swell from 16-bit to 32-bit. It can swap its bytes, or suddenly start paying attention to its sign-bit.

If you're lucky, g has been written to work on exactly the same kind of array as f does. Think of all the possible arrays there are in the world. Still feeling lucky?

And if, at last, you manage to point your one unbroken finger at a valid number, does that number mean what you think it does? Is it inches or centimetres? (Spacecraft have blundered full-speed into Mars due to that source of confusion.)

Okay, so you've got all that stuff right. Now remember: if A has 256 entries, the last entry is not  A[256] but  A[255]. (1-in-3 bugs boil down to getting that simple fact wrong.)


How would it be nice to do it?

J is an Array-Processing Language (…an "APL").

APL also happens to be the name of a particular language too, invented by Ken Iverson.
He invented J too, in response to criticisms of the original APL, notably its funny character-set.
Here we use "APL" in the general sense.

Other people's languages (OPLs) have arrays. They have functions that work on those arrays. So that makes them APLs too, right?

Wrong.

An OPL fetches array A in one form or another (e.g. "by name" or "by value"), and dumps it down in your lap, so to speak. Then it leaves you to poke inside it using nested for-loops. Or you can lay a paper-trail of pointers and use them to mark where you've been, to go back to. That's the world as IBM created it, and it's never going to change.

Pardon me, it just has.

J is an APL, not an OPL. What makes an APL different from an OPL is that it doesn't just dump A in the lap of function f to flip through with one finger. An APL has primitives which are array-savvy. That means you can develop your functions from the primitives using single numbers, but then you discover to your delight that they automatically generalize from single numbers to arrays.

For example, consider the J primitive (+), which adds two numbers. Adding 1 to a number is perfectly simple:

   A =: 6
   A + 1
7

Now suppose you have an array A defined as follows:

   A =: 6 5 9 2 4 9 0 7 0 4

You want to add 1 to every element of A. Do you have to write a loop to step through A applying the "function": (+1)? No, you don't. Adding 1 to A is just as simple as adding 1 to 6

   A + 1
7 6 10 3 5 10 1 8 1 5

That's because (+) is array-savvy. A function f built using (+), and other such J primitives, will automatically generalize from individual numbers to whole arrays.

Function f need never know the size of A, nor even its structure. It need never get passed a copy of the whole of A, nor even a pointer to its length-field. Not even if it needs to call function g to contract-out some of the processing of A. And what goes for f goes for g too.

In fact you can write f to take delivery of just a single number, or (if A is a matrix) a row of numbers, leaving it to J to scale-up f to handle A, whatever the shape and size of A. To behave sensibly if A runs out of numbers altogether. You can even leave it to J to determine the precision of those number at runtime.

J lets you process numbers as just numbers, ignoring questions of precision (unless such questions interest you).


How can I get hold of J so I can do all that?

There's 2 ways to view this question:

  • How do you download and install J?
  • How do you get to use J as it should be used?

The first question is easy.

Just download and install J from http://jsoftware.com

It's free, needs Mbytes not Gbytes of disk space, and installation is easy.

The second question is tougher.

The trouble is this. J can easily be used like an OPL. You'll recognise all the language features it offers and will have no trouble putting them to work instantly. They'll (mostly) do what you expect. When they don't, you soon spot the problem, plus learn the solution.

But experience has shown that people trained on OPLs have a lot of trouble using J like the APL it's meant to be. They can't relax enough to let J do the hard work for them. They want to do the things they're used to doing, to build the ship out of matchsticks, because then they feel in-control. They end up with code that's a hundred times longer than it needs to be, and a hundred times slower. This ball-park figure is literal, not metaphorical, as we shall demonstrate.

The purpose of this article is to show you how to use J like an APL right from the word "go". Not use it for a year (enjoying the neat IDE), before cautiously trialling those tricky features you don't really understand.


Use J first as an OPL, then as an APL

Let's make an array of numbers, A, and process it using two simple functions f and g, where f calls g.

  • We'll do it the OPL way, with code which will be clear enough to an OPL-trained programmer.
  • Then we'll do it the APL way (with only a brief explanation for now).
  • Then we'll time both solutions, showing a better-than-hundredfold improvement.

Here is A, created using the primitive (?.) which generates a list of random numbers. Let's make A heavy enough by giving it 100,000 entries instead of just 10

smoutput A=: ?. 100000 $ 10   NB. That's 100000 random #s - each 0 thru 9
6 5 9 2 4 9 0 7 0 4 6 8 3 8 1 2 8 0 0 2 1 6 0 4 4 1 3 3 6 6 5 7 8 8 2 1 9 7 9 9 1 4 3 9 7 2 4 3 7 5 3 8 3 6 ...

(?.) is a variant of (?). The advantage of (?.) for us here is that it generates the same sequence of numbers each time it's used. Good for developing demos, like this one, and for testing your apps. Not so good for Monte Carlo simulations.

Now let's write a function: av4 to calculate the average of all the entries in A greater than 4.

In J we normally implement a mathematical function as a verb.

First write it the traditional way, using the looping constructs in J which are familiar to any programmer trained on an OPL. Then run it with argument A

   require 'stats/base/univariate'   NB. needed for the verb: mean

   av4 =: verb define
z=. i.0                 NB. initialize buffer: z to the empty list
for_i. i.$y do.         NB. loop thru the entries of y
  j=. i { y             NB. j is the i'th element of y
  if. j>4 do.           NB. if j>4 then
    z=. z,j             NB. append j to z
  end.
end.
mean z			NB. average the buffer: z
)

   smoutput av4 A       NB. run verb av4 on array A
7.00823

Now rewrite av4 using function composition, which is J's special strength.

We'll give the verb a different name: av4j so that it can coexist with av4 for comparison purposes.

Let's define two verbs f and g

   f =: >&4             NB. returns 1 if y>4
   g =: mean            NB. g is the function: mean

We're not forced to do this. We could use (>&4) instead of f and mean instead of g. But it shows the process more clearly.

Now let's define av4j as a simple-looking composition of three functions: f, g and a primitive: (#). Then run it with argument A as we did with av4

av4j=: [: g f # ]	NB. combine the verbs f, g and # to act in 1-shot

smoutput av4j A		NB. run verb av4j on array A
7.00823

It gives the same answer as av4, which is reassuring.

But in practice you'd want to conduct more exhaustive tests to convince yourself the two verbs are functionally equivalent.

Now time the two expressions. For this we need a test verb timex

   timex =: 6!:2

Don't bother how timex works. All we need to know is that it estimates the time taken to run a given string of J code, returning the answer in (fractional) seconds.

Let's run timex for av4 and av4j in turn, acting on the array A

   smoutput timex 'av4 A'
0.340405
   smoutput timex 'av4j A'
0.00117706

   0.340405 % 0.00117706   NB. divide one result by the other
289.199

So we find that av4j ran 289.199 times faster than av4 on this occasion.

The figure will vary slightly as you repeat the experiment.

That's a big gain in speed. But not atypical when rewriting - in the APL way - a verb written in the conventional OPL way.


Download a script of the entire demo: File:Av4.ijs


Now tell me how it works

The verb av4j uses a construct called a Fork, firstly to combine f and # , and next to combine the result with g.

If you read the page: Fork, this will introduce you to further primitives, which you can look up in the NuVoc portal.

But first take a look at the ancillary page Vocabulary/Loopless, which goes deeper into the topic of loopless programming. We've only scratched the surface.

Each J primitive has a page to itself. Special topics are explored in the ancillary pages listed at the end of NuVoc.

Welcome to the world of J!