From J Wiki
Jump to: navigation, search

tab-delimited table, scientific visualization, diffusion-limited aggregation, 3D interactive scatterplot, finite-state machine, finite-state automata, Raspberry Pi, finding J code

Location:: Heartland

Meeting Agenda for NYCJUG 20130611

0. Beginner's regatta: see "numericMatToTabDelimitedCharacterVector.doc" and
"Learn Visualization, Young Coder.doc".

1. Show-and-tell: diffusion-limited aggregation: see "Recap of Work on DLA.doc".

2. Beginner's regatta re-visited: see "Primitive 3D Interactive Scatterplot
Using d3.doc".

3. Show-and-tell re-visited: see "Parsing CSV Files with a Finite State

4. Learning, teaching and promoting J, et al.: see "Newbie Roadblocks.doc"


We looked at a simple example of turning a table of numbers into a tab-delimited character array suitable for import into something like Excel, and considered the pros and cons of scientists as programmers. After an introduction to an approach to coding diffusion-limited aggregation, we looked at how we might extend this into higher than two dimensions and considered the question of how to look at a three-dimensional set of points interactively in a browser. Then we had a tutorial on using J's finite state automata routines, then finished up with a discussion of what roadblocks J newbies face.

Beginner's Regatta, part 0

Tab-Delimiting Numeric Table

In testing the robustness of the standard "standard deviation" calculation in J, we thought we'd compare our results in J to those in Excel.

   stddev                     NB. Look at definition.
   (<0 1 2)+&.>1e7;1e8;1e9    NB. Special cases with known results: 1.
|10000000 10000001 10000002|100000000 100000001 100000002|1000000000 1000000001 1000000002|

   stddev&>(<0 1 2)+&.>1e7;1e8;1e9
1 1 1
   (0 1 2)+/1e7 1e8 1e9
10000000 100000000 1000000000
10000001 100000001 1000000001
10000002 100000002 1000000002
   |:(0 1 2)+/1e7 1e8 1e9     NB. More intuitive display
 10000000   10000001   10000002
 100000000  100000001  100000002
 1000000000 1000000001 1000000002

Now we want to put these results into a spreadsheet for display and comparison. Start by tab-delimiting the character versions of these numbers.

   TAB,~&.>":&.>|:(0 1 2)+/1e7 1e8 1e9
|10000000   |10000001   |10000002   |
|100000000  |100000001  |100000002  |
|1000000000 |1000000001 |1000000002 |

   mm=. TAB,~&.>":&.>|:(0 1 2)+/1e7 1e8 1e9  NB. Convenience assignment

Drop last tab, then append linefeeds to each line.

   (}:&.>_1{"1 mm),&.>LF
|10000002 |100000002 |1000000002 |

  ;((}:&.>_1{"1 mm),&.>LF) _1}&.|:mm        NB. How does whole table look?
10000000        10000001        10000002
100000000       100000001       100000002
1000000000      1000000001      1000000002

Now bring these pieces together into one piece of code, with some help.

  13 : ';((}:&.>_1{"1 y),&.>x) _1}&.|:y'    NB. Tacit crutch
[: ; ] _1}&.|:~ [ ,&.>~ [: }:&.> _1 {"1 ]

Using this code to produce the table we want:

   LF ([: ; ] _1}&.|:~ [ ,&.>~ [: }:&.> _1 {"1 ]) TAB,~&.>":&.>|:(0 1 2) +/ 10^7+i.10
10000000        10000001        10000002
100000000       100000001       100000002
1000000000      1000000001      1000000002
10000000000     10000000001     10000000002
100000000000    100000000001    100000000002
1000000000000   1000000000001   1000000000002
10000000000000  10000000000001  10000000000002
100000000000000 100000000000001 100000000000002
1000000000000000        1000000000000001        1000000000000002
10000000000000000       10000000000000001       10000000000000002

Now, test "stddev" on extreme values:

   stddev"1 |:(0 1 2)+/ 10^7+i.10
1 1 1 1 1 1 1 1 1 1.41421

We get the right answer for all but the last case. Compare this to results from Excel:

SD   "Should be ""1"" for each row"
1       10000000        10000001        10000002
1       100000000       100000001       100000002
1       1000000000      1000000001      1000000002
1       10000000000     10000000001     10000000002
1       1E+11   1E+11   1E+11
1       1E+12   1E+12   1E+12
1       1E+13   1E+13   1E+13
1       1E+14   1E+14   1E+14
0       1E+15   1E+15   1E+15
0       1E+16   1E+16   1E+16

So, Excel falls down for the latter two cases.

We get the incorrect results at the higher values because of the imprecision of floating point numbers. However, we can do better in J by using extended precision:

   stddev"1 |:(0 1 2)+/x: 10^7+i.10
1 1 1 1 1 1 1 1 1 1
   stddev"1 |:(0 1 2)+/x: 10^7+i.20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

The Importance of Visualization

We discussed an essay about the importance to scientists of learning to code and why they should concentrate more on coding that helps them visualize data.

From the introduction:

. My First Recommendation to New Scientific Coders: Learn Visualization

Scientists are learning programming at an unprecedented rate. I’ve expressed concern over the fast-paced growth of computing across the sciences and what this could mean for reproducibility and incorrect findings in the sciences. Perhaps the best example that illustrates the severity of this issue is Coombes and Baggerly’s Duke Saga.

I think a lot about how scientists learn programming and how we can change this process to yield a better outcome (fewer errors, more readable and reproducible code). Scientific coders must learn to program in a particular fashion that “stacks the deck” to make errors apparent. On this front, unit tests, following coding standards, and peer code review get a lot of deserved attention. Yet for some reason, visualization does not. This is unfortunate; visualization should be learned to a high degree of competency very early on in a programmer’s career.

An early section of this essay is relevant to J because the author emphasizes that problems look differently when you can visualize quickly. This puts an onus on those of us developing tools for J to improve its graphical capabilities.


We re-visited an old essay showing how some J code was reworked to make it more J-like. This covers - in some depth - improvement of some code for DLA (diffusion-limited aggregation) and shows how easily this modification becomes once we work with an appropriate array structure. We make extensive use of visualization as we develop our strategy for efficient generation of these sets.

Also, once we've settled on a good data structure, it becomes trivially easy to extend the code from two dimensions to multiple dimensions. However, it raises a problem with the visualization we used originally as it is limited to two dimensions. We show some code to help with the three-dimensional visualization but have not yet come up with a good solution for extending this visualization generally to higher dimensions.

Recap of Work on Diffusion-limited Aggregation

The follow excerpts from our exploration of how the "release policy" shapes the resulting shape of a diffusion-limited aggregation (DLA).

Basic Code

We start with this basic code, all of which can be found here. First, load some sub-routines, define our own namespace dla, and create an initialization routine which defines a few globals:

   load 'coutil'
   cocurrent 'dla'
   load '~Code/math.ijs mystats viewmat'

init=: 3 : 0
   D=: y            NB. How many dimensions?
   RNI=: D dimnbr 1 NB. Relative neighbor indexes: 1-away in D-space
   PTS=: ,:y$0      NB. List of points: start with one at origin.
NB.EG  init 2       NB. Initialize 2-dimensional space.

Here's the main routine, which aggregates points to the global PTS:

NB.* aggStruc: randomly walk a point y times to aggregate it to cluster:
NB. (>0{x) away from random point in >1{x.
aggStruc=: 4 : 0"(_ 0)
   point=. (>0{x) release >1{x [ ctr=. _1
   while. y>ctr=. >:ctr do. point=. walk point
       if. PTS check_collision point do. y=. ctr [ addPt point end. end.
NB.EG (5;PTS) aggStruc 1e2$1e2

The basic sub-routines:

NB.* check_collision: 1 iff x in neighborhood of points y.
check_collision=: 4 : '1 e. x e. RNI+"1 y'

NB.* dimnbr: x-dimensional y-nearest neighbor offsets.
dimnbr=: 13 : '(x$0)-.~,/>{x$<y-~i.>:+:y'
NB. RNI=: (D$0)-.~(_1 0 1){~(D$3)#:i.3^D     NB. Relative neighbor offsets

NB.* addPt: add point to cluster.
addPt=: 3 : 'PTS=: PTS,y'

NB.* walk: walk one random step.
walk=: 3 : 'y+RNI{~?#RNI'

Finally, an early version of the nub of the problem: the function that "releases" a new particle.

NB.* releaseIfOpen: release new particle x away to find open neighborhood.
releaseIfOpen=: 4 : 0
   while. 1 e. PTS e. RNI+"1 newpt=. ((]{~[:?#)y)+x*_1 1{~?D$2 do. end.

Make this our release policy (for now):

release=: releaseIfOpen

The main routine randomly moves the newly-released point until it either encounters the existing cluster and sticks, or finishes a pre-set number of moves without encountering the cluster, at which point it disappears.

We'll explain in more detail how the release policy works shortly but first let's see how to run the code. If we run the above lines of J, then move back into the base namespace, and make the dla namespace transparently available here, we can initialize our cluster.

   coclass 'base'
   coinsert 'dla'
   init 2
0 0
1 2

So, we start with one point at the origin; its shape is 1x2 because it's one point in two dimensions: we plan to concatenate new points to the bottom of this matrix (as can be seen by the definition of addPt above).

The arguments to the main routine are a parameter of 3 (a "perimeter" distance explained below) and the existing list of points and a right argument of how long our random walks should be before we give up. We'll arbitrarily try 20 walks of length 10 each:

   (3;PTS) aggStruc 20$10
4 10 10 10 10 10 10 1 10 10 10 10 3 10 10 7 10 9 10 10

The result is how many steps were in each walk. We see that most of them took the maximum of 10 - which means they didn't encounter the cluster (or encountered it only on the last step) - but a few ended early, hence became part of the cluster.

So, how many points do we now have and what are they?

7 2
0 _1 _2 _2  0 _1 _2
0 _1 _2  0 _2 _3  1

We have 7 points which we show transposed simply because it's a more efficient use of display space.

Let's try the same thing again:

   (3;PTS) aggStruc 20$10
10 10 1 10 6 10 10 10 3 10 10 10 2 3 10 1 10 4 4 10
|16 2|0 _1 _2 _2  0 _1 _2 _3  1 _4 _1 0 _2 1  2 _3|
|    |0 _1 _2  0 _2 _3  1  2 _2  1  2 3 _4 1 _1 _3|

This time we see that more points stuck. Again, we display the shape and the values as a vector of boxed items just to make more efficient use of our display.

How about a picture of these points? To do this, re-enter the namespace and define a new utility function bordFill, then return to the base namespace. This new function takes a left argument of how many empty cells we want to border our points and a right argument of our point co-ordinates.

   coclass 'dla'

   NB.* bordFill: fill 0-mat w/1 according to y, having border of x cells.
   bordFill=: 4 : '(1)(<"1 y-"1 x-~<./y)}0$~2$(>:+:x)+(>./y)-<./y'
   NB.EG viewmat 1 bordFill PTS
   coclass 'base'

We use this in conjunction with the general "viewmat" utility to display the points:

   viewmat 1 bordFill PTS
Which looks like this: height="110",width="122"

Explaining First Release Policy

Our first release function is very crude and its limitations will become apparent as the cluster grows but it's fine for a start. Here's most of the code:

while. 1 e. PTS e. RNI+"1 newpt=. ((]{~[:?#)y)+x*_1 1{~?D$2 do. end.

This is a while loop where all the work gets done in the conditional part. Here, we check if any of the immediate neighbors of the new, randomly chosen point newpt, are in the existing set of points. If so, we do nothing and try again until this is false. Once we exit the loop, we return newpt: a neighborless point near the cluster.

Let's examine this line in detail. The first part after while. is

   1 e. PTS e. RNI+"1 newpt

which evaluates true if one or more points are in the set formed by RNI+"1 newpt. RNI is one of the three globals from our initialization: it looks like this (transposed for more efficient display):

_1 _1 _1  0 0  1 1 1
_1  0  1 _1 1 _1 0 1

Arranging these eight points around an origin,

   3 3$<"1]1 1 1 1 0 1 1 1 1#^:_1]RNI
|_1 _1|_1 0|_1 1|
|0 _1 |0 0 |0 1 |
|1 _1 |1 0 |1 1 |

we can see that they are the two-dimensional offsets around a given point. So, if we add each of these (one-dimensional) elements of RNI to a single point like newpt, we get the co-ordinates of all of the neighbors of newpt, e.g. for (5,5) shown here arranged two-dimensionally with the point itself inserted into the middle:

   (13 : '(<y)(<1 1)}3 3$<"1]1 1 1 1 0 1 1 1 1#^:_1]RNI+"1 y') 5 5
|4 4|4 5|4 6|
|5 4|5 5|5 6|
|6 4|6 5|6 6|

The first part of the assigment of newpt:  newpt=. ((]{~[:?#)y)+x*_1 1{~?D$2 uses a tacit function  (]{~[:?#) which generates a random number ( ? ) based on the length ( # ) of its right argument and uses this to select ( {~ ) an element (row). This is applied to the right argument y which is the list of points in the cluster, so we randomly choose one of our existing points.

This random point is added to the latter part of this expression -  x*_1 1{~?D$2 in which we apply an offset determined by the left argument x multiplied by a random combination of ones and negative ones set by  _1 1{~?D$2 . This multiplication essentially moves x points in a random direction from the original point. The D is in there to support generalization to more than two dimensions.

So, to return to our long-awaited explanation for the magic number 3 in our invocation of the main function ( (3;PTS) aggStruc 20$10 ), this is the random distance we travel to find an empty neighborhood in which to release our new point. So, if we pick a larger number for this argument, we'll pick release points, on average, further away from the cluster. This means the random walk will have to go on longer to have good chance of hitting the cluster.

So, if we choose a much bigger number than 3, we might see something like this:

   (20;PTS) aggStruc 20$10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
16 2

We added no new points because a random walk of 10 steps is not long enough to travel (more or less) 20 points to where the cluster is. If we are willing to walk a lot more, say on the order of 20^2 steps, it's a different story:

   (20;PTS) aggStruc 20$400
400 288 400 400 400 400 309 400 400 400 400 400 400 247 400 400 400 400 400 400
19 2

However, as you might guess, it takes more time to walk further, i.e. starting with the original 16 points and timing 20 walks of 10 versus 20 walks of 400:

   6!:2 '(3;PTS) aggStruc 20$10'
30 2
   $PTS_dla_=: 16{.PTS_dla_
16 2
   4!:55 <'PTS'   NB. Ensure local copy doesn't shadow one in "dla" namespace.
   6!:2 '(20;PTS) aggStruc 20$400'
18 2

So, we see that the longer walks take many times as long and add fewer points. In fact, this set of parameters will often add no points.

Concentrating on the Perimeter

Fortunately, it's easy to see how to address the problem of the growing number of points in the cluster reducing the efficiency of our process: instead of giving all the points as an argument to aggStruc, why don't we give only the perimeter points? This will reduce the number of false hits. How do we know which points are on the perimeter? One simple way is to look at only the most recently added points since we are, by the nature of our aggregation, adding to the perimeter.

Guessing that the most recent 10% of points is a reasonable bunch to try, let's extract those:

   $PERIMPTS_dla_=: PTS{.~->.10%~#PTS
78 2

Looking at how these compare to the whole set:

    viewmat 2 (<"1]1+PERIMPTS-"1 <./PTS)}1 bordFill PTS

The manipulation we do to PERIMPTS is necessary to map the origin-centered co-ordinates to positive-only co-ordinates. This is a replication of what bordFill does internally as well.

This looks good: we can see that the perimeter - indicated by the pink points around the edge - appears to be what we expect. width="280",height="240"

Let's try using this as our argument to aggStruc to see how it affects our timings.

1005.5137 995.09202 564.62669
|attention interrupt: releaseIfOpen
|   point=.(>0{x)    release>1{x[ctr=._1
831 2
The points/sec initially goes back up - to over 1,000 per second - but quickly declines, then the code freezes up and has to be interrupted. A look at the points and their perimeter makes the problem clear: height="244",width="269"

We see that we quickly exceed our original perimeter. Let's see how costly it is to re-calculate the perimeter with each call to the main routine:

   do1_dla_=: 3 : '(sv-~#PTS)%tm=. 6!:2 ''(2;PTS{.~->.10%~#PTS) aggStruc y'' [ sv=. #PTS'
   NB.  Inserted the perimeter calculation   ^^^^^ here ^^^^^^
1009.626 1297.3126 1809.8097 1466.0253 1293.6868

Increasing the number of attempts per call:

1578.9208 857.49015 1561.9932 1310.9005

This seems to be working fine though there does seem to be a gradual decrease in efficiency.

1358 2
892.7828 859.53069 827.92105 809.63688 852.42751
This looks pretty good but there is an aesthetic problem: This DLA looks a bit "chunky" compared to other examples we've seen. width="131",height="125"

Our policy of releasing very near the existing cluster has a tendency to fill in the gaps between branches of the cluster. Also, since we pick release points randomly in relation to the perimeter, sometimes we release within the existing cluster as indicated by the interior pink points on the lower part of this picture. What we'd really like to do is to pick release points only outside the perimeter. One way to do this is shown here as well in the following section.

A simple measure of how densely packed the above-pictured DLA is this:

   (13 : '(#y)%~*/$0 bordFill y') PTS

This shows us that 1 in every 2.5 points in the bounding rectangle is part of the cluster. If we add more points, getting up to almost 62,000 using these parameters, this sparsity ratio goes up to 4.07.

New Release Policy

(The following method is also elaborated here in the context of an earlier version of the DLA code which works on a fixed-size, Boolean matrix rather than an explicit set of points.)

First, we define a verb to generate all the immediate neighbors of a point set not in that set.

neigh2=: 13 : 'x-.~~.,/y+"1/RNI'   NB. Empty neighbors of these

We use this to get the nearest neighbors:

   $NP=: PTS neigh2 PTS
26386 2

Then we expand the neighborhood by finding the next closest set of neighbors to these, then their neighbors, and so on five times:

   6!:2 'NP=: PTS neigh2^:5]NP'

This gives us an envelope six deep around our original points. We find the edge of this envelope by isolating those outermost neighbors (those with neighbors not in the set of neighbors):

   $edges=. ~.PTS-.~NP -.~ ,/NP+"1/RNI
1029 2
   viewmat 2 (<"1]1+edges-"1 <./PTS,edges)}1 bordFill PTS,edges
Here's what the edge of the envelope looks like in relation to our existing cluster: width="298",height="275"

This looks like a good perimeter set on which to release points but it's a bit thin. Let's thicken it by finding its neighbors two levels deep:

   $edges=. ~.PTS neigh2^:2]edges
5138 2
Giving us this fatter release area: width="290",height="255"

Adjust our top-level calling routine to use this perimeter points global:

   do1_dla_=: 4 : '(sv-~#PTS)%tm=. 6!:2 ''(x;PPTS) aggStruc y'' [ sv=. #PTS'
   PPTS_dla_=: edges
   1 do1&>10$<10$25
12.281323 21.910126 23.042773 79.635403 23.813149 22.408149 34.180425 48.340221 21.374501 50.795483

Our initial points/second are a bit low but seem to be improving. Let's try some more:

12590 2
   viewmat 2 (<"1]1+PPTS-"1 <./PTS,PPTS)}1 bordFill PTS,PPTS
   1 do1&>20$<20$25
24.342477 61.743025 41.15663 27.630435 34.954674 34.740216 60.580126 52.822625 57.748258 22.060313 22.321549 52.2986 21.649241 40.913895 67.019374 34.597901 40.404442 52.186518 39.882113 42.753015

One problem we'll have if we try to generate too many points using one perimeter is that the cluster will eventually "invade" the perimeter. If we don't do something to adjust it, this will lead to unsightly clumps of aggregation within the perimeter band.

After running this enough times to add a significant number of points, we look at the cluster with its perimeter:

This is interesting because we can plainly see the effect of the new release policy on the appearance of the cluster. The inner, denser portion was formed using our flawed "releaseIfOpen" method. The outer, sparser part of the cluster was formed using our new perimeter-based release policy. width="411", height="360"

If we generate a cluster solely based on this newer release policy, we get something like the following with a sparsity measure of 6.7, which makes this less dense than the clusters generated by our earlier policies.

Here’s an example of a DLA generated with this modified code. width="395",height="378"

Multi-Dimensional Change

To extend this code to multiple dimensions, it turns out we need make only one change to our working code. Instead of this initialization:

init=: 3 : 0
   D=: y            NB. D-dimensions
   RNI=: D dimnbr 1 NB. Relative neighbor indexes: 1 cell around center.
   PTS=: ,:y$0      NB. Start point list w/Origin.
NB.EG  init 2       NB. Initialize 2-dimensional space.

We embed the definition of our "empty neighbor finder" neigh2 in our initialization routine as its definition depends on the relative neighbor indexes which, in turn, depend on the number of dimensions in which we are working:

NB. Re-define initialization routine to allow multi-D:
init_dla_=: 3 : 0
   D=: y                 NB. D-dimensions
   RNI=: D dimnbr 1      NB. Relative neighbor indexes: 1 cell around center.
   neigh2=: 13 : 'x-.~~.,/y+"1/RNI'   NB. Empty neighbors of these
   PTS=: ,:y$0           NB. Start point list w/Origin.
NB.EG  init 3            NB. Initialize 3-dimensional space.

Here’s an example of running this code.

   load 'DiffLimAgg.ijs'
   load 'progDLAParmsTemplate.ijs'
   init_dla_ 3           NB. Initialize in three dimensions
0 0 0
   do1_dla_              NB. Grow the cluster and show stats
3 : 'tm,(#PTS),(sv-~#PTS),(tm%~sv-~#PTS),(usus nn),(#nn)%~nn+/ . =>./nn [ tm=. 6!:2 ''nn=. growMany y''[sv=. #PTS'
   6!:2 'smoutput do1_dla_ 125;100;5 3'
0.149511 22 21 140.458 3 125 110.06 33.8833 0.79
   NB. Grew to 22 points in about 1/5 second...
# % [: */ >./ - <./
   %ptsDensity PTS_dla_  NB. About 1 filled cell per 13 empty ones
   6!:2 'smoutput do1_dla_ 125;1000;5 3'
0.284263 780 758 2666.54 1 125 54.096 45.3134 0.242
   (%ptsDensity PTS_dla_),#PTS_dla_
35.5808 780

These results look plausible but how can we display them? This is what we do in the next section.

Beginner's Regatta re-visited

Primitive 3D Interactive Scatterplot Using d3.js

First, we generate three-dimensional points in J using Diffusion-limited Aggregation.

   load '~Code/DiffLimAgg.ijs'
   init_dla_ 3
0 0 0
   do1_dla_ 20;100;5 3
0.0334947 5 4 119.422 4 20 19.66 1.89214 0.96
5 3

After generating more points for this cluster, we end up with 5,255 of them. Looking into the Javascript module “ScatterPlot3D.js” gives us an idea of the format into which we need to put these points. We got this code from this "Data Explorer" site

function ScatterPlot3D(userConfig)
{, userConfig,
  { // The parent container of this chart.
    'parent'           : null,
    // Set these when you need to CSS style components independently.
    'id'               : 'ScatterPlot3D',
    'class'            : 'ScatterPlot3D',
    // Our data...
    'csv'              :
    { // Give folks without data something to look at anyhow.
        'header'         : [ "X", "Y", "Z" ],
        'data'           : [[0,0,0],[1,1,1],[2,4,8],[3,9,27]]
    'width'            : 400,
    'height'           : 400,
    'xoffset'          : 20,
    'yoffset'          : 0
  this.chart = this;

So, here’s how we go about this.

   3{.j2n&.> pts=. PTS_dla_   NB. “j2n” converts J numbers to common character
+-+--+-+                      NB. representation.
|0|0 |0|

Since we’re only looking at the first three points for now anyway, concentrate only on these.

   ]pp=. 3{.pts
0  0 0
0 _1 1
0 _2 2
   ;}.&.>;&.><"1 (<'],['),.~',',&.>j2n&.>pp

   }:_1|.;}.&.>;&.><"1 (<'],['),.~',',&.>j2n&.>pp

Now that we have a piece of code to properly bracket the numbers in our J table, name it and use it to write the bracketed points to a file.

   cvt2d3Array=: 3 : '}:_1|.;}.&.>;&.><"1 (<''],[''),.~'','',&.>j2n&.>y'

   (cvt2d3Array pts) fwrite 'egDLA3DPts2.txt'

Right now, I just cut-and-paste the contents of this file into an HTML shell I adapted from an example I found illustrating the "Data Explorer" Dex (which uses d3.js).

Here’s a sample screenshot of what we end up with:


We can rotate these points in real-time in any modern browser. I'd like to make the whole experience more seamless, perhaps by using JHS to generate the code directly from a J table or by incorporating JavaScript to read in data that J outputs.

There's still a lot of work to do in this world of multi-dimensional data rendering as there appears to be little understanding of good ways of dealing with data in more than two dimensions. For instance, much of my work of adapting the "Dex" HTML was removing an ugly and useless alternate rendering of these 3D points in what’s called a “parallel coordinate view”, an example of which is shown here on the left.


The chart on the left renders the points on the right by drawing sets of lines linking (x,y,z) sets. I don't see the point of this muddy and confusing representation but I've seen it elsewhere as well but only one case where it seemed to illuminate anything.

Show-and-tell re-visited

Parsing CSV Files with a “Finite State Machine”

from:    Zachary Elliott <> via
date:    Fri, May 24, 2013 at 11:57 AM
subject: Re: [Jprogramming] sequential machines examples

Provided below is an example of using ;: for parsing CSV files. The variable sj is defined twice (once in the way I would normally define it in a script and once in a more verbose manner for clarity)

   fa =. a. #~ # }. [: -. [: -. [: ~: a. ,~ ]

   sj =. 7 5 2 $ 0 0 1 1 2 1 3 1 4 1 0 0 1 0 2 2 1 0 1 0 0 0 1 1 1 2 3 1 4 1 0 0 3 0 3 0 5 0 3 0 0 0 4 0 4 0 4 0 6 0 0 0 0 0 2 2 3 0 0 0 0 0 0 0 2 2 0 0 4 0

The following is a much more well-documented version of the above assignment.

   sj =. _2] \"1 }.".;._2  (0 : 0)
 NB.    X    C    D    Q    S
       0 0  1 1  2 1  3 1  4 1        NB. 0 - Other
       0 0  1 0  2 2  1 0  1 0        NB. 1 - Char
       0 0  1 1  1 2  3 1  4 1        NB. 2 - Delim
       0 0  3 0  3 0  5 0  3 0        NB. 3 - Quote
       0 0  4 0  4 0  4 0  6 0        NB. 4 - SQuote
       0 0  0 0  2 2  3 0  0 0        NB. 5 - Second Quote
       0 0  0 0  2 2  0 0  4 0        NB. 6 - Second SQuote

   mj =. <'';(fa ',"''');',';'"';''''

   csv_parser =. (0;sj;mj)&;:

   csv_parser 'Year,"Make,Model",''foo,bar'',Hello,World'

A Suggestion

from:  [[User:Raul Miller|Raul Miller]] <>
date:  Fri, May 24, 2013 at 1:19 PM

Why do you use an empty character class? (This is the X column in your commented definition of sj).

If you got rid of this part of sj, you would not need fa and could define

   mj=: (a.&-.;<"0)',"'''

Though, granted, this phrasing would need to be changed if you used non-default character classes with multiple members. But unless you are treating underspecified sequential machines, or a collection of them, "full generality" is probably not needed and is somewhat illusory.

Generally speaking, sequential machines are not fully general.

Thanks, -- Raul

Learning, teaching and promoting J

Newbie Roadblocks

from:    Philip Hunt <>
date:    Mon, May 27, 2013 at 2:35 PM
subject: Re: [Jprogramming] How read text file, sort its lines, write it back out?

Don, Murray and Bill, I'm a newbie too and I found the files lab helpful but incomplete. For instance it mentions 'm' as an option too, is there a list of these options somewhere? (I didn't go back and I could be wrong but I thought it was in files-lab - but I did certainly see it recently in the J help system)

Which brings up another point for newbie and newbie attraction to is often there but finding it is tough, topics like file read/write are critical to a good impression of J system for general computing. The general approach of most J pro's seems to be read it in as fast as you can then use the power of J to manipulate it the way you want. Which is great once you know J well enough. Having the skills to read and manipulate data into J from files of the OS is so fundamental I'd say its worth more than a lab or some references like in foreign - it needs good clear well explained examples that assume the user has the simplest J skills.

The skills to learn the hard way and that I found most useful were :-

. 1) fundamental reading of a file using names in strings (I literally had to find WHERE files are assumed to be on my system first, where's home, I ended up using absolute strings to ensure I found things..and this explanation needs to be explicit with paths even if it varies by OS, list 'em there's not that many!)

. 2) Use and need for boxing of names and values (both of file names and of data ) is something you'll need right away and something that's not a level 0 skill in J (its not that hard of course but it is above +/ i. 5 type level which I think of as level 0)

. 3) extraction of commas / quotes / LF /LFCR to start to remove the extraneous human readable crutch elements from data

. 4) making data into lists for J . . a) use of ". and ": (do and format) . . b) getting around the data being mixed types (integer, string, etc.) . . c) making data appear in the right form - what shape has it now, and how to change that shape (easily!) - that's a level 2 skill . 5) same again for writing out files

I know these are all in the system for users to find (I found most of them the afternoon I attacked a ProjectEuler problem on names in a file) but frankly I'm still a novice and having a clear deeper intro (files in labs almost does it) would really have helped me.

As a newbie I found J alluring (in the extreme) but some hurdles can make that shine wear off fast, files is one, and there's sometimes an attitude of "well this is hard, so work on it a while" - which is ok if you have the time. The first time you thing J can be useful you want that faith returned not made hollow with this type of brush off...honestly I have tried to get work colleagues interested in J and they are put off with the learning curve (make that learning cliff) they see ahead. I have worked at this a while now and its just beginning to workout...I hope its worth it!

Now if I can just have my brain click over into J-thinking like it did once before into Smalltalk-thinking I'd be all set ! (Hey its coming I know it is LOL)

. Phil -- Philip Hunt Dum spiro spero.

from:  Murray Eisenberg <> via
date:  Mon, May 27, 2013 at 3:09 PM

Yes, to the wiki -- please! Or even to the J help system. (I seem to recall that many of these script files used to be documented in the help system pages, both as distributed with J and on-line at

On 27 May 2013 09:37:51 -0600, Don Guinn <> wrote:
  > Look in the "Files" lab. Years ago I fought the same battle. At that time
  > the comments were still in "files.ijs" and I stole them to create the
  > "files" lab. This was all before the J WIKI existed, so I wrote the lab in
  > a reference style. Now it would be better to move the reference material to
  > WIKI and  keep just examples in the lab. One of these days.
. ---
from:  Joey K Tuttle <> via
date:  Mon, May 27, 2013 at 6:04 PM

I think having examples in the wiki is a great idea. For many years (starting c. 1990) my profile.ijs included my definitions of

    fread=:  1!:1@<
    fwrite=: 1!:2 <

A year or so ago I commented out my definitions in my profile so that when I tried something for/from the forum I would be using the definitions provided with the system. The idea of an argument to make boxes 'b' or a matrix 'm' may be useful, but it loses a certain amount of flexibility in answering the original question of how to sort a text file. That flexibility is mainly lost because of the use of <;._2 (instead of <;.2) for boxing the file. Often times, "helpful features" make more complicated solutions (things I encounter in M$ Word and Excel leap to mind...)

The original question of how to sort the lines of a text file could be written, with no cover functions at all, as -

   sortfile=: ([:;[:/:~[:<;.2[:][:1!:1<)1!:2[:<'sorted.'"_,]

Where the output file is prefixed with 'sorted.' to avoid destroying the original file. I'm sure that (13 : generated) tacit definition could be made more readable - but the point is that having "convenient cover functions" can make things obscure in a different way. Note that sortfile completely avoids the issues around "line end character(s)" and works for lines ending with CR, LF, or CR,LF.

A ground up set of examples of file fiddling in J would be a useful project. Some things need to be stated at the beginning, e.g. All files are treated as text strings (character vectors). Text strings such as '3.14159 2.71828' have to be changed to numbers for calculations. Numeric results have to be converted to character vectors to be written into files.

There are various utilities provided to manipulate special/binary text vectors (e.g. files of binary data [read as text] can be converted using 3!:n). That reminds me of my favorite FORTRAN trivia question - what FORMAT command is used to write binary representations of numbers (integers floating point etc.)? The answer is A (alphabetic) since if you use I (integer) or F (floating point), what is written in your file is character representations (EBCDIC in the day, surely ASCII these days) of the numbers...

A discussion about the idea of "lines" and line/record delimiters, and so forth. By the way, it would be useful to note that some widely used programs (e.g. Excel) write text files as lines, but without the final line end character, thus causing <;.2 to fail... Why the choice was made to create a file that isn't complete is lost to history, but it is unlikely to change.

Maybe if the wiki page was set up in a way that contributors could fill in things about their personal tricks, examples would show up. I have been enjoying rummaging in system files with J (and APL before that) for 40+ years. I'm still amazed at some of the things I find that have been generated by professional/standard programs (e.g. the HTML files named something.xls that I ranted about in chat a few weeks ago).

It has been entertaining to follow some of these "simple questions" in the forums of late - and it certainly indicates a need and opportunity for the creation of more helpful learning materials.


-- Devon McCormick <<DateTime(2013-12-13T15:53:38-0200)>>