NYCJUG/2023-01-10

From J Wiki
Jump to navigation Jump to search

Beginner's regatta

At the meeting, Dan told us of many other odd characters (byte combinations) that crop up regularly when you have to gather input from many varied sources. However, the lowly NBSP was enough to discuss in the beginner section.

Problem with Non-breaking Spaces

It started with a cry for help:

​From: Arthur Anger <ArtAnger@comcast.net>
Date: Wed, Dec 28, 2022 at 12:47 PM
Subject: [Jgeneral] Spelling errors
To: J-General <General@jsoftware.com>

I compose and edit my scripts using TextEdit on a Macintosh. Occasionally I copy a line or two from a J-session into a script, and find that it may cause a Spelling error when used. Those errors are always at the beginning of a line, where an invisible print-control character has crept in.

Recently, I got Spelling errors on most blank spaces when trying to load a script, and had to retype each one.

Can anyone suggest a reason and a cure?


Henry Rich had a suggestion:

​From: Henry Rich <henryhrich@gmail.com>
Date: Wed, Dec 28, 2022 at 12:52 PM
Subject: Re: [Jgeneral] Spelling errors
To: <general@jsoftware.com>

You probably have nonbreaking spaces (codepoint 0xa0) that needs to be translated to 0x20. Other possibilities are smart quotes. Why don't you write a script that translate those characters on the clipboard? Others may want to use it too.

A Solution

It turned out to be not quite as simple as Henry had suggested though his diagnosis was correct. The complication arises from the fact that the non-breaking space was in Art's file as a Unicode byte sequence. Here is how we narrowed down the problem to a few characters.

Running my J session under emacs, I loaded the file Art had sent me:

   load 'D:\amisc\J\Hk_LF.ijs'
|spelling error
|   sff=: +/"2 ff , (|.!.0 ff) ,: (|.!.0 |.!.0 ff)
|        ^
|   reinigen=:3     :0
|[-48] D:\amisc\J\Hk_LF.ijs

If we look at an image of this session, it is immediately clear which characters are suspect:

FixNBSPinFile.png

We see cyan underscores in the line flagged for the "spelling error". Since we are in an emacs session, we can select a suspicious-looking underscore and look it up in a.. This gives us a pair of values, confirming that this is a multi-byte Unicode sequence.

As we can see in the image above, simply replacing this pair of characters with a regular space character and writing the result to file allows us to load it without getting the spelling error.

Show-and-tell

Recently I have been posting past NYCJUG meetings to the J wiki. There are quite a few meetings with interesting topics, the information for which is languishing on my hard drive.

One recent set of notes from 2006 included this slight attempt to find solutions in J for the equation a^2 + b^2 + c^2 = (a - b)(b - c)(c - a).

In that section, we didn't try to solve the problem as stated which was to find integer solutions to this equation. Instead, we developed some iterative code which attempts to search for a close solution for non-integers as this can be searched incrementally in an arbitrary neighborhood.

Here is code to incrementally move from an arbitrary point to one closer to an exact solution.

NB.* findPoints.ijs: find points that are solutions for the given equality.

eqn0=: 3 : '+/*:y'                 NB. Set up equations
eqn1=: 3 : '*/2-/\y,{.y'
diffeqns=: 3 : '(eqn0 y)-eqn1 y'  NB. and differencer.

First we set up the two parts of the equation as eqn0 and eqn1, then we subtract one from the other in diffeqns so we have a function that approaches zero as we get closer to a solution.

The main routine takes an existing set of points or creates some random ones if none are given.

findPoints=: 3 : 0
   'rndn np'=. 2{.y
   if. rndn-:' ' do. rndn=. (_1e2+?1e5$201)%>:?1e5$1e2 [ np=. 25 end.
   diffs=. 3 diffeqns\rndn
   close1s=. np{./:|diffs     NB. Pick closest results as
   seeds=. close1s{3[\rndn    NB.  starting points...
   for_around. <"1 ] 1 _1*/~0.1^>:i.4 do.
       pts=. 0 3$0
       for_seed. seeds do.
           nn=. >,&.>/&.>/seed +&.>/ <steps (>around),11
           dd=. diffeqns&>nn
           pts=. pts,>nn{~<($dd)#:(,|dd)i.<./,|dd
       end.
       seeds=. pts
   end.
   pts
)   

steps=: 3 : 0
NB.* steps: vector of numbers from num, to num, in numsteps steps.
   'from to numsteps'=. y
   from+(to-from)*(numsteps-1)%~i.numsteps
)

We see that findPoints works by testing a lot of points and selecting those where diffeqns is closest to zero. These are used in the loop as starting points from which we generate a small grid of points around each one to select a new point even closer to the solution.

Converging to a Solution

The argument to the for_around. statement deserves some elucidation. This sets the value of the loop variable around to be each of these four successive intervals:

      <"1 ] 1 _1*/~0.1^>:i.4
+--------+----------+------------+--------------+
|0.1 _0.1|0.01 _0.01|0.001 _0.001|0.0001 _0.0001|
+--------+----------+------------+--------------+

Each of these intervals is used to generate a set of points around the initial seed points in the expression >,&.>/&.>/seed +&.>/ <steps (>around),11.

For example, for an arbitrary seed point and an interval from _0.1 to 0.1, we get 11 points spaced around the initial one.

   seed=. _1.4 0.7 _1.9
   around=. 0.1 _0.1
   >seed +&.>/ <steps (>around),11
_1.3 _1.31818 _1.33636 _1.35455 _1.37273 _1.39091 _1.40909 _1.42727 _1.44545 _1.46364 _1.48182 _1.5
 0.8 0.781818 0.763636 0.745455 0.727273 0.709091 0.690909 0.672727 0.654545 0.636364 0.618182  0.6
_1.8 _1.81818 _1.83636 _1.85455 _1.87273 _1.89091 _1.90909 _1.92727 _1.94545 _1.96364 _1.98182   _2

For each seed, we generate and evaluate points in a grid around the initial one. The best of these are selected for the next round which has a tighter around interval in order to get increasingly closer to a possible solution.

Back to the Original Problem

Interestingly, this method works well if we consider the original problem which restricted us to integer solutions. In fact, we can evaluate a set of integers directly without trying to converge on a non-integral solution. However, a three-dimensional grid of points can get arbitrarily large so we have to pick some maximum size of the cube with which we want to work.

Once we have tested the values in a particular cube, we can work on all the adjacent cubes, and so on.

Original Cube

We can generate points for our original cube (centered about the origin) like this:

   $pts=. >(i:50),&.>/(i:50),&.>/i:50
101 101 101 3
   7!:5<'pts'
3.35544e7

The points take up over 300 megabytes of memory. Do they look like what we expect?

   2 2 2 3{.pts
_50 _50 _50
_50 _50 _49

_50 _49 _50
_50 _49 _49


_49 _50 _50
_49 _50 _49

_49 _49 _50
_49 _49 _49
   _2 _2 _2 3{.pts
49 49 49
49 49 50

49 50 49
49 50 50


50 49 49
50 49 50

50 50 49
50 50 50

Applying the objective function, we get these results.

   $dd=. diffeqns&>pts
101 101 101 3

We locate the zeros and find their indexes by converting the integer indexes from I. into three-dimensional coordinates with ($dd)#: , then selecting the points to which these indexes correspond.

     (<"1 ($dd)#: I. 0=,dd){pts
_21 _14  _7
_14  _7 _21
 _7 _21 _14
 _2  _1   1
 _1   0   1
 _1   1  _2
 _1   1   2
  0   0   0
  0   1  _1
  1  _2  _1
  1  _1   0
  1   2  _1
  2  _1   1
  7  14  21
 14  21   7
 21   7  14

Verifying one of the points

   diffeqns 14  21   7
0

Expansion Cubes

Thinking about how to build cubes of points adjacent to our original one highlights the power of array languages. Starting with the corners, we know that a cube has eight of them so what is the offset from our original cube for each of the eight corners?

Adding each of these rows gives us the 8 corner cubes.

   ]cornerShifts=. _100 100{~#:i.8
_100 _100 _100
_100 _100  100
_100  100 _100
_100  100  100
 100 _100 _100
 100 _100  100
 100  100 _100
 100  100  100

Adding each of these rows gives us the 6 face cubes.

  
   ]faceShifts=. ,/0 1 2|."(0 1)/ _100 100 ,"(0 1) 0 0
_100    0    0
 100    0    0
   0    0 _100
   0    0  100
   0 _100    0
   0  100    0

These are the offsets which give us the 12 edge-aligned cubes.

  
   ]edgeShifts=. ,/0 1 2|."(0 1)/0,._100 100{~#:i.4
   0 _100 _100
   0 _100  100
   0  100 _100
   0  100  100
_100 _100    0
_100  100    0
 100 _100    0
 100  100    0
_100    0 _100
 100    0 _100
_100    0  100
 100    0  100

We see that a cube has 8 corners, 6 faces, and 12 edges so there are 26 cubes that are adjacent to this one. Since we are going from one cube to 3x3x3 cubes, this is the correct number of additional cubes. Also, per Euler, we know that F + V - E = 2, so these numbers make sense. We are using 100 and _100 because the original cube ranges from _50 to 50 on each dimension, so moving 100 from the origin in each dimension takes us to the limit of the existing cube with an overlap of one point.

Looking at the corner cubes:

   $dd=. diffeqns"1 pts+"1/ ] cornerShifts
101 101 101 8
   (<"1 ($dd)#:I. 0=,dd){pts+"1/ ] cornerShifts
  55   85  100
  60   75  105
_125 _100  _75
  75  100  125
  75  105   60
  85  100   55
_105  _75  _60
 100   55   85
_100  _85  _55
_100  _75 _125
 100  125   75
 105   60   75
 _85  _55 _100
 _75 _125 _100
 125   75  100
 _75  _60 _105
 _60 _105  _75
 _55 _100  _85

   diffeqns _85  _55 _100
0

Evaluating the edge and the face cubes gives us no solutions.

   (<"1 ($dd)#:I. 0=,dd){pts+"1/edgeShifts 
   <./|,dd                                  NB. Closest distance
1

We have no exact matches. How many are only off by one?

   $(<"1 ($dd)#:I. 1=|,dd){pts+"1/edgeShifts
18 3

Similarly for the face cubes, there are no exact matches but we have some points off by only 2.

   6!:2 'dd=. diffeqns"1 pts+"1/ ] faceShifts'
9.67004
   (<"1 ($dd)#:I. 0=,dd){pts+"1/ ] faceShifts
   <./|,dd
2
   (<"1 ($dd)#:I. 2=|,dd){pts+"1/faceShifts
_33 _30 _89
_30 _89 _33
 89  30  33
_89 _33 _30
 30  33  89
 33  89  30

Advanced topics

Recently there has been some discussion in the J community about things that seem to be missing from the language. Often this particular discussion focuses on data structures like trees. There are a few essays on the wiki about trees and I have not heard of any objections to those ways of working with trees so it seems that these approaches are adequate.

What are some other data structures or ways of processing that are widely available in other languages but seem to have to equivalent in J? One that has been mentioned often is the hash table. How would we build one in J?

I should note that this is a bare introduction to hash tables that overlooks many of the issues commonly associated with implementing hash-tables, such as handling overloaded keys and bundling insertion and deletion capabilities.

Hash Table

The key idea of a hash table is that if we can generate a unique key for each item in a large array and if the key is much smaller than an array element, we should be able to look up elements in the array more quickly than if we have to compare the full element.

What should we use for our large array? The work we've done on poker gives us the idea of looking up poker hands in a table comprising all possible hands. To do this, we need a function to generate all possible combinations of a given size from a set. A search of the wiki yields this link which gives us this definition:

    comb=: 4 : 0 M.   NB. All size x combinations of i.y
 if. (x>:y)+.0=x do. i.(x<:y),x else. (0,.x comb&.<: y),1+x comb y-1 end.
)

A quick test shows us what to expect.

   $3 comb 5
10 3
   3!5
10
   3 comb 5
0 1 2
0 1 3
0 1 4
0 2 3
0 2 4
0 3 4
1 2 3
1 2 4
1 3 4
2 3 4

To generate the table of all possible 5-card hands, we do this:

   $RH5=: 5 comb 52
2598960 5
   5!52              NB. How many should we expect?
2598960

Look at a few sample hands:

   3{.RH5
0 1 2 3 4
0 1 2 3 5
0 1 2 3 6
   _3{.RH5
46 47 49 50 51
46 48 49 50 51
47 48 49 50 51
   showCards 100 1000 10000 100000 1000000{RH5
+---+--+--+---+--+
|K♣ |7♣|4♣|3♣ |2♣|
+---+--+--+---+--+
|10♠|8♥|4♣|3♣ |2♣|
+---+--+--+---+--+
|2♣ |2♥|A♣|10♦|3♣|
+---+--+--+---+--+
|A♥ |9♥|8♣|4♠ |2♣|
+---+--+--+---+--+
|6♣ |6♦|J♥|4♦ |3♠|
+---+--+--+---+--+

This should give us unique hash values.

   makeHash=: 52 #."(0 1) ]
   6!:2 'hash5=. makeHash RH5'
10.7526
   usus hash5          NB. minimum, maximum, mean, and standard deviation
146176 3.5053e8 5.96886e7 5.22142e7
   (#-:#@:~.) hash5
1

This last expression compares the tally of the table to the tally of its nub. If they are equal, we have unique hashes.

   #RH5
2598960
   ixs=. 2598960?#hash5   NB. Some random rows to look up
   6!:2 'wh=. RH5 i. ixs{RH5'
2.09659
   wh-:ixs                NB. Are the answers correct?
1

The basic lookup took about two seconds for about 2.6 million items. How does the hash compare?

   6!:2 'wh=. hash5 i. ixs{hash5'
0.123261
   wh-:ixs             NB. Are the answers correct?
1

Time to Free Memory

When we extend this exercise to a table of six-card hands we see that the hash lookup is still much faster than the naïve one. However, we encountered some slowness in J's response to the creation of this hash vector.

   6!:2 'RH6=: 6 comb 52'     NB. All 6-card hands
0.66211
   6!:2 'hash6=. 52#."(0 1) RH6' [ smoutput qts''
2023 1 9 17 59 55.97
qts''
usus hash6
22.7858
   2023 1 9 18 17 41.905
   7.60116e6 1.78399e10 2.60503e9 2.36078e9
   20358520
   2023 1 9 17 59 55.97 TSDiff 2023 1 9 18 17 41.905
+--------+-------------------+
|_1065.93|0 0 0 0 _17 _45.935|
+--------+-------------------+
   (#-:#@:~.)hash6
1

Notice how the inputs following the generation of hash6 are flush with the left margin. This is because we entered them immediately after the expression for generating the hashes but it took more than the 22.8 seconds of compute time for J to be ready to accept further input. So the difference between the two timestamps - over 1,000 seconds - reflects a period where J was non-responsive even though the computation was finished.

Here we see how memory is slowly returned after peaking during the hash generation.

Taking time to return memory after hash.png

   ixs=. 2598960?#hash6        NB. Same number of samples as in the 5-card case
   6!:2 'hash6 i. ixs{hash6'
0.537749
   6!:2 'RH6 i. ixs{RH6'
17.7146

Once again, we see that the hash lookup is much faster than the basic J expression.

Learning and Teaching J

There is this essay about algorithms with which every programmer should have experience. These are common algorithms much as hashing is so it would be interesting to see how well suited J is for coding them. There is already code for some of these which I note below.

Topological sort

We use topological sort to sort items that are dependent on other items. Common scenarios include determining the order of tasks for build systems, the evaluation order of cells in a spreadsheet, which classes a student should take each semester to graduate, and a variety of scheduling problems.

Here is the same graph represented two different ways; the latter one is topologically sorted.

Toposort.png

More information:
Used in J code for writing a WAV file
Using topological sort in J for an Advent of Code problem
Topological sorting (Wikipedia)
Topological sort interactive visualization (web)

Recursive descent parsing

We have examples of this in J here and here.

This one felt like unlocking a super power when it clicked for me. If you ever need to ingest a recursive data format or make a programming language, this is the tool for you. Sketch out a grammar then map each grammar rule to a code function. Bam! It feels like it codes itself.

Teenyparser.png

You could have a simple compiler whipped up in a few hours with it.

More information:
Regex lexer
Essay by Oleg Kobchenko includes an example of parsing JSON
Let's make a Teeny Tiny compiler (author's tutorial series)
Crafting Interpreters and Amazon
Recursive descent parser (Wikipedia)

Myers string diff

Processing strings is such a common programming task, but I can usually get away with brute forcing whatever it is I need to calculate. We've all learned about Levenshtein distance, but what if you need to show the differences between two strings in a sensible way?

Codediff.png

The default algorithm used by git for showing diffs is the Myers diff algorithm.

More information:
The Myers diff algorithm
Myers diff algorithm - Code and interactive visualization

Bloom filter

Apparently Bloom filters are used internally in J.

Given the remark here about "a really big hash table", this might be a follow-on to the hash table work above.

The most known unknown data structure. In fact, I polled some friends at Big Tech about what data structure they'd suggest I add to this list and all of them said bloom filter!

It really only helps with large-scale problem, such as when you need a really big hash table, but it is a great introduction to probabilistic data structures. With very little memory, it can tell you if an item is not present in the table, otherwise it can only tell you maybe the item is present.

Google asked me two interview problems involving these back when I was in college. Have I ever needed this in practice? Nope!

More information:
Bloom filter (Wikipedia)
Let's implement a bloom filter
Bloom filters by example

Piece table

When I first tried making my own text editor, I stored all of the text in an array. But that hits performance problems right away (e.g., when the user inserts text anywhere but the end). Years later I even got asked how to implement a performant text buffer at an internship interview.

What should I have used? Well, there are a few options with different tradeoffs: rope, gap buffer, or piece table. With a piece table, you track the edits performed on text rather than only keeping the resulting text. It makes a lot of other features trivial to add (e.g., undo support and incremental saving).

Piece tables aren't only useful for text editors though. You can use it whenever you have a large buffer of data that can be arbitrarily modified.

More information:
Piece table (Wikipedia)
VS Code's text buffer story
The piece table

Splay tree

How about a self-optimizing data structure? Splay trees are binary trees that tend to have the more recently accessed elements closer to the root, and it does so without maintaining any additional metadata. Each time an element is accessed, it gets moved to the root.

Splay tree 0.png Splay tree 1.png

It's a tree with a built-in cache and it is beautifully simple to implement! As a student, I had a really hard time implementing red-black trees (we wrote the code on paper, ugh) so I was pleasantly surprised when splay trees just clicked for me.

More information:
Splay tree interactive visualization
Splay tree (Wikipedia)
Implementation in TypeScript (GitHub)

Miscellaneous Other

If you are still craving more, here are the honorable mentions that I cut from the list:

Try these out, and let me know if I missed any!