NYCJUG/2020-12-08

From J Wiki
Jump to navigation Jump to search

Beginner's regatta

Data structures in J

A member of the J forum asked about how to set up and use data structures in J, analogous to typedef or struct in C.

 From: emacstheviking <objitsu@gmail.com>
 Date: Sun, Dec 6, 2020 at 3:14 PM
 Subject: [Jprogramming] Seeking advice on structures
 To: <Programming@jsoftware.com>
What's the conventional wisdom / best practice on defining data structures for an application?
Given there is no explicit keyword/operator support like C (typdef, struct)  is it merely a case of convention and using boxed structures. I have read several operators that can modify structures both as new aliased copies and in-place modifications but I do not have the experience with J to know what's efficient at run time in time / memory etc.
My specific use case is that of a vertically scrolling star field... I intend to recreate and hopeful extend the tiny little game I wrote but never finished, screenshot here: http://seancharles.xyz/posts/2019-10-06-all-at-c.html
In that I had a struct that had the x, y, dy and type values but it seems to me that given that J is all about arrays, it might be more efficient using parallel arrays i.e. x array, y array, dy array etc.
Also, given that the state is being updated in a tight event loop using the time differential between frames to calculate the step motion (i.e. CPU speed independently), what are your thoughts on immutable updates producing new arrays or updating in place ?
Thanks,
Sean.

There were a number of helpful suggestions, including advice to consider using jdb if the data lends itself to a database and using namespaces to simulate an OOP approach. My own suggestions and examples follow.

Some Examples

As others have already indicated, J is most efficient with unboxed, simple homogenous arrays. That said, it would be helpful to have an example of a data structure, maybe one you have in C or C++, that you would like to implement in J.

Looking around, here is one of the more complex ones I found in C++ (from https://codescracker.com/cpp/cpp-data-structures.htm):

struct stud
{
	int rollno;
	char name[20];
	char branch[3];
	char batch[2];
	float marks[5];
	char grade;
}stud_var;

This has two numeric and four character fields so may make us think we should use a boxed matrix, maybe like this:

   students=. ,:101;'Joe Blow';'BAT';'FR';60 70 85 96.8 9;'B'             NB. Use ",:" to give us a one-row table
   students=. students,202;'Cruella De Ville';'DOG';'SO';91 92 93 94 89;'A'
   students=. students,303;'Christopher Xavier Columbus';'EXP';'JR';14 92 10 15 0;'F'

We could specify the (column) labels for each field:

   labels=. 'rollno';'name';'branch';'batch';'marks';'grade'

This allows us to use them like this to avoid "magic numbers" in the code and make it clear to which column we are referring:

   students{"1~labels i. <'name'
+--------+----------------+---------------------------+
|Joe Blow|Cruella De Ville|Christopher Xavier Columbus|
+--------+----------------+---------------------------+

Alternatively, we could build this data structure using unboxed arrays with distinct names in a namespace:

   rollno_students_=. 101 202 303
   name_students_=. >20{.&.>'Joe Blow';'Cruella De Ville';'Christopher Xavier Columbus'
   branch_students_=. 'BAT','DOG',:'EXP'
   batch_students_=. 'FR','SO',:'JR'
   marks_students_=. 60 70 85 96.8 9,91 92 93 94 89,:14 92 10 15 0
   grade_students_=. 'BAF'

Notice how I now enforce the length limitations, on the "name" field in particular, that the original C++ example imposed. This is one of the things I dislike about these more primitive languages: they force you to make data structure decisions in advance of the actual data.

   name_students_
Joe Blow            
Cruella De Ville    
Christopher Xavier C

This length limitation does allow us to use simpler, unboxed, data structures. Of course, we don't have to fix the maximum length in advance in J:

   ]name_students_=. 'Joe Blow','Cruella De Ville',:'Christopher Xavier Columbus'
Joe Blow                   
Cruella De Ville           
Christopher Xavier Columbus

It depends on how closely you wish to mimic the C++ or what trade-offs you want to make.

The unboxed arrays are often more efficient to process but the boxed ones are more flexible, e.g.what if the number of "marks" varies substantially from student to student?

Also, boxed arrays can be quite efficient. As a rule of thumb, you don't want to box very small things as each box incurs the overhead of a pointer; for large things, say full names or paragraphs, the overhead is amortized.

Other Methods and Examples

Here is an example of J code to work directly with C data structures.

Here is an example of six J vectors used for saving the results of parsing a file directory tree. There are four equal-length vectors representing file information: all file names, last modified dates and times, sizes, and indexes into a vector of directory names in which each file appears. The other two equal-length vectors contain the dependency tree relating directories and the full names of each directory.

Condition Number

Preliminary definitions

require 'numeric' NB. for clean
mp=:+/ .*         NB. matrix product
det=:-/ .*        NB. determinant
id=: =@i.         NB. identity matrix

Recall that monad %. is matrix inverse. We expect that multiplying a with %. a will give the identity.

   a=:2 2 $ 1 2 3 _4
   a mp %. a
1 0
0 1
   (id # a) -: a mp %. a
1

However, we can define matrices for which this is not the case.

NB. Vandermonde matrix on i.y
v=:^/~ @: i.
   v 4
1 0 0  0
1 1 1  1
1 2 4  8
1 3 9 27
   a=:v 4
   (id # a) -: a mp %. a
0
   a mp %. a
           1 3.44169e_15 _3.19189e_15  1.29757e_15
 2.66454e_15           1   2.9976e_15 _1.55431e_15
_1.77636e_15 1.77636e_15            1  1.33227e_15
 1.77636e_15           0            0            1

The verb clean replaces quantities close to 0 by 0.

   (id # a) -: clean a mp %. a
1

We can measure the maximum divergence from the identity.

   >./ ,| (id # a) - clean a mp %. a
3.44169e_15
   a=:v 20
   >./ ,| (id # a) - clean a mp %. a
0.377341

The key to predicting this bad behavior is condition number of a. We first define the norm (take absolute value, add each row, take the maximum.) Then the condition number is the product of the norms of a and the inverse of a.

NB. Infinity norm
n=:>./@:(+/"_1)@:|
NB. Condition number
k=:n * n @: %.

Condition numbers of the first few Vandermonde matrices:

   k@:v"0 >:i.10
1 4 28 266.667 4546.67 104160 2.85658e6 9.2066e7 3.42818e9 1.45316e11
   k v 20
9.80872e27

A large condition number indicates that any numerical calculations may be off.

e=:0.00001
a=:2 2 $ 1 3 1, (3+e)
b=:4,4+e
xs=:1 1  NB. Exact solution of a mp x = b

However:

   b %. a
0.999856 1.00005

It can be hard to tell when we are near a solution.

   b - a mp 4 0  NB. 4 0 is nowhere near 1 1 
0 1e_5
   k a
2.39997e6

We would not worry about this except that matrices with large condition numbers arise all the time. The Vandermonde matrices arise when doing interpolation with equally spaced points, among other contexts.

Conclusion: Beware of large condition numbers when doing numerical calculations. Ideally, you should reformulate your problem so that condition numbers are small.

Show-and-tell

Sizing directories

I recently wrote a utility function to list the space taken by all sub-directories in a given directory:

NB.* dirSize: total size(s) of all files and all subdirectories under dir y
dirSize=: 3 : 0
   0 dirSize y                        NB. Default is don't sum
:
   y=. y,PATHSEP_j_#~PATHSEP_j_~:{:y
NB.   y=. PATHSEP_j_ (],[#~[~:[:{:]) y
   dl=. dir y,'*'
   whd=. 'd' e. &> 4{"1 dl
   szs=. +/(-.whd)#;2{"1 dl             NB. Total size of files under top dir.
   if. 1 e. whd do.                   NB. Sum everything below top-level dirs
       szs=. szs,1 dirSize&>(<y),&.>PATHSEP_j_,~&.>whd#0{"1 dl
   end.
   if. x do. +/szs else. ,szs end.      NB. bytes in files immediately under dir,
NB.EG dirSize 'C:\Program Files'   NB. then total bytes under sub-dirs.
NB.EG (((<'Files:'),[:jd[:dir '*',~ endSlash),.[: <"0 dirSize) 'C:\Program Files'
NB.EG (1 dirSize 'C:\Temp')=+/0 dirSize 'C:\Temp'
)

Let's see how this works in practice:

   $dirsFls=. dir 'C:\Program Files\*'  NB. Directories and files
50 5
   3{.dirsFls
+---------------------------+------------------+-+---+------+
|7-Zip                      |2020 6 25 21 49 1 |0|rw-|----d-|
+---------------------------+------------------+-+---+------+
|Amazon Redshift ODBC Driver|2020 12 2 10 32 57|0|rw-|----d-|
+---------------------------+------------------+-+---+------+
|Application Verifier       |2020 6 23 12 49 49|0|rw-|----d-|
+---------------------------+------------------+-+---+------+
   6!:2 'szs=. dirSize ''C:\Program Files''   NB. then total bytes under sub-dirs.'
0.845075
   $szs
50

However, this timing is deceptive because I ran this once before I timed it. Here we compare a timing the first time versus the second time we run the code:

   6!:2 'szse=. dirSize ''e:\Program Files'''   NB. Initial timing
8.32851
   6!:2 'szse=. dirSize ''e:\Program Files'''   NB. Same thing again
1.28927
   $szse
75

We see a substantial speed-up because Windows must be doing some caching.

Here's how we might associate the sizes returned with the list of sub-directories:

   dirsFls=. (}:"1 dirsFls),.<"0 szs
   5{.dirsFls\:{:"1 dirsFls                  NB. What are the largest sub-dirs?
+----------------------------+-------------------+-+---+------+----------+
|Tableau                     |2020 12 4 22 32 55 |0|rw-|----d-|3186342310|
+----------------------------+-------------------+-+---+------+----------+
|Microsoft Office            |2020 12 4 15 2 15  |0|rw-|----d-|2765183321|
+----------------------------+-------------------+-+---+------+----------+
|Docker                      |2020 11 10 10 59 57|0|rw-|----d-|2359099642|
+----------------------------+-------------------+-+---+------+----------+
|NVIDIA Corporation          |2020 11 18 15 1 38 |0|rw-|----d-|1560623001|
+----------------------------+-------------------+-+---+------+----------+
|NVIDIA GPU Computing Toolkit|2020 11 18 15 0 38 |0|rw-|----d-|1370955044|
+----------------------------+-------------------+-+---+------+----------+

New K-like Language

There is a new language based on K. It sounds kind of exciting. Here's a little bit about it.

The How

From r/apljk on Reddit:

•Posted byu/vsovietov
28 days ago
New vibe man in town

Preliminary announcement of ThePlaform, new k-like language, and time-series database

Dear colleagues

I’m happy to announce the new solution based on k-like language. Actually, it’s not brand new; this project is being developed for a while as the internal project of Lynx Capital Partners and now is being used as the main tool to implement (or re-implement) numerous internal services, especially in those places where kdb+ would usually be used. Despite common purposes, our implementation (we call it ThePlatform so far) is completely different from kdb+. From the dev's point of view, it may look similar in many cases — the same data model, APL-like input language (or K-like), etc., but everything else... Here's the brief list of differences:

Core implementation The Platform is written in Rust and uses LLVM infrastructure to perform run-time optimizations and to keep up with performance capabilities provided by modern vector command set extensions.

Memory management is fully automatic instead of manual execution of .Q.gc[]

the core functionality of ThePlatform can be extended with custom plug-ins (for example, GUI plug-in integrates Qt/QML into The Platform; we're using it to develop desktop apps). If we ever need some code that needs to avoid unnecessary copying and/or using IPC, we can embed it to ThePlaform as a plug-in.

The Language

The language (O) The Platform's language (codenamed as "O") resembles K with some flavors of q (SQL-like selects, etc.)

o O implements some extensions to APL-like languages, like pattern matching with proper destructuring, join-calculus, AST instrumentation, etc.; that makes O programming style significantly different from q's (k/J/Shakti's k5/k9 as well)

We heavily rely on meta-programming capabilities of O and create DSLs when it's convenient: the standard library includes PEG parser-generator to do that.

o O supports streams (which can look like tables updated in real-time) and reactions on stream events.
o O's tables and streams can be indexed to speed-up further lookups.
o O's queries can be "lazy", working like cursors, returning chunks of data as needed.
o O implements join calculus to avoid manual implementation of lock/sync/rendezvous logic.

Code written in APL-like languages can look quite cryptic (especially for beginners). If one's not satisfied with O, other interpreters can easily be implemented as plug-ins (we considered Lua-like and Clojure-like syntaxes so far)

Easy, Powerful Concurrency

Concurrency The Platform runs O interpreters (or other user's code) as lightweight tasks managed by schedulers (which can be bound to specific CPU cores to reduce latency)

Since O implements join calculus, interaction with multiple data/event streams is trivial and can be defined by specifying declarative join rules which fire reactively.

There's a standard API to define "reagents" in The Platform to interact with join rules. It simplifies network and CEP programming a lot.

Concurrent and lock-free nature of The Platform's core enables non-blocking execution, IOW, The Platform can execute as many queries as hardware can bear simultaneously.

Advanced topics

Hard to Parallelize Intrinsically Scalar Languages

The following is excerpted from an essay introducing the "Alan" programming language.

The Turing-Completeness Problem

4 November 2020 | Luis F. De Pombo, David Ellis

Can we write programs with predictable execution that could be automatically parallelized? Part of the work of Alan Turing and Alonzo Church on computability demonstrated that for some programs it cannot be determined ahead of time whether or not the program can be computed. This constitutes the crux of the Halting Problem. This "Turing-Completeness Problem" is not an issue with what they have shown, but how the software development process is hindered by the lack of assurances on computability and our industry's lack of recognition of the burden this has put onto software developers. Further, it has been exacerbated by recent computing mechanisms, such as multithreading, NUMA, GPGPU, and distributed computing, which have moved away from the "single tape" Turing Machine model of computation that most programming languages are founded upon.

The following snippet of C code succinctly contains the problems of Turing completeness complicating modern development:

while (node) {
  doSomethingWith(node);
  node = node->next;
}

1. Automatic Parallelization - It's not possible to automatically parallelize this code. The number of nodes is not clear to the compiler for it to potentially bunch those calls across threads, and it's also unclear if the function call mutates the node and adds/removes nodes to process.

2. Halting Problem - It's not possible to know if the code will halt, even ignoring the idea that doSomethingWith may always add a new node to the one it is processing, there's nothing stopping a node from pointing at itself and never halting that way.

Automatic parallelization is only a problem because we want/need to parallelize our code. Until very recently this was a niche need as computing power had been increasing exponentially for decades and has only flattened in the past decade. If this was still the case it is likely that it would have remained a niche problem that only experts need to worry about, but with multicore now the norm even in personal laptops and unlikely to go away as signal propagation itself becomes the limiting factor, it has become a problem for the average developer, as well.

...

Side-stepping the Halting Problem by defining a barely-Turing-incomplete subset, also makes the automatic parallelization story possible. Now the compiler can work with code that it can determine where and how it could parallelize the work and has the ability to model its finite execution time to determine whether or not parallelization is worth it under various conditions. This is the thesis of the Alan programming language [1], that with minor syntax constraints (that we believe the vast majority of developers will not notice, or at least won't mind) we can be sure that the code you write always halts, and we can parallelize your code for you.

...

The root of most of the problems with the example C code involves the ambiguity in the data structure and the behavior of the user-defined function. The other problem is that the while loop is an unbounded looping construct that cannot be reasoned about except "the body of the while loop is executed zero or more times." In Alan, arbitrary looping and recursion are disallowed, and all data is built on top of data types with knowable constraints, no arbitrary JUMP statements or data pointer logic allowed.

What this means in practice is that instead of writing that while loop above for your list of nodes, you would write something like:

nodes.each(doSomethingWith)

Or, as we might put it in J:

   doSomethingWith &.> nodes

Learning and Teaching J

The Writings of Keith Smillie

A recent post on Reddit's r/apljk forum submitted a link to some of the writings of Keith Smillie. These were written some years ago but are oriented around learning J and calculating statistical measures in J, so are worth a look. A summary of these writings, which can be found here, follows:

A Lecture on Array Languages

(14 pages)

Programming languages are introduced briefly and a distinction is made between conventional and array languages. The language J is given as a modern exemplar of array languages and is illustrated with a few simple examples. Some comments are given on the teaching of languages and on the history of computing.

Beginning J

(4 pages)

J IS A GENERAL-PURPOSE programming language developed by Kenneth Iverson, the originator of APL, and Roger Hui. It is intended to be a modern dialect of APL that will provide the simplicity and generality of APL while at the same time be readily and inexpensively available on a variety of computers and capable of being printed on standard printers. In this article we shall give a brief introduction to J illustrated by a discussion of the coupon collector's problem which serves as a model for collecting a complete set of prizes included, one prize per package, in products such as breakfast cereal.

J Companion for Statistical Calculations

(63 pages)

J is used for the development of algorithms for a variety of statistical calculations including means, medians and quartiles, frequency tabulations, variances and covariances, regression analysis, graphical presentation, analysis of variance, random number generation and simulation, nonparametric tests, and probability distributions.

Table of Contents
Introduction Another example Simulation II - Continuous variables
The J language Variance Simulation III - Central limit theorem
Arithmetic mean Summary statistics Simulation IV - Sampling
Geometric and harmonic means Covariance and correlation One more example
Frequencies I - Range Linear regression Chi-square distribution
Frequencies II - Nub Analysis of variance I - Sums of squares Nonparametric statistics - Preliminaries
J for Windows Analysis of variance II - Orthogonal contrasts Nonparametric statistics - Three examples
Frequencies III - Continuous Analysis of variance III - Unequal numbers Nonparametric statistics - Runs
Frequencies IV - Multidimensional Probability Markov chains
Barcharts Discrete probability distributions Moving averages
Stem-and-leaf diagrams Three problems Windows programming
Median and quartiles J constants A final example
Mode Continuous probability distributions Acknowledgements
An example Simulation I - Discrete variables References
   Appendices
1. Summary of verbs
2. Figures
3. Table of computer breakdowns
4. REG and AOV verbs
5. Windows programming
6. Utilities

JSP: A J Statistical Package

(4 pages)

An overview of the above and precursor of the essay below.

J Tutorial and Statistical Package

(68 pages)

The array language J is introduced with examples chosen mostly from elementary statistics. The topics include frequency tabulations, measures of central tendency and dispersion, correlation and regression, analysis of variance, probability distributions, random variables, chisquare, and nonparametric methods.

J4.01 Windows Programming Examples

(10 pages)

Examples of simple GUI creation using a very old dialect of J. This one is bound to be dated as the GUI tools in J have gone through at least a couple of major revisions since this.

A Ray of Hope (video)

This video, A Ray of Hope: Array Programming for the 21st Century by Gilad Bracha on Youtube - start at 1:50 - is probably one of the least ept videos I've seen but the content is good.

 o New word: "orthotope"
 o Operations are lifted to arrays
 o Dot product is the first example  

It has a very Python orientation - symptom of our times - but intersperses mathematical terms like "currying". He thinks he's got a way with named things like "shapereq" (?) to avoid the "strange markings that constitute APL syntax", making it "a bit more readable I think".

He does emphasize that it "makes you think different".

Visualization Considered Harmful

The preceding section on calculating directory sizes was inspired in part by noticing a nice-looking graphical representation of directory and file sizes called "Diskitude". This is a very small executable (10,240 bytes) that quickly produces a circular representation of directory and sub-directory sizes. The sizes of individual sub-directories can be brought up interactively by hovering the cursor over parts of the diagram.

Diskitude really reduced intro screen.jpg

The introduction says:

 Diskitude helps you figure out what is taking up space on your hard drive.  Its intuitive and animated interface allows you to easily explore your file system and prune your files.  Plus it's super small (only 10 KB) so it loads really fast and won't take up any space.

It works as advertised but speed and smallness do not compensate for unwieldy interface design. To be fair, it's currently a very popular type of poor interface - one that requires you to hover the cursor over each unlabeled segment of the diagram to get the size and name of the associated directory. And, while the segmented circular area display looks really cool, it suffers from two basic problems: it's hard for people to accurately estimate relative areas and the shape of the graph distorts the areas anyway.

So, looking at this, I was able to pick out the largest and second-largest directory but I had to hover over my guesses and - what? Write down the name of the directory and its size? Remember these two things while I hover over my next guesses?

Looking at this diagram, can you tell what the top three largest directories are and about how many bytes each is? Of course not.

 Diskitude example on CProgramFiles.jpg

Why does one segment extend up and another down? Who knows?

As I mentioned, in spite of this, I did guess the top two directories on my first two guesses. Of course, I already knew their names because I had already looked at "C:\Program Files" using my own disk-sizing tool as we saw in a preceding section so that's cheating slightly.

Digging in to the Flaws

Here were my first two (lucky) guesses of the largest two sub-directories.

Largest 2nd Largest
Diskitude example on CProgramFiles-probWithCircleAreas0.jpg Diskitude example on CProgramFiles-probWithCircleAreas1.jpg

OK, that's not so bad but what if we compare two segments corresponding to sub-directories a level or two down (out from center) from the first layer?

Some Large Segment Another Large Segment
Diskitude example on CProgramFiles-probWithCircleAreas2.jpg Diskitude example on CProgramFiles-probWithCircleAreas2-evenMoreDistorted.jpg

The one on the right looks larger but, because it is two steps out from the center, it cannot be visually compared to the one on the left. So you can't rely on your immediate perception of relative size of any two segments but you also have to stop and count how far out each is from the center or mentally trace around the corresponding circle segment.

Keeping track of how far you are from the center becomes more of an issue when you think about where you would be pruning files and directories to increase free space: probably not near the top of the hierarchy.

Proposing an Alternative

Compare the above, scalar-oriented way of looking at the data to the start of this simple table:

      @dir	  w/subdirs	Directory
       174	16683886022	C:/Program Files/
         0	 3186342310	C:/Program Files/Tableau
   4773264	 2765183321	C:/Program Files/Microsoft Office
         0	 2753241638	C:/Program Files/Microsoft Office/root
  23072215	 2359099642	C:/Program Files/Docker/Docker
         0	 2359099642	C:/Program Files/Docker
1129374534	 2139101895	C:/Program Files/Docker/Docker/resources
       312	 1649671486	C:/Program Files/Tableau/Tableau 2020.3
     27203	 1560623001	C:/Program Files/NVIDIA Corporation
       312	 1528582440	C:/Program Files/Tableau/Tableau Public 2020.3
...

With a table like this, which shows the amount of space taken up by the files immediately under each sub-directory in the "@dir" column as well as the amount taken up by all sub-directories, we can simply scan down for a likely deletion candidate in order of size. So, we might ask if the gigabyte of files under "C:/Program Files/Docker/Docker/resources" are needed. Compare this to one of the above diagrams where these files are the light-blue segments at about 10 o'clock: we have to hover over each and mentally add them to come up with the total size of files at that level.

Contrasting the two presentations, we have the hover, remember the number and file name, do this again, and maybe again:

Big File #1 Big File #2 Big File #3
Diskitude-largeFile1.jpg Diskitude-largeFile2.jpg Diskitude-largeFile3.jpg

versus looking at this line:

 1129374534	 2139101895	C:/Program Files/Docker/Docker/resources