NYCJUG/2014-05-13

From J Wiki
Jump to navigation Jump to search

string replace, file streaming, optimization software, working with large files, innate language instinct, effect of using foreign language on decision, coding in education


Summary of NYCJUG Meeting on 20140513

We looked at some variations of string replacement, from how we might develop a simple version in J versus the standard one supplied. From this, we introduced the problem of applying something like string replacement across a file too large to be read into memory in one piece, then generalized this to the problem of applying an arbitrary verb across a large file in pieces.

We also considered some aspects of human cognition as it relates to language acquisition and how language shapes cognition. Continuing in this vein, we talked about the usefulness of including coding as a standard part of an education curriculum.

Meeting Agenda for NYCJUG 20140513

1. Beginner's Regatta: developing a simple string replacement verb: see "String Replacement".

2. Show-and-tell: various ways to replace a string in a file: see "Replace a String in a File".
Also, see "Streaming Through Large Files".

3. Advanced topics: see "Optimization Software".

4. Learning, teaching and promoting J, et al.: cognitive aspects of language: see "Born to chat: Humans may have innate language instinct" and "Using a foreign language changes moral decisions".

Learning to code - see "Reading, Writing, Arithmetic, and Lately, Coding" - and why this is increasingly important - see excerpt from Forrester Computing white paper.

Beginner's Regatta: Simple String Replacement

We have a standard routine for replacing one string by another in a given text:

   whereDefined 'stringreplace'
C:\Program Files\J701\system\main\stdlib.ijs   NB. YMMV

Here's an example of how it works:

   ('aa';'XXX') stringreplace 'abcaabbccaabbbccc'
abcXXXbbccXXXbbbccc

Let's see how we might develop a simple version of this. First, we'll name the pieces using the example above.

   'old new txt'=. 'aa';'XXX';'abcaabbccaabbbccc'

   whold=. old E. txt     NB. Find the old string
   (' '-.~":whold),:txt   NB. Show together
00010000010000000
abcaabbccaabbbccc

   ]ixold=. (I. whold)+/i.#old  NB. Start/stop indexes of old
3  4
9 10

Now that we have the indexes of the old strings within the text, we can remove them:

   ]rmold=. (0) ixold}1$~#txt   NB. 0s to remove old
1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1
   rmold#txt
abcbbccbbbccc

The tricky parts are making room to insert the new text and keeping track of where it should go in the reduced, intermediate text.

   rmold#1|.-.rmold        NB. Shift 1 to mark starting points of
0 0 1 0 0 0 1 0 0 0 0 0 0  NB. where to insert.
   (#new)*rmold#1|.-.rmold NB. Size of insertions
0 0 3 0 0 0 3 0 0 0 0 0 0
   >:(#new)*rmold#1|.-.rmold
1 1 4 1 1 1 4 1 1 1 1 1 1

Here we've built an expansion vector with ones for the letters left untouched and larger numbers to allow room for both the original character at that point and the new string we'll be inserting.

   ]expand=: >:(#new)*rmold#1|.-.rmold NB. Expand for insertion
1 1 4 1 1 1 4 1 1 1 1 1 1
   expand#rmold#txt
abccccbbcccccbbbccc

Now we need to figure out where to place the new strings...

   expand#rmold#1|.rmold
1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1

This isn't quite right: we want the first zero to be a one to preserve the existing character before the insertion.

   13 : 'y+.1,}:y'     NB. generate tacit expression
] +. 1 , }:
   (] +. 1 , }:) expand#rmold#1|.rmold
1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1

Create the Boolean with ones corresponding to the insertion points for the new strings.

   ]insnew=. -.(] +. 1 , }:) expand#rmold#1|.rmold  NB. Insert new
0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0
   ]insnew=. I. insnew   NB. Use indexes
3 4 5 10 11 12

Ensure we replicate the new string enough times to correspond to how man times we're inserting a copy.

   new$~#insnew
XXXXXX

Put it all together to see that it works:

   (new$~#insnew) insnew}expand#rmold#txt
abcXXXbbccXXXbbbccc

Gather the relevant portions of code from above to define a simple string replacement verb.

simpleReplace=: 4 : 0
   'old new'=. x [ txt=. y             NB. Remove old text, insert new text.
   whold=. old E. txt                  NB. Find old strings.
   ixold=. (I. whold)+/i.#old          NB. Indexes of old
   rmold=. (0) ixold}1$~#txt           NB. Remove old
   expand=. >:(#new)*rmold#1|.-.rmold  NB. Expand to make room for new
   insnew=. I.-.(]+.1,}:) expand#rmold#1|.rmold NB. Insert new
   (new$~#insnew) insnew}expand#rmold#txt
)

Compare our new version to the existing one:

   ('aa';'XXX') stringreplace 'abcaabbccaabbbccc'
abcXXXbbccXXXbbbccc
   ('aa';'XXX') simpleReplace 'abcaabbccaabbbccc'
abcXXXbbccXXXbbbccc

See where it falls short:

   ('aa';'XXX') stringreplace 'aBaaaBBaaaa'  NB. More complex because
aBXXXaBBXXXXXX                               NB. self-overlap.
   ('aa';'XXX') simpleReplace 'aBaaaBBaaaa'
aBXXXBBXXX

We see that our simple version gives a different answer in the case where the old string is found in self-overlapping instances. That is "aa" is found twice in the string "aaa": once starting at position zero and again at position one.

This may be sufficiently unusual that we are willing to live with the difference. However, there is also a bonus with our simple version: unlike the original, this also works with numeric vectors:

   (1 1;9 9 9) simpleReplace 1 2 3 1 1 2 2 3 3 1 1 4 4
1 2 3 9 9 9 2 2 3 3 9 9 9 4 4

Show-and-tell

We looked at a wide array of suggestions for text replacement in various (non-J) languages in response to a StackOverflow question. The responses ranged from large bodies of code (JScript) to very succinct expressions (Perl and, surprisingly, the Windows Batch command language). These are of interest to compare with J in terms of readability.

Streaming Through Large Files

Of greater interest is a more general solution of developing a J adverb to apply something like a string replacement verb across a large file without reading the entire file into memory. The code explained here can be found in its entirety here.

An occasional disadvantage of an array-oriented language like J is that we like to work on an entire array at a time but sometimes the data is too large for this to work well. Often we have large amounts of data in a file, for instance:

   fsize 'CIQ_G_History_New.txt'
185661834
   10^.fsize 'CIQ_G_History_New.txt'
8.26872

Here’s one way we might start on this problem: by specifying a verb to work through the file until we’ve accumulated at least a full “line” of data, here defaulted to being delimited by LF (10{a.).

NB.* getFirstLine: get 1st line of tab-delimited file, along w/info
NB. to apply this repeatedly to get subsequent lines.
getFirstLine=: 3 : 0
   (10{a.) getFirstLine y     NB. Default to LF line-delimiter.
:
   if. 0=L. y do. y=. 0;10000;y;'' end.
   'st len flnm accum'=. 4{.y NB. Starting byte, length to read, file name,
   len=. len<.st-~fsize flnm  NB. any previous accumulation.
   continue=. 1               NB. Flag indicates OK to continue (1) or no
   if. 0<len do. st=. st+len  NB. header found (_1), or still accumulating (0).
       if. x e. accum=. accum,fread flnm;(st-len),len do.
           accum=. accum{.~>:accum i. x [ continue=. 0
       else. 'continue st len flnm accum'=. x getFirstLine st;len;flnm;accum end.
   else. continue=. _1 end.   NB. Ran out of file w/o finding x.
   continue;st;len;flnm;accum
NB.EG hdr=. <;._1 TAB,(CR,LF) -.~ >_1{getFirstLine 0;10000;'bigFile.txt' NB. Assumes 1e4>#(1st line).
)
   6!:2 'hdr=. >_1{getFirstLine ''CIQ_G_History_New.txt'''
0.423876
   $hdr
285
   $<;._1 TAB,hdr
24
   5{.<;._1 TAB,hdr
+----+-------+-------------+------+-------+
|Date|$ticker|$company_name|$cusip|DAT_ITM|
+----+-------+-------------+------+-------+

The idea to expand this is that, contrary to the indication of the name, we could continue to read lines from the file by feeding the resulting pointer from one invocation to begin the next. However, we have already developed such a routine that we’ve covered in another meeting: an adverb called “doSomething”. The advantage of an adverb is that it leaves open a slot for a verb to do whatever arbitrary thing we want to with a file, within limits, as we shall soon see. This adverb is designed to work on text files that are conceptually large matrixes starting with a header line naming the columns.

This is implemented in the adverb by treating the first line of the file specially – essentially, we save this line to pass to subsequent invocations of the adverb after the first one. For this reason, as can been seen below, there are two major logic branches in the code for the adverb: the first branch segregates the first line into the variable “hdr” which is then returned as part of the output – to be passed on to subsequent invocations as the processing moves through the file – and is also offered as part of the argument to the verb supplied to “doSomething”.

Basic Adverb for Processing a Large File

doSomething=: 1 : 0
   'curptr chsz max flnm leftover hdr'=. 6{.y
   if. curptr>:max do. ch=. curptr;chsz;max;flnm
   else. if. 0=curptr do. ch=. readChunk curptr;chsz;max;flnm
           chunk=. leftover,CR-.~>_1{ch NB. Work up to last complete line.
           'chunk leftover'=. (>:chunk i: LF) split chunk  NB. LF-delimited lines
           'hdr body'=. (>:chunk i. LF) split chunk   NB. Assume 1st line header.
           hdr=. }:hdr            NB. Retain trailing partial line as "leftover".
       else. chunk=. leftover,CR-.~>_1{ch=. readChunk curptr;chsz;max;flnm
           'body leftover'=. (>:chunk i: LF) split chunk
       end.
       u body;<hdr
   end.
   (4{.ch),leftover;<hdr
NB.EG CTR [ ((10{a.)&(4 : 'CTR=: CTR + x +/ . = >0{y')) doSomething ^:_ ] 0x;1e6;(fsize 'bigFile.txt');'bigFile.txt' [ CTR=: 0
)

The example use shown in the “NB.EG” comment simply counts the number of LFs in a file, accumulating the result into the global variable “CTR”. Here’s an example of running this on a large file, along with a timing:

   load '~Code/workOnLargeFile.ijs'
   6!:2 '((10{a.)&(4 : ''CTR=: CTR + x +/ . = >0{y'')) doSomething ^:_ ] 0x;1e6;(fsize bigfl);bigfl [ CTR=: 0 [ bigfl=. ''CIQ_G_History_New.txt'''
4.38501
   CTR
1371898
   shell 'wc CIQ_G_History_New.txt'  NB. Verify result using "wc"
1371899 32901273 185661834 CIQ_G_History_New.txt

The difference of one is due to the header being treated separately.

Issues With Using Adverb

Following is an example of using “doSomething” for a real problem that raises some issues for considering how this idea might be better-built. The problem to be solved is that a large data file contains some entries with duplicate keys which consist of the first two columns. To locate the duplicated records, we process the file to extract the contents of these first two columns.

To do this, we first define a verb to return a matrix given a character vector with LF-delimited lines and TAB-separated fields.

   usuMatify=: (TAB,LF)&([: <;._1&> (0{[) ,&.> [: <;._2 ] (],[#~[~:[:{:])~ 1{ [)"0 1
   usuMatify ('hi',TAB,'ho'),LF,'har',TAB,'har'
+---+---+
|hi |ho |
+---+---+
|har|har|
+---+---+

The initially-evaluated verb phrase on the right - ( ],[#~[~:[:{:] ) – ensures that there is a trailing line-delimiter (internally generalized as " 1{[ " – the 1th character of the left argument) as this is expected by the line-partitioning phrase " <;._2 ". The tacit version of this right-most phrase was generated using J’s "explicit-to-tacit" verb like this:  13 : 'y,x#~x~:{:y' .

flnm=. 'CIQ_G_History_New.txt'
   accumKeys=: 3 : 'KEYS=: KEYS,2{."1 usuMatify y'
   6!:2 '(accumKeys >0{y) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm [ KEYS=: 0  2$'''''
30.8413
   $KEYS
1450459 2
   3{.KEYS
+--------+----+
|20140207|AIR |
+--------+----+
|20140207|AFAP|
+--------+----+
|20140207|AAL |
+--------+----+
   6!:2 'KEYS=: ,&.>/"1 KEYS'  NB. Combine keys into single item.
0.887262
   $KEYS
1450459
   $~.KEYS
1384524
   (-/,])(#,#@:~.) KEYS        NB. # duplicates, total # records, # unique recs
65935 1450459 1384524

When we want to remove the duplicates from the file, we find a difficulty with using our adverb which arises because of the external information we must use in the verb: the compression vector. We create this vector this way:

RMDUPS=: whUnq KEYS

Where

   whUnq      NB.* whUnq: 1 corresponds to unique item, 0 to duplicate one.
3 : '(/:/:y){1,2~:/\/:~y'

We implement the verb to remove duplicates using a number of global variables, not the most elegant way to do things.

rmDupes=: 3 : 0
   ixs=. RECCTR+i.#mm=. usuMatify y
   (unmatify (ixs{RMDUPS)#mm) fappend NEWFL
   RECCTR=: RECCTR+#ixs
)

We set up the necessary globals and run like this:

   fsize flnm=. 'CIQ_VM.txt'
413733741
   hdr=. >_1{getFirstLine flnm
   hdr fwrite NEWFL=: 'CIQ_VM_new.txt'   NB. Initialize new file with header.
283
   6!:2 '([: rmDupes [: > 0 { ]) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm [ RECCTR=: 0'
214.309

This last phrase shows that we can process the entire file in about 214 seconds.

Verify that the new file is smaller and does not have duplicate keys:

   fsize NEWFL
394026934
   6!:2 '(3 : ''KEYn=: KEYn,2{."1 usuMatify >0{y'') doSomething ^:_ ] 0x;5e6;(fsize NEWFL);NEWFL [ KEYn=: 0 2$'''''
74.5355
   6!:2 'KEYn=: ,&.>/"1 KEYn'
0.904184
   (-/,])(#,#@:~.) KEYn
0 1384592 1384592

One side-effect of processing these pieces of the file is left-over memory allocation:

   7!:3''
 128 2769329
 256    3200
 512      39
1024      65
   */"1]7!:3''
354474240 819200 19456 66560
   +/*/"1]7!:3''
355379456

Advanced topics: Optimization Software

Scott Locklin posted this note:

from:	 Scott Locklin scott_locklin@yahoo.com
to:	 programming@jsoftware.com
date:	 Fri, Feb 22, 2013 at 12:50 AM
subject: Re: [Jprogramming] deoptim and lbfgs minimization

For what it is worth, lbfgs works on linux. I stuck it on github, along with fortran source and Makefile in case it helps others. https://github.com/locklin/j-lbfgs

I also took the liberty of using the same code to do the box constrained version of the algorithm. The example is no good, and I think the loop needs to be fixed up, but it does call lbfgsb, and notice when you pass it infeasible initial conditions. Perhaps it is useful to someone: https://github.com/locklin/j-lbfgsb

I also got the flann library to mostly work for locality hashing, k-means trees and kd-trees for near neighbor searches. There's an issue with passing a raw pointer around (something I'd like to do for multiple online queries to a pre-sorted tree), so use it at your own risk. If anyone can spot what I'm doing wrong on the "buildtree/searchtree" examples, I'd appreciate it. It does calculate nearest neighbors, and very quickly. https://github.com/locklin/j-nearest-neighbor

I'll have a look at 64 bit lapack, as PCA/eigenvectors are important to me.

LBFGS

According to this,

The L-BFGS method solves the unconstrainted minimization problem,

    minimize F(x), x = (x1, x2, ..., xN),
 only if the objective function F(x) and its gradient G(x) are computable.

The well-known Newton's method requires computation of the inverse of the hessian matrix of the objective function. However, the computational cost for the inverse hessian matrix is expensive especially when the objective function takes a large number of variables. The L-BFGS method iteratively finds a minimizer by approximating the inverse hessian matrix by information from last m iterations. This innovation saves the memory storage and computational time drastically for large-scaled problems.

The other techniques Locklin mentions - k-means trees and kd-trees - are general methods for dealing with large amounts of multi-dimensional data. These methods allow us to specify clusters of points that are near each other to speed up locating nearest neighbors.

Learning and Teaching J

Language Studies

We looked at the results of some studies on language acquisition. One, which asserts that people instinctively organize language according to an internal semantic hierarchy, is seen by some as supporting the Chomskyan idea of an innate, Universal Grammar facility in human brains, though others point out that it could just as easily be an expression of general organizing principles of cognition. Another study measured blood flow in the brains of 24 newborn infants as they listened to recordings of nonsense syllables. This found that these infants appeared to distinguish between syllables that are more or less easily pronounceable in spite of presumably having had little exposure to speech. This was also seen as supporting the Chomskyan idea of innate language facilities, which would have implications for computer language design.

We also looked at an article, about another study, titled "Using a foreign language changes moral decisions". This article claims that a study by psychologists from the University of Chicago and Pompeu Fabra University in Barcelona "...that people using a foreign language take a relatively utilitarian approach to moral dilemmas, making decisions based on assessments of what's best for the common good. That pattern holds even when the utilitarian choice would produce an emotionally difficult outcome, such as sacrificing one life so others could live." This raises interesting ideas about the utility of reasoning with a (foreign) computational notation.

Materials