NYCJUG/2013-11-12

From J Wiki
Jump to navigation Jump to search

weighted averages, empty arrays, randomly-generated J phrases, large file processing

Location:: The Heartland

Agenda

             Meeting Agenda for NYCJUG 20131112
             ----------------------------------
1. Beginner's regatta: comparing averaging methods: see "Weighted Moving
Averages.doc".


2. Show-and-tell: see "Adventures in Random J.doc".

Iterating through a dataset too large to process at once: see
"newBreakupFile.doc".


3. Advanced topics: J Conference 2014: see "J Conference 2014.doc".

See "Why Empty Arrays of Different Types are the Same.doc".


4. Learning, teaching and promoting J, et al.: report on d3.js workshop and
FinTech Hackathon: where J might aim.

See "Old schooled: You never stop learning like a child" and "People in their 
90s are Getting Smarter".

Beginner's regatta

Weighted Moving Averages

[from http://en.wikipedia.org/wiki/Moving_average#Weighted_moving_average]

Weighted moving average

WtdMovingAvgFromWikipedia.jpg

Exponential moving average

[from http://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average] ExponentMovingAvgFromWikipedia.jpg

n-period Exponential Weighted Mean

This differs from the Wikipedia definition as it explicitly specifies the window size n rather than leaving it ambiguous.

For every date in a time-series, we can specify a window to average the current point and the n-1 previous items. The values in this window are used to calculate the exponential weighted mean of those n values:

ExponentialWeightedMeanFormula.jpg

where

SumOfWeightsFormula.jpg

and where At is the input value at time t in the calculation window, base is the base weighting, and p is the power multiplier.

In this worked-out example, n=5, base=2, and p=0.4 for the observation date of 11/11/2013:

ExpWtdMeanDetailedWorkout 30p.jpg

Note that the result is shorter than the input by n-1.

Exponential Weighted Mean in J

Let's see what we can do using the values from the above example in a more array-oriented fashion.

   2^_0.4*|.i.5
0.329877 0.435275 0.574349 0.757858 1
   +/2^_0.4*|.i.5
3.09736
   (]%+/) 2^_0.4*|.i.5
0.106503 0.140531 0.185432 0.244679 0.322856
   15 17.5 22 20 18+/ . *(]%+/) 2^_0.4*|.i.5
18.8413

So, we could write this:

expWtdMean=: 4 : 0
   'n base pm'=. x
   base^(-pm)*|.i.n
)

But it seems not very J-like because of the fixed length. On reflection, we see that the length n is implicitly the window across which we apply the verb, so we can re-write it more generally like this:

expWtdMean=: 4 : 0
   'base pm'=. x
   y+/ . * (]%+/) base^(-pm)*|.i.#y
)

And apply it like this:

   5 (2 0.4&expWtdMean)\15 17.5 22 20 18
18.8413

So, can we approximate the simple moving average with the exponentially-weighted version?

   movingAvg=: 3 : '(([: -: ] * >:)#y)%~y+/ . *>:i.#y'
   3 movingAvg\ >:i.10
2.33333 3.33333 4.33333 5.33333 6.33333 7.33333 8.33333 9.33333
   3 (2 0.67&expWtdMean)\>:i.10
2.29897 3.29897 4.29897 5.29897 6.29897 7.29897 8.29897 9.29897
   3 (2 0.75&expWtdMean)\>:i.10
2.33182 3.33182 4.33182 5.33182 6.33182 7.33182 8.33182 9.33182
   rr=. 10?20
   3 movingAvg\ rr
6 11.3333 10.8333 12 6.66667 4.83333 2.16667 7
   3 (2 0.75&expWtdMean)\rr
6.14211 11.4071 10.5654 12.1999 6.44623 5.0229 2.09712 7.19081
   (2}.rr)%~(3 movingAvg\ rr)-3 (1.9 0.75&expWtdMean)\rr
_0.02084 0.005675 0.0304 _0.01684 0.06102 _0.0668 0.04873 _0.006125

So, to some degree of precision, it looks like we can approximate the one by the other.

   mean (2}.rr)%~(3 movingAvg\ rr)-3 (1.9 0.75&expWtdMean)\rr
0.004408
   mean (2}.rr)%~(3 movingAvg\ rr)-3 (1.9 0.72&expWtdMean)\rr
_0.000272

Adventures in Randomly-generated J Code

I set out to answer the question perhaps no one else has asked: in a random 10 by 10 table of J tokens, how many adjacent triples – vertically or horizontally – form valid J expressions? Continuing this question to longer sequences, how often does a random sequence of four J tokens form a valid expression? Of five? And so on.

This might seem to be a silly exercise but the exploration of this question led to some unexpected insights.

First of all, what is the set of valid J tokens and how might we determine valid combinations of these? The following simple verb helps us answer both of these questions.

checkPhrase=: 3 : 0
   try. ". y catch. y=. '' end.
   y
)

All this does is attempt to run the character string passed as its argument, returning the string if it succeeds, or an empty vector if it doesn’t.

Token J

We first use checkPhrase to identify all valid J tokens under the assumption that these are only one, two, or three characters long. We can generate all triplets like this but there are quite a few of them:

   $ { 3$<a.
256 256 256

This large an array can cause problems on a machine with only 3 GB:

   $goodj=. ~.a:-.~checkPhrase &.> a:-.~' '-.~&.>,{ 3$<a.
|out of memory
|   $goodj=.~.a:-.~checkPhrase&.>a:-.~' '    -.~&.>,{3$<a.

  7!:3''
  64  829
 128  180
 256  215
 512 2042
1024   56

A simplifying assumption is that the last and second-to-last character is always a dot or a colon and there are no spaces in a valid token:

   $goodj=. ~.a:-.~checkPhrase &.> a:-.~' '-.~&.>, { a.;' .:';' .:'
200

So we have about 200 tokens to start with. However, upon inspection, there are a number we probably don’t care about, so we’ll eliminate them.

   goodj=. ~.goodj-.&.>TAB                   NB. TAB is like whitespace
   goodj=. goodj-.a:                         NB. Any empties?
   goodj=. goodj-.,&.>Alpha_j_               NB. A plain letter?
   $goodj=. goodj-.,&.>'0123456789',&.>'.'   NB. numeral, ‘.’, e.g. “1.”?
136

A manual count of tokens on the NuVoc page on the J wiki turns up about 143 valid J tokens, so this is most of them; it does not include a few that may have been added since I did this exercise.

What are the lengths of these tokens?

   ~.#&>goodj
1 2 3
   +/"1 =#&>goodj       NB. How many of each number?
39 92 5
   goodj#~3=#&>goodj    NB. What are the triplets?
+---+---+---+---+---+
|&.:|p..|{::|}::|NB.|
+---+---+---+---+---+

Counting Phrases – Some Surprises

I found a number of surprises in this exploration, even at this early stage of simply determining the valid tokens. One of the single character tokens indicates a minor bug in J:

   232{a.
Φ
   Φ 5

   Φ 'hi'

   Φ / i.10

So, there appears to be a special, high-bit character that doesn’t really do anything. Since this is a fluke, we’ll eliminate it. The next set of surprises is perhaps not so surprising to someone who understands the power of J: very many random combinations form valid J phrases and some of these phrases can be deadly.

I started out by creating a 10 by 10 table of tokens,

genMat=: [ $ ] {~ ([: */ [) ?@$ [: # ]
NB.EG samp=. 10 10 genMat goodj

Then, I applied checkPhrase to all the horizontal triples:

   #&>([: checkPhrase ;)&.>3<\"1 samp NB. Any non-zero length is a good combo.
6 6 0 6 0 0 6 5
4 5 0 0 0 0 0 3
6 0 0 0 0 6 6 6
6 7 0 0 0 5 0 4
0 0 0 0 0 5 5 0
0 0 0 0 0 0 0 0
6 6 6 7 0 6 5 0
0 4 5 0 0 0 0 5
5 0 0 0 0 3 0 5
4 5 6 6 0 0 0 5

So, at first glance, it looks like it will be easy to count up all the triples and longer phrases:

   countGood=: 13 : '0+/ . ~:|:#&>checkPhrase &.> ;&.>x <\"1 y'
   (3+i.8) countGood samp  NB. Check for good phrases lengths 3 through 10
5 3 4 4 2 0 6 3 3 5
2 1 2 2 1 0 5 1 0 3
1 0 1 1 0 0 4 0 0 2
0 0 0 0 0 0 4 0 0 1
0 0 0 0 0 0 3 0 0 0
0 0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0

This table is eight (number of phrase lengths) by ten (number of rows in our sample). Notice how there are a few hits even for some of the longer phrase lengths. So, now all I have to do is run this a few thousand times to get some interesting statistics. However, the practice turns out not to be quite so simple:

   6!:2 'stats=. (<3+i.8) countGood &.> samps'
|domain error: checkPhrase
|       0:M.a:

It turns out there are some phrases that break our simple method of validity checking. Even more annoying, there are some perfectly valid phrases that cause J to do an inordinate amount of work or attempt to allocate more memory than is possible. One of the simplest of this latter class is this simple phrase:

   NB. ,:/a.         NB. Refrain from attempting this: here’s why:
   $,:/5{.a.         NB. How large is result with much shortened argument?
2 2 2 2 1
   $,:/10{.a.        NB. How does result grow with size of argument?
2 2 2 2 2 2 2 2 2 1
   #$,:/10{.a.       NB. So, result of full argument would have 256 dimensions.
10

It seems there are any number of problematic expressions that are not invalid. Through trial and error, I identified a number of these, some of which appear to belong to families of troublesome expressions. Those I’ve identified so far, I assign to a noun badPhrases, which I use to curry a verb rmBadPhrases, thus:

rmBadPhrases=: badPhrases&(] #~&.> [: -. [: +./ [: +./&> E.&.>/)
   
rmBadPhrases_test_=: 3 : 0
   samp=. ('_';'+.';'_.') (0,&.>i.3)}10 10 genMat goodj
   assert. a:-:(<0 0){rmBadPhrases 3 ([: ;&.> <\"1) samp
)

Where ‘_+._.’, used in the test here, is one of the first bad phrases I identified. A number of them are not bad phrases until their argument exceeds a certain value, e.g.

NB. *.^:x:24 -> "out of memory"
NB. $^:!<:@~.12  NB. OOME >:12

How Many are Good?

So, by excluding bad phrases and families of bad phrases, I was able to come up with some statistics on how likely a random J phrase is to be valid, checking the horizontal and vertical of 1,000 ten by tens.

   tit=. 'Length';'Min';'Max';'Mean';'SD'
   load 'dsv'
   sumstats=. usus"1 (+/>stats),.+/>stats11
   (":&.>tit,<"0 (3+i.8),.sumstats) writedsv 'stats2K.txt';TAB;''
238
   fread 'stats2K.txt'
Length	Min	Max	Mean		SD
3		3838	4004	3927.15	51.4119
4		2494	2639	2568.7	45.17
5		1687	1838	1757.25	43.2312
6		1089	1207	1148.4	36.6583
7		693	787	742.2		26.4587
8		399	482	447.15	23.1659
9		213	272	246.2		15.6192
10		86	121	101.85	9.65878

   |.>:i.8       NB. # of each length substring/row.
8 7 6 5 4 3 2 1
   2*10*|.>:i.8  NB. Horizontal, vertical * # rows * # substrings of each length
160 140 120 100 80 60 40 20
   (2{"1 sumstats)%1000*2*10*|.>:i.8
0.0245447 0.0183479 0.0146438 0.011484 0.0092775 0.0074525 0.006155 0.0050925
   ff=. (2{"1 sumstats)%1000*2*10*|.>:i.8
   2%~/\ff
0.747529 0.798118 0.784225 0.807863 0.803288 0.825897 0.827376

Bad to the Bone

Here are some of the very “bad” phrases I’ve uncovered so far:

badPhrases=: ' '-.~&.><;._2 ] 0 : 0
,:/a.
$M.a:
,:^:]a:
$^:!<:@~.
_+._.
A.!9
A.!8
*._.
+._.
*.^:x:
p.[~:a.
>M.@{_.
%.H.p:,._.
)
NB. *.^:x:24 -> "out of memory"
NB. $^:!<:@~.12  NB. OOME >:12
NB. ,:^:]a:      NB. OOME

vn=: ' '-.~&.>(<'_0:')-.~,{' _';'0123456789';<':'        NB. all verbal numbers
NB. badPhrases=: badPhrases,,;&.>{(<'*f.H.');vn;<<'_.'   NB. e.g. *f.H.1:_.
NB. badPhrases=: badPhrases,,}.{vn;(<'H.');vn;<<'_.'     NB. e.g. _2:H.0:_.
badPhrases=: badPhrases,,;&.>{ (<'H.');vn;<<'_.'         NB. e.g. H._1:_.
badPhrases=: badPhrases,,(<'_.*.'),&.>'123456789'        NB. e.g. _.*.1
badPhrases=: badPhrases,,}.vn,&.><'H.{:_.'               NB. e.g. 1:H.{:_.
badPhrases=: badPhrases,,}.vn,&.><'H.{._.'               NB. e.g. 1:H.{._.
badPhrases=: badPhrases,vn,&.><'M.a:'                    NB. e.g. 1:M.a:
badPhrases=: badPhrases,vn,&.><'M.a:'                    NB. e.g. 1:M.a:
badPhrases=: badPhrases,'6789',~&.><'q:^:|:^*:'          NB. e.g. q:^:|:^*:7
badPhrases=: badPhrases,(<',:"'),&.>(vn,,&.>'0123456789'),&.><'/a.' NB. e.g. ,:"3/a.

Show-and-Tell

Working with Large Files in Pieces

In order to work with a file too large to fit into memory in one piece, we develop a verb to break it into pieces and an adverb to apply an arbitrary verb across the file. In this case, our objective is break a large file into small pieces to facilitate transmission of it, then re-assemble the pieces on the target machine to re-create our original, large file.

NB.* breakUpFile: inner verb to break apart file into smaller pieces.
breakUpFile=: 4 : 0
   'curptr chsz max flnm ctr'=. 5{.y
   if. curptr>:max do. ch=. (curptr;chsz;max;flnm;'');ctr
   else. ch=. readChunk curptr;chsz;max;flnm
       x writeFilePiece (>{:ch);ctr
       ch=. ch;>:ctr
   end.
   ch
NB.EG ('pfx';'.suf')&breakUpFile ^:_ ] 0;1e6;(fsize 'big.dat');'big.dat';0
)

writeFilePiece=: 4 : 0
   'pfx suff'=. x [ 'ch ctr'=. y
   ch fwrite pfx,(":ctr),suff
)

NB.* doSomething: do something to a large file in sequential blocks.
doSomething=: 1 : 0
   'curptr chsz max flnm leftover hdr'=. 6{.y
   if. curptr>:max do. ch=. curptr;chsz;max;flnm
   else. if. 0=curptr do. ch=. readChunk curptr;chsz;max;flnm
           chunk=. leftover,CR-.~>_1{ch
           'chunk leftover'=. (>:chunk i: LF) split chunk
           'hdr body'=. (>:chunk i. LF) split chunk
           hdr=. }:hdr
       else. chunk=. leftover,CR-.~>_1{ch=. readChunk curptr;chsz;max;flnm
           'body leftover'=. (>:chunk i: LF) split chunk
       end.
       u body;<hdr
   end.
   (4{.ch),leftover;<hdr
NB.EG (('PRCCD - Price - Close - Daily - USD';'$issue_id';'IDsDateRanges-Daily.txt')&accumDts2File) doSomething ^:_ ] 0;1e6;(fsize 'gvkeyIID-USD.txt');'gvkeyIID-USD.txt'
NB.EG (('PRCCD - Price - Close - Daily';'IDsDateRanges.txt')&accumDts2File) doSomething ^:_ ] 0;1e6;(fsize 'GvkeyIID.txt');'GvkeyIID.txt'
)

readChunk=: 3 : 0
   'curptr chsz max flnm'=. 4{.y
   if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread flnm;curptr,chsz2
   else. chunk=. '' end.
   (curptr+chsz2);chsz2;max;flnm;chunk
NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize 'GvkeyIID.txt');'GvkeyIID.txt'
)

readChunk_egUse_=: 0 : 0
   ch0=. readChunk 0;1e6;(fsize 'GvkeyIID.txt');'GvkeyIID.txt'
   chunk=. CR-.~>_1{ch0
   'chunk leftover'=. (>:chunk i: LF) split chunk
   'hdr body'=. split <;._1&> TAB,&.><;._2 chunk
   body=. body#~-.a: e.~ body{"1~hdr i. <'PRCCD - Price - Close - Daily'
   unqids=. ~.ids=. ;&.><"1 body{"1~ hdr i. '$gvkey';'$iid'
   dts=. MDY2ymdNum&>0{"1 body
   (unqids textLine ids (<./,>./) /. dts) fappend 'IDsDateRanges.txt'
)

Still to Do

We need to create a batch file with the commands to re-assemble the pieces into the original file. Here's an example of doing this manually.

First, we group the assembly of the smallest pieces into intermediate files, in order.

   $nmlst=. 0{"1 dir 'Bridge*.dat'
100
   3{.nmlst
+-----------+-----------+------------+
|Bridge0.dat|Bridge1.dat|Bridge10.dat|
+-----------+-----------+------------+
   nmlst=. nmlst /: ".&>6}.&.>_4}.&.>nmlst   NB. Order by numeric portion
   _3{.nmlst
+------------+------------+------------+
|Bridge97.dat|Bridge98.dat|Bridge99.dat|
+------------+------------+------------+
   #&>nmlst
11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12...

We need to account for the length of the start of the command, its result and each of the small file names separated by plus signs - "+" is the DOS "copy" command concatenation symbol.

   #st=. 'copy /b ',end=. 'Br000.tmp'
17
   255-17             NB. Maximum line length is 255
238
   12%~255-17         NB. How many intermediate joins can we do per line?
19.8333
   +/ptn=. (#nmlst)$19{.1
6
   #ptn
100
   bb=. ptn<;.1 nmlst
join1=: 3 : 0
   'nms outnm ctr'=. y
   nms=. >nms
   ('copy /b '),(}.;'+',&.>nms),' ',(outnm{.~outnm i. '.'),(":ctr),outnm}.~outnm i. '.'
)

Check that this works as we think it ought to.

   join1 (0{bb);'Br.tmp';0
copy /b Bridge0.dat+Bridge1.dat+Bridge2.dat+Bridge3.dat+Bridge4.dat+Bridge5…
   ;LF,~&.>join1 &.> (<"0 bb);&.>(<'Br.tmp');&.>i.+/ptn
copy /b Bridge0.dat+Bridge1.dat+Bridge2.dat+Bridge3.dat+…+Bridge18.dat Br0.tmp
copy /b Bridge19.dat+Bridge20.dat+Bridge21.dat+Bridge22.dat+...
…
copy /b Bridge95.dat+Bridge96.dat+Bridge97.dat+Bridge98.dat+Bridge99.dat Br5.tmp
   (;LF,~&.>join1 &.> (<"0 bb);&.>(<'Br.tmp');&.>i.+/ptn) fwrite 'Assemble529.bat'
1386

Now we have to do the same thing at the next level: join together the intermediate files that are the aggregates of the smallest pieces.

   (<'.tmp'),~&.>(<'Br'),&.>":&.>i.6
+-------+-------+-------+-------+-------+-------+
|Br0.tmp|Br1.tmp|Br2.tmp|Br3.tmp|Br4.tmp|Br5.tmp|
+-------+-------+-------+-------+-------+-------+
   (LF,~'copy /b ',(}.;'+',&.>(<'.tmp'),~&.>(<'Br'),&.>":&.>i.6),' 5.2.9_Clarifi_BridgeInstaller.exe') fappend 'Assemble529.bat'
90

Again, an example from the top, for another file. First, we break down the large file named by finalNm into two million byte pieces with names of the form "PatchN.dat" where "N" is a sequence number.

   finalNm=. '5.2.9_Clarifi_PatchInstaller.exe'
   ('Patch';'.dat')&breakUpFile ^:_ ] 0;2e6;(fsize finalNm);finalNm;'';0
+---------+-----+---------+--------------------------------+...
|396056125|56125|396056125|5.2.9_Clarifi_PatchInstaller.exe|...
+---------+-----+---------+--------------------------------+...

Now, get the list of names of the small pieces.

   nmlst=. 0{"1 dir 'Patch*.dat'
   #'Patch'
5

Check that the file names are in numeric order (by the number embedded in the file name).

   11{.nmlst=. nmlst /: ".&>5}.&.>_4}.&.>nmlst
+----------+----------+----------+----------+----------+----------+--...
|Patch0.dat|Patch1.dat|Patch2.dat|Patch3.dat|Patch4.dat|Patch5.dat|Pa...
+----------+----------+----------+----------+----------+----------+--...
   _11{.nmlst=. nmlst /: ".&>5}.&.>_4}.&.>nmlst
+------------+------------+------------+------------+------------+---...
|Patch188.dat|Patch189.dat|Patch190.dat|Patch191.dat|Patch192.dat|Pat...
+------------+------------+------------+------------+------------+---...

Check the sizes of these names and use the longest to calculate how many we can group to assemble the intermediate pieces.

   #&>nmlst
10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 ...
   11%~255-17
21.6364

Partition the name list into sufficiently short groups so we can build the commands within the 255-character limit.

   +/ptn=. (#nmlst)$20{.1
10
   bb=. ptn<;.1 nmlst
   qts''
2013 10 21 11 38 18.048

Generate the first level of commands to assemble the smallest files into intermediate, larger files.

   (;LF,~&.>join1 &.> (<"0 bb);&.>(<'Pa.tmp');&.>i.+/ptn) fappend 'Assemble529.bat'
2637

Generate the final assembly of the intermediate pieces into the original file.

   (LF,~'copy /b ',(}.;'+',&.>(<'.tmp'),~&.>(<'Pa'),&.>":&.>i.+/ptn),' ',finalNm) fappend 'Assemble529.bat'
121

Remember to put together "send.ftp" file to transmit all the pieces over to the target machine.

Advanced Topics

Learning, teaching and promoting J

Materials

-- Devon McCormick <<DateTime>>