NYCJUG/2013-11-12

From J Wiki
Jump to: navigation, search

weighted averages, empty arrays, randomly-generated J phrases, large file processing


Location:: The Heartland

Agenda

             Meeting Agenda for NYCJUG 20131112
             ----------------------------------
1. Beginner's regatta: comparing averaging methods: see "Weighted Moving
Averages.doc".


2. Show-and-tell: see "Adventures in Random J.doc".

Iterating through a dataset too large to process at once: see
"newBreakupFile.doc".


3. Advanced topics: J Conference 2014: see "J Conference 2014.doc".

See "Why Empty Arrays of Different Types are the Same.doc".


4. Learning, teaching and promoting J, et al.: report on d3.js workshop and
FinTech Hackathon: where J might aim.

See "Old schooled: You never stop learning like
a child" and "People in their 90s are Getting Smarter".

Beginner's regatta

Show-and-Tell

Working with Large Files in Pieces

In order to work with a file too large to fit into memory in one piece, we develop a verb to break it into pieces and an adverb to apply an arbitrary verb across the file. In this case, our objective is break a large file into small pieces to facilitate transmission of it, then re-assemble the pieces on the target machine to re-create our original, large file.

NB.* breakUpFile: inner verb to break apart file into smaller pieces.
breakUpFile=: 4 : 0
   'curptr chsz max flnm ctr'=. 5{.y
   if. curptr>:max do. ch=. (curptr;chsz;max;flnm;'');ctr
   else. ch=. readChunk curptr;chsz;max;flnm
       x writeFilePiece (>{:ch);ctr
       ch=. ch;>:ctr
   end.
   ch
NB.EG ('pfx';'.suf')&breakUpFile ^:_ ] 0;1e6;(fsize 'big.dat');'big.dat';0
)

writeFilePiece=: 4 : 0
   'pfx suff'=. x [ 'ch ctr'=. y
   ch fwrite pfx,(":ctr),suff
)

NB.* doSomething: do something to a large file in sequential blocks.
doSomething=: 1 : 0
   'curptr chsz max flnm leftover hdr'=. 6{.y
   if. curptr>:max do. ch=. curptr;chsz;max;flnm
   else. if. 0=curptr do. ch=. readChunk curptr;chsz;max;flnm
           chunk=. leftover,CR-.~>_1{ch
           'chunk leftover'=. (>:chunk i: LF) split chunk
           'hdr body'=. (>:chunk i. LF) split chunk
           hdr=. }:hdr
       else. chunk=. leftover,CR-.~>_1{ch=. readChunk curptr;chsz;max;flnm
           'body leftover'=. (>:chunk i: LF) split chunk
       end.
       u body;<hdr
   end.
   (4{.ch),leftover;<hdr
NB.EG (('PRCCD - Price - Close - Daily - USD';'$issue_id';'IDsDateRanges-Daily.txt')&accumDts2File) doSomething ^:_ ] 0;1e6;(fsize 'gvkeyIID-USD.txt');'gvkeyIID-USD.txt'
NB.EG (('PRCCD - Price - Close - Daily';'IDsDateRanges.txt')&accumDts2File) doSomething ^:_ ] 0;1e6;(fsize 'GvkeyIID.txt');'GvkeyIID.txt'
)

readChunk=: 3 : 0
   'curptr chsz max flnm'=. 4{.y
   if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread flnm;curptr,chsz2
   else. chunk=. '' end.
   (curptr+chsz2);chsz2;max;flnm;chunk
NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize 'GvkeyIID.txt');'GvkeyIID.txt'
)

readChunk_egUse_=: 0 : 0
   ch0=. readChunk 0;1e6;(fsize 'GvkeyIID.txt');'GvkeyIID.txt'
   chunk=. CR-.~>_1{ch0
   'chunk leftover'=. (>:chunk i: LF) split chunk
   'hdr body'=. split <;._1&> TAB,&.><;._2 chunk
   body=. body#~-.a: e.~ body{"1~hdr i. <'PRCCD - Price - Close - Daily'
   unqids=. ~.ids=. ;&.><"1 body{"1~ hdr i. '$gvkey';'$iid'
   dts=. MDY2ymdNum&>0{"1 body
   (unqids textLine ids (<./,>./) /. dts) fappend 'IDsDateRanges.txt'
)

Still to Do

We need to create a batch file with the commands to re-assemble the pieces into the original file. Here's an example of doing this manually.

First, we group the assembly of the smallest pieces into intermediate files, in order.

   $nmlst=. 0{"1 dir 'Bridge*.dat'
100
   3{.nmlst
+-----------+-----------+------------+
|Bridge0.dat|Bridge1.dat|Bridge10.dat|
+-----------+-----------+------------+
   nmlst=. nmlst /: ".&>6}.&.>_4}.&.>nmlst   NB. Order by numeric portion
   _3{.nmlst
+------------+------------+------------+
|Bridge97.dat|Bridge98.dat|Bridge99.dat|
+------------+------------+------------+
   #&>nmlst
11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12...

We need to account for the length of the start of the command, its result and each of the small file names separated by plus signs - "+" is the DOS "copy" command concatenation symbol.

   #st=. 'copy /b ',end=. 'Br000.tmp'
17
   255-17             NB. Maximum line length is 255
238
   12%~255-17         NB. How many intermediate joins can we do per line?
19.8333
   +/ptn=. (#nmlst)$19{.1
6
   #ptn
100
   bb=. ptn<;.1 nmlst
join1=: 3 : 0
   'nms outnm ctr'=. y
   nms=. >nms
   ('copy /b '),(}.;'+',&.>nms),' ',(outnm{.~outnm i. '.'),(":ctr),outnm}.~outnm i. '.'
)

Check that this works as we think it ought to.

   join1 (0{bb);'Br.tmp';0
copy /b Bridge0.dat+Bridge1.dat+Bridge2.dat+Bridge3.dat+Bridge4.dat+Bridge5…
   ;LF,~&.>join1 &.> (<"0 bb);&.>(<'Br.tmp');&.>i.+/ptn
copy /b Bridge0.dat+Bridge1.dat+Bridge2.dat+Bridge3.dat+…+Bridge18.dat Br0.tmp
copy /b Bridge19.dat+Bridge20.dat+Bridge21.dat+Bridge22.dat+...
…
copy /b Bridge95.dat+Bridge96.dat+Bridge97.dat+Bridge98.dat+Bridge99.dat Br5.tmp
   (;LF,~&.>join1 &.> (<"0 bb);&.>(<'Br.tmp');&.>i.+/ptn) fwrite 'Assemble529.bat'
1386

Now we have to do the same thing at the next level: join together the intermediate files that are the aggregates of the smallest pieces.

   (<'.tmp'),~&.>(<'Br'),&.>":&.>i.6
+-------+-------+-------+-------+-------+-------+
|Br0.tmp|Br1.tmp|Br2.tmp|Br3.tmp|Br4.tmp|Br5.tmp|
+-------+-------+-------+-------+-------+-------+
   (LF,~'copy /b ',(}.;'+',&.>(<'.tmp'),~&.>(<'Br'),&.>":&.>i.6),' 5.2.9_Clarifi_BridgeInstaller.exe') fappend 'Assemble529.bat'
90

Again, an example from the top, for another file. First, we break down the large file named by finalNm into two million byte pieces with names of the form "PatchN.dat" where "N" is a sequence number.

   finalNm=. '5.2.9_Clarifi_PatchInstaller.exe'
   ('Patch';'.dat')&breakUpFile ^:_ ] 0;2e6;(fsize finalNm);finalNm;'';0
+---------+-----+---------+--------------------------------+...
|396056125|56125|396056125|5.2.9_Clarifi_PatchInstaller.exe|...
+---------+-----+---------+--------------------------------+...

Now, get the list of names of the small pieces.

   nmlst=. 0{"1 dir 'Patch*.dat'
   #'Patch'
5

Check that the file names are in numeric order (by the number embedded in the file name).

   11{.nmlst=. nmlst /: ".&>5}.&.>_4}.&.>nmlst
+----------+----------+----------+----------+----------+----------+--...
|Patch0.dat|Patch1.dat|Patch2.dat|Patch3.dat|Patch4.dat|Patch5.dat|Pa...
+----------+----------+----------+----------+----------+----------+--...
   _11{.nmlst=. nmlst /: ".&>5}.&.>_4}.&.>nmlst
+------------+------------+------------+------------+------------+---...
|Patch188.dat|Patch189.dat|Patch190.dat|Patch191.dat|Patch192.dat|Pat...
+------------+------------+------------+------------+------------+---...

Check the sizes of these names and use the longest to calculate how many we can group to assemble the intermediate pieces.

   #&>nmlst
10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 ...
   11%~255-17
21.6364

Partition the name list into sufficiently short groups so we can build the commands within the 255-character limit.

   +/ptn=. (#nmlst)$20{.1
10
   bb=. ptn<;.1 nmlst
   qts''
2013 10 21 11 38 18.048

Generate the first level of commands to assemble the smallest files into intermediate, larger files.

   (;LF,~&.>join1 &.> (<"0 bb);&.>(<'Pa.tmp');&.>i.+/ptn) fappend 'Assemble529.bat'
2637

Generate the final assembly of the intermediate pieces into the original file.

   (LF,~'copy /b ',(}.;'+',&.>(<'.tmp'),~&.>(<'Pa'),&.>":&.>i.+/ptn),' ',finalNm) fappend 'Assemble529.bat'
121

Remember to put together "send.ftp" file to transmit all the pieces over to the target machine.

Advanced Topics

Learning, teaching and promoting J

Materials

-- Devon McCormick <<DateTime>>