User:Devon McCormick/Code/WorkOnLargeFiles

From J Wiki
Jump to navigation Jump to search

A slightly earlier version of the code below is explained in some detail here. The major update since this earlier explanation was written is the addition of the "passedOn" parameter in the adverb "doSomething" which applies a verb across a large file in pieces. The "doSomething" adverb assumes the file is structured with a first row being column headers and subsequent rows being LF-delimited. The first row is made available to processing of pieces of the file following the initial one. So, if we need to look up a column by name, we can use the header row as a reference for the column location.

The "passedOn" parameter allows the verb called by "doSomething" to pass information to subsequent invocations of the verb. This might include things like file statistics or a row count.

An example of using this code to apply verbs across a large, tab-delimited file is embodied in this code.

NB.* workOnLargeFile.ijs: apply arbitrary verb across large file in blocks.

NB.* doSomething: do something to a large file in sequential blocks, by lines.
NB. Args: pointer to current location in file, size of chunk to read each time,
NB. size of file, name of file, [piece of a chunk left over from previous
NB. call, file header (first line), result of previous call to be passed on
NB. to next one.
doSomething=: 1 : 0
   'curptr chsz max flnm leftover hdr passedOn'=. 7{.y
   if. curptr>:max do. ch=. curptr;chsz;max;flnm
   else. if. 0=curptr do. ch=. readChunk curptr;chsz;max;flnm
           chunk=. leftover,CR-.~>_1{ch NB. Work up to last complete line.
           'chunk leftover'=. (>:chunk i: LF) split chunk   NB. LF-delimited lines
           'hdr body'=. (>:chunk i. LF) split chunk    NB. Assume 1st line is header.
           hdr=. }:hdr                  NB. Retain trailing partial line as "leftover".
       else. chunk=. leftover,CR-.~>_1{ch=. readChunk curptr;chsz;max;flnm
           'body leftover'=. (>:chunk i: LF) split chunk
       end.
       passedOn=. u body;hdr;<passedOn  NB. Allow u's work to be passed on to next invocation
   end.
   (4{.ch),leftover;hdr;<passedOn
NB.EG ((10{a.)&(4 : '(>_1{y) + x +/ . = >0{y')) doSomething ^:_ ] 0x;1e6;(fsize 'bigFile.txt');'bigFile.txt';'';'';0  NB. Count LFs in file.
)

NB.* getFirstLine: get 1st line of tab-delimited file, along w/info
NB. to apply this repeatedly to get subsequent lines.
getFirstLine=: 3 : 0
   (10{a.) getFirstLine y     NB. Default to LF line-delimiter.
:
   if. 0=L. y do. y=. 0;10000;y;'' end.
   'st len flnm accum'=. 4{.y NB. Starting byte, length to read, file name,
   len=. len<.st-~fsize flnm  NB. any previous accumulation.
   continue=. 1               NB. Flag indicates OK to continue (1) or no
   if. 0<len do. st=. st+len  NB. header found (_1), or still accumulating (0).
       if. x e. accum=. accum,fread flnm;(st-len),len do.
           accum=. accum{.~>:accum i. x [ continue=. 0
       else. 'continue st len flnm accum'=. x getFirstLine st;len;flnm;accum end.
   else. continue=. _1 end.   NB. Ran out of file w/o finding x.
   continue;st;len;flnm;accum
NB.EG hdr=. <;._1 TAB,(CR,LF) -.~ >_1{getFirstLine 0;10000;'bigFile.txt' NB. Assumes 1e4>#(1st line).
)

readChunk=: 3 : 0
   'curptr chsz max flnm'=. 4{.y
   if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread flnm;curptr,chsz2
   else. chunk=. '' end.
   (curptr+chsz2);chsz2;max;flnm;chunk
NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize 'bigFile.txt');'bigFile.txt'
)

readChunk_egUse_=: 0 : 0
   ch0=. readChunk 0;1e6;(fsize 'bigFile.txt');'bigFile.txt'
   chunk=. CR-.~>_1{ch0
   'chunk leftover'=. (>:chunk i: LF) split chunk
   'hdr body'=. split <;._1&> TAB,&.><;._2 chunk
   body=. body#~-.a: e.~ body{"1~hdr i. <'PRCCD - Price - Close - Daily'
   unqids=. ~.ids=. ;&.><"1 body{"1~ hdr i. '$gvkey';'$iid'
   dts=. MDY2ymdNum&>0{"1 body
   (unqids textLine ids (<./,>./) /. dts) fappend 'IDsDateRanges.txt'
)

accumDts2File=: 4 : 0
   'body hdr'=. y
   hdr=. <;._1 TAB,hdr
   'lkupPxs lkupID'=. hdr i. 2{.x [ outflnm=. >_1{x
   body=. <;._1&> TAB,&.><;._2 body
   body=. body#~-.a: e.~ body{"1~lkupPxs
   unqids=. ~.ids=. body{"1~lkupID
   dts=. MDY2ymdNum&>0{"1 body
   (unqids textLine ids (<./,>./) /. dts) fappend outflnm
NB.EG ('PRCCD - Price - Close - Daily - USD';'$issue_id';'IDsDateRanges.txt') accumDts2File body;<hdr
)

NB.* MDY2ymdNum: 'mm/dd/yyyy' -> yyyymmdd
MDY2ymdNum=: [: ". [: ; _1 |. [: <;._1 ] ,~ [: {. [: ~. '0123456789' -.~ ]

-- Devon McCormick <<DateTime(2015-01-14T17:07:38-0200)>>