User:Devon McCormick/Code/largeFileVet

From J Wiki
Jump to: navigation, search

Vetting a Large File

The following provides an example of using the code to work with large files on a tab-delimited file with column headers as the initial row. In this case, we assume that some combination of columns (Date and security ID) comprises a key for which there should be no duplicate entries.

This code is in transition from using an earlier version of the "doSomething" adverb to the current version which has added the "passedOn" parameter. This parameter allows us to write the verb applied by "doSomething" to blocks of a file without needing global variables to retain information between calls to the verb: any such global information is now passed on to the subsequent invocations of the verb as the >2{y element of argument from "doSomething". See the old and new definitions of "rmDupes" below for an example of the change from using globals to using the "passedOn" parameter.

The newer version of "doSomething" is backwards-compatible with the older one.

NB.* largeFileVet.ijs: routines to vet large tab-delimited file with column headers as the initial row.

require '~Code/workOnLargeFile.ijs'

NB.* cvtYMD2MsDsY: '20140923' -> '09/23/2014'
cvtYMD2MsDsY=: '/'&(13 : '}.;x,&.>1|.1 0 0 0 1 0 1 0 <;.1 y')

saveBadKeyRecs=: 3 : 0
   ixs=. RECCTR+i.#mm=. usuMatify y
   (unmatify (ixs{-.RMDUPS)#mm) fappend BADFL

NB.EG hdr=. >_1{getFirstLine flnm
NB.EG (hdr-.CR) fwrite  BADFL=. 'badVkeys.txt'
NB.EG 6!:2 '([: saveBadKeyRecs [: > 0 { ]) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm' [ RECCTR=: 0

NB.* usuMatify: usual tab-delimited fields, LF-terminated lines -> boxed matrix.
usuMatify=: (TAB,LF)&([: <;._1&> (0 { [) ,&.> [: <;._2 ] (] , [ #~ [ ~: [: {: ])~ 1 { [)"0 1
NB.* unmatify: inverse of "usuMatify" above.
unmatify=: ([: ; [: , [: (] _1}&.|:~ (10{a.) ,~&.> [: }:&.> {:"1) (9{a.) ,~&.> ])
NB.* accumGblKeysVar: (different versions of) global key columns accumulator.
accumGblKeysVar0=: 4 : '(x)=: (".x),2{."1 usuMatify >0{y'
accumGblKeysVar1=: 4 : '(x)=: (".x),0 3{"1 usuMatify >0{y'  NB. key is date,CUSIP
accumGblKeysVar=: 3 : '(>_1{y),2{."1 usuMatify >0{y'        NB. Now using "passedOn" in "doSomething".
NB.* accumulateCol: generalized global key accumulator: specify name of global in which to
NB. accumulate keys and key column indexes as left argument; provide this as argument to
NB. adverb "doSomething" to process entire file in pieces.
accumulateCol=: 4 : '(GblVar)=: (".GblVar),colnum{"1 usuMatify >0{y [ ''GblVar colnum''=. x'
NB.EG (('VAR';5) accumulateCol ]) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm [ VAR=: ''

Here we see the older version of "rmDupes" that worked with the older version of "doSomething".

NB. This "rmDupes" is an older version, which required global variables, before
NB. "doSomething" had the "passedOn" parm added to allow this global information to
NB. be localized.
rmDupes=: 3 : 0
   ixs=. RECCTR+i.#mm=. usuMatify y
   (unmatify (ixs{RMDUPS)#mm) fappend NEWFL
NB.EG ([: rmDupes [:>0{]) doSomething 0x;5e6;flnm;(fsize flnm [ RECCTR=: 0x [ NEWFL=: newFlNm [ RMDUPES=: ...

The newer version of "rmDupes", defined below, achieves the same effect as the older one by returning the values of what were the global variables as part of the result. So, whereas the older "rmDupes" took only a single argument and was not intended to return a result, the newer version takes the former single argument as the first element of y and uses the last element of y to hold the three globals formerly named in the older version.

NB. rmDupes: remove records with duplicate keys from file
rmDupes=: 3 : 0
   'fldat flhdr args'=. y
   'recctr rmdups newfl'=. args
   ixs=. recctr+i.#mm=. usumatify fldat
   (unmatify (ixs{rmdups)#mm) fappend newfl
   recctr=. recctr+#ixs

The "example usage" text in "egUse" has not yet been updated to take advantage of the newer version of "doSomething".

egUse=: 0 : 0
NB. To check if file needs de-duping:
   flnm=. 'CIQ_G_History_New.txt'
   6!:2 '(''KEYS'' accumGblKeysVar ]) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm [ KEYS=: 0 2$'''''
   $KEYS=: ,&.>/"1 KEYS
   (-/,])(#,#@:~.) KEYS   NB. If {.~:0, there are that many duplicate keys.

NB. To de-duplicate flnm:
   NEWFL=. 'CIQ_G_new2.txt'
   hdr=. >_1{getFirstLine flnm
   wu=. whUnq KEYS
   RMDUPS=: -. (-.wu) +. (8=#&>KEYS) +. 0=#&>KEYS
   (hdr-.CR) fwrite NEWFL=: 'CIQ_VMgood.txt'
   6!:2 '([: rmDupes [: > 0 { ]) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm' [ RECCTR=: 0
   fsize NEWFL            NB. Should be smaller than flnm if there were dupes
   6!:2 '(''KEYn'' accumGblKeysVar ]) doSomething ^:_ ] 0x;5e6;(fsize NEWFL);NEWFL [ KEYn=: 0 2$'''''
   KEYn=: ,&.>/"1 KEYn
   (-/,])(#,#@:~.) KEYn   NB. {. should be 0

countColumnItems=: 3 : 0
    ctr=. _1 [ stats=. 0 4$0 [ 'flnm y'=. split y
    arg=. 0x;5e6;(fsize flnm);flnm
    while. (#y)>ctr=. >:ctr do.
        VAR=: ''
        smoutput tm=. 6!:2 '((''VAR'';ctr) accumulateCol ]) doSomething ^:_ ] arg'
        stats=. stats, tm, ((<./,>./)#&>VAR), +/0=#&>VAR NB. timing, min & max size, how many empty
NB.EG stats=. countColumnItems 'CIQ_VM.txt';<;._1 TAB,>_1{(CR,LF)-.~getFirstLine 'CIQ_VM.txt'

NB. (3 : 'STATS=: STATS,([: ((<./ ,: >./) , [: +/ 0 = ]) [: #&> [: > 0 { ]) y') doSomething^:_]0x;5e6;(fsize ; ])flnm ] STATS=: 0 3 56$0

pullRecs=: 3 : 0
   ixs=. RECCTR+i.#mm=. usuMatify y
NB. 6!:2 '([: pullRecs [: > 0 { ]) doSomething ^:_ ] 0x;9e6;(fsize flnm);flnm [ RECCTR=: 0[NULLRECS=: a:$~0,#vmhdr[SELVEC=: whnull'

Example Usage

The noun "egUse" gives an example of using this code to find duplicate keys and remove the records associated with them. The duplicates removed are those containing a duplicate key and occurring after the first instance of those key columns with the duplicated values. So, the first instance of any duplicate is kept and any following ones are removed.

First, we assign the file name to a variable for subsequent re-use, then we accumulate the key values into a global variable "KEYS".

   fsize flnm=. dd,'ranksAll.txt'
   6!:2 '(''KEYS'' accumGblKeysVar ]) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm [ KEYS=: 0 2$'''''

Here we see that for a file of almost 750 MB, it takes about 97 seconds to extract all the key columns from it and that there are 668,000 of them. We look at the first three to confirm that they look they way we expect them to.

Now we'll check if any of the key values are duplicated. However, because the simplest J expression for counting unique values - #~.KEYS - has poor performance, we first simplify the set of keys by concatenating the two columns together because we know, in this case, that concatenation will preserve the uniqueness of the keys.

   $KEYS=: ,&.>/"1 KEYS
   (-/,])(#,#@:~.) KEYS   NB. # duplicate keys, # total, # unique
68 668000 667932

This last set of numbers shows us the total number of keys concatenated to the number of unique keys ((#,#@:~.)); the difference between these two is prepended ((-/,])) to give us the number of duplicate records, followed by the total number of keys and the number of unique ones. So, we need to remove 68 records.

A Digression: Investigating Duplicates

Before we proceed with removing the duplicates from the file, let's further investigate the duplicate keys to see if they have something in common or appear to be random.

First, we check if there are different sizes of our concatenated keys.


In this case, they are all the same length. However, since it was possible to have keys missing one of the key columns, or with columns of differing lengths, this is a good thing to check.

Next, let's extract the duplicate keys and examine some of them to see if they have anything in common.

   whDup=: [: I. [: -. ~:             NB.* whDup: where duplicate elements are
   $dupes=. KEYS{~whDup kk

This shows us that all the duplicates are for the same security ID but that they seem to have occurred only over a particular time period. Doing the math, it looks like we had 12 duplicates for this ID for the years 1995-1999. Whatever was causing this seems to have begun with October of 1994 and ended with June of 2000. This could be useful for tracking down the initial cause of these duplicates.

Removing Duplicate Records

Now that we know something about the duplicate records, let's eliminate them from our large file. Here we see the special-purpose verb "rmDupes" for doing this.

rmDupes=: 3 : 0
   ixs=. RECCTR+i.#mm=. usuMatify y
   (unmatify (ixs{RMDUPS)#mm) fappend NEWFL

This is not very good code as it depends on three global variables, with all the potential problems this implies. However, as a one-shot fix, it works well enough. In this case, we use the globals to keep track of state - our current position in the file as given by "RECCTR" - and which records are duplicates, given by the Boolean "RMDUPS". The third global "NEWFL" is simply the name of the new file we'll create to mimic the existing file but with the duplicates removed.

It would be simple enough to re-write this code to eliminate the globals by passing in their values and instantiating local equivalents of them, but maybe we'll save that for a future enhancement of this code.

In any case, we assign these globals as shown below. The case of "RMDUPS" - a Boolean with the same length as the number of records, having a one for each record to retain and a zero for each one to remove - is the most interesting because it includes a couple of conditions irrelevant to this particular case but which would apply more generally.

   RMDUPS=: -. (-.~:KEYS) +. (10=#&>KEYS) +. 0=#&>KEYS

The only condition relevant to this case is the leftmost one ( -.~:KEYS ) as this is the Boolean negation -. of the Boolean indicator of the unique elements of "KEYS". The other two conditions with which this is or'd (+.) pertain to the cases where one or both key columns are missing, which, as we saw in the digression above, is not the case for this particular file. However, it doesn't hurt to retain the more general condition when it requires so little extra effort and does not affect our result.

We continue by checking that our global "RMDUPS" has the expected number of zeros and by initializing the new file (after checking that it does not yet exist).

   fexist NEWFL=: dd,'ranksAll2.txt'
   hdr=. >_1{getFirstLine flnm     NB. Use existing header line.
   (hdr-.CR) fwrite NEWFL          NB. as first row of new file.

Now that we've assigned two of our globals and initialized the new file, we run through the source file and write only non-duplicate-key records to the new file. We initialize the third global "RECCTR" on the same line to ensure consistency. Note that, for a file with more than 2^31 records, we would want to initialize "RECCTR" as an extended integer, i.e. "0x", much as we do with the first argument (the file position pointer) to "doSomething".

   6!:2 '([: rmDupes [: > 0 { ]) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm' [ RECCTR=: 0
   fsize NEWFL            NB. Should be smaller than flnm if there were dupes
   fsize flnm

We see that this took slightly less than six minutes (358.553 seconds) to create the new file and that it is slightly smaller than the old one, as expected.

As a final check, we should be able to repeat our first few lines of code above, but on the new file, to confirm that we now have no duplicate keys.

   6!:2 '(''KEYS2'' accumGblKeysVar ]) doSomething ^:_ ] 0x;5e6;(fsize NEWFL);NEWFL [ KEYS2=: 0 2$'''''
   $KEYS2=: ,&.>/"1 KEYS2     NB. Simplify key to single item per record
   (-/,])(#,#@:~.) KEYS2      NB. # duplicate keys, # total, # unique
0 667932 667932

-- Devon McCormick <<DateTime(2015-01-12T19:10:28-0200)>>