Scripts/Working with Big Files

From J Wiki
Jump to navigation Jump to search

This is Mostly Obsolete

With modern 64-bit J, the regular file verbs covered in "stdlib" - which are loaded by default - work fine on files larger than 2 GB. However, the following is still applicable for 32-bit systems. Also these definitions may still be useful as examples of using external DLLs from within J; some of the verbs may still be useful on their own as well. For instance, "bigNumdFl" is still handy for quickly building a usefully-defined file. To use this on a 64-bit system, I make the following global definitions after loading it since I can use the regular versions of file verbs:

   bfsize=: fsize
   bappend=: fappend

Working with Big Files: Using Numbers Bigger than an Integer

Sometimes we have to deal with large files -- those larger than 2 gigabytes.

The immediate use of the functions in breakupBigFiles2CD.ijs is to allow us to break apart a file too large to fit onto a writable CD so that we can write the file onto CDs in pieces in order to move it or archive it. There is also a function to allow us to re-assemble the pieces into the original, large file.

Download the code File:BreakupBigFiles2CD.ijs (or you can take a look at it here). Be sure to load it

   load '~Code/breakupBigFiles2CD.ijs'

(where "Code" was set in Edit/Configure/Folders to point to my code directory) and make the namespace transparently available:

   coinsert 'bbf2cd'

before trying the following examples.

The pedagogical purpose of these functions is to demonstrate how to use the functions in bigfiles.ijs for dealing with files larger than 2 GB.

The main problem with using these functions is the restriction on the size of an integer: how do we address parts of a file with locations not representable by a standard 32-bit (or signed, 31-bit) integer?

In the breakupBigFiles2CD.ijs code, we use a function cvtIntPair to convert between a (floating point) number of sufficient magnitude and a pair of signed integers suitable for passing to the Windows APIs. However, since J does have extended integers, we can use these externally as long as we're careful to coerce them to int before passing them to the API.

That is, since the underlying bigfiles functions require integer arguments which exceed the scope of individual signed integers, we use pairs of signed integers, a datatype native to J, to allow the APIs to recognize the numbers properly.

For example, the largest signed integer is normally 2^31 (2147483648).

So, to represent the next largest integer, we use

   cvtIntPair 2147483649
0 _2147483647

The high-order bits are in the first integer of the resulting pair (0). The second integer is negative because the highest order bit is the sign bit in a signed integer. So, 2^32, which is 1 followed by 32 zeroes in binary, appears thusly:

   cvtIntPair 2^32
1 0

Note that cvtIntPair is its own inverse. If the argument is a pair of numbers, it converts them to a single number; a single number gets converted to a suitable pair of numbers.

NB.* cvtIntPair: integer pair converter (both ways); fails above <:2x^64.
cip=: cvt1to2 :.cvt2to1
cvtIntPair=: 3 : 0
   if. 2={:$y do. cvt2to1 y
   else. cvt1to2 y
   end.
)

NB.* cvt1to2: convert pair of integers into single number.
cvt1to2=: 3 : '<.(_4294967296x*qq>2147483648x)+qq=. 4294967296x 4294967296x#:y'

NB.* cvt2to1: convert single number into pair of integers.
cvt2to1=: 3 : '4294967296x#.|:4294967296x&||:y'

The self-inverse property of cvtIntPair also allows us to verify its correctness. The following examples show how the function applied to its output (should) give us the original input. However, for sufficiently large numbers, this scheme [used to] break down because of the limited precision of floating point numbers.

   cvtIntPair 4e9
0 _294967296
   cvtIntPair^:2 ] x: 4e9
4000000000

Since we use extended integers, it's not necessary to use formatting (16j0":) to see all the digits to verify that output looks the same as input.

Now, look at successively larger numbers to see where cvtIntPair breaks:

   cvtIntPair^:2 ] 1234567890
1234567890
   cvtIntPair^:2 ] 12345678901234
12345678901234
   cvtIntPair^:2 ] 123456789012345
123456789012345
   cvtIntPair^:2 ] 9234567890123456
9234567890123456

This last one still looks OK, but we've already passed the point of losing precision - the example above just happened to work out, as the following examples, using extended integers, or not, should make clear:

   cvtIntPair^:2 ] 9123456789012345   NB. Last digit changes
9123456789012344
   cvtIntPair^:2 ] 9123456789012345x  NB. Or doesn't if done properly w/extended integer
9123456789012345

The good news is that we should be fine up to one less than 2^64:

   2x^64
18446744073709551616
   cvtIntPair^:2 ] 18446744073709551616x
0
   cvtIntPair^:2 ] 18446744073709551615x
18446744073709551615

   10^.2x^64
19.26592

Explanation of Selected Functions

We'll look at the following major three functions:

NB.* buildBigNumdFl: build large file with (text representation of) number N
NB.* breakUpBigFile: break up file >0{y. into number of pieces using
NB.* assembleBrokenFiles: put pieces back together->big file.

buildBigNumdFl

The first of these, buildBigNumdFl, builds a large file we can use to test the other two major functions of this suite. This file has a special format to make it handy for testing large-file functions: it's a text file with the file locations designated by the file contents. That is, starting at the 1000th byte is the text '1000'.

Obviously, this cannot work for every location since each number takes more than one byte but we just fill in each number as available. So, since the string '1000 ' (with a space separator at the end) takes 5 bytes, the next number has to be '1005' since it's starting in the 1005th byte. The initial byte is location zero so the first string in the file is '0 '.

Interestingly, at least the first several integer powers of 10 seem to be present: does this hold in general? For a look at some examples, see the end of this essay: Powers of Ten in a Self-Numbered File.

The format of the file makes it readily apparent, for instance, that our file-reading functions are working on the expected locations. Some examples follow.

   (1!:11) 'C:\Temp\BigFile.txt';1000 20
|index error
|       (1!:11)'C:\Temp\BigFile.txt';1000 20

The standard indexed file read foreign conjunction does not work for this large file (except, as noted above, if you're running 64-bit version of J). For a file larger than 2 GB, we must use (our adaptation of) bigfiles function "bixreadx":

   bixreadx 'C:\Temp\BigFile.txt';1000 20
1000 1005 1010 1015

This shows us that we properly started reading at file location 1000.

Now look at other locations:

   bixreadx 'C:\Temp\BigFile.txt';2147483656x 32
2147483656 2147483667 2147483678
   bfsize 'C:\Temp\BigFile.txt'
2196023059

   NB. Read last few bytes in file:
   bixreadx 'C:\Temp\BigFile.txt';(2196023059x-33),33
2196023026 2196023037 2196023048

Here's an example of starting construction of a large file by writing 210 million integers which results in a file slightly larger than 2^31:

   buildBigNumdFl 2.1e8;'C:\Temp\BigFile.txt'

Since it may take some time (20 minutes or more, depending on the speed of your machine) to build a file larger than 2 GB, the function is designed to pick up where it left off so you can build the file in pieces as you have the time to spare.

NB.* buildBigNumdFl: build large file with (text representation of) number N: faster.
buildBigNumdFl=: 3 : 0
   'nn bigfl'=. y                       NB. Append nn numbers
   if. -.fexist bigfl do. nn=. <:nn     NB. Initialize if no file.
       '0 ' fwrite bigfl end.           NB. Start counting at zero.
   while. 0<nn do.
       ctr=. bfsize bigfl               NB. Reduce number of file writes
       len=. 2+<.10^.ctr                NB. Length of nums now
       n2app=. 1e4<.nn<.>.len%~ctr-~10^<:len
       (' ',~":ctr+len*i.n2app) bappend bigfl
       nn=. nn-n2app
   end.
   bfsize bigfl
NB.EG buildBigNumdFl2 2e8;'C:\Temp\BigFile.txt'
)

breakUpBigFile

This is ostensibly the main function. The default arguments are chosen under the assumption that we want to put the resulting pieces of the large file onto writable CDs with a capacity of 720 million bytes. We resist the all-too-common convention of calling this 700 Megabytes because we would rather not obfuscate the meaning of well-established prefixes, which are based on decimal, rather than muddying their meaning with base-2 approximations.

NB.* breakUpBigFile: break up file >0{y into number of pieces using
NB. dir/name >1{y reading in chunks of 0{x, filling up to 1{x byte
NB. output files. Handles large files (>2^31 bytes).
breakUpBigFile=: 3 : 0
   1e7 720e6 breakUpBigFile y NB. Work w/1e7 bytes at-a-time, write 720e6
:
   'flnm outflnm'=. y
   sufflen=. ->:outflnm i.&.|. '.'
   'outsuff outpre'=. sufflen split outflnm
   totflsz=. bfsize flnm
   'rdctr flctr'=. x: 0 0     NB. Need extended integers to read large file.
   flbase=. actsz=. 0
   'maxchsz maxperfl'=. x: x
   if. -. nameExists 'BREAKATLINE' do. BREAKATLINE=. 0 end.
NB. Max (0-origin) file counter length-># digits->leading zeros in output file
NB. name number portion.
   maxctrlen=. >.10^.maxperfl%~totflsz

   while. rdctr<totflsz do.
       outflnm=. outpre,(maxctrlen lead0s flctr),outsuff
       '' fwrite outflnm      NB. Initialize output file
       chsz=. maxchsz
       flmax=. maxperfl+flbase=. flbase+actsz
       flmax=. flmax<.totflsz
       actsz=. adj=. 0
       while. rdctr<flmax+adj do.
           chsz=. chsz<.flmax-rdctr
           ch=. bixreadx flnm;rdctr,chsz
           adj=. 0
           if. flmax<:rdctr+chsz do.
               if. BREAKATLINE do.
                   if. LF~:{:ch do. ch=. ch}.~adj=. -<:(#ch)-ch i: LF end. end.
           end.
           assert. (chsz+adj)=ch bappend outflnm  NB. Write as much as read?
           actsz=. actsz+chsz+adj
           rdctr=. x: rdctr+chsz+adj
       end.
       smoutput 'Wrote ',outflnm,'; length = ',":rdctr-flbase
       flctr=. >:flctr
   end.
NB.EG breakUpBigFile 'F:\Video\RailwayChP2.avi';'C:\Video\RailwayChP2avi.dat'
)

assembleBrokenFiles

This is not a very well-designed function but is included here for the sake of completeness. It works by creating a DOS binary copy command (using copy option /b) to re-assemble the smaller files into the larger one.

The function expends much effort validating the input file names. This is because we expect them to have been generated by the above breakUpBigFile function, hence we know what they should look like, and a missing or mis-placed piece would give us an incorrect assembly.

The known problems with the design of this function are that it relies on an OS-specific set of commands and is vulnerable to breaking if, for instance, a large file has been broken into so many pieces that the resulting command line is too large, requires too much interim memory, or is too slow to run.

NB.* assembleBrokenFiles: put pieces back together->big file.
assembleBrokenFiles=: 3 : 0
   'flnm partfls'=. y.
   sufflen=. (#-~'.'i:~])partfls
   'outsuff outpre'=. sufflen split partfls
NB. The "broken" files to be assembled should have names like, e.g.
NB. 'Outfl0.dat';'Outfl1.dat';'Outfl2.dat'..., so most of the following work
NB. is to extract and validate the numbers at the end of "Outfl" before the
NB. suffix ".dat": they should all be valid numbers and in sequence.

NB. We do a lot work validating these names because they should have been
NB. generated according to the above function "breakUpBigFiles".  We expect
NB. the inputs to this function to conform as it is the inverse of that one.

   outpath=. outpre{.~>:outpre i: PATHSEP_j_
   flnmprelen=. #>{:<;._1 PATHSEP_j_,outpre
   ofls=. {."1 jfi dir outpre,'*',outsuff
   flnums=. (flnmprelen}.sufflen}.])&.>ofls
   if. 0 e. whvn=. isValNum&>flnums do.
       smoutput 'Excluding file',('s'#~1~:0 +/ . =whvn),' with invalid '
       smoutput 'numeric portion: ','.',~punclist ofls#~-.whvn
       'ofls flnums'=. (<whvn)#&.>ofls;<flnums
   end.

NB. It might be an error to proceed with an incomplete sequence, but we will.
   misssq=. sequenceGap flnums=. /:~".&>flnums
   if. 0~:#misssq do.
       smoutput 'Proceeding with missing sequence number',('s'#~1~:#misssq),':'
       smoutput '  ','.',~punclist ":&.>misssq
   end.

   osuff=. ~.sufflen{.partfls
   cmd=. }:;((<outpre),&.>":&.>flnums),&.><osuff,'+'
   smoutput 'Running command: ','...',~cmd=. 'copy /b ',cmd,' ',flnm
   shell cmd        NB. This could take a while depending on file size.
NB. copy /b BigFlPart0.dat+BigFlPart1.dat+BigFlPart2.dat+BigFlPart3.dat BFl.txt
)

Powers of Ten in a Self-Numbered File

Does the self-numbered file produced by buildBigNumdFl always happen to line up to give us integral powers of ten? It seems like it for this first five in this sample:

   bixreadx&>(<flnm);&.>(10-~10^>:i.9),&.>18+6,~2#+:i.4
0 2 4 6 8 10 13 16
 91 94 97 100 104
8 992 996 1000 1005
9990 9995 10000 1000
988 99994 100000 10000
999990 999997 1000004
99988 9999996 10000004 1
9986 99999995 100000004
984 999999994 1000000004

But we see from looking at a window around the integral powers of ten that it fails when we get to one million. This sequence can be found in the On-line Encyclopedia of Integer Sequences.