User:Devon McCormick/Code/breakupBigFiles2CD.ijs

From J Wiki
Jump to: navigation, search

This code is explained in some detail here.

NB.* breakupBigFiles2CD.ijs: break up large files to fit onto CDs - 720e6
NB.bytes - or whatever size desired.
NB. 20090826: use extended integers instead of floating point or integer pairs.

coclass 'bbf2cd'
load '~system/packages/files/bigfiles.ijs'
NB. load '~Code/bigfiles.ijs'
coinsert 'jbf'

load 'task'    NB. To run command in "assembleBrokenFiles".

NB.* breakUpBigFile: break up file >0{y into number of pieces using
NB.* assembleBrokenFiles: put pieces back together->big file.
NB.* jfi: just files from dir listing
NB.* sequenceGap: find missing integers in supposedly contiguous sequence.
NB.* cvtIntPair: integer pair converter (both ways); fails above about 10^15.
NB.* cvt1to2: convert pair of integers into single number.
NB.* cvt2to1: convert single number into pair of integers.
NB.* cvtLargeInt: convert (fp) num from 2^31 to <:2^32 to (signed) int.
NB.* doSeveral: use "each" to break up several files from indir to outdir,
NB.* buildBigNumdFl: build large file with (text representation of) number N

NB. The following is still necessary because the underlying OS-specific
NB. calls use integer pairs to overcome the size limitation of integers.
NB. The overlying J code is not so constrained as extended integers can be
NB. arbitrarily large but will fail for sizes >: 2^64 because of the
NB. limitations of underlying system call.
explainNecessaryLongIntegers=: 0 : 0
The "bigfiles" functions require integer arguments which exceed the scope
of individual signed integers, so we must use pairs of signed integers to
allow the APIs to recognize the numbers properly.  In other words,
we use a pair of signed, 32-bit, integers to represent a 64-bit integer.

For example, the largest signed integer is normally 2^31 (2147483648).
So, to represent the next largest integer, we use
   cvtIntPair 2147483649
0 _2147483647

The high-order bits are in the first integer of the resulting pair (0).
The second integer is negative because the highest order bit is the
sign bit in a signed integer.  So, 2^32, which is 1 followed by 32 zeroes
in binary, appears thusly:
   cvtIntPair 2^32
1 0
)

NB. This warning is obsolete following the change to extended integers:
NB. **Danger: the following routine uses floating point numbers as large
NB. integers, assuming we can convert while retaining precision.  This
NB. should be OK for sizes less than a few hundred billion, i.e. 10^15 or so
NB. but I haven't really checked it thoroughly to determine the correct limits.

NB.* breakUpBigFile: break up file >0{y into number of pieces using
NB. dir/name >1{y reading in chunks of 0{x, filling up to 1{x byte
NB. output files. Handles large files (>2^31 bytes).
breakUpBigFile=: 3 : 0
   1e7 720e6 breakUpBigFile y NB. Work w/1e7 bytes at-a-time, write 720e6
:
   'flnm outflnm'=. y
   sufflen=. ->:outflnm i.&.|. '.'
   'outsuff outpre'=. sufflen split outflnm
   totflsz=. bfsize flnm
   'rdctr flctr'=. x: 0 0     NB. Need extended integers to read large file.
   flbase=. actsz=. 0
   'maxchsz maxperfl'=. x: x
   if. -. nameExists 'BREAKATLINE' do. BREAKATLINE=. 0 end.
NB. Max (0-origin) file counter length-># digits->leading zeros in output file
NB. name number portion.
   maxctrlen=. >.10^.maxperfl%~totflsz

   while. rdctr<totflsz do.
       outflnm=. outpre,(maxctrlen lead0s flctr),outsuff
       '' fwrite outflnm      NB. Initialize output file
       chsz=. maxchsz
       flmax=. maxperfl+flbase=. flbase+actsz
       flmax=. flmax<.totflsz
       actsz=. adj=. 0
       while. rdctr<flmax+adj do.
           chsz=. chsz<.flmax-rdctr
           ch=. bixreadx flnm;rdctr,chsz
           adj=. 0
           if. flmax<:rdctr+chsz do.
               if. BREAKATLINE do.
                   if. LF~:{:ch do. ch=. ch}.~adj=. -<:(#ch)-ch i: LF end. end.
           end.
           assert. (chsz+adj)=ch bappend outflnm  NB. Write as much as read?
           actsz=. actsz+chsz+adj
           rdctr=. x: rdctr+chsz+adj
       end.
       smoutput (': ',~":qts ''),'Wrote ',outflnm,'; length = ',":rdctr-flbase
       flctr=. >:flctr
   end.
NB.EG breakUpBigFile 'F:\Video\RailwayChP2.avi';'C:\Video\RailwayChP2avi.dat'
)
NB. 13!:3 'breakUpBigFiles : 14 20 25 28'

NB.* assembleBrokenFiles: put pieces back together->big file.
assembleBrokenFiles=: 3 : 0
   'flnm partfls'=. y
   sufflen=. (#-~'.'i:~])partfls
   'outsuff outpre'=. sufflen split partfls
NB. The "broken" files to be assembled should have names like, e.g.
NB. 'Outfl0.dat';'Outfl1.dat';'Outfl2.dat'..., so most of the following work
NB. is to extract and validate the numbers at the end of "Outfl" before the
NB. suffix ".dat": they should all be valid numbers and in sequence.

NB. We do a lot work validating these names because they should have been
NB. generated according to the above function "breakUpBigFiles".  We expect
NB. the inputs to this function to conform as it is the inverse of that one.

   outpath=. outpre{.~>:outpre i: PATHSEP_j_
   flnmprelen=. #>{:<;._1 PATHSEP_j_,outpre
   ofls=. {."1 jfi dir outpre,'*',outsuff
   flnums=. (flnmprelen}.sufflen}.])&.>ofls
   if. 0 e. whvn=. isValNum&>flnums do.
       smoutput 'Excluding file',('s'#~1~:0 +/ . =whvn),' with invalid '
       smoutput 'numeric portion: ','.',~punclist ofls#~-.whvn
       'ofls flnums'=. (<whvn)#&.>ofls;<flnums
   end.

NB. It might be an error to proceed with an incomplete sequence, but we will.
   misssq=. sequenceGap flnums=. /:~".&>flnums
   if. 0~:#misssq do.
       smoutput 'Proceeding with missing sequence number',('s'#~1~:#misssq),':'
       smoutput '  ','.',~punclist ":&.>misssq
   end.

   osuff=. ~.sufflen{.partfls
   cmd=. }:;((<outpre),&.>":&.>flnums),&.><osuff,'+'
   smoutput 'Running command: ','...',~cmd=. 'copy /b ',cmd,' ',flnm
   shell cmd        NB. This could take a while depending on file size.
NB. copy /b BigFlPart0.dat+BigFlPart1.dat+BigFlPart2.dat+BigFlPart3.dat BFl.txt
)
NB. 13!:3 'assembleBrokenFiles 16 19 24 32'  NB. Useful breakpoints

NB. These two utility fns, included for completeness, are from other libraries.
jfi=: 3 : '(-.''d''e.&>4{"1 y)#y'            NB.* jfi: just files from dir listing
nameExists=: 0:"_ <: [: 4!:0 <^:(L. = 0:)    NB.* nameExists: 1 if name exists

NB.* sequenceGap: find missing integers in supposedly contiguous sequence.
sequenceGap=: 3 : 0
   fullseq=. (<./y)+i.>:(>./-<./)y
   /:~fullseq-.y
NB.EG sequenceGap 13 14 18-.~10+i.10
)

NB.* cvtIntPair: integer pair converter (both ways); fails above <:2x^64.
cip=: cvt1to2 :.cvt2to1
cvtIntPair=: 3 : 0
   if. 2={:$y do. cvt2to1 y
   else. cvt1to2 y
   end.
)

NB.* cvt1to2: convert pair of integers into single number.
cvt1to2=: 3 : '<.(_4294967296x*qq>2147483648x)+qq=. 4294967296x 4294967296x#:y'

NB.* cvt2to1: convert single number into pair of integers.
cvt2to1=: 3 : '4294967296x#.|:4294967296x&||:y'

NB. These next 2 are based on Dave Mitchell's "bigfiles" code: for exegesis.
3 : 0 ''
   if. -.nameExists 'K31' do. K31=: 2^31 end.
)

NB.* cvtLargeInt: convert (fp) num from 2^31 to <:2^32 to (signed) int.
cvtLargeInt=: 3 : 0
   if. y>:K31 do. K31-~K31#:y else. y end.
NB.EG    cvtLargeInt"0 ] 2147483647 2147483648 2147483649 4294967295 4294967296
NB. 2147483647 _2147483648 _2147483647 _1 _2147483648
NB. Note failure for final argument above (=2^32).
)

NB.* doSeveral: use "each" to break up several files from indir to outdir,
NB. auto-generate name prefixes and suffixes.
doSeveral=: 4 : 0
   'indir outdir'=. endSlash&.>x
   flnm=. y
   infl=. indir,flnm
   outfl=. outdir,(flnm-.'.'),'.dat'
NB. Break into 240e6 pieces because this is about 1/3 of a CD and using
NB. pieces smaller than 1/2 of CD gives more flexibility.
   1e7 240e6 breakUpBigFile infl;outfl
NB.EG (<'F:\bigfiles\';'C:'\brokenFiles\') doSeveral&.>'file1.zun';'file2.foo'
)
NB.* endSlash: ensure path has ending slash.
endSlash=: 13 : 'y,PATHSEP_j_#~PATHSEP_j_~:{:y'

NB.* buildBigNumdFl: build large file with (text representation of) number N
NB. beginning at byte N; can be run to append to existing file.
NB. This large file handy for testing large-file fns as file locations are
NB. designated by the file contents, i.e.
NB.    frdix 'C:\Temp\BigFile.txt';1000 20   NB. Read 20 bytes starting at 1000
NB. 1000 1005 1010 1015
NB. shows us that we properly started reading at file location 1000.
NB. Or, for an even larger file, we must use "bixreadx":
NB.    bixreadx 'C:\Temp\BigFile.txt';2147483649x 32
NB. 2147483649 2147483660 2147483671
NB.    bfsize 'C:\Temp\BigFile.txt'
NB. 2303603063
NB.* buildBigNumdFl: build large file with (text representation of) number N: faster.
buildBigNumdFl=: 3 : 0
   'nn bigfl'=. y                       NB. Append nn numbers
   if. -.fexist bigfl do. nn=. <:nn     NB. Initialize if no file.
       '0 ' fwrite bigfl end.           NB. Start counting at zero.
   while. 0<nn do.
       ctr=. bfsize bigfl               NB. Reduce number of file writes
       len=. 2+<.10^.ctr                NB. Length of nums now
       n2app=. 1e5<.nn<.>.len%~ctr-~10^<:len
       (' ',~":ctr+len*i.n2app) bappend bigfl
       nn=. nn-n2app
   end.
   bfsize bigfl
NB.EG buildBigNumdFl 2e8;'C:\Temp\BigFile.txt'
)

NB.-- Older version: this is off by one on the initial run because of the
NB. uncounted initialization.
explanationOfPeculiarDualNumArg=. 0 : 0
One peculiarity of the arguments to the function ''buildBigNumdFlSlower'' is
the use of two numbers, which are essentially multiplied together, to
specify how many integers to write out.  The numbers are, respectively,
the number of times through the outer and inner loops.  The inner
loop builds the string in memory; the string is only written to file
at the end of the inner loop.

The intention here is to allow the user to try different combinations
to balance file writing time versus memory allocation time to achieve
the best throughput for a given machine.  This could undoubtedly be
made more efficient but I haven't bothered since it's such a limited-use
function.
)
buildBigNumdFlSlower=: 3 : 0
   'nouter ninner bigfl'=. y            NB. Append nouter*ninner numbers
   if. -.fexist bigfl do.               NB. Need to initialize file?
       '0 ' fwrite bigfl end.           NB. Start counting at zero.
   while. _1<nouter=. <:nouter do.      NB. Build string in inner loop to
       fsz=. bfsize bigfl               NB.  reduce number of file writes
       ctr=. 0 [ str=. ''               NB.  (for better efficiency).
       while. ninner>:ctr=. >:ctr do.
           str=. str,' ',~0j0": fsz+#str
       end.
       str bappend bigfl
   end.
   0j0":bfsize bigfl
NB.EG buildBigNumdFl 10000;1000;'C:\Temp\BigFile.txt'
)

bfsize=: 3 : 0
   if. t y do.
       try. fh=: CreateFileR (y,{.a.);GENERIC_READ;0;NULLPTR;OPEN_EXISTING;0;0 catch.
           (13!:11 '');(13!:12 '')
           return.
       end.
       if. fh=_1 do.
           cderx''
           return.
       end.
       F=. 1
   else.
       fh=. y
       F=. 0
   end.
   b=. ,2
   ts=. GetFileSizeR fh;b
   if. F do. CloseHandleR fh end.
NB.  K32#.|:K32&|b,ts
   cvt2to1 b,ts
)

cvt1to2_jbf_=: ([: ([: <. (_4294967296x * 2147483648x < ]) + ]) 4294967296 4294967296x #: ])

NB.* bixreadx: read (big) files using extended integers for indexing.
NB. bixreadx fname;startx[,len]
bixreadx_jbf_=: 13 : 'bixread (0{y),<x:^:_1 (cvt1to2 {.>1{y),}.>1{y'

NB.* bixwritex: read (big) files using extended integers for indexing.
NB. data bixwritex fname;startx[,len]
bixwritex_jbf_=: 13 : 'x bixwrite (0{y),<x:^:_1 (cvt1to2 {.>1{y),}.>1{y'