User:Devon McCormick/JDBWithNetflixChallengeData

From J Wiki
Jump to navigation Jump to search

Here's my experience so far attempting to use JDB on the Netflix Challenge dataset.

First, I load JDB, then create my folder, database, and table for Netflix data:

   load '~addons/data/jdb/jdb.ijs'
   hf=: Open_jdb_ 'C:\Data\Netflix'
   hd=: Create__hf 'nfbase'
   ht=: Create__hd 'mcrd';0 : 0
movid int
custno int
rating byte
rdate int
)

Notice the "in-line" table definition beginning with "0 : 0" above. There are four fields: the movie ID number, the customer number, the customer rating of the movie, and the rating date.

Now, I attempt to load the Netflix data into the database but, first, a word about this data and how I currently work with it. Since there's too much data to handle all at once - about 100 million records - I've broken it into 100 numbered variables which I've written to file. Each variable is a four-row table of integers, as seen here in variable "murd0" retrieved from file:

   VDIR
\Data\Netflix\VarsDir\
   VDIR unfileVar_WS_ 'murd0'  NB. Movie, User, Rating, Date
+-+----------------------------------------+
|1|+----------------------+---------+-----+|
| ||\Data\Netflix\VarsDir\|MURD0.DAT|murd0||
| |+----------------------+---------+-----+|
+-+----------------------------------------+
   $murd0
4 982878
   5{."1 murd0
       1        1        1        1        1
 1596531  1366860  1181550  1227322  2413320
       5        4        3        4        4
20040123 20040126 20040201 20040206 20040206

As you can see, this is a matrix of integers with a little less than one million columns. In the table definition, I attempted to save some space by defining the rating as a byte since it can only have the values 1-5.

I'm attempting to load this data from these 100 variables into the database in the way I usually deal with this data - using an adverb that applies a verb to specified variables on file:

NB.* getVarInfo: apply arbitrary function to each (filed) var named.
getVarInfo=: 1 : 0
   'dd varnm'=. y.                 NB. Vars dir, var names.
   rc=. dd unfileVar_WS_ varnm
   if. >{.rc do. rc=. 1;u. ".varnm
       [4!:55 <varnm
   end.
   rc
NB.EG ({."1,.{:"1) getVarInfo &.>(<'C:\data\');&.>'var1';'var2';'var3'
NB.EG dts=: (3&{) getVarInfo&.>(<VDIR);&.>MVN
)

I have a list "MVN" of the variable names "murd0" through "murd99", so after defining the database update verb like this:

insRecs=: 3 : 0
   Insert__hd 'mcrd';<(<'012345'{~>2{y) 2}y=. <"1 y
)

As you can see, I convert the rating to a character on-the-fly. So, I should be able to do this to load all the records into the database:

   rc=. insRecs getVarInfo&.>(<VDIR);&.>MVN

However, after processing about 30 of these million column variables, I get this error:

|bad view: assert
|   'bad view'    assert 0[free fh,mh,0

Here's what the table directory looks like at this point:

   dir 'C:\Data\Netflix\nfbase\mcrd\*.*'
+--+-------------------+---------+---+------+
|c0|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|c1|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|c2|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|c3|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|c4|2008 10 19 21 15 40|34191553 |rw-|-----a|
+--+-------------------+---------+---+------+
|c5|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|d0|2008 10 19 21 15 40|136762360|rw-|-----a|
+--+-------------------+---------+---+------+
|d1|2008 10 19 21 9 24 |5284     |rw-|-----a|
+--+-------------------+---------+---+------+
|d2|2008 10 19 21 9 24 |1296     |rw-|-----a|
+--+-------------------+---------+---+------+

   +/,>2{"1 dir 'C:\Data\Netflix\nfbase\mcrd\*.*'
854772293

There are some records loaded:

   Reads__hd 'from mcrd where custno=25'
+-----+------+------+--------+
|movid|custno|rating|rdate   |
+-----+------+------+--------+
|3427 |25    |2     |20050831|
|4432 |25    |5     |20050831|
|4656 |25    |3     |20050831|
|4996 |25    |2     |20050831|
|5317 |25    |4     |20050831|
|6037 |25    |4     |20050831|
|6786 |25    |5     |20050831|
+-----+------+------+--------+

but not all of them:

   0 0$''[VDIR unfileVar_WS_ 'murd99'
   5{."1 murd99
   17622    17622    17622    17622    17622
  816549   359127   921587  2483371  1265459
       3        4        2        4        5
20011216 20011218 20011218 20011221 20011222

NB. "17622" is first movie number in last var:
   Reads__hd 'from mcrd where movid=17622'
+-----+------+------+-----+
|movid|custno|rating|rdate|
+-----+------+------+-----+
+-----+------+------+-----+

The following appears to be the last variable from file loaded:

   0 0$''[VDIR unfileVar_WS_ 'murd37'
   5{."1 murd37
    6722     6722     6722     6722     6722
 1539813   302850  1337655  1003230  2012582
       1        4        5        5        5
20030417 20030420 20030420 20030424 20030425
   $&.>Reads__hd 'from mcrd where movid=6722'
+------+------+------+------+
|5     |6     |6     |5     |
+------+------+------+------+
|1105 1|1105 1|1105 1|1105 1|
+------+------+------+------+

So, I was able to get a little over one-third of the data loaded. I suspect I'm running into system-related limits to paging or other memory space though I have a 4GB pagefile defined on this machine.

I got the following message from JDB Browser upon attempting to open "Data" tab:

Query: movid,custno,rating,rdate
from mcrd
|out of memory: getwhere
|       I.active
No data



See further JDB/Netflix -- Oleg Kobchenko <<DateTime(2008-10-22T06:09:46Z)>>