start | find | index | login or register | edit
by earl, 6493 days ago
Another occassion to let Kdb+ do what it does best, namely crunching data: As part of the Netflix Prize competition, Netflix released a dataset of roughly 100 million datestamped human ratings of 17770 movies. That's certainly a lovely heap of data to grind the mighty axe of kdb+ on and well, that's just what Chris and I did the other night.

First things first, namely some pre-processing. Each movie's ratings are contained in a seperate file (mv_nnnnnnn.txt) leaving us with 17770 files which are bundled in a 2.0 GB tar archive. The movie's ID (terminated by a colon) is repeated as the first line in the movie's file, followed by comma-separated user,rating,date records.

Now let's see how fast we can load that bunch of data into kdb+. We start off with a simple pre-processing AWK-script which prefixes each record with the correct movie ID and join together all those records into a single huge file:

BEGIN       { OFS="," }
/^[0-9]+:/ { split($1, parts, ":"); m = parts[1] }
/^[0-9]+,/ { print m, $1 }

This awk script takes 221326 milliseconds to pre-process the dataset:

$ tar Oxf training_set.tar | awk -f join.awk > training_set.csv
# real 3m41.326s

I'm certainly no awk wizard, so if you have suggestions on making this script faster, please leave them in the comments. The resulting file training_set.csv is 2.5 GB large (a 25% increase in file size).

Now that we have that data processed into a very nice and regular form, loading it into kdb+ is a matter of only a few keystrokes:


Kdb+ finishes loading the data after 76077 milliseconds achieving a load rate of almost 33 MB/sec. Quite terrific, considering that hdparm's buffered disk read-benchmark (-t) comes out at ~54 MB/sec for the actual device the data is read from. Writing the data to disk in a kdb+ native format takes another 74 secs resulting in 1.6 GB used on disk.

Now finally, let's do some quick analysis:

/ number of records
q)count d

/ average rating [1183 msec]
q)avg d.r

/ rating distribution [3833 msecs]
q)select p:(count r)%total, n:count r by r from d
r| p n
-| -------------------
1| 0.04595906 4617990
2| 0.1008363 10132080
3| 0.2867347 28811247
4| 0.3358956 33750958
5| 0.2305744 23168232

/ worst-rated movies [3340 msec]
q)5#`r xasc select r:avg r, n:count r by m from d
m | r n
-----| ------------
515 | 1.287879 132
16487| 1.346021 289
11768| 1.37561 205
17429| 1.396761 247
3575 | 1.4 165

Hah, enough fun for one evening. Getting the movie titles associated with the movies is left as an exercise to the reader. And to put it bluntly: kdb+ rocks :)

All timings were done on a dual AMD Opteron 246 with 3.5 GB RAM running Debian GNU/Linux 3.1 with a 2.6-series kernel.
powered by vanilla
echo earlZstrainYat|tr ZY @. • esa3 • online for 8534 days • c'est un vanilla site