start | find | index | login or register | edit | ||
2006-10-06
by earl, 6660 days ago
Another occassion to let Kdb+ do what it does best, namely crunching data: As part of the Netflix Prize competition, Netflix released a dataset of roughly 100 million datestamped human ratings of 17770 movies. That's certainly a lovely heap of data to grind the mighty axe of kdb+ on and well, that's just what Chris and I did the other night.First things first, namely some pre-processing. Each movie's ratings are contained in a seperate file ( mv_nnnnnnn.txt ) leaving us with 17770 files which are bundled in a 2.0 GB tar archive. The movie's ID (terminated by a colon) is repeated as the first line in the movie's file, followed by comma-separated user,rating,date records.Now let's see how fast we can load that bunch of data into kdb+. We start off with a simple pre-processing AWK-script which prefixes each record with the correct movie ID and join together all those records into a single huge file: BEGIN { OFS="," } This awk script takes 221326 milliseconds to pre-process the dataset: $ tar Oxf training_set.tar | awk -f join.awk > training_set.csv I'm certainly no awk wizard, so if you have suggestions on making this script faster, please leave them in the comments. The resulting file training_set.csv is 2.5 GB large (a 25% increase in file size).Now that we have that data processed into a very nice and regular form, loading it into kdb+ is a matter of only a few keystrokes:
Kdb+ finishes loading the data after 76077 milliseconds achieving a load rate of almost 33 MB/sec. Quite terrific, considering that hdparm's buffered disk read-benchmark (-t) comes out at ~54 MB/sec for the actual device the data is read from. Writing the data to disk in a kdb+ native format takes another 74 secs resulting in 1.6 GB used on disk. Now finally, let's do some quick analysis: / number of records Hah, enough fun for one evening. Getting the movie titles associated with the movies is left as an exercise to the reader. And to put it bluntly: kdb+ rocks :) All timings were done on a dual AMD Opteron 246 with 3.5 GB RAM running Debian GNU/Linux 3.1 with a 2.6-series kernel. |
search 2 active users
backlinks (more) none, yet recent stores (more) recent comments echo earlZstrainYat|tr ZY @. |
|
earl.strain.at • esa3 • online for 8701 days • c'est un vanilla site |