esa://2006-10-06

/ number of records
q)count d
100480507

/ average rating [1183 msec]
q)avg d.r
3.60429

/ rating distribution [3833 msecs]
q)select p:(count r)%total, n:count r by r from d
r| p n
-| -------------------
1| 0.04595906 4617990
2| 0.1008363 10132080
3| 0.2867347 28811247
4| 0.3358956 33750958
5| 0.2305744 23168232

/ worst-rated movies [3340 msec]
q)5#`r xasc select r:avg r, n:count r by m from d
m | r n
-----| ------------
515 | 1.287879 132
16487| 1.346021 289
11768| 1.37561 205
17429| 1.396761 247
3575 | 1.4 165

oct 6, in:

2006 / 2002 / 2001

18 active users

unknown (18)

backlinks (more)

none, yet

echo earlZstrainYat|tr ZY @.


start \| find \| index \| login or register \| edit
2006-10-06 by earl, 6861 days ago Another occassion to let Kdb+ do what it does best, namely crunching data: As part of the Netflix Prize competition, Netflix released a dataset of roughly 100 million datestamped human ratings of 17770 movies. That's certainly a lovely heap of data to grind the mighty axe of kdb+ on and well, that's just what Chris and I did the other night. First things first, namely some pre-processing. Each movie's ratings are contained in a seperate file (`mv_nnnnnnn.txt`) leaving us with 17770 files which are bundled in a 2.0 GB tar archive. The movie's ID (terminated by a colon) is repeated as the first line in the movie's file, followed by comma-separated `user,rating,date` records. Now let's see how fast we can load that bunch of data into kdb+. We start off with a simple pre-processing AWK-script which prefixes each record with the correct movie ID and join together all those records into a single huge file: BEGIN { OFS="," } /^[0-9]+:/ { split($1, parts, ":"); m = parts[1] } /^[0-9]+,/ { print m, $1 } This awk script takes 221326 milliseconds to pre-process the dataset: $ tar Oxf training_set.tar \| awk -f join.awk > training_set.csv # real 3m41.326s I'm certainly no awk wizard, so if you have suggestions on making this script faster, please leave them in the comments. The resulting file `training_set.csv` is 2.5 GB large (a 25% increase in file size). Now that we have that data processed into a very nice and regular form, loading it into kdb+ is a matter of only a few keystrokes: d:+`m`u`r`d!("IIID";",")0:`$"training_set.csv" Kdb+ finishes loading the data after 76077 milliseconds achieving a load rate of almost 33 MB/sec. Quite terrific, considering that hdparm's buffered disk read-benchmark (-t) comes out at ~54 MB/sec for the actual device the data is read from. Writing the data to disk in a kdb+ native format takes another 74 secs resulting in 1.6 GB used on disk. Now finally, let's do some quick analysis: / number of records q)count d 100480507 / average rating [1183 msec] q)avg d.r 3.60429 / rating distribution [3833 msecs] q)select p:(count r)%total, n:count r by r from d r\| p n -\| ------------------- 1\| 0.04595906 4617990 2\| 0.1008363 10132080 3\| 0.2867347 28811247 4\| 0.3358956 33750958 5\| 0.2305744 23168232 / worst-rated movies [3340 msec] q)5#`r xasc select r:avg r, n:count r by m from d m \| r n -----\| ------------ 515 \| 1.287879 132 16487\| 1.346021 289 11768\| 1.37561 205 17429\| 1.396761 247 3575 \| 1.4 165 Hah, enough fun for one evening. Getting the movie titles associated with the movies is left as an exercise to the reader. And to put it bluntly: kdb+ rocks :) All timings were done on a dual AMD Opteron 246 with 3.5 GB RAM running Debian GNU/Linux 3.1 with a 2.6-series kernel.	search oct 6, in: 2006 / 2002 / 2001 18 active users unknown (18) backlinks (more) none, yet recent stores (more) 2025-01-01 films-seen 2022-01-01 2021-06-12 2019-12-17 2017-05-15 recent comments #1 at 2025-01-01 by taggy #1 at 2021-06-12 by taggy #1 at 2019-12-17 by taggy #1 at 2016-08-24 by taggy #1 at 2016-06-12 by taggy `echo earlZstrainYat\|tr ZY @.`
earl.strain.at • esa3 • online for 8902 days • c'est un vanilla site