start | find | index | login or register | edit
Dienstag, 23. Dezember 2008 link

Kragen Sitaker's "dumbfts" is a "full-text indexing and search engine for my email, in about 250 lines of code. It can produce results for many (most?) queries on my 12-gigabyte mailbox in under two seconds on my 700MHz laptop."

250 lines of Python, that is. Most amazing. It finally inspired a clear vision of how I'd want my email system to look like, cf 2008-10-02. More on that later, maybe.


Kragen Javier Sitaker 615 days ago:
I'm glad you like it! Have you tried using it? I'll probably fix a few of the more annoying aspects of its UI soon... and maybe I'll try using an SSD for the index storage to get faster results :)

Python actually makes it quite a challenge to get fast search response times with an on-disk index. There's a tradeoff between waiting for more seeks and chewing through more data, and Python is pretty slow at the chewing-through-data part.

earl 614 days ago:
Not yet, as my mail is stored in maildirs. But I plan to play around with it later today. I'll report back, then :)

Kragen Javier Sitaker 614 days ago:
I'm not sure how to index maildirs fast. I mean, obviously you want to use a different method for finding the messages than what I'm using, but I mean if you want the indexing to run fast, it has to not do a disk seek per email message. So how do you read messages from a maildir in mostly the order they're stored on the disk? You could maybe `stat` all of them and sort the list by inode number before you start opening files.

I haven't had much luck with storing more than a few hundred thousand messages in a maildir, though.

earl 611 days ago:
So, here are some first results:

8m18s to index a 1.3 GB mbox on my dual-core 2.6GHz notebook. The resulting index is 1.2GB according to du and spread over 27 segments.

Without any post-processing, most queries take between 6 to 10 seconds wall clock on the same machine, as measured by time fts mbox query > /dev/null. Query time doesn't change much if I store the index in memory (on a ramfs), but I've 4GB memory here, so the index was in a fs cache even without ramfs.

Kragen Javier Sitaker 603 days ago:
6 to 10 seconds for queries on a 27-segment index is about what I'd expect, although of course it's frustratingly slow. I've just changed the default chunksize from 200k to 50k, and also changed indexmail.py to build the skipfiles automatically, which should afford noticeably better performance.

If you run merge.py, it will suggest merging all or most of those segments into a single new segment, then deleting the old segments, which will give you much better performance for most queries.

If you wanted to optimize for answering queries out of filesystem cache instead of from disk, you could use a much smaller chunksize, like 1k or 4k. The reason for making the chunksize largish is to avoid spending all of our time doing disk seeks.

earl 602 days ago:
Yes, as expected, things are getting much better with the merged index.

Indexing speed stayed the same at roughly 8m7s. merge.py suggested merging all segments into one, which I did and then rebuilt a skipfile with buildskips.py.

With the same timing method as before, most queries now take between 0.1s and 3s. For queries with large result sets, most of that time is spent reading the actual mail content, so if you page through the results there wouldn't be no noticeable delay at all before you get the first page of content.

That's once again with all settings left at their defaults. Maybe I'll do a testrun with a smaller chunksize later, but it's already fast enough:

$ time ./fts mbox 'kragen from ' 'bicicleta ' |
formail -3 -s formail -z -x Subject:
naming of "self" in Bicicleta
rational number class in Bicicleta's language
various approaches to primitive functions

real 0m0.128s
user 0m0.056s
sys 0m0.028s

(That requires a small patch of msgs.py to ignore the IOError that's caused by formail closing the pipe.)

Kragen Javier Sitaker 601 days ago:
How long did it take to do the merge for you?

There are cases where there will be a noticeable delay before you get the first page of results --- try fts mbox a b c for example. There are even queries with small result sets that will have the same problems, like fts mbox a b c 'kragen from ' 'bicicleta '. A little bit of query optimization here could go a long way: estimate the number of postings for each query term, evaluate the query terms that produce the smallest number of results first, and then just use the rest to filter messages out from the stream produced by the more selective terms. Where this usually hurts is in queries like fts mbox 'watchdog list-id ' 'sep date ' '2008 date ', which takes about 30 seconds on my 12GB mailbox.

Is there a place I can pull your patch from?

Please log in (you may want to register first) to post comments!

powered by vanilla
echo earlZstrainYat|tr ZY @.
earl.strain.at • esa3 • online for 3469 days • c'est un vanilla site