Skip to main content.
home | support | download

Back to List Archive

Re: ANNOUNCE: "Indexing Arbitrary Data with SWISH-E"

From: Dennis Nichols <nichols(at)>
Date: Tue Aug 10 2004 - 03:40:13 GMT
At 8/9/2004 06:29 PM, Josh Rabinowitz wrote:
>Hello Everyone:
>I just wanted to let everyone know that I've made my USENIX paper
>"Indexing Arbitrary Data with SWISH-E" available on my website at

Thanks for posting this - I've been looking forward to seeing it. I think 
the paper is really good and will find lots of use in the future.

The only project I've done with SWISH-E was prototyped using MySQL 
3.23.something without using MySQL's full-text indexing. It worked OK for a 
few thousand documents, but ran out of steam long before it reached the my 
current 40,000 documents (average length = 350 words). I would have 
switched to using full-text but MySQL 4.01 is required for boolean searches 
and a new version wasn't going to be available on the intended server. So I 
cast about, landing on SWISH-E. Implementation wasn't too hard, performance 
has been great, and I've chosen not to eliminate any common words. This 
means that my search users get pretty much what they expect no matter what 
they search for. And on top of that, now it appears that it is faster and 
smaller than a MySQL implementation would have been. Of course, I still use 
MySQL for lots of other (non-text-search) stuff.

As a side effect of writing this, I became curious about the importance of 
short words. I do keep track of unique words used in searches, but not 
their frequency of use. So, as a really crude approximation, the vocabulary 
of my search users consists of:

  1% one-character words
  4% two-character words
11% three-character words
17% four-character words
17% five-character words
14% six-character words
12% seven-character words
  9% eight-character words
  6% nine-character words
  4% ten-character words
  3% eleven-character words
  1% twelve-character words
  1% thirteen-character thru eighteen-character words

This is based on a total vocabulary of about 800 searched-for words (so 
far). The one-character words used were: a, c, f, g, h, j, m, r, s, t, and 
z. The only eighteen-character word was: electroluminescent.

Dennis Nichols  
Received on Mon Aug 9 20:40:36 2004