At 8/9/2004 06:29 PM, Josh Rabinowitz wrote:
>Hello Everyone:
>
>I just wanted to let everyone know that I've made my USENIX paper
>"Indexing Arbitrary Data with SWISH-E" available on my website at
>http://joshr.com/src/docs/IndexingWithSwishe-Rabinowitz.pdf
Thanks for posting this - I've been looking forward to seeing it. I think
the paper is really good and will find lots of use in the future.
The only project I've done with SWISH-E was prototyped using MySQL
3.23.something without using MySQL's full-text indexing. It worked OK for a
few thousand documents, but ran out of steam long before it reached the my
current 40,000 documents (average length = 350 words). I would have
switched to using full-text but MySQL 4.01 is required for boolean searches
and a new version wasn't going to be available on the intended server. So I
cast about, landing on SWISH-E. Implementation wasn't too hard, performance
has been great, and I've chosen not to eliminate any common words. This
means that my search users get pretty much what they expect no matter what
they search for. And on top of that, now it appears that it is faster and
smaller than a MySQL implementation would have been. Of course, I still use
MySQL for lots of other (non-text-search) stuff.
As a side effect of writing this, I became curious about the importance of
short words. I do keep track of unique words used in searches, but not
their frequency of use. So, as a really crude approximation, the vocabulary
of my search users consists of:
1% one-character words
4% two-character words
11% three-character words
17% four-character words
17% five-character words
14% six-character words
12% seven-character words
9% eight-character words
6% nine-character words
4% ten-character words
3% eleven-character words
1% twelve-character words
1% thirteen-character thru eighteen-character words
This is based on a total vocabulary of about 800 searched-for words (so
far). The one-character words used were: a, c, f, g, h, j, m, r, s, t, and
z. The only eighteen-character word was: electroluminescent.
Dennis Nichols
Received on Mon Aug 9 20:40:36 2004