Skip to main content.
home | support | download

Back to List Archive

Re: SWISH-E digest 2364

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat May 27 2006 - 15:37:43 GMT
On Sat, May 27, 2006 at 08:02:10AM -0700, Glenn Hammonds wrote:
> I've a set of indices totaling 240,000 documents, 1.3 million words,
> 134 million word positions installed in a relatively unloaded BSD
> system.    Discounting initial runs that took an extra second and
> appear to have cached the indices, both searches took 4.8 seconds,
> with remarkably little difference seen between various words used.
> 
> On another set installed on the same system with 7 million files, 2.4
> million words and 700 million word positions, the search times were
> 208 seconds for 'not wordnotinindex' and 194 seconds for 'not(word and
> not word)'.  I did them in that order, and expect a portion of the
> difference in time was due to caching.  I only ran these once.
> 
> Were you expecting a big difference?

Kind of.

    not dkdkdksksks

does the hash lookup in the table if if not found then it knows to
create a list of all documents.

    not(word and not word)

grabs all docs with "word", then grabs all with "word" again and
inverts that.  Then the two result sets are merged and then that set
is inverted.

Either way, it's not very efficient.  And you can also see "not
dkdkdkdk" is a potential DoS on your server.  In the past I've looked
for and rejected queries that only contained "not" and a search word.
And also limited other queries that could eat CPU like the number of
wild cards.  a* or b* or c* or d* or e*  can get painful.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Sat May 27 08:37:44 2006