Skip to main content.
home | support | download

Back to List Archive

Re: Swish-e scalability, performance

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Nov 15 2004 - 21:17:30 GMT
On Mon, Nov 15, 2004 at 01:04:57PM -0800, dstevens@roaddog.com wrote:
> A couple of the drawbacks with swish-e for a large Web wide search tool
> were spider.pl after long (700-800k pages, or a few days) crawls would
> hang or become incredibly slow even on a dual Opteron 242 with 4GB ram. 

Hum, do you think the machine was running low on memory?  spider.pl
simply keeps a hash of URLs seen, so it's all in memory.  It would be
nice to have spider.pl use either a database or BerkeleyDB so that it
could be restarted -- I thought about just using Storable to dump the
hash to disk if it gets a signal to abort.  Then read that back in to
continue.

> To be fair I don't think the original intent of swish-e was to be a
> Web wide level search tool, but it does a pretty good job up to a
> million or two pages.

That's the bottom line. Kevin wrote the original swish in a weekend or
so and the basic design hasn't really changed.  Things are faster, but
that's about it.  That's kind of a problem, as you can evaluate swish
and it looks real fast compared to other indexers, but then you hit
some limit and it slows down real fast.

I'm always amazed when people post that they are using it for millions
of documents.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Nov 15 13:17:31 2004