Skip to main content.
home | support | download

Back to List Archive

Re: swish-e solution compared to others ...

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jun 05 2002 - 06:02:41 GMT
At 07:56 AM 06/04/02 -0700, Roland BENEDETTI wrote:
>Hello,
>I have recently discover Swish-e software and previously I have used Lucene
>Java Jakkarta software for the developement of an indexing and search
>engine..
>Could someone help me comparing those two solutions ?
>
>Does Swish-e scales as large as Lucen ?
>Is it faster ?
>Is it simpler ?

I Can't answer any of those questions as I know nothing about that program.

Swish-e seems to be commonly used for reasonably small collections (not
sure what "small" means -- 50,000-100,000 docs?) that don't change very
often.  Sounds like Lucene is designed for incremental indexing, where
swish isn't at this time.  Swish-e is very fast at indexing, and in some
case that can make up for its lack of incremental indexing.

Here's a few samples of indexing times.  This first is on my Linux desktop
indexing /usr/doc:

  24735 files indexed.  184345060 total bytes.  20254324 total words.
  Elapsed time: 00:01:20 CPU time: 00:01:00

Here's on a reasonably busy FreeBSD server:

  46023 files indexed.  689283714 total bytes.  23783228 total words.
  Elapsed time: 00:18:21 CPU time: 00:11:39

(by the way, that second one is using -e to reduce memory usage so that
adds a little time to the indexing)

Fast indexing speed helps if you want to attempt some type of incremental
indexing.

For example, if you had a mailing list with, say, 50 messages a day you
might reindex all messages every time a new message comes in until some
limit (perhaps when it takes more than a few seconds to reindex) and then
reindex everything into a master index and start new with the "new
messages" index.  Then you have two indexes and you can search them both at
the same time with swish.   Kind of like a full backup and an incremental
backup.

On the other hand, if you had 100,000 docs that changed daily and they had
to be searchable as soon as they were modified then swish-e probably would
not be the best choice.

So, it really matters what kind of data you have, how often it changes and,
and how many docs you need to maintain.

Lucene's faq says it's very fast at indexing.  If you do try it out could
you post some numbers?

Searching speed is fine with swish.  

You can get an idea of swish-e's features by looking at the config options
listed at http://swish-e.org/2.2/docs/

Let's see, Lucene doesn't have a web crawler according to their FAQ, and
swish includes that.  But, it wouldn't be too hard to find one, though.
Swish also has internal parsers for HTML, XML, and text file types.  I
think with Lucene you need to find something to parse your docs.  Again,
that shouldn't be too hard.

Swish-e is typically run by calling the binary program, but there is also a
swish-e library for searching.

Swish-e only indexes 8 bit chars, and I doubt that will change any time
soon.  I don't know what Lucene indexes.

Sorry I couldn't offer more specifics.  




-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Jun 5 06:06:23 2002