Skip to main content.
home | support | download

Back to List Archive

Re: Big Index works!

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jun 12 2002 - 14:36:36 GMT
At 02:53 AM 6/12/2002 -0700, Cristiano Corsani wrote:

>Hi all,
>
>I wite just to tell that swish-e works with my big DB.

Cool.



>1111343 files indexed.  2026462918 total bytes.  92327081 total words.
>Elapsed time: 16:27:15 CPU time: 16:27:15

16 hours!   

Average sized of doc is about 1,823 bytes.

>on a Pentium IV with 250Mb RAM.

On my machine I my Athlon 1800+ with 1/2G I can index about 24,000 files in
a minute.  Less than an hour for a million.  On my PIII-550 it takes about
4 minutes.  So that's about 3 hours to do a million files.

My guess is you are running out of memory while indexing.  Did you index
with the -e switch?  It will keep your disk drive busy, but will save RAM.
Better to let swish swap than the OS.  Best to use a machine with more RAM.

How does one monitor memory usage on Windows?

So it says: 2,778,708 unique words indexed.

That's a lot of words to index.  Will people be searching all those words?
Trim that number down and you will save memory.

Make sure you are *not* indexing a unique record identifier.  No point
indexing something you can use to look up the item directly in a database.

Run swish-e -T index_words_only > word_list  and then you can look at the
words indexed.  You may see words that do not need to be index.

Also, you might search the archive using a Subject Only search for "multi
millions words" and also search for BIGHASHSIZE to look at possible tuning
you might be able to do.

Hope this helps.

Bill Moseley
mailto:moseley@hank.org
Received on Wed Jun 12 14:40:14 2002