Skip to main content.
home | support | download

Back to List Archive

Re: Indexing large nbrs of docs

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jun 01 2001 - 04:59:17 GMT
At 07:44 PM 05/31/01 -0700, Greg Caulton wrote:

>    Large, well compared to my other indexes :-)
>
>    I wish to index a directory with 2800 word docs, of which the total
>combined size is 720MB.

I think Jose has indexed somewhere like 600,000 docs.  (Is that right, Jose?).


>    However the indexing is getting slower and slower as the number of
>documents indexed increases - and I believe it will run for several
>hours before slowing to a crawl.

Hard to tell without more information.  Are you running out of memory when
indexing?

Swish 2.1 has a -e economy switch to use less memory, but it's currently
unclear how much help this adds.  If it keeps you from swapping then it's a
big help.

The other issue is with filters.  If you are using a shell or (especially)
a perl script with FileFilters then, yes, it can be very slow because it
runs the script for every document.

FileFilters are smarter now in that you can avoid a shell script or perl
script with some filters and run the filter program directly.  This still
uses popen for every document so the shell is still run for every document,
but it's still much, much, faster than running a perl script for every
document.

    FileFilter .doc "/usr/local/bin/catdoc" "-s8859-1 -d8859-1 '%p'"

Swish 2.1 has a new input method called "prog" where an external program
feeds documents to swish.  So the external program can be a perl script
that runs (compiles) only one time and stays running while indexing all
documents.

This can be a very significant increase in indexing speed if you *must* use
a perl or shell script in your processing.

If you cannot avoid a shell or perl script for filtering, then you should
probably try using the prog method.  There are examples in the prog-bin
directory of the 2.1-dev distribution.  But if you are just indexing word
docs, then try that FileFilter command first and let us know what happens.

>    Is it possible to merge seperate smaller indexes?

Yes, but only if your problem is running low on memory.  Otherwise it
probably won't save you any time.

But you must first find out if you are running out of memory while indexing.



Bill Moseley
mailto:moseley@hank.org
Received on Fri Jun 1 04:59:30 2001