Skip to main content.
home | support | download

Back to List Archive

Re: Indexing takes forever

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri May 06 2005 - 20:09:53 GMT
Nick scribbled on 5/6/05 2:54 PM:
> I currently have swish-e 2.4.3 up and working.  It appears to be working
> fine (with a small set of files) but indexing all my files is taking a
> really long time.


you're right. should not be taking that long.


> 
> I am somewhat confused at the best (for speed) way to setup indexing.  I
> have read through all the docs (or at least I think I did), and I am still
> somewhat confused at the best way to setup the filters.

as luck has it, I spent the morning working on the docs. So at least I have it 
fresh in my head (which may not mean much).

swish-e does not know about non-text files like .pdf, .doc, .xls and .ppt. You 
need some 3rd party programs to convert those to text so that swish-e can index 
them. For the windows distrib of swish-e, some of those 3rd party apps are 
bundled in: xpdf and catdoc (see the note here: 
http://swish-e.org/download/index.html). Since you're using Linux and mouting 
the windows volume remotely, you need to install the 3rd party apps for Linux. I 
think the filters/README file talks about that (I haven't gotten to that doc 
revision yet...).

You're also calling swish-e with the default -S fs method (since you don't 
specify one explicitly). You probably want -S prog, in order to get your docs 
filtered with the 3rd party apps.

A few things I would try:

1. make sure the SWISH::Filter class is in your Perl include path:

  % export PERL5LIB=/usr/local/lib/swish-e  # bash, bourne shells
  % setenv PERL5LIB /usr/local/lib/swish-e  # csh, tcsh


2. index with this command instead:

swish-e -c /etc/swish.conf -S prog -i DirTree.pl

3. if you're going to index every night, but the binary docs (pdf, .doc, etc) 
don't change that often, consider caching the filtered output. The filtering 
causes the most overhead: a new forked process for each doc.

you can cache output with the DirTree.pl script, or roll your own.

4. like I mentioned, I'm working on the docs even now, so if there are specific 
ways you think that they could be improved, post back to the list.


-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Fri May 6 13:09:54 2005