Skip to main content.
home | support | download

Back to List Archive

Re: Indexing takes forever

From: Nick <newsgroups(at)not-real.2thebatcave.com>
Date: Fri May 06 2005 - 20:28:55 GMT
As far as docs go, I would like to see a few different sample swish.conf
files (and possibly related command line options like you showed below)
for different applications.  Generally when I am setting up something I
like to see example setups/configs and play around with it before trying
to fine-tune it.  If there were more example configs then a user could
just pick one that is close to what they are looking for to get it going,
then work from that.

On the same note what should I put in the config file if I use the:

swish-e -c /etc/swish.conf -S prog -i DirTree.pl

as you said below.  I need to be able to search ms word, excel,
powerpoint, pdf, html, and text.

The doc files especially change very often so I probably wouldn't want to
cache those, but since I have mostly doc files I probably won't bother
caching anything at this point.

I have xpdf, catdoc, ppthtml, the excel perl stuff installed on the linux
box.

I am guessing that it was just using the default html filter to find text
in the doc and ppt files that I searched then?  I know that it could find
text in these binary files using my existing config, that is why I thought
it was somehow finding the extra progs I had installed to filter the file
types.

>
>
> Nick scribbled on 5/6/05 2:54 PM:
>> I currently have swish-e 2.4.3 up and working.  It appears to be working
>> fine (with a small set of files) but indexing all my files is taking a
>> really long time.
>
>
> you're right. should not be taking that long.
>
>
>>
>> I am somewhat confused at the best (for speed) way to setup indexing.  I
>> have read through all the docs (or at least I think I did), and I am
>> still
>> somewhat confused at the best way to setup the filters.
>
> as luck has it, I spent the morning working on the docs. So at least I
> have it
> fresh in my head (which may not mean much).
>
> swish-e does not know about non-text files like .pdf, .doc, .xls and .ppt.
> You
> need some 3rd party programs to convert those to text so that swish-e can
> index
> them. For the windows distrib of swish-e, some of those 3rd party apps are
> bundled in: xpdf and catdoc (see the note here:
> http://swish-e.org/download/index.html). Since you're using Linux and
> mouting
> the windows volume remotely, you need to install the 3rd party apps for
> Linux. I
> think the filters/README file talks about that (I haven't gotten to that
> doc
> revision yet...).
>
> You're also calling swish-e with the default -S fs method (since you don't
> specify one explicitly). You probably want -S prog, in order to get your
> docs
> filtered with the 3rd party apps.
>
> A few things I would try:
>
> 1. make sure the SWISH::Filter class is in your Perl include path:
>
>   % export PERL5LIB=/usr/local/lib/swish-e  # bash, bourne shells
>   % setenv PERL5LIB /usr/local/lib/swish-e  # csh, tcsh
>
>
> 2. index with this command instead:
>
> swish-e -c /etc/swish.conf -S prog -i DirTree.pl
>
> 3. if you're going to index every night, but the binary docs (pdf, .doc,
> etc)
> don't change that often, consider caching the filtered output. The
> filtering
> causes the most overhead: a new forked process for each doc.
>
> you can cache output with the DirTree.pl script, or roll your own.
>
> 4. like I mentioned, I'm working on the docs even now, so if there are
> specific
> ways you think that they could be improved, post back to the list.
>
>
> --
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>
Received on Fri May 6 13:28:56 2005