Skip to main content.
home | support | download

Back to List Archive

Re: Indexing takes forever

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat May 07 2005 - 16:33:12 GMT
I have a few comments to add:

On Fri, May 06, 2005 at 12:53:45PM -0700, Nick wrote:
> 11,322	.doc
> 137	.txt
> 8,536	.xls
> 2,026	.ppt
> 1,575	.pdf
> 1,129	.htm
> 25	.html

Not very many docs, but I don't know how large those are.  If those
were text I'd expect it to take just a few minutes to index.  If the
files are huge then you might want to use -e to have swish use disk
instead of RAM.  Still, that's not that many files.

> I am somewhat confused at the best (for speed) way to setup indexing.
> have read through all the docs (or at least I think I did), and I am still
> somewhat confused at the best way to setup the filters.  In some places it
> seems to say I don't need to configure anything specifically to get the
> extra ms word/excel/powerpoint functionality, and in others I get the
> impression I am supposed to actively configure something for each file
> type.  I have installed all the programs I am supposed to for pdf, word,
> excel, and powerpoint from what I read.

Here's a little overview -- I suppose you get all this by now, though:

Swish can parse only xml, html, and text files.  All others need to
be filtered.  There's basically two ways to do that:

FileFilter in the swish config causes swish to call an external
program fore each document that needs to be filtered.  This works
fine for simple filtering tasks.

But, when the external filter is a Perl script there is a considerable
startup time incurred each time the filter is run.  So, it's nice to
have the filter code loaded only once.

So, that method is often preferred.  You can think of that method
working like this series of pipes:

    fetch_docs | filter_docs | swish-e

In practice, the first two steps are done together.  DirTree.pl is an
example of that.  So, you can actually do this:

   $ /path/to/DirTree.pl $docs_dir | swish-e -S prog -i stdin -c config

It scans the file system starting at one or more top directories
"$docs_dir" and for each document it calls the SWISH::Filter module
for filtering.  SWISH::Filter is a Perl module that combines all
filtering into a single tool.

DirTree.pl calls SWISH::Filter which looks up the mime type of the
document and then looks for a SWISH::Filters::* module to handle that
type of document.

The individual SWISH::Filters::* modules do the work for each file
type. These filters often call external programs -- so just having the
SWISH::Filters::* module installed is not enough to have the filter
work.  On Linux, for example, you need to install the helper programs
pdftotext and pdfinfo from the Xpdf package.

So thinking about those pipes above you can then do this:

   $ DirTree.pl /home/foo > docs.txt
   $ swish-e -S prog -i stdin -c config < docs.txt

to break indexing into two steps.  That's nice since you can then
look at docs.txt and see how your docs are filtered.  Pipe into gzip
if you care about disk space.

There's also the program swish-filter-test that is a thin wrapper
around SWISH::Filter.  You can use it to test if you can filter a
given file.  That's faster than generating a complete docs.txt above.


On Fri, May 06, 2005 at 01:09:32PM -0700, Peter Karman wrote:
>
> 1. make sure the SWISH::Filter class is in your Perl include path:
> 
>   % export PERL5LIB=/usr/local/lib/swish-e  # bash, bourne shells
>   % setenv PERL5LIB /usr/local/lib/swish-e  # csh, tcsh

You don't need to do that when using DirTree.pl.  That path gets set
when you install swish-e.  Look at the top of DirTree.pl.


Parsing Windows documents doesn't always work.  You may need to use
the most current versions of the tools to filter them as possible.
I think that the format of those docs may change over time and I'm
not so sure Microsoft publishes the specification for the docs.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Sat May 7 09:33:24 2005