Skip to main content.
home | support | download

Back to List Archive

Re: Filters/HTTP (was:Documentation structure)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Dec 13 2000 - 00:26:51 GMT
At 05:07 PM 12/12/00 -0500, David Norris wrote:
>The filter section could become large.  A seperate section for that is a
>good idea.  The indexing robot may be worthy of it's own section, as
>well.  The robot really acts a special form of index filter; maybe it
>could be combined with filters?

Filters, and HTTP method are different things, of course.  HTTP (spider)
feeds files to Swish for indexing, but the filters translate the files into
something that swish can index no matter which way swish gets the files.

I don't use either but I do have comments (surprise!).

I've mentioned this before, but I'm not sure how integrated the HTTP method
should be in swish.  I'm not saying that there shouldn't be a way to spider
documents, but rather that maybe there should be a modular approach to the
way the HTTP method is connected to swish.

Spidering is tricky, and can be slow especially if grabbing one file at a
time.   And there are a lot of options to consider when spidering such as
robots.txt, how deep to spider, how fast to spider, how to deal with time
outs, how often should requests be retried, what domain names should be
considered at the same site and so on.  Enough details about issues that
are really related to indexing documents.

So, in some ways I'd like to see a system where you tell swish to use an
external program for delivering the files to swish for indexing instead of
using the built-in file access method.

We should include a basic spider program in the distribution, but someone
might want to write a spider program, for example, that grabs files from
multiple sites in parallel and does some type of special filtering before
passing the document onto swish.

Or another example, say you have local documents that contain descriptions
of various web pages.  You might use a program that indexes the local
files, but at the same time retrieves via http the remote web pages that
the local file describes, extract out the words, and include them in a
<META> tag.  Then searching for words that are found in that remote web
page would return the local file, even though none of the search terms
appear in the local file.

Now about filters.  Again, I don't use filters, but the current system
looks like you define a file extension and a program that swish calls.

      FilterDir   /usr/local/apache/swish-e/filters-bin/
      FileFilter  .pdf   pdf-filter.sh

pdf-filter.sh will get passed the name of the file to filter.  

I'm unclear if you can use filters in http mode.  The documentation
indicates that a URL is passed, which would mean that the filter would also
need to retrieve the remote document first -- a process that isn't really
related to filtering.

Anyway, with the current system swish must fork and exec /bin/sh -c for
each document.  Forking isn't that expensive in modern operating systems,
but it still seems like it would be slower than just opening up the filter
program once and feeding it the documents one after another, leaving the
filter program in running in memory.

For example, a really simple filter that just calls cat $1 on my Linux P550
with 128M indexing 6000 files went from 26 seconds to 1:38.  And on a two
processor Sun machine with 2GB RAM it went from 1:12 to 3:38.

Granted most filters will call an external program anyway, but if filtering
was needed and speed was important then all the filtering code could be put
in one program and not require constant forking/exec a shell.

So what am I proposing?  First, the file method would be the only built-in
access method.  Then, for spidering or even just filtering all documents:

  DocumentSource /usr/local/bin/wget_wrapper
or
  DocumentSource /usr/local/swish/httpd  # spider included with swish
or
  DocumentSource /usr/local/bin/dump_mysql_db

Where the program that's called might get passed IndexDir and IndexOnly or
maybe better, get passed the path to the swish config file.  Then this
external program could return records consisting of the document name,
length, content-type (if known), and then the contents, one after another.

And same with filtering.

  DocumentFilter text/html /usr/local/bin/htmlstrip
  DocumentFilter .gz /usr/local/bin/expand_gz.pl

But the idea is that the programs are only exec'd once and then are passed
document after document.

I think these things could make swish more scalable and more flexible.

Bill Moseley
mailto:moseley@hank.org
Received on Wed Dec 13 00:29:31 2000