Skip to main content.
home | support | download

Back to List Archive

Re: Filters/HTTP (was:Documentation structure)

From: <Rainer.Scherg(at)>
Date: Wed Dec 13 2000 - 08:34:54 GMT
> -----Original Message-----
> From: Bill Moseley []
> Sent: Wednesday, December 13, 2000 1:27 AM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Filters/HTTP (was:Documentation structure)

> I've mentioned this before, but I'm not sure how integrated 
> the HTTP method
> should be in swish.  I'm not saying that there shouldn't be a 
> way to spider
> documents, but rather that maybe there should be a modular 
> approach to the
> way the HTTP method is connected to swish.


We should have two methods in the future (IMO):

  - internal swish httpd spidering.
  - external feeding of swish.

The best program I know for spidering is "wget" (as you also mentioned).

Including parts of "wget" as internal spider into swish would provide
all the means we need:
  - spider level control
  - host span options
  - domain control
  - etc.

> Now about filters.  Again, I don't use filters, but the current system
> looks like you define a file extension and a program that swish calls.
>       FilterDir   /usr/local/apache/swish-e/filters-bin/
>       FileFilter  .pdf
> will get passed the name of the file to filter.  
> I'm unclear if you can use filters in http mode.  The documentation
> indicates that a URL is passed, which would mean that the 
> filter would also
> need to retrieve the remote document first -- a process that 
> isn't really
> related to filtering.

not quite correct.

The following parameters are passed to a filter script:
   - file path to index
   - real path or url

In case of file indexing "file path" and "real path" are the same.

Passing the real path/URL is just for information or special purpose, mostly
not used by the filter program.

> Anyway, with the current system swish must fork and exec 
> /bin/sh -c for
> each document.  Forking isn't that expensive in modern 
> operating systems,
> but it still seems like it would be slower than just opening 
> up the filter
> program once and feeding it the documents one after another, 
> leaving the
> filter program in running in memory.

Yep, thats correct, but has also some disadvantages:

  - you have to implement a communication protocol between swish and
    Swish and the filter has to know, when a new document starts and what

  - you cannot use simple scripts.

  - you have to install a multi filter protocol.
    (there are more than pdf-filters).

  - you have still to fork/exec the filter programs (xpdf, gostscript,
catdoc, or

On httpd method IMO most time is spent retrieving the documents from the
Following your proposal, it would make sense to have a multithreaded swish
engine, with a httpd-read-ahead. But this would mean a major redesign of

But we should keep the proposal in mind.

cu - rainer

This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
Received on Wed Dec 13 08:37:58 2000