Skip to main content.
home | support | download

Back to List Archive

VERY CONFUSED ABOUT FILTERS

From: David Larkin <david.larkin(at)not-real.djl.co.uk>
Date: Fri Dec 02 2005 - 15:57:02 GMT
The only tutorial I can find on indexing external file formats is Josh's 'How to Index anything', it is very readable, and the examples work.

But, I note that although it is highlighted as 'recently featured' it was written in July 2003.

I now read 

There are two ways to filter documents with Swish-e. Both are described in the SWISH-CONFIG man page. They use the FileFilter directive and the SWISH::Filter Perl module.

Does Josh use Filefilter directive ??

I assume he does but i can't find it mentioned anywhere.

I then read 
Previous versions of Swish-e (before 2.4.0) used a collection of filter programs for converting files such as PDF or MS Word documents. The external programs call other program to do the work of filtering (e.g. pdftotext to extract the contents from PDF files). Although these filter programs are still included with the Swish-e distribution as examples, it is recommended to use the SWISH::Filter method, instead.

So, is this saying DO NOT use Josh's approach ?

Then I read a paragraph which I simply don't understand.

But, Swish-e will not use SWISH::Filter by default when using the file system method of indexing. To use SWISH::Filter when indexing by file system method (-S fs), you can use a FileFilter directive with the swish_filter.pl filter (which is just a program that uses SWISH::Filter) or use the -S prog method of indexing and use the DirTree.pl program for fetching documents.

Can I have a single index for a directory with different filetypes ?

I guess I have to add lines like 

FileFilter .pdf  pdftotext   "'%p' -"
IndexContents TXT* .pdf

to the config file

But then what args do I use with swish-e to create the index ?

Is there an example, or tutorial anywhere ?

Thanks
Received on Fri Dec 2 07:57:03 2005