The only tutorial I can find on indexing external file formats is Josh's 'How to Index anything', it is very readable, and the examples work.
But, I note that although it is highlighted as 'recently featured' it was written in July 2003.
I now read
There are two ways to filter documents with Swish-e. Both are described in the SWISH-CONFIG man page. They use the FileFilter directive and the SWISH::Filter Perl module.
Does Josh use Filefilter directive ??
I assume he does but i can't find it mentioned anywhere.
I then read
Previous versions of Swish-e (before 2.4.0) used a collection of filter programs for converting files such as PDF or MS Word documents. The external programs call other program to do the work of filtering (e.g. pdftotext to extract the contents from PDF files). Although these filter programs are still included with the Swish-e distribution as examples, it is recommended to use the SWISH::Filter method, instead.
So, is this saying DO NOT use Josh's approach ?
Then I read a paragraph which I simply don't understand.
But, Swish-e will not use SWISH::Filter by default when using the file system method of indexing. To use SWISH::Filter when indexing by file system method (-S fs), you can use a FileFilter directive with the swish_filter.pl filter (which is just a program that uses SWISH::Filter) or use the -S prog method of indexing and use the DirTree.pl program for fetching documents.
Can I have a single index for a directory with different filetypes ?
I guess I have to add lines like
FileFilter .pdf pdftotext "'%p' -"
IndexContents TXT* .pdf
to the config file
But then what args do I use with swish-e to create the index ?
Is there an example, or tutorial anywhere ?
Received on Fri Dec 2 07:57:03 2005