Re: Problems indexing PDF files using HTTP crawler

From: Bill Moseley <moseley(at)>
Date: Mon Jan 09 2006 - 12:59:36 GMT
On Mon, Jan 09, 2006 at 03:32:09AM -0800, Rosalyn Hatcher wrote:
> Consequently, I decided it must be my config file so ditched it and
> started again.  The problem line in my swish.conf file was
> FileFilter .pdf pdftotext "'%p' -"
> Once that was removed all seems to work ok.  Not sure I understand
> why this line isn't needed as my internet searches indicated that it
> was.

Because the spider uses SWISH::Filter by default to filter pdf files.
The spider was fetching the pdf, converting it to text, then swish was
then passing that text to pdftotext, and pdftotext doesn't take plain
text as input.

Bill Moseley

