Skip to main content.
home | support | download

Back to List Archive

Re: Problems indexing PDF files using HTTP crawler

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Jan 09 2006 - 12:59:36 GMT
On Mon, Jan 09, 2006 at 03:32:09AM -0800, Rosalyn Hatcher wrote:
> Consequently, I decided it must be my config file so ditched it and
> started again.  The problem line in my swish.conf file was
> 
> FileFilter .pdf pdftotext "'%p' -"
> 
> Once that was removed all seems to work ok.  Not sure I understand
> why this line isn't needed as my internet searches indicated that it
> was.

Because the spider uses SWISH::Filter by default to filter pdf files.
The spider was fetching the pdf, converting it to text, then swish was
then passing that text to pdftotext, and pdftotext doesn't take plain
text as input.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Jan 9 04:59:38 2006