Skip to main content.
home | support | download

Back to List Archive

Re: Problems indexing PDF files using HTTP crawler

From: Rosalyn Hatcher <r.s.hatcher(at)>
Date: Mon Jan 09 2006 - 11:38:27 GMT
Bill Moseley wrote:

>Try updating xpdf, perhaps.  Looks like it's not able to process that
>pdf file.  Fetch the file and then try:
>    pdfinfo Report05.pdf
>    pdftotext Report05.pdf -
>and see if those work directly.
>I have no problem with it:
>$ /usr/local/lib/swish-e/ default > /dev/null
>/usr/local/lib/swish-e/ Reading parameters from 'default'
>Summary for:
>         Connection: Close:      1  (0.1/sec)
>               Total Bytes: 72,475  (10353.6/sec)
>                Total Docs:      1  (0.1/sec)
>               Unique URLs:      1  (0.1/sec)
>application/pdf->text/html:      1  (0.1/sec)
The pdftotext and pdfinfo were fine and i also got a success with the 
spider command you
used above with the default configuration.  Bizarre, since I was sure 
I'd tried to use the
default before and still got the error - guess I must have done that 

Consequently, I decided it must be my config file so ditched it and 
started again.  The problem line in my
swish.conf file was

FileFilter .pdf pdftotext "'%p' -"

Once that was removed all seems to work ok.  Not sure I understand why 
this line
isn't needed as my internet searches indicated that it was.

Thanks for your help,

Rosalyn Hatcher
CGAM, Dept. of Meteorology, University of Reading, 
Earley Gate, Reading. RG6 6BB
Email:     Tel: +44 (0) 118 378 7841
Received on Mon Jan 9 03:38:32 2006