Skip to main content.
home | support | download

Back to List Archive

Re: Problems indexing PDF files using HTTP crawler

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jan 06 2006 - 18:39:11 GMT
On Fri, Jan 06, 2006 at 03:47:57AM -0800, Rosalyn Hatcher wrote:
> Summary for: http://prism.enes.org/Publications/Reports/Report05.pdf
>          Connection: Close:      1  (1.0/sec)
>                Total Bytes: 72,475  (72475.0/sec)
>                 Total Docs:      1  (1.0/sec)
>                Unique URLs:      1  (1.0/sec)
> application/pdf->text/html:      1  (1.0/sec)
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table

Try updating xpdf, perhaps.  Looks like it's not able to process that
pdf file.  Fetch the file and then try:

    pdfinfo Report05.pdf
    pdftotext Report05.pdf -

and see if those work directly.

I have no problem with it:

$ /usr/local/lib/swish-e/spider.pl default http://prism.enes.org/Publications/Reports/Report05.pdf > /dev/null
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

Summary for: http://prism.enes.org/Publications/Reports/Report05.pdf
         Connection: Close:      1  (0.1/sec)
               Total Bytes: 72,475  (10353.6/sec)
                Total Docs:      1  (0.1/sec)
               Unique URLs:      1  (0.1/sec)
application/pdf->text/html:      1  (0.1/sec)

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Jan 6 10:39:28 2006