Skip to main content.
home | support | download

Back to List Archive

Re: PDFToText Getting Partial Files when Spidering

From: Bill Moseley <moseley(at)>
Date: Wed Jan 02 2002 - 22:21:16 GMT
At 01:41 PM 01/02/02 -0800, Deane Barker wrote:
>We're using Swish-E on Windows (gasp!).  Working great except for PDF files
>when spidering.  We're getting an error bubbling up from (we believe) the
>pdftotext program:
>Error (0): PDF file is damaged - attempting to reconstruct xref table...
>As near as we can tell, the spider is not writing the complete file to the
>temp directory before pdftotext tries to pick it up and convert it.  Thus,
>pdftotext gets a partial file and says it's corrupted.

I assume you are talking about -S http method (not -S prog), correct?

The spider exits after every document so everything should be flushed, so I
doubt it's an issue of the spider not writing the complete file before
swish reads it. 

To correctly debug, I'd first run the swishspider without swish to fetch
the document (as *.content)  Then fetch the document some other way that
you know works and compare the files.  Then at least we will know what
problem we are trying to fix.

I'd guess it's a DOS line ending issue.  You could try adding

   binmode CONTENTS;

right after the open() call.  To do it right, you would want to select
binmode based on the content-type header in the response, but running in
bindmode all the time should be ok, I'd think, as swish-e doesn't care
about the line endings.

How are you calling pdftotext in your FileFilter directive?

Bill Moseley
Received on Wed Jan 2 22:22:44 2002