Skip to main content.
home | support | download

Back to List Archive

RE: PDFToText Getting Partial Files when Spidering

From: Deane Barker <deane.barker(at)not-real.bankfirstcorp.com>
Date: Wed Jan 02 2002 - 22:32:10 GMT
Two files attached:
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas

a.pdf -- the actual PDF document, obtained via an HTTP call from my browser
a.pdf.contents -- the document returned by the swishspider

They are not the same.  If you open the second file as a PDF, it is blank,
whereas the first file has the correct two sentences of data. 

Deane



-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org] 
Sent: Wednesday, January 02, 2002 4:19 PM
To: deane.barker@bankfirstcorp.com; Multiple recipients of list
Subject: Re: [SWISH-E] PDFToText Getting Partial Files when Spidering

At 01:41 PM 01/02/02 -0800, Deane Barker wrote:
>We're using Swish-E on Windows (gasp!).  Working great except for PDF files
>when spidering.  We're getting an error bubbling up from (we believe) the
>pdftotext program:
> 
>Error (0): PDF file is damaged - attempting to reconstruct xref table...
> 
>As near as we can tell, the spider is not writing the complete file to the
>temp directory before pdftotext tries to pick it up and convert it.  Thus,
>pdftotext gets a partial file and says it's corrupted.

I assume you are talking about -S http method (not -S prog), correct?

The spider exits after every document so everything should be flushed, so I
doubt it's an issue of the spider not writing the complete file before
swish reads it. 

To correctly debug, I'd first run the swishspider without swish to fetch
the document (as *.content)  Then fetch the document some other way that
you know works and compare the files.  Then at least we will know what
problem we are trying to fix.

I'd guess it's a DOS line ending issue.  You could try adding

   binmode CONTENTS;

right after the open() call.  To do it right, you would want to select
binmode based on the content-type header in the response, but running in
bindmode all the time should be ok, I'd think, as swish-e doesn't care
about the line endings.

How are you calling pdftotext in your FileFilter directive?


-- 
Bill Moseley
mailto:moseley@hank.org




*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Wed Jan 2 22:32:15 2002