Skip to main content.
home | support | download

Back to List Archive

PDFToText Getting Partial Files when Spidering

From: Deane Barker <deane.barker(at)not-real.bankfirstcorp.com>
Date: Wed Jan 02 2002 - 21:42:56 GMT
We're using Swish-E on Windows (gasp!).  Working great except for PDF files
when spidering.  We're getting an error bubbling up from (we believe) the
pdftotext program:
 
Error (0): PDF file is damaged - attempting to reconstruct xref table...
 
As near as we can tell, the spider is not writing the complete file to the
temp directory before pdftotext tries to pick it up and convert it.  Thus,
pdftotext gets a partial file and says it's corrupted.
 
Reasons why we think this is so:
 
-- Everything works fine when we use the file system method (because then it
has the actual file and doesn't have to bother with writing temp files)
 
-- It will index very small PDF documents.  We made a one paragraph (4k) PDF
document, and it indexed this via HTTP, because presumably the file was
small enough to write to the temp directory in time for pdftotext to pick it
up.
 
Any thoughts?
Deane Barker 
Technical Research Analyst 
BANKFIRST 
605-988-3355 
 
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Wed Jan 2 21:44:23 2002