Re: PDF to HTML causing swish-e to crash

From: Greg Fenton <greg_fenton(at)>
Date: Thu Oct 10 2002 - 22:48:55 GMT
--- Bill Moseley <> wrote:
> I sure wish pdftotext printed the file name with its errors.

Agreed...  ;-(

> So when you pass the file directly to pdftotext and pdfinfo it works
> fine?


> Looks like you are passing an invalid file to pdftotext.  Have your
> filter write it to disk and compare with the source file.
> Try indexing a single file so you can be sure of what file is
> generating the errors.


Thanks for the list of steps.  It appears to be a bug in swish-e, but I
open to other interpretations.

Here's what I have:

1. I enhanced to write $$content to a temporary file.
2. I enhanced the filter (filter-bin/ to copy $file
   to another temporary file (using File::Copy->copy)

The temp file from step 1 is identical to the source file (size and
sum).  xpdf runs against this file by hand without problems.

The temp file from step 2 is different: same size, different sum.
xpdf blows up when run against this file.  I compared the files via "od
-c" and it seems that most (all?) of the \0 in the original PDF have
been converted to \n.

Question: is anyone successfully indexing PDF documents on Linux with
swish-e-2.2.1 ?  If so, can you please post your swish-e configuration
indicating how you are filtering PDF to HTML (or text)?

I'm going to have to find a fix for this tonight (deadlines and all
that).  Can someone point me to where in the code I should be looking
for fixing this?  Roughly, I know it will be between the spidering and
the filtering, but I'm not sure where that code is (I've only briefly
glanced over the code before).

Thanks again,

Greg Fenton

Received on Thu Oct 10 22:52:44 2002